ImageVerifierCode 换一换
格式:PPT , 页数:66 ,大小:8.15MB ,
资源ID:1366351      下载积分:10 金币
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.docduoduo.com/d-1366351.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录   微博登录 

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(chapter-2.-数据挖掘认识数据.ppt)为本站会员(无敌)主动上传,道客多多仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知道客多多(发送邮件至docduoduo@163.com或直接QQ联系客服),我们立即给予删除!

chapter-2.-数据挖掘认识数据.ppt

1、1,数据仓库与数据挖掘 第2章 认识数据,李成安华南理工大学经济与贸易学院2013-2014学年,July 4, 2018,Data Mining: Concepts and Techniques,2,3,第2章: 认识数据,数据对象与属性类型数据的基本统计描述数据可视化度量数据的相似性和相异性小结,4,数据集类型,记录关系记录数据矩阵, e.g., numerical matrix, crosstabs文档数据: text documents: term-frequency vector交易数据图和网络World Wide WebSocial or information networksM

2、olecular Structures有序的视频数据: sequence of images时间数据: time-series序列数据: transaction sequences基因序列数据空间、图象与多媒体空间数据:地图图象数据: 视频数据:,5,结构数据的重要特征,维度维度曲线稀疏性Only presence counts分辨率基于尺度的模式 分布中心性与分散性,6,数据对象,数据集由数据对象组成.一个数据对象表示一个实体.例如: 销售数据库: customers, store items, sales医学数据库: patients, treatments大学数据库: students,

3、 professors, courses也可叫做样本, 实例, 数据点, 对象, 元祖.数据对象用属性描述.数据库的行- 数据对象; 列 -属性.,7,属性,属性 (or 维, 特征, 变量): 数据字段, 表示数据对象的一个特征.E.g., customer _ID, name, address类型:标称二元数值的: 定量的区间标度比率标度,8,属性类型,标称: 类别, 状态, or “事物的命名”Hair_color = auburn, black, blond, brown, grey, red, whitemarital status, occupation, ID numbers,

4、zip codes二元Nominal attribute with only 2 states (0 and 1)对称二元: both outcomes equally importante.g., gender非对称二元: outcomes not equally important. e.g., medical test (positive vs. negative)Convention: assign 1 to most important outcome (e.g., HIV positive)序数Values have a meaningful order (ranking) but

5、 magnitude between successive values is not known.Size = small, medium, large, grades, army rankings,9,数值属性类型,定量 整数或实数)区间Measured on a scale of equal-sized unitsValues have orderE.g., temperature in Cor F, calendar datesNo true zero-point比率Inherent zero-pointWe can speak of values as being an order

6、of magnitude larger than the unit of measurement (10 K is twice as high as 5 K).e.g., temperature in Kelvin, length, counts, monetary quantities,10,离散 vs. 连续属性,离散属性Has only a finite or countably infinite set of valuesE.g., zip codes, profession, or the set of words in a collection of documents Somet

7、imes, represented as integer variablesNote: Binary attributes are a special case of discrete attributes 连续属性Has real numbers as attribute valuesE.g., temperature, height, or weightPractically, real values can only be measured and represented using a finite number of digitsContinuous attributes are t

8、ypically represented as floating-point variables,11,第2章: 认识数据,数据对象与属性类型数据的基本统计描述数据可视化度量数据的相似性和相异性小结,12,数据的基本统计描述,动机To better understand the data: central tendency, variation and spread数据分散性描述 median, max, min, quantiles, outliers, variance, etc.数值维:correspond to sorted intervalsData dispersion: anal

9、yzed with multiple granularities of precisionBoxplot or quantile analysis on sorted intervals计算度量的分散性分析Folding measures into numerical dimensionsBoxplot or quantile analysis on the transformed cube,13,中心趋势度量,均值 (algebraic measure) (sample vs. population):Note: n is sample size and N is population si

10、ze. Weighted arithmetic mean:Trimmed mean: chopping extreme values中位数: Middle value if odd number of values, or average of the middle two values otherwiseEstimated by interpolation (for grouped data):众数Value that occurs most frequently in the data单峰, 双峰, 三峰经验公式:,July 4, 2018,Data Mining: Concepts an

11、d Techniques,14,对称 vs. 倾斜数据,Median, mean and mode of symmetric, positively and negatively skewed data,右偏,左偏,对称,15,度量数据的分散性,分位数, 离群点与盒图Quartiles: Q1 (25th percentile), Q3 (75th percentile)内距: IQR = Q3 Q1 五数概括: min, Q1, median, Q3, max盒图: ends of the box are the quartiles; median is marked; add whiske

12、rs, and plot outliers individually离群思安: usually, a value higher/lower than 1.5 x IQR方差与标准差(sample: s, population: )Variance: (algebraic, scalable computation)Standard deviation s (or ) is the square root of variance s2 (or 2),16,盒图分析,分布的五数概括Minimum, Q1, Median, Q3, Maximum盒图Data is represented with

13、a boxThe ends of the box are at the first and third quartiles, i.e., the height of the box is IQRThe median is marked by a line within the boxWhiskers: two lines outside the box extended to Minimum and MaximumOutliers: points beyond a specified outlier threshold, plotted individually,July 4, 2018,Da

14、ta Mining: Concepts and Techniques,17,数据分散性可视化: 3-D 盒图,18,标准正态曲线的特性,The normal (distribution) curveFrom to +: contains about 68% of the measurements (: mean, : standard deviation) From 2 to +2: contains about 95% of itFrom 3 to +3: contains about 99.7% of it,19,基本统计描述的图形表示,盒图: graphic display of fiv

15、e-number summary直方图: x-axis are values, y-axis repres. frequencies 分位数图: each value xi is paired with fi indicating that approximately 100 fi % of data are xi 分位数-分位数图: graphs the quantiles of one univariant distribution against the corresponding quantiles of another散点图: each pair of values is a pai

16、r of coordinates and plotted as points in the plane,20,直方图分析,Histogram: Graph display of tabulated frequencies, shown as barsIt shows what proportion of cases fall into each of several categoriesDiffers from a bar chart in that it is the area of the bar that denotes the value, not the height as in b

17、ar charts, a crucial distinction when the categories are not of uniform widthThe categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent,21,Histograms Often Tell More than Boxplots,The two histograms shown in the left may have the same b

18、oxplot representationThe same values for: min, Q1, median, Q3, maxBut they have rather different data distributions,Data Mining: Concepts and Techniques,22,分位数图,Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)Plots quantile informationFor a dat

19、a xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi,23,分位数-分位数图,Graphs the quantiles of one univariate distribution against the corresponding quantiles of anotherView: Is there is a shift in going from one distribution to anoth

20、er?Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2.,24,散点图,Provides a first look at bivariate data to see clusters of points, outliers, etcEach pair of values is treated as a pair of coor

21、dinates and plotted as points in the plane,25,正负相关数据,左上为正相关右上为负相关,26,不相关数据,27,第2章: 认识数据,数据对象与属性类型数据的基本统计描述数据可视化度量数据的相似性和相异性小结,28,数据可视化,为什么要数据可视化?Gain insight into an information space by mapping data onto graphical primitivesProvide qualitative overview of large data setsSearch for patterns, trends,

22、 structure, irregularities, relationships among dataHelp find interesting regions and suitable parameters for further quantitative analysisProvide a visual proof of computer representations derived可视化方法:基于像素的可视化技术几何投影可视化技术基于图符的可视化技术层次可视化技术可视化复杂对象和关系,29,基于像素的可视化技术,对于一个m 维数据集,创建m个窗口,每维一个.记录的m个维值映射到这些窗

23、口中对应位置的m个像素.像素的颜色反映对应的值.,收入,(b) 信用额度,(c) 成交量,(d) 年龄,30,圆弓分割技术,为了节省空间和显示多维之间的连接, 用圆拱形窗口进行空间填充。,用圆弓内表示一个数据记录,(b) 在圆弓内安排像素,31,几何投影可视化技术,数据的几何变换和投影的可视化方法直接可视化散点图和散点图矩阵景观投影追逐技术: 帮助用户发现多维数据的有意义的投影审视图超槽值平行坐标,Data Mining: Concepts and Techniques,32,直接数据可视化,Ribbons with Twists Based on Vorticity,33,散点图矩阵,Mat

24、rix of scatterplots (x-y-diagrams) of the k-dim. data total of (k2/2-k) scatterplots,Used by ermission of M. Ward, Worcester Polytechnic Institute,34,新闻文章可视化为一个景观,Used by permission of B. Wright, Visible Decisions Inc.,景观,Visualization of the data as perspective landscapeThe data needs to be transfo

25、rmed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data,35,平行坐标,N个等距离、相互平行的轴,每维代表一个属性The axes are scaled to the minimum, maximum: range of the corresponding attribute数据记录用折线表示,与每个轴在对应于相关维值的点上相交。,36,使用平行坐标可视化,37,基于图符的可视化技术,使用少量图符表示多维数据值典型方法切尔诺夫脸任务线条

26、画一般技术形状编码: Use shape to represent certain information encoding色图符: Use color icons to encode more informationTile bars: Use small icons to represent the relevant feature vectors in document retrieval,38,切尔诺夫脸,A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be

27、 eye size, z be nose length, etc. The figure shows faces produced using 10 characteristics-head eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, generated using Mathemati

28、ca (S. Dickson),REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993Weisstein, Eric W. Chernoff Face. From MathWorld-A Wolfram Web Resource. Mining: Concepts and Techniques,39,人口统计数据,used by permission of G. Grinstein, University of Massachus

29、ettes at Lowell,人物线条画,5段人物线条画(四肢和一个躯体 w. different angle/length),40,层次可视化技术,Visualization of the data using a hierarchical partitioning into subspacesMethodsDimensional Stacking世界中的世界树图 Cone TreesInfoCube,41,Dimensional Stacking,Partitioning of the n-dimensional attribute space in 2-D subspaces, whi

30、ch are stacked into each otherPartitioning of the attribute value ranges into classes. The important attributes should be used on the outer levels.Adequate for data with ordinal attributes of low cardinalityBut, difficult to display more than nine dimensionsImportant to map dimensions appropriately,

31、42,Used by permission of M. Ward, Worcester Polytechnic Institute,Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes,Dimensional Stacking,43,世界中的世界,在内世界设置一个函数和两个最重要的参数固定其他维于某个常数- draw other (1 or 2 or 3 d

32、imensional worlds choosing these as the axes),Nvision: Dynamic interaction through data glove and stereo displays, including rotation, scaling (inner) and translation (inner/outer) Auto Visual: Static interaction by means of queries,44,树图,把层次数据显示成嵌套矩形的集合The x- and y-dimension of the screen are parti

33、tioned alternately according to the attribute values (classes),SchneidermanUMD: Tree-Map of a File System,SchneidermanUMD: Tree-Map to support large data sets of a million items,45,InfoCube,A 3-D visualization technique where hierarchical information is displayed as nested semi-transparent cubes The

34、 outermost cubes correspond to the top level data, while the subnodes or the lower level data are represented as smaller cubes inside the outermost cubes, and so on,46,Three-D Cone Trees,3D cone tree visualization technique works well for up to a thousand nodes or soFirst build a 2D circle tree that

35、 arranges its nodes in concentric circles centered on the root nodeCannot avoid overlaps when projected to 2D G. Robertson, J. Mackinlay, S. Card. “Cone Trees: Animated 3D Visualizations of Hierarchical Information”, ACM SIGCHI91Graph from Nadeau Software Consulting website: Visualize a social netwo

36、rk data set that models the way an infection spreads from one person to the next,可视化复杂对象和关系,可视化非数值型数据: 文本和社交网络标签云: 可视化用户生成的标签,标签的重要性用字体大小/颜色Besides text data, there are also methods to visualize relationships, such as visualizing social networks,Newsmap: Google News Stories in 2005,48,第2章: 认识数据,数据对象

37、与属性类型数据的基本统计描述数据可视化度量数据的相似性和相异性小结,49,相似性和相异性,相似性两个数据对象之间如何相似的数值度量值越大表示越相似范围0,1相异性 (e.g., distance)两个数据对象之间如何不同的数值度量越相似时值越低最小相异性通常为0上限是可变的邻近性 refers to a similarity or dissimilarity,50,数据矩阵与相异性矩阵,数据矩阵n*p数据点 二模相异性矩阵n 数据点, 仅仅距离 三角矩阵单模,51,标称属性的邻近性度量,2或多状态, e.g., red, yellow, blue, green (generalization

38、of a binary attribute)方法1: 简单匹配m: # of matches, p: total # of variables方法 2: Use a large number of binary attributescreating a new binary attribute for each of the M nominal states,52,二元属性的邻近性度量,二元数据的列联表对称二元变量的距离度量: 非对称变量的距离度量: Jaccard 系数 (非对称二元变量的相似性度量):,Note: Jaccard coefficient is the same as “co

39、herence”:,Object i,Object j,53,二元变量的相异性度量,ExampleGender is a symmetric attributeThe remaining attributes are asymmetric binaryLet the values Y and P be 1, and the value N 0,54,标准化数值数据,Z-score: X: raw score to be standardized, : mean of the population, : standard deviationthe distance between the raw

40、 score and the population mean in units of the standard deviationnegative when the raw score is below the mean, “+” when aboveAn alternative way: Calculate the mean absolute deviationwherestandardized measure (z-score):Using mean absolute deviation is more robust than using standard deviation,55,例子:

41、 数据矩阵与相异性矩阵,Dissimilarity Matrix (with Euclidean Distance),Data Matrix,56,数值数据的相异性: 闵可夫斯基距离,闵可夫斯基距离: 流行的距离度量where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm)性质d(i, j) 0 if i j, and d(i, i) =

42、0 (Positive definiteness)d(i, j) = d(j, i) (Symmetry)d(i, j) d(i, k) + d(k, j) (Triangle Inequality)满足这些条件的测度称做度量(metric),57,闵可夫斯基距离特例,h = 1: 曼哈顿 (city block, L1 norm) 距离 E.g., the Hamming distance: the number of bits that are different between two binary vectorsh = 2: (L2 norm) 欧几里得距离h . “上确界” (Lma

43、x norm, L norm) 距离. This is the maximum difference between any component (attribute) of the vectors,58,例子: 闵可夫斯基距离,相异性度量,曼哈顿 (L1),欧几里得 (L2),上确界,59,序数属性,序数属性可以是离散的或连续的顺序是重要的, e.g., rank可用区间尺度处理 replace xif by their rank map the range of each variable onto 0, 1 by replacing i-th object in the f-th var

44、iable bycompute the dissimilarity using methods for interval-scaled variables,60,混合类型属性的相异性,一个数据库呢能包含所有类型的属性标称, 对称二元, 非对称二元, 数值, 序数定义如下f 是二元或标称:dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwisef 是 数值: use the normalized distancef 为序数 Compute ranks rif and Treat zif as interval-scaled,61,余弦相似性,文档用数以千

45、计的属性表示,每个记录文档中一个特定词(如关键词)或短语的频度.Other vector objects: 微排中的基因特征, 应用: 信息检索, 生物学分类, 基因特征匹配, .余弦度量: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1 d2) /|d1| |d2| , where indicates vector dot product, |d|: the length of vector d,62,例子: 余弦相似性,cos(d1, d2) = (d1 d2) /|d1

46、| |d2| , 其中 表示向量点积, |d|: d的长度Ex: 文档1 和2的相似性.d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25|d1|= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481|d2|= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12cos(

47、d1, d2 ) = 0.94,63,第2章: 认识数据,数据对象与属性类型数据的基本统计描述数据可视化度量数据的相似性和相异性小结,小结,数据属性类型: 标称, 二元, 序数, 区间尺度, 比例尺度数据集类型, e.g., 数值的, 文本, 图, Web, 图象.主要方法:基本数据统计描述: 中心趋势, 分散性, 图形展示数据可视化: 映射数据到图形度量数据相似性这些是数据预处理的开始. 已经开发许多方法,但是研究的热点.,References,W. Cleveland, Visualizing Data, Hobart Press, 1993T. Dasu and T. Johnson.

48、Exploratory Data Mining and Data Cleaning. John Wiley, 2003U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 19

49、90.H. V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech. Committee on Data Eng., 20(4), Dec. 1997D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization and Computer Graphics, 8(1), 2002D. Pyle. Data Preparation for Data Mining. M

50、organ Kaufmann, 1999S. Santini and R.Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(9), 1999E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001C. Yu et al., Visual data mining of multimedia data for social and behavioral studies, Information Visualization, 8(1), 2009,

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报