数据挖掘概念与技术 CHAPTER2-了解数据.ppt-道客多多

资源描述

1、1,Data Mining: Concepts and Techniques,杨昆修译 Chapter 2 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University,2,Chapter 2: 了解数据,数据对象和属性类型Data Objects and Attribute Types 数据的(基本)统计描述Basic Statistical Descriptions of Data 数据可视化Data Visualization 测

2、量数据相似性和相异性Measuring Data Similarity and Dissimilarity 总结Summary,3,数据集合的类型,记录Record 关系记录数据矩阵, e.g., 数值矩阵, 交叉表文档数据: 文本文档:词频向量term-frequency vector 交易数据图 and 网络万维网社会或信息网络分子结构Molecular Structures 有序的 Ordered 视频数据: sequence of images 时间数据: 时间序列 time-series 序列数据:交易序列transaction sequences 遗传序列数据空间,

3、图像image and 多媒体multimedia: Spatial data: maps Image data: Video data:,4,结构数据的重要特征,维度Dimensionality 维数灾难 Curse of dimensionality 稀疏 Sparsity 只有计数 Only presence counts 分辩率 Resolution 模式依赖于尺度分布Distribution 中心性和分散 Centrality and dispersion,5,数据对象,数据集由数据对象构成一个数据对象代表一个实体例子: 销售数据库sales database:客户/顾客,商店

4、物品, sales 医学数据库: patients, treatments 大学数据库: students, professors, courses 又称为样本, 事例,实例, 数据点, 对象,元组tuples. 数据对象由属性来描述 Database rows - data objects; columns -attributes.,6,属性,属性Attribute (or维度, 特征, 变量):一个数据字段, 表示一个数据对象的某个特征. E.g., customer _ID, name, address 类型: 名词性Nominal 二元的数字的Numeric: 数量的 Interv

5、al-scaled Ratio-scaled,7,属性类型,名词性Nominal:类别,状态, or “名目” Hair_color = auburn, black, blond, brown, grey, red, white 婚姻状态, 职业occupation, ID numbers, zip codes 二元只有2个状态的名词性属性 (0 and 1) 对称二元Symmetric binary: 同样重要的两相 e.g., gender 非对称Asymmetric binary: 非同等重要 e.g., 医疗检查 (positive vs. negative) 惯例Conventio

6、n: assign 1 to most important outcome (e.g., HIV positive) 顺序的 Ordinal 值有一个有意义的顺序(排序) 但连续值之间的大小未知. Size = small, medium, large,等级,军队排名,8,数值属性的类型,数量Quantity (integer or real-valued) 区间Interval 在某个同等大小的一个尺度单位上Measured on a scale of equal-sized units 值有序 E.g., temperature in Cor F, calendar dates 没有真正的

7、零点 Ratio 有真正的零点可以讲值是被测量单位一个数量级 (10 K is twice as high as 5 K). e.g.,温度在开尔文,长度,计数,货币的数量,9,离散 vs. 连续属性,Discrete Attribute 一个有限的或可数无限集值 E.g., zip codes，the set of words in a collection of documents 有时,表示为整数变量注: 二元属性是离散属性的一个特殊情况 Continuous Attribute 属性值为实数 E.g., temperature, height, or weight 实际上，实值只能

8、使用有限位数进行测量和代表连续属性通常表示为浮点变量,10,Chapter 2:数据的统计描述,Data Objects and Attribute Types 数据的(基本)统计描述数据可视化测量数据相似性和相异性Measuring Data Similarity and Dissimilarity Summary,11,数据的(基本)统计描述,Motivation 为了更好的理解数据:集中趋势，变异和传播数据离散特征中位数, 最大, 最小, 粉位数, 离群点, 方差, 等. 针对排序区间的数值维数据离散度: 多个粒度上的精确分析排序区间的盒图/分位数图分析某计算侧度下的离散

9、度分析折叠为某数值维度下转化立方体上的盒图/分位数图,12,分布度量/代数度量/整体度量,从数据挖掘角度，需要考察如何在大型数据可中有效计算度量。分布式度量 distributive measure 可通过如下方法计算的度量（函数）：将数据划分成较小子集，计算每个子集的度量，合并计算结果得到整个数据集的度量值。 Sum, count 代数度量 algebraic measure 可用一个函数于一个或多个分布度量计算的度量整体度量 holistic measure 必须对整个数据集计算的度量,13,度量数据的中心趋势,均值 (代数度量) (样本 vs. 总体): Note: n 样本大小

10、，N 总体大小. 加权算术均值: 截断均值: 去掉高低极端值中位数: 奇数则为有序集的中间值, 否则为中间两个数的平均 (基于分组数据)可以插值估计众数Mode 出现频率最高的值(不惟一/每个值出现一次则没有) 1/2/3个众数-单峰的, 双峰的, 三峰的 Empirical formula:,14,2019年6月13日星期四,Data Mining: Concepts and Techniques,14,对称/偏斜数据,中位数, 均值, 众数：对称, 正倾斜和负倾斜数据,positively skewed,negatively skewed,symmetric,15,度量数据的离散度,四分

11、位数Quartiles, 离群点 outliers ，盒图 boxplots 四分位数: Q1 (25th 百分位数percentile), Q3 (75th percentile) 中间四分位数极差 Inter-quartile range: IQR = Q3 Q1 五数概括: min, Q1, median, Q3, max 盒图: 盒两端为四分位数; 中位数标记; 添加胡须, 离群点独立标出离群点: 通常是值高/低于四分位数1.5 x IQR 方差/标准差 (样本: s, 总体: ) Variance: (代数度量, 可伸缩计算)Standard deviation s (or ) 方

12、差的平方根s2 (or 2),16,盒图分析,五数概括最小值, Q1, 中位数Median, Q3, 最大值 Boxplot 使用盒子表示数据盒子两端是第1/3四分位数, 即盒子高度为四分位数极差IQR 盒内的线表示中位数胡须: 不超过四分位数1.5 x IQR 的最大/小数据点离群点Outliers: 单独绘出满足某个离群点阈条件的离群点,17,可视化数据的离散度: 3-D Boxplots,18,正态分布曲线的性质,正态分布曲线 , +:含有约68的测量(: 均值, : 标准差) 2, +2: contains about 95% of it 3, +3: contains abou

13、t 99.7% of it,19,基本统计说明de图形显示,Boxplot: 五数概括的图形 Histogram直方图:值x-axis, y-axis表示频率 Quantile plot分位数图: 值xi 与fi (表明近似100 fi % 的数据 xi )成对 Quantile-quantile (q-q) plot: 对着另一个分位数，绘制一个单变量分布的分位数 Scatter plot散布图: 每个值对为一个坐标点绘于平面上,20,直方图分析,Histogram:图形显示每个列值的频率，条形图所示显示有多大比例的点下落入每个类别类别并不是均匀的宽度时有别于条形图一个关键：条形图的面

14、积表示值而不是条形图的高度 a bar chart柱状图/柱形图类别通常指定为变量的一些非重叠区间。类别（带）必须相邻,21,Histograms Often Tell More than Boxplots,两个直方图显示在左边有同样的boxplot表示相同的值: min, Q1, median, Q3, max 他们拥有的是不同的数据分布 But they have rather different data distributions,22,Data Mining: Concepts and Techniques,分位数图Quantile Plot,显示所有数据 (允许用户评估全部行为

15、和不寻常的事件) Plots quantile information 对于升序中的值点xi ，fi 表明近似100 fi % 的数据 xi ；成对绘制(xi ，fi ),23,分位数-分位数图 (Q-Q图),对着另一个分位数，绘制一个单变量分布的分位数观察:正从一种分布到另一个种是否有偏移? 例子表示分店1出售的物品单价 vs. 分店 2 的每个分位数.分店1出售的物品单价倾向于低于分店2.,24,散布图Scatter plot,提供双变量的数据的第一印象：点的聚集，离群点, 等每个值对作为一个坐标点绘于平面上,25,正/负相关数据,The left half fragment is

16、 positively correlated The right half is negative correlated,26,不相关的数据,27,散布图的例子,28,Chapter 2: 了解数据,数据对象和属性类型Data Objects and Attribute Types 数据的(基本)统计描述Basic Statistical Descriptions of Data 数据可视化Data Visualization 测量数据相似性和相异性Measuring Data Similarity and Dissimilarity 总结Summary,29,数据可视化,Why data v

17、isualization? 把数据映射到图形信息空间中获取视角提供定性的概述(大数据集的) 在数据中搜寻模式, 趋势, 结构,不规则, 关联为进一步的量化分析发现有意义的区域及合时的参数为衍生的计算机表示提供一个视觉证据可视化方法的分类: 基于像素的可视化技术 Pixel-oriented visualization 几何投影可视化技术 Geometric projection 基于图标的可视化技术 Icon-based visualization 分层可视化技术 Hierarchical visualization 可视化复杂数据和关系,30,基于像素的可视化技术,对一个维度m的数

18、据，在屏幕上产生m个窗口,每个维度一个一个记录的m维度值被匹配到窗口中对应位置的m个像素上像素的颜色值反映了相应的值,Income,(b) 信用限额,(c)交易额,(d) age,31,安排象素于圆弧片断,为节省空间并显示多个维度间的联系,往往是以一个弧形片段填充空间,Representing a data record in circle segment,(b) Laying out pixels in circle segment,32,像素图的例子,33,几何投影可视化技术,可视化数据的几何变换和投影方法直接可视化散布图和散布图矩阵matrices 透视地形Landscapes

19、投影捕获技术: 帮助用户发现有意义的投影（多维数据上）解剖视角Prosection views- projections and sections sections, i.e., intersections of subspaces with a highdimensional object, can easily display structure of only low codimension Hyperslice 平行坐标Parallel coordinates,34,直接数据可视化,基于涡度的含扭曲丝带,35,散布图矩阵,Matrix of scatterplots (x-y-di

20、agrams) of the k-dim. data total of (k2/2-k) scatterplots,Used by ermission of M. Ward, Worcester Polytechnic Institute,36,news articles visualized as a landscape,Used by permission of B. Wright, Visible Decisions Inc.,透视地形/景观,透视方式可视化数据数据要被转化为能保持数据特点的二维表示（可能人工）,37,平行坐标,对应于属性的n个等距轴平行于一个屏幕轴这些轴缩放到最小值

21、，最大值:相应的属性范围每个数据项对应于一折线，属性轴的对应取值点处相交,38,一个数据集的平行坐标,39,基于图标的可视化技术,以图标特征可视化数据值典型的可视化方法 Chernoff Faces 脸谱图 Stick Figures 棍棒图常用技术形状编码 Shape coding: 使用形状来表示特定信息的编码颜色图标Color icons: 使用颜色图标编码更多的信息瓦片条形图Tile bars:在文档检索中使用小图标代表相关特征向量,40,切尔诺夫脸谱图 Chernoff Faces,一种方法在二维空间显示变量,如设X眉倾斜,Y是眼睛大小,Z是鼻子长度等图中的面孔使用10

22、个特点产生-头离心率，眼睛大小，眼间距，眼离心率，瞳孔大小，斜眉，鼻大小，嘴形，嘴的大小，张口程度: Each assigned one of 10 possible values, generated using Mathematica (S. Dickson),REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993 Weisstein, Eric W. “Chernoff Face.“ From MathWorld-A Wolf

23、ram Web Resource. Figure,一个5-piece棍棒图(身体和四肢)，两个属性映射到轴，其余的属性映射到角度或肢体长度,42,分层可视化技术,使用子空间层次划分可视化数据方法维数堆叠Dimensional Stacking Worlds-within-Worlds Tree-Map 树状图 Cone Trees锥形树 InfoCube,43,维数堆叠 Dimensional Stacking,把n维属性空间剖分为2-D子空间，互相堆叠与一起属性值的范围划分为等级，重要的属性分布在外层. 适合次序属性较少的数据超过9个维度时显示困难重要的是匹配维度适当,44,Us

24、ed by permission of M. Ward, Worcester Polytechnic Institute,可视化石油勘探数据，经度和纬度映射到外x-, y轴，油质和深度映射到内部x-, y-轴,维数堆叠 Dimensional Stacking,45,Worlds-within-Worlds,分配功能f和两个重要参数给内部世界固定其他参数 - draw other (1 or 2 or 3 世界选择他们为坐标轴) 使用这种模式的软件,Nvision:通过数据手套和立体显示以动态互动，包括旋转，缩放（内部）和转换（内/外）自动视觉：经查询手段静态互动,46,树状图Tree-M

25、ap,屏幕填充方法：依赖于属性值把屏幕层次划分为区域根据属性值（类）屏幕的x-y-维交替剖分,MSR Netscan Image,47,Tree-Map of a File System (Schneiderman)？,48,InfoCube,3-D可视化技术：层次信息被显示成嵌套的半透明立方体最外层的立方体对应顶层数据, 子节点or低层数据作为稍小的立方体显示于外层立方体中, 以此类推,49,3d锥树 Three-D Cone Trees,3D cone tree 可用于数千个节点先构造 2D环形树，安排节点于根节点为中心的同心圆环投影到2维时将不可避免重叠 G. Robertso

26、n, J. Mackinlay, S. Card. “Cone Trees: Animated 3D Visualizations of Hierarchical Information”, ACM SIGCHI91 Graph from Nadeau Software Consulting website: 可视化社会网络数据：模型感染从一个人到下一个扩散的方式,50,可视化复杂数据和关系,Visualizing non-numerical data: text and social networks Tag cloud: visualizing user-generated tags,Th

27、e importance of tag is represented by font size/color Besides text data, there are also methods to visualize relationships, such as visualizing social networks,Newsmap: Google News Stories in 2005,51,Chapter 2:数据的统计描述,Data Objects and Attribute Types 数据的(基本)统计描述数据可视化测量数据相似性和相异性Measuring Data Simil

28、arity and Dissimilarity Summary,52,相似性和相异性,Similarity 数值测量两个数据对象类似程度目标越相似时值越大通常介于 0,1 Dissimilarity (e.g., 距离distance) 数值测量两个数据对象差异程度 Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies 邻近度Proximity refers to a similarity or dissimilarity,53,数据矩阵和相异度矩阵,Data matrix

29、 n data points with p dimensions Two modesDissimilarity matrix n data points, but registers only the distance A triangular matrix Single mode,54,名词性属性的邻近度量,2个或多个状态, e.g., red, yellow, blue, green (二元属性的推广) Method 1: 简单匹配 m: p个变量中匹配的个数, p: 全部变量的个数Method 2:使用一系列的二进制属性为M个名义状态的每一个产生一个新的二进制/二元属性,55,二进制属

30、性的邻近度量,二进制数据的列联表contingency table 对称二元变量的距离侧度: 不对称二元变量的距离侧度: Jaccard系数(不对称二元变量的相似性侧度):,Note: Jaccard coefficient is the same as “coherence”:,Object i,Object j,56,二进制属性的相异度量,Example性别是对称属性 The remaining attributes are asymmetric binary 令Y and P 值为1, 且N值为0,57,规范数值数据,Z-score: X: 需标准化的原始数值, : 总体均值, : 标准

31、差在标准偏差单位下，原始分数和总体均值之间的距离 “-”, “+” 另一种方法: Calculate the mean absolute deviation其中standardized measure (z-score):使用平均绝对偏差比使用标准差更稳健,58,例: 数据矩阵和相异度矩阵,Dissimilarity Matrix (with Euclidean Distance),Data Matrix,59,数值数据的距离: Minkowski Distance,Minkowski distance:一种流行的距离测度其中i = (xi1, xi2, , xip) and j = (xj

32、1, xj2, , xjp)为两个p-维数据点, and h is the order (the distance so defined is also called L-h norm) 特性 d(i, j)0 if ij, and d(i,i)=0 (正定Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j) d(i, k) + d(k, j) (Triangle Inequality) A distance that satisfies these properties is a metric度量,60,闵可夫斯基距离特殊形式

33、,h = 1: Manhattan (city block, L1 norm) distance曼哈顿距离（L1范数） E.g., the Hamming distance: the number of bits that are different between two binary vectorsh = 2: (L2 norm) Euclidean distanceh .上确界 “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component (attribute)

34、of the vectors,61,Example: Minkowski Distance,Dissimilarity Matrices,Manhattan (L1),Euclidean (L2),Supremum,62,有序变量Ordinal Variables,一个序变量可以离散的或连续的 Order is important, e.g., rank Can be treated like interval-scaled 用他们的序代替xif 映射每一个变量的范围于0,1，用如下支代替第f-th变量的i-th对象compute the dissimilarity using methods

35、 for interval-scaled variables,63,混合型属性,A database may contain all attribute types Nominal, symmetric binary, asymmetric binary, numeric, ordinal 可以用加权法计算合并的影响f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise f is numeric: use the normalized distance f is ordinal Compute rank

36、s rif and Treat zif as interval-scaled,64,余弦相似性 Cosine Similarity,A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document.Other vector objects: gene features in micro-arrays, Applications: information re

37、trieval, biologic taxonomy, gene feature mapping, . Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), thencos(d1, d2) = (d1 d2) /|d1| |d2| , where indicates vector dot product, |d|: the length of vector d,65,Example: Cosine Similarity,cos(d1, d2) = (d1 d2) /|d1| |d2| , whe

38、re indicates vector dot product, |d|: the length of vector dEx: Find the similarity between documents 1 and 2.d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 |d1|= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481

39、|d2|= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 cos(d1, d2 ) = 0.94,66,Summary,Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled Many types of data sets, e.g., numerical, text, graph, Web, image. Gain insight into the data by: Basic statistical data desc

40、ription: central tendency, dispersion, graphical displays Data visualization: map data onto graphical primitives Measure data similarity Above steps are the beginning of data preprocessing. Many methods have been developed but still an active area of research.,67,References,W. Cleveland, Visualizing

41、 Data, Hobart Press, 1993 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Int

42、roduction to Cluster Analysis. John Wiley & Sons, 1990. H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech. Committee on Data Eng., 20(4), Dec. 1997 D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization and Computer Graphics,

43、 8(1), 2002 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(9), 1999 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001 C. Yu , et al, Visual data mining of multimedia data for social and behavioral studies, Information Visualization, 8(1), 2009,

展开阅读全文