1、Cluster Analysis,outline,What Is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Typical Clustering Methods Outlier Analysis,What is Cluster Analysis?,Clustering is the process of grouping the data into classes or clusters so that objects Similar to o
2、ne another within the same cluster Dissimilar to the objects in other clusters,Examples of Clustering Applications,Text clustering:web search engines Marketing: Help marketers discover distinct groups in their customer bases Insurance: Identifying groups of motor insurance policy holders with a high
3、 average claim cost Land use: Identification of areas of similar land use in an earth observation database City-planning: Identifying groups of houses according to their house type, value, and geographical location,outline,What Is Cluster Analysis? Types of Data in Cluster Analysis A Categorization
4、of Major Clustering Methods Typical Clustering Methods Outlier Analysis,Data Structures,Data matrix,Data Structures,Dissimilarity matrix,d(i,j) is the measured difference or dissimilarity between object i and object j, a nonnegative number.,Type of data in clustering analysis,Interval-scaled variabl
5、es Binary discrete variables Multi-valued discrete attributes,Interval-scaled variables,interval-scaled variables are continuous measurements. such as distance、age、 income 、weight Standardization: different unit different clustering structure,Similarity and Dissimilarity Between Objects,Distances ar
6、e normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance Manhattan distance Euclidean distance,Minkowski distance:where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two p-dimensional data objects, and q is a positive int
7、eger If q = 1, d is Manhattan distance:,If q = 2, d is Euclidean distance:,Binary discrete variables,A binary variable has only two states: 0 or 1. Symmetric(对称的):gender Asymmetric: disease test,Dissimilarity between Binary variables,Object i,Object j,a is the number of variables that equal 1 for bo
8、th object i and j.,P is the total number of variables , p=a+b+c+d.,Dissimilarity between Binary Variables,Examplegender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0,Multi-valued discrete attributes,A generaliz
9、ation of the binary variable in that it can take more than 2 states. Such as map-color variable: red, yellow, blue, green, and pink.,dissimilarity,Method 1: Simple matching approach: m: number of matches, p: total number of variablesMethod 2: use a large number of binary variables,Variables of mixed
10、 types,Suppose that the data set contains p variables of mixed type, the dissimilarity d(i,j) between object i and j is defined as:,=0,if either(1)Xif or Xjf is missing , or(2) Xif= Xjf=0 and variable f is asymmetric binary; otherwiese =1,Variables of mixed types,f is binary or multi-valued discrete
11、 : dij(f) = 0 if xif = xjf ; otherwise dij(f) = 1. f is interval-based: use the normalized distance,outline,What Is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Typical Clustering Methods Outlier Analysis,Major Clustering Approaches,Partitioning me
12、thod Hierarchical method Density-based method Grid-based method Model-based method,outline,What Is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Typical Clustering Methods Outlier Analysis,Partitioning method,Given a database of n objects and k, the
13、 number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k=n), where each partition represents a cluster.,The K-Means Clustering Method,Example,0,1,2,3,4,5,6,7,8,9,10,0,1,2,3,4,5,6,7,8,9,10,K=2 Arbitrarily choose K object as initial cluster center,Assign each ob
14、jects to most similar center,Update the cluster means,Update the cluster means,reassign,Squared-error criterion function converge,E is the sum of square- error for all objects in the database,P representing a given object,mi is the mean of cluster Ci,Comments on the K-Means Method,Weakness: Applicab
15、le only when mean is defined, then what about categorical data? Need to specify k, the number of clusters Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes,K-Medoids,K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
16、point, medoids can be used, which is the most centrally located object in a cluster.,Hierarchical method,A hierarchical clustering method works by grouping data objects into a tree of clusters. hierarchical clustering methods can be further classified into agglomerative and divisive hierarchical clu
17、stering.,Hierarchical method,Hierarchical method,Four widely used measures for distance between clusters: Minimum distance: Maximum distance: Mean distance: Average distance:,Weak of hierarchical method,Inability to perform adjustment once a merge or split decision has been executed. One promising d
18、irection for improving the clustering quality of hierarchical methods is to integrate hierarchical clustering with other clustering techniques for multiple phase clustering.,Density-based method,To discover clusters with arbitrary shape, density-based clustering methods have been developed. General
19、idea is to continue growing the given cluster as long as the density (number of objects or data points) in the “neighborhood” exceeds some threshold. DBSCAN, OPTICS,Grid-based method,Grid-based methods quantize the object space into a finite number of cells that form a grid structure. Advantage: fas
20、t processing time, dependent only on the number of cells in each dimension. STING,Model-based method,Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. Two major approaches: a statistical approach or a neural network approach.,outli
21、ne,What Is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Typical Clustering Methods Outlier Analysis,What is an outlier?,there exist data objects that do not comply with the general behavior or model of the data. Such data objects, which are grossly
22、 different from or inconsistent with the remaining set of data, are called outliers.,Importance of outlier detection,Outlier is unuseful? Must be discarded?error!,Outliers detection applications,Credit card fraud detection Telecom fraud detection Medical analysis ,Outliers detection methods,Graphica
23、l method,Clustering-based outlier detection,Requirements of Clustering in data mining,Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters,Requirements of Clustering in data mining,Able to deal with noise and outliers Insensitive to order of input records High dimensionality Constraint-based clustering Interpretability and usability,