数据挖掘实验报告.pdf-道客多多

资源描述

1、哈尔滨工业大学数据挖掘理论与算法实验报告 (2015年度秋季学期 ) 课程编码 S1300019C 授课教师邹兆年学生姓名谢浩哲学号 15S103172 学院计算机科学与技术学院哈尔滨工业大学 Page 1 of 10 Designed by 谢浩哲 NOTE: 本报告所涉及的全部代码均已在 GitHub 上开源 : https:/ 一、实验内容 NOTE: 各算法的实现思想将在下一节阐述 . 1. K-Means K-means clustering is a method of vector quantization, originally from si

2、gnal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. 2. AGNES (层次聚类 ) AGNES, known as Agglomerative Hie

3、rarchical clustering. This algorithm works by grouping the data one by one on the basis of the nearest distance measure of all the pairwise distance between the data point. Again distance between the data point is recalculated but which distance to consider when the groups has been formed? For this

4、there are many available methods. Some of them are: - Single-nearest distance or single linkage - Complete-farthest distance or complete linkage - Average-average distance or average linkage - Centroid distance - Wards method - sum of squared Euclidean distance is minimized 3. DBSCAN Density-based s

5、patial clustering of applications with noise (DBSCAN) is a data clustering algorithm. It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie a

6、lone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature. 二、实验设计 1. K-Means 算法思想 : 任意选取点集中的 k 个点作为中心 , 对每一个点与 k 个中心进行对比 , 划分至以这 k 个中心为中心点的簇中 . 划分结束后 , 重新计算每一个簇的中心点 . 重复以上过程

7、 , 直至这些中心点不再变化 . 哈尔滨工业大学 Page 2 of 10 Designed by 谢浩哲程序流程图 : 核心代码 : 1 public class KMeans 2 public Cluster getClusters(int k, Point points) 3 if ( k = points.length ) 4 return null; 5 6 7 Cluster clusters = getInitialClusters(k, points); 8 Cluster newClusters = null; 9 do 10 newClusters = getClu

8、sters(k, points, clusters); 11 12 if (isClustersTheSame(clusters, newClusters) 13 break; 哈尔滨工业大学 Page 3 of 10 Designed by 谢浩哲 14 15 clusters = newClusters; 16 while (true); 17 return clusters; 18 19 20 private Cluster getClusters(int k, Point points, Cluster cluster) 21 for ( int i = 0; i clusters)

9、3 while ( clusters.size() 1 ) 4 double minProximity = Double.MAX_VALUE; 5 int minProximityIndex1 = 0, minProximityIndex2 = 0; 6 7 for ( int i = 0; i getClusters(List points, int minPoints, double eps) 3 List corePoints = getCorePoints( points, minPoints, eps); 4 Map clusters = getClustersOfCorePoint

10、s(corePoints, eps); 5 6 List borderPoints = getBorderPoints(points, corePoints, minPoints, eps); 7 getClustersOfBorderPoints( corePoints, borderPoints, clusters, eps); 8 哈尔滨工业大学 Page 7 of 10 Designed by 谢浩哲 9 return new ArrayList(clusters.values(); 10 11 12 private List getCorePoints(List points, in

11、t minPoints, double eps) 13 List corePoints = new ArrayList(); 14 15 for ( int i = 0; i = minPoints ) 25 currentPoint.pointType = PointType.CorePoint; 26 corePoints.add(currentPoint); 27 28 29 return corePoints; 30 31 32 private Map getClustersOfCorePoints( List corePoints, double eps) 33 Map cluste

12、rs = new HashMap(); 34 35 for ( int i = 0; i corePoints.size(); + i ) 36 Point currentPoint = corePoints.get(i); 37 Point representPoint = null; 38 for ( int j = 0; j i; + j ) 39 Point anotherPoint = corePoints.get(j); 40 if ( currentPoint.isPointsInEpsCircle( anotherPoint, eps) ) 41 representPoint

13、= anotherPoint.representPoint; 42 currentPoint.representPoint = representPoint; 43 break; 44 45 46 if ( representPoint = null ) 哈尔滨工业大学 Page 8 of 10 Designed by 谢浩哲 47 currentPoint.representPoint = currentPoint; 48 clusters.put(currentPoint, new Cluster(currentPoint); 49 else 50 Cluster cluster = cl

14、usters.get(representPoint); 51 cluster.points.add(currentPoint); 52 53 54 return clusters; 55 56 三、测试数据 1. K-Means 对于 K-Means, 程序随机生成均匀分布的二维坐标数据 , 其中横纵坐标均在 0, 100范围内 . 运行参数 : 坐标点的个数 : n = 200 预期的聚类数量 : k = 5 数据样例 : 请参见附件中的 KMeans/Runtime/input.txt 查看具体数据 . (16.38, 7.41), (39.14, 10.49),

15、 (66.43, 38.65), (44.11, 51.71), (66.99, 6.14), 2. AGNES (层次聚类 ) 对于 AGNES, 程序随机生成 k 个圆形簇的二维坐标数据 . 对于每一个簇 , 程序随机生成簇的中心以及小于等于 24 的半径 r. 对于每一个簇 , 其中点的数量被控制在 0, r2 + 64. 运行参数 : 聚类的数量 : k = 2 数据样例 : 请参见附件中的 AGNES/Runtime/input.txt 查看具体数据 . (16.38, 7.41), (39.14, 10.49), (66.43, 38.65)

16、, (44.11, 51.71), (66.99, 6.14), 3. DBSCAN 对于 DBSCAN, 程序随机生成 k 个圆形簇的二维坐标数据 . 对于每一个簇 , 程序随机生成簇的中心以及小于等于 24 的半径 r. 对于每一个簇 , 其中点的数量被控制在 0, r2 + 64. 运行参数 : 聚类的数量 : k = 3 被认定为 CorePoint 周围点的数量 : minPt = 5 被认定为 CorePoint 搜索周围点的半径 : Eps = 4 请参见附件中的 DBSCAN/Runtime/input.txt 查看具体数据 .

17、(16.38, 7.41), (39.14, 10.49), (66.43, 38.65), (44.11, 51.71), (66.99, 6.14), 哈尔滨工业大学 Page 9 of 10 Designed by 谢浩哲四、实验结果 1. K-Means 运行参数 : n = 200, k = 5 2. AGNES (层次聚类 ) 运行参数 : k = 2 哈尔滨工业大学 Page 10 of 10 Designed by 谢浩哲 3. DBSCAN 运行参数 : k = 3, minPt = 5, Eps = 4 NOTE: 图中绿色点为 Core Point, 蓝色

18、点为 Border Point, 红色点为 Noise Point. 五、遇到的困难及解决方法、心得体会 K-Means 算法主要来说比较简单 . AGNES 算法本身并不复杂 , 但做可视化时的确用了很长的时间 , 现在的解决方案是 , 每一秒都将图中的一个类分裂成 2个类 (感觉像 Bisecting K-Means), 以展示 “ 层次 ” 聚类的过程 . DBSCAN 就更复杂一些 , 并且要在时间和空间上找到一个平衡点并非易事 . 一种占用较大空间的解法是 , 对于每一个点 , 程序扫描其周围的点 , 并将它们加入列表 (每个点都各自维护一个列表 ). 可以想见 , 这样的空间消耗较大 . 现有的实现方案是 , 先扫描识别出Core Point, 根据 Core Point 找出周围的 Border Point, 剩下的即为 Noise Point. 但是这样需要有 2 次扫描 , 时间复杂度不如前者 . 总的来说 , 这些实验让我较好的了解聚类算法 , 对于不同算法的利弊也有了更深刻的认识 .

展开阅读全文