1、一个食品连锁店每周的事务记录如下表所示,其中每一条事务表示在一项收款机业务中卖出的项目,假定 supmin=20%, confmin=40%,使用 Apriori 算法计算生成的关联规则,标明每趟数据库扫描时的候选集和大项目集。事务 项目 事务 项目T1T2 T3面包、果冻、花生酱面包、花生酱面包、牛奶、花生酱T4T5啤酒、面包啤酒、牛奶解:1) 扫描数据库对每个候选计算支持2) 比较候选支持度与最小支持度,得出频繁项集 L13)由 L1 产生候选 C2C2项集面包,花生酱面包,牛奶面包,啤酒面包,果冻花生酱,牛奶花生酱,啤酒花生酱,果冻牛奶,啤酒牛奶,果冻啤酒,果冻4)扫描,对每个候选计算支
2、持度C1项集 支持度面包花生酱牛奶啤酒果冻4/53/52/52/51/5L1项集 支持度面包花生酱牛奶啤酒果冻4/53/52/52/51/5C2项集 支持度面包,花生酱面包,牛奶面包,啤酒面包,果冻花生酱,牛奶花生酱,啤酒花生酱,果冻牛奶,啤酒牛奶,果冻啤酒,果冻3/51/51/51/51/501/51/5005)比较候选支持度与最小支持度,得出频繁项集 L2L2项集 支持度面包,花生酱面包,牛奶面包,啤酒面包,果冻花生酱,牛奶花生酱,果冻牛奶,啤酒3/51/51/51/51/51/51/56)由 L2 产生候选 C3C3项集面包,花生酱,牛奶面包,花生酱,啤酒面包,花生酱,果冻面包,牛奶,啤
3、酒面包,牛奶,果冻面包,啤酒,果冻花生酱,牛奶,果冻花生酱,牛奶,啤酒7)扫描,对每个候选计算支持度C3项集 支持度面包,花生酱,牛奶面包,花生酱,啤酒面包,花生酱,果冻面包,牛奶,啤酒面包,牛奶,果冻面包,啤酒,果冻花生酱,牛奶,果冻花生酱,牛奶,啤酒1/501/5000008)比较候选支持度与最小支持度,得出频繁项集 L3C3项集 支持度面包,花生酱,牛奶面包,花生酱,果冻1/51/5下面计算关联规则:面包,花生酱,牛奶的非空子集有 面包,花生酱, 面包,牛奶,花生酱,牛奶,面包,花生酱,牛奶面包,花生酱 牛奶 confidence= =33.3%5/31面包,牛奶 花生酱 confide
4、nce= =100%花生酱,牛奶 面包 confidence= =100%/1面包 花生酱,牛奶 confidence= =25%54花生酱 面包,牛奶 confidence= =33.3%/3牛奶 面包,花生酱 confidence= =50%21故强关联规则有面包,牛奶 花生酱,花生酱,牛奶 面包,牛奶 面包,花生酱 面包,花生酱,果冻的非空子集有 面包,花生酱, 面包,果冻,花生酱,果冻,面包,花生酱,果冻面包,花生酱 果冻 confidence= =33.3%5/31面包,果冻 花生酱 confidence= =100%花生酱,果冻 面包 confidence= =100%/1面包 花
5、生酱,果冻 confidence= =25%54花生酱 面包,果冻 confidence= =33.3%/3果冻 面包,花生酱 confidence =100%1故强关联规则有面包,果冻 花生酱,花生酱,果冻 面包,果冻 面包,花生酱 The following shows a history of customers with their incomes, ages and an attribute called “Have_iPhone”indicating whether they have an iPhone. We also indicate whether they will bu
6、y an iPad or not in the lastcolumn.No. Income Age Have_iPhone Buy_iPad1 high young yes yes2 high old yes yes3 medium young no yes4 high old no yes5 medium young no no6 medium young no no7 medium old no no8 medium old no no(a) We want to train a CART decision tree classifier to predict whether a new
7、customer will buy an iPad or not. We define the value of attribute Buy_iPad is the label of a record.(i) Please find a CART decision tree according to the above example. In the decision tree, wheneverwe process a node containing at most 3 records, we stop to process this node for splitting.(ii) Cons
8、ider a new young customer whose income is medium and he has an iPhone. Please predictwhether this new customer will buy an iPad or not.(b) What is the difference between the C4.5 decision tree and the ID3 decision tree? Why is there a difference?解:解:a.(i)对于所给定样本的期望信息是: - log2 - log2 =18484属性 Income
9、的样本:Info(high)=-3 log21-0 log20=0Info(medium)=- log2 - log2 =0.721935145期望信息为: 0+ 0.72193=0.2707283信息增益为:Gain(Income)=1-E(Income)= 0.729277同样计算知:Gain(Age)=0.09436Gain(Have_iPhone)=0.311这三个属性中 Income 的 Gain 最大,所以选择 Income 为最优特征,于是根节点生成两个子节点,一个是叶节点,对另一个节点继续使用以上方法,在 A2,A3 选择最优特征及其最优切分点,结果是 Age。依此计算得,CA
10、RT 树为:YoungOldmediumYes AgeIncomeHighNONO(ii)这个新的年轻、中等收入、有 IPhone 的顾客,将不会购买 IPad。(b)C4.5 决策树算法和 ID3 算法相似,但是 C4.5 决策树算法是对 ID3 算法的改进,ID3 算法在生成决策树的过程中,使用信息增益来进行特征选择,是选择信息增益最大的特征;C4.5 算法在生成决策树的过程中,用信息增益比来选择特征,是选择信息增益比最大的特征。因为信息增益的大小是相对于训练数据集而言的,并没有绝对的意义,在分类困难时,也就是在训练数据集的经验熵大的时候,信息增益会偏大,反之,信息增益会偏小。使用信息增益
11、比可以对这一问题进行校正。 Consider the following eight two-dimensional data points:x1: (23, 12), x2: (6, 6), x3: (15, 0), x4: (15, 28), x5:(20, 9), x6: (8, 9), x7: (20, 11), x8: (8, 13),Consider algorithm k-means.Please answer the following questions. You are required to show the information about each final cl
12、uster(including the mean of the cluster and all data points in this cluster). You can consider writing a program forthis part but you are not required to submit the program.(a) If k = 2 and the initial means are (20, 9) and (8, 9), what is the output of the algorithm?(b) If k = 2 and the initial mea
13、ns are (15, 0) and (15, 29), what is the output of the algorithm?解:(a)已知 K=2,初始质心是(20, 9)、(8, 9)则:M1 M2 K1 K2(20, 9) (8, 9) (20,9),(23,12),(15,0), (15,28), (20,11) (8,9), (6,6), (8,13)(18.6,12) (7.3,9.3) (23,12),(15,28),(20,9),(20,11) (15,0),(6,6),(8,9),(8,13)(19.5,15) (9.5,7) (23,12),(15,28),(20,9)
14、,(20,11) (15,0),(6,6),(8,9),(8,13)所以,算法输出两个簇:K1=x1,x4,x5,x7K2=x2,x3,x6,x8(b)已知 K=2,初始质心是 (15, 0)、(15, 29)则:M1 M2 K1 K2(15, 0) (15, 29) (23,12),(6,6),(15,0),(20,9),(8,9),(20,11),(8,13) (15,28)(14.3,8.6) (15,28) (23,12),(6,6),(15,0),(20,9),(8,9),(20,11),(8,13) (15,28)所以,算法输出两个簇:K1=x1,x2,x3,x5,x6,x7,x8
15、K2=x44. Consider eight data pointsThe following matrix shows the pairwise distances between any two points.1 2 3 4 5 6 7 81 02 11 03 5 13 04 12 2 14 05 7 17 1 18 06 13 4 15 5 20 07 9 15 12 16 15 19 08 11 20 12 21 17 22 30 0 Please use the agglomeration approach to cluster these eight points into two
16、 groups/clusters by using distance complete linkage.Please write down all data points for each cluster and write down the distance between the two clusters.3 5 距离 1 合并为簇(3,5)1 2 3 4 5 6 7 81 0 2 11 0 3 5 13 0 4 12 2 14 0 5 7 17 1 18 06 13 4 15 5 20 0 7 9 15 12 16 15 19 08 11 20 12 21 17 22 30 02 4 距
17、离 2 合并为簇(2,4)1 2 3,5 4 6 7 81 02 11 03,5 5 13 04 12 2 14 06 13 4 15 5 07 9 15 12 16 19 08 11 20 12 21 22 30 0(2 ,4)6 距离 4 合并为簇(2,4,6)1 2,4 3,5 6 7 81 0 2,4 11 0 3,5 5 13 0 6 13 4 15 07 9 15 12 19 08 11 20 12 22 30 01 距离(3,5)为 5 合并为簇(1,3,5)1 2,4,6 3,5 7 81 0 2,4,6 11 0 3,5 5 13 07 9 15 12 08 11 20 12 30 0(1,3,5)距离 7 为 9 合并为簇(1,3,5,7)1,3,5 2,4,6 7 81,3,5 0 2,4,6 11 0 7 9 15 08 11 20 30 0(1,3,5,7) 距离 8 为 11 合并为簇(1,3,5,7 ,8)1,3,5, 7 2,4,6 81,3,5,7 0 2,4,6 11 08 30 20 00 合并1, 3,5,7 ,8 2,4,61,3,5,7,8 02,4,6 11 0