1、Introduction to Data MiningInstructors Solution ManualPang-Ning TanMichael SteinbachVipin KumarCopyright c2006 Pearson Addison-Wesley. All rights reserved.Contents1 Introduction 12Data 53 Exploring Data 194 Classification: Basic Concepts, Decision Trees, and ModelEvaluation 255 Classification: Alter
2、native Techniques 456 Association Analysis: Basic Concepts and Algorithms 717 Association Analysis: Advanced Concepts 958 Cluster Analysis: Basic Concepts and Algorithms 1259 Cluster Analysis: Additional Issues and Algorithms 14710 Anomaly Detection 157iii1Introduction1. Discuss whether or not each
3、of the following activities is a data miningtask.(a) Dividing the customers of a company according to their gender.No. This is a simple database query.(b) Dividing the customers of a company according to their prof-itability.No. This is an accounting calculation, followed by the applica-tion of a th
4、reshold. However, predicting the profitability of a newcustomer would be data mining.(c) Computing the total sales of a company.No. Again, this is simple accounting.(d) Sorting a student database based on student identification num-bers.No. Again, this is a simple database query.(e) Predicting the o
5、utcomes of tossing a (fair) pair of dice.No. Since the die is fair, this is a probability calculation. If thedie were not fair, and we needed to estimate the probabilities ofeach outcome from the data, then this is more like the problemsconsidered by data mining. However, in this specific case, solu
6、-tions to this problem were developed by mathematicians a longtime ago, and thus, we wouldnt consider it to be data mining.(f) Predicting the future stock price of a company using historicalrecords.Yes. We would attempt to create a model that can predict thecontinuous value of the stock price. This
7、is an example of the2 Chapter 1 Introductionarea of data mining known as predictive modelling. We could useregression for this modelling, although researchers in many fieldshave developed a wide variety of techniques for predicting timeseries.(g) Monitoring the heart rate of a patient for abnormalit
8、ies.Yes. We would build a model of the normal behavior of heartrate and raise an alarm when an unusual heart behavior occurred.This would involve the area of data mining known as anomaly de-tection. This could also be considered as a classification problemif we had examples of both normal and abnorm
9、al heart behavior.(h) Monitoring seismic waves for earthquake activities.Yes. In this case, we would build a model of dierent types ofseismic wave behavior associated with earthquake activities andraise an alarm when one of these dierent types of seismic activitywas observed. This is an example of t
10、he area of data miningknown as classification.(i) Extracting the frequencies of a sound wave.No. This is signal processing.2. Suppose that you are employed as a data mining consultant for an In-ternet search engine company. Describe how data mining can help thecompany by giving specific examples of
11、how techniques, such as clus-tering, classification, association rule mining, and anomaly detectioncan be applied.The following are examples of possible answers. Clustering can group results with a similar theme and presentthem to the user in a more concise form, e.g., by reporting the10 most freque
12、nt words in the cluster. Classification can assign results to pre-defined categories such as“Sports,” “Politics,” etc. Sequential association analysis can detect that that certain queriesfollow certain other queries with a high probability, allowing formore ecient caching. Anomaly detection techniqu
13、es can discover unusual patterns ofuser trac, e.g., that one subject has suddenly become muchmore popular. Advertising strategies could be adjusted to takeadvantage of such developments.33. For each of the following data sets, explain whether or not data privacyis an important issue.(a) Census data
14、collected from 19001950. No(b) IP addresses and visit times of Web users who visit your Website.Yes(c) Images from Earth-orbiting satellites. No(d) Names and addresses of people from the telephone book. No(e) Names and email addresses collected from the Web. No2Data1. In the initial example of Chapt
15、er 2, the statistician says, “Yes, fields 2 and3 are basically the same.” Can you tell from the three lines of sample datathat are shown why she says that?Field 2Field 3 7 for the values displayed. While it can be dangerous to draw con-clusions from such a small sample, the two fields seem to contai
16、n essentiallythe same information.2. Classify the following attributes as binary, discrete, or continuous. Alsoclassify them as qualitative (nominal or ordinal) or quantitative (interval orratio). Some cases may have more than one interpretation, so briefly indicateyour reasoning if you think there
17、may be some ambiguity.Example: Age in years. Answer: Discrete, quantitative, ratio(a) Time in terms of AM or PM. Binary, qualitative, ordinal(b) Brightness as measured by a light meter. Continuous, quantitative,ratio(c) Brightness as measured by peoples judgments. Discrete, qualitative,ordinal(d) An
18、gles as measured in degrees between 0and 360. Continuous, quan-titative, ratio(e) Bronze, Silver, and Gold medals as awarded at the Olympics. Discrete,qualitative, ordinal(f) Height above sea level. Continuous, quantitative, interval/ratio (de-pends on whether sea level is regarded as an arbitrary o
19、rigin)(g) Number of patients in a hospital. Discrete, quantitative, ratio(h) ISBN numbers for books. (Look up the format on the Web.) Discrete,qualitative, nominal (ISBN numbers do have order information, though)6 Chapter 2 Data(i) Ability to pass light in terms of the following values: opaque, tran
20、slu-cent, transparent. Discrete, qualitative, ordinal(j) Military rank. Discrete, qualitative, ordinal(k) Distance from the center of campus. Continuous, quantitative, inter-val/ratio (depends)(l) Density of a substance in grams per cubic centimeter. Discrete, quan-titative, ratio(m) Coat check numb
21、er. (When you attend an event, you can often giveyour coat to someone who, in turn, gives you a number that you canuse to claim your coat when you leave.) Discrete, qualitative, nominal3. You are approached by the marketing director of a local company, who be-lieves that he has devised a foolproof w
22、ay to measure customer satisfaction.He explains his scheme as follows: “Its so simple that I cant believe thatno one has thought of it before. I just keep track of the number of customercomplaints for each product. I read in a data mining book that counts areratio attributes, and so, my measure of p
23、roduct satisfaction must be a ratioattribute. But when I rated the products based on my new customer satisfac-tion measure and showed them to my boss, he told me that I had overlookedthe obvious, and that my measure was worthless. I think that he was justmad because our best-selling product had the
24、worst satisfaction since it hadthe most complaints. Could you help me set him straight?”(a) Who is right, the marketing director or his boss? If you answered, hisboss, what would you do to fix the measure of satisfaction?The boss is right. A better measure is given bySatisfaction(product) =number of
25、 complaints for the producttotal number of sales for the product.(b) What can you say about the attribute type of the original productsatisfaction attribute?Nothing can be said about the attribute type of the original measure.For example, two products that have the same level of customer satis-facti
26、on may have dierent numbers of complaints and vice-versa.4. A few months later, you are again approached by the same marketing directoras in Exercise 3. This time, he has devised a better approach to measure theextent to which a customer prefers one product over other, similar products.He explains,
27、“When we develop new products, we typically create severalvariations and evaluate which one customers prefer. Our standard procedureis to give our test subjects all of the product variations at one time and then7ask them to rank the product variations in order of preference. However, ourtest subject
28、s are very indecisive, especially when there are more than twoproducts. As a result, testing takes forever. I suggested that we performthe comparisons in pairs and then use these comparisons to get the rankings.Thus, if we have three product variations, we have the customers comparevariations 1 and
29、2, then 2 and 3, and finally 3 and 1. Our testing time withmy new procedure is a third of what it was for the old procedure, but theemployees conducting the tests complain that they cannot come up with aconsistent ranking from the results. And my boss wants the latest productevaluations, yesterday.
30、I should also mention that he was the person whocame up with the old product evaluation approach. Can you help me?”(a) Is the marketing director in trouble? Will his approach work for gener-ating an ordinal ranking of the product variations in terms of customerpreference? Explain.Yes, the marketing
31、director is in trouble. A customer may give incon-sistent rankings. For example, a customer may prefer 1 to 2, 2 to 3,but 3 to 1.(b) Is there a way to fix the marketing directors approach? More generally,what can you say about trying to create an ordinal measurement scalebased on pairwise comparison
32、s?One solution: For three items, do only the first two comparisons. Amore general solution: Put the choice to the customer as one of order-ing the product, but still only allow pairwise comparisons. In general,creating an ordinal measurement scale based on pairwise comparison isdicult because of pos
33、sible inconsistencies.(c) For the original product evaluation scheme, the overall rankings of eachproduct variation are found by computing its average over all test sub-jects. Comment on whether you think that this is a reasonable ap-proach. What other approaches might you take?First, there is the i
34、ssue that the scale is likely not an interval or ratioscale. Nonetheless, for practical purposes, an average may be goodenough. A more important concern is that a few extreme ratings mightresult in an overall rating that is misleading. Thus, the median or atrimmed mean (see Chapter 3) might be a bet
35、ter choice.5. Can you think of a situation in which identification numbers would be usefulfor prediction?One example: Student IDs are a good predictor of graduation date.6. An educational psychologist wants to use association analysis to analyze testresults. The test consists of 100 questions with f
36、our possible answers each.8 Chapter 2 Data(a) How would you convert this data into a form suitable for associationanalysis?Association rule analysis works with binary attributes, so you have toconvert original data into binary form as follows:Q1= A Q1= B Q1= C Q1= D . Q100= A Q100= B Q100= C Q100= D
37、1 0 0 0 . 1 0 0 00 0 1 0 . 0 1 0 0(b) In particular, what type of attributes would you have and howmany of them are there?400 asymmetric binary attributes.7. Which of the following quantities is likely to show more temporal autocorre-lation: daily rainfall or daily temperature? Why?A feature shows s
38、patial auto-correlation if locations that are closer to eachother are more similar with respect to the values of that feature than loca-tions that are farther away. It is more common for physically close locationsto have similar temperatures than similar amounts of rainfall since rainfallcan be very
39、 localized;, i.e., the amount of rainfall can change abruptly fromone location to another. Therefore, daily temperature shows more spatialautocorrelation then daily rainfall.8. Discuss why a document-term matrix is an example of a data set that hasasymmetric discrete or asymmetric continuous feature
40、s.The ijthentry of a document-term matrix is the number of times that termj occurs in document i. Most documents contain only a small fraction ofall the possible terms, and thus, zero entries are not very meaningful, eitherin describing or comparing documents. Thus, a document-term matrix hasasymmet
41、ric discrete features. If we apply a TFIDF normalization to termsand normalize the documents to have an L2norm of 1, then this creates aterm-document matrix with continuous features. However, the features arestill asymmetric because these transformations do not create non-zero entriesfor any entries
42、 that were previously 0, and thus, zero entries are still not verymeaningful.9. Many sciences rely on observation instead of (or in addition to) designed ex-periments. Compare the data quality issues involved in observational sciencewith those of experimental science and data mining.Observational sc
43、iences have the issue of not being able to completely controlthe quality of the data that they obtain. For example, until Earth orbit-9ing satellites became available, measurements of sea surface temperature re-lied on measurements from ships. Likewise, weather measurements are oftentaken from stati
44、ons located in towns or cities. Thus, it is necessary to workwith the data available, rather than data from a carefully designed experi-ment. In that sense, data analysis for observational science resembles datamining.10. Discuss the dierence between the precision of a measurement and the termssingl
45、e and double precision, as they are used in computer science, typicallyto represent floating-point numbers that require 32 and 64 bits, respectively.The precision of floating point numbers is a maximum precision. More ex-plicity, precision is often expressed in terms of the number of significant dig
46、itsused to represent a value. Thus, a single precision number can only representvalues with up to 32 bits, 9 decimal digits of precision. However, often theprecision of a value represented using 32 bits (64 bits) is far less than 32 bits(64 bits).11. Give at least two advantages to working with data
47、 stored in text files insteadof in a binary format.(1) Text files can be easily inspected by typing the file or viewing it with atext editor.(2) Text files are more portable than binary files, both across systems andprograms.(3) Text files can be more easily modified, for example, using a text edito
48、ror perl.12. Distinguish between noise and outliers. Be sure to consider the followingquestions.(a) Is noise ever interesting or desirable? Outliers?No, by definition. Yes. (See Chapter 10.)(b) Can noise objects be outliers?Yes. Random distortion of the data is often responsible for outliers.(c) Are
49、 noise objects always outliers?No. Random distortion can result in an object or value much like anormal one.(d) Are outliers always noise objects?No. Often outliers merely represent a class of objects that are dierentfrom normal objects.(e) Can noise make a typical value into an unusual one, or vice versa?Yes.10 Chapter 2 Data13. Consider the problem of finding the K nearest neighbors of a data object. Aprogrammer designs Algorithm 2.1 for this task.Algorithm 2.1 Algorithm for finding K nearest neighbors.1: for i =1tonumber of data objects do2: Find the distances of th