1、 1 LLL Chapter 2: Organizing and Visualizing Variables 第 2章 : 组织和可视化变量 In this chapter you learn:在这一章你学习 : Organizing categorical variables.组织分类变量 Organizing numerical variables.组织数值变量 Visualizing categorical variables.分类变量可视化 Visualizing numerical variables.可视化数值变量 2 LLL Organizing Data Creates Bot
2、h Tabular And Visual Summaries Summaries both guide further exploration and sometimes facilitate decision making. 摘要既指导进一步的探索 , 有时又促进决策 。 Visual summaries enable rapid review of larger amounts of data & show possible significant patterns. 可视化摘要可以快速检查大量数据 &显示可能的重要模式 。 Often, the Organize and Visualiz
3、e step in DCOVA occur concurrently. 通常 , 该组织和可视化步 DCOVA同时发生 。 DCOVA 组织数据创建表格和可视化 摘要 3 LLL Categorical Data Are Organized By Utilizing Tables Categorical Data Tallying Data Summary Table DCOVA One Categorical Variable Two Categorical Variables Contingency Table 分类数据是利用表来组织的 。 分类数据 统计数据 一个分类变量 二分类变量 一
4、览表 , 汇总表 ( 质量管理 ) 相依表 4 LLL Organizing Data: Summary Table (One Categorical Variable) A summary table tallies the frequencies (counts) or percentages of items in a set of categories so that you can see differences between categories. DCOVA A summary table which tallies the frequencies (counts) is al
5、so called a frequency table记录频率 ( 计数 ) 的汇总表也 称为 频率表 。 A summary table which tallies the relative frequencies is also called a relative frequency table记录相对频率的汇总表也称为相对频率表 。 组织数据 : 汇总表 ( 一个分类变量 ) 汇总表吻合频率 ( 计数 ) 或百分比的一组类的项目 , 你可以看到不同类别之间。 5 LLL Organizing Data: Summary Table (One Categorical Variable) R
6、eason For Shopping Online? Percent Better Prices更好的价格 37% Avoiding holiday crowds or hassles避开假日人群或麻烦 29% Convenience方便 , 便利 18% Better selection更好的选择 13% Ships directly 船舶直达 3% DCOVA Main Reason Young Adults Shop Online Source: Data extracted and adapted from “Main Reason Young Adults Shop Online?”
7、 USA Today, December 5, 2012, p. 1A. 组织数据 : 汇总表 ( 一个分类变量 ) 年轻人网上购物的主要原因 网上购物的原因 ? 来源 : 数据提取和改编自 “主要原因年轻人网上购物吗 ? “ 今日美国 , 2012年 12月 5日 , 1a P.。 6 LLL Contingency Table (Two Categorical Variables) A random sample of 400 invoices is drawn.随机抽取 400张发票样本 。 Each invoice is categorized as a small, medium,
8、or large amount.每张发票分为小 、 中 、 或大量 。 Each invoice is also examined to identify if there are any errors.每张发票也检查以确定是否有任何错误 This data are then organized in the contingency table to the right.然后将数据在应急表中组织到右侧 。 DCOVA No Errors Errors Total Small Amount 170 20 190 Medium Amount 100 40 140 Large Amount 65 5
9、 70 Total 335 65 400 Contingency Table Showing Frequency of Invoices Categorized By Size and The Presence Of Errors 列联表 ( 两个分类变量 ) 列联表显示的大小和存在的错误分类发票频率 没有错误 错误 小量 误差 7 LLL Contingency Table Based On Percentage Of Overall Total No Errors Errors Total Small Amount 170 20 190 Medium Amount 100 40 140 L
10、arge Amount 65 5 70 Total 335 65 400 DCOVA No Errors Errors Total Small Amount 42.50% 5.00% 47.50% Medium Amount 25.00% 10.00% 35.00% Large Amount 16.25% 1.25% 17.50% Total 83.75% 16.25% 100.0% 42.50% = 170 / 400 25.00% = 100 / 400 16.25% = 65 / 400 83.75% of sampled invoices have no errors and 47.5
11、0% of sampled invoices are for small amounts. 基于总百分比的列联表 83.75%的抽样发票没有错误 , 47.50%的抽样发票是少量的 。 8 LLL Contingency Table Based On Percentage of Row Totals No Errors Errors Total Small Amount 170 20 190 Medium Amount 100 40 140 Large Amount 65 5 70 Total 335 65 400 DCOVA No Errors Errors Total Small Amou
12、nt 89.47% 10.53% 100.0% Medium Amount 71.43% 28.57% 100.0% Large Amount 92.86% 7.14% 100.0% Total 83.75% 16.25% 100.0% 89.47% = 170 / 190 71.43% = 100 / 140 92.86% = 65 / 70 Medium invoices have a larger chance (28.57%) of having errors than small (10.53%) or large (7.14%) invoices. 基于行总计百分比的列联表 中型发
13、票比小 ( 10.53%) 或大 ( 7.14%) 发票的出错机会大 ( 28.57%)。 9 LLL Tables Used For Organizing Numerical Data Numerical Data Ordered Array DCOVA Cumulative Distributions Frequency Distributions 用于组织数值数据的表 数据 有序阵列 频数分布图 累积分布 10 LLL Organizing Numerical Data: Ordered Array An ordered array is a sequence of data, in r
14、ank order, from the smallest value to the largest value.有序阵列是一个序列的数据 , 在排名顺序 , 从最小值到最大值 。 Shows range (minimum value to maximum value). 显示范围 ( 最大值 , 最小值 )。 May help identify outliers (unusual observations). 可以帮助识别离群值 ( 异常值 )。 Age of Surveyed College Students 大学生调查年龄 Day Students走读生 16 17 17 18 18 18
15、 19 19 20 20 21 22 22 25 27 32 38 42 Night Students夜读生 18 18 19 19 20 21 23 28 32 33 41 45 DCOVA 组织数值数据 : 有序数组 11 LLL Organizing Numerical Data: Frequency Distribution组织数值数据 : 频率分布 The frequency distribution is a summary table in which the data are arranged into numerically ordered classes. 频率分布是一个汇
16、总表 , 其中数据被排列成数字有序类 。 You must give attention to selecting the appropriate number of class groupings for the table, determining a suitable width of a class grouping, and establishing the boundaries of each class grouping to avoid overlapping. 你必须注意选择合适的表的班数 , 确定一个合适的一类分组的宽度 , 并建立每类分组以避免重叠的边界 。 The nu
17、mber of classes depends on the number of values in the data. With a larger number of values, typically there are more classes. 类的数量取决于数据中的值的个数 。 具有较大数量的值 , 通常有更多的类 。 To determine the width of a class interval, you divide the range (Highest valueLowest value) of the data by the number of class groupi
18、ngs desired. 确定一类区间的宽度 , 你把范围 ( 最高值 最低值 ) 的数据由班所需的数字 。 DCOVA 12 LLL Organizing Numerical Data: Frequency Distribution Example Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature. 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43,
19、44, 27, 53, 27 DCOVA 数值数据组织 : 频率 分布的例子 例子 : 一家绝缘制造商随机选择 20个冬季 , 记录每天的高温 。 13 LLL Organizing Numerical Data: Frequency Distribution Example Sort raw data in ascending order:按升序排序原始数据 : 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58. Step 1: Find range: 58 - 12 = 46. St
20、ep 2: Select number of classes: the book chooses 5 (usually between 5 and 15).第 2步 : 选择类的数量 : 书选择 5( 通常在 5到 15之间 )。 Follow “2 to k” rule:遵循 “2到 K”规则 : Number of observations is n 一些 观察 N Choose the smallest k such that 2kn 选择 最小的 k, 例如 2K n Step 3: Compute class interval (width): 10 (46/5 then round
21、 up).步骤 3: 计算类间距 ( 宽度 ): 10( 46/5然后圆 )。 Step 4: Determine class boundaries (limits):步骤 4: 确定 阶级界限 ( 限制 ): Class 1: 10 but less than 20.上课 1:10但不到 20点 。 Class 2: 20 but less than 30. Class 5: 50 but less than 60. Compute class midpoints: 15, 25, 35, 45, 55.计算类的中点 : 15, 25, 35, 45, 55。 Count observati
22、ons & assign to classes.计数观察与指定类 。 DCOVA 14 LLL Organizing Numerical Data: Frequency Distribution Example Class Midpoints Frequency 10 but less than 20 15 3 20 but less than 30 25 6 30 but less than 40 35 5 40 but less than 50 45 4 50 but less than 60 55 2 Total 20 Data in ordered array: 12, 13, 17,
23、 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 DCOVA 组织数值数据 : 频率分布示例 有序数组中的数据 : 中点 频率 15 LLL Class Frequency 10 but less than 20 3 .15 15% 20 but less than 30 6 .30 30% 30 but less than 40 5 .25 25% 40 but less than 50 4 .20 20% 50 but less than 60 2 .10 10% Total 20 1.00 100% R
24、elative Frequency Percentage Organizing Numerical Data: Relative & Percent Frequency Distribution Example DCOVA Relative Frequency = Frequency / Total, e.g. 0.10 = 2 / 20 频率 相对频率 百分比 组织数据 : 相对与百分频率分布的例子 相对频率 =频率 /总和 , 16 LLL 10 but less than 20 3 15% 3 15% 20 but less than 30 6 30% 9 45% 30 but less
25、 than 40 5 25% 14 70% 40 but less than 50 4 20% 18 90% 50 but less than 60 2 10% 20 100% Total 20 100% 20 100% Organizing Numerical Data: Cumulative Frequency Distribution Example Class Percentage Cumulative Percentage Cumulative Percentage = Cumulative Frequency / Total * 100 e.g. 45% = 100*9/20 Fr
26、equency Cumulative Frequency DCOVA 组织数值数据 : 累积频率分布示例 累积频数 17 LLL Visualizing Categorical Data Through Graphical Displays Categorical Data Visualizing Data Bar Chart Summary Table For One Variable Contingency Table For Two Variables Side By Side Bar Chart DCOVA Pie or Doughnut Chart Pareto Chart Doug
27、hnut Chart 通过图形显示可视化分类数据 分类数据 可视化数据 单变量汇总表 双变量列联表 条线图 排列图 ; 帕 累托图 侧侧的条形图 圆环图 饼图或圆环图 18 LLL Visualizing Categorical Data: The Bar Chart The bar chart visualizes a categorical variable as a series of bars. The length of each bar represents either the frequency or percentage of values for each category
28、. Each bar is separated by a space called a gap. DCOVA Reason For Shopping Online? Percent Better Prices 37% Avoiding holiday crowds or hassles 29% Convenience 18% Better selection 13% Ships directly 3% 分类数据可视化 : 条形图 条形图可视化分类变量作为一系列的酒吧 。 每一条的长度表示的频率或百分比值为每个类别 。 每一条分隔的空间称为间隙 。 网上购物的原因 ? 百分比 19 LLL Vi
29、sualizing Categorical Data: The Pie Chart and The Doughnut Chart The pie chart is a circle broken up into slices that represent categories. The size of each slice of the pie varies according to the percentage in each category. 饼图是一个圆分成片代表类别 。 每片饼的大小根据每个类别的百分比变化 。 DCOVA The doughnut chart is the oute
30、r part of a circle broken up into pieces that represent categories. The size of each piece of the doughnut varies according to the percentage in each category. Doughnut Chart of Reasons to Shop Online 分类数据可视化 : 饼图和甜甜圈图 圆环图是一个圆的外部部分 , 分解成代表类别的部分 。 每个甜甜圈的大小根据每个类别的百分比而变化 。 20 LLL Visualizing Categorica
31、l Data: Side By Side Bar Charts The side by side bar chart represents the data from a contingency table. DCOVA 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0%No ErrorsErrorsInvoice Size Split Out By Errors & No Errors Large Medium SmallNo Errors Errors Total Small Amount 50.75% 30.77% 47.50% Medium A
32、mount 29.85% 61.54% 35.00% Large Amount 19.40% 7.69% 17.50% Total 100.0% 100.0% 100.0% 可视化分类数据 : 侧侧的条形图 并排条形图表示列联表中的数据 。 21 LLL Visualizing Numerical Data By Using Graphical Displays Numerical Data Ordered Array Stem-and-Leaf Display Histogram Polygon Ogive Frequency Distributions and Cumulative Dis
33、tributions DCOVA 利用图形显示数字数据可视化 数据 有序阵列 频率分布 累积分布 茎叶显示 柱状图 多边形 , 多角形 交错骨 22 LLL Organizing Numerical Data: Stem and Leaf Display A stem-and-leaf display organizes data into groups (called stems) so that the values within each group (the leaves) branch out to the right on each row. 茎叶显示组织的数据分组 ( 称为茎 )
34、, 在每一组中的值 ( 叶子 ) 分支出来的每一行上的权利 。 METHOD: Separate the sorted data series into leading digits (the stems) and the trailing digits (the leaves).方法 : 单独排序的数据系列为主导的数字 ( 茎 ) 和尾位 ( 叶子 )。 Stem Leaf 1 67788899 2 0012257 3 28 4 2 Age of College Students Day Students Night Students Stem Leaf 1 8899 2 0138 3 23
35、 4 15 Age of Surveyed College Students Day Students 16 17 17 18 18 18 19 19 20 20 21 22 22 25 27 32 38 42 Night Students 18 18 19 19 20 21 23 28 32 33 41 45 DCOVA 组织数值数据 : 茎叶显示 大学生年龄 23 LLL Visualizing Numerical Data: The Histogram A vertical bar chart of the data in a frequency distribution is call
36、ed a histogram. 在频率分布数据的垂直条形图称为 直方图 In a histogram there are no gaps between adjacent bars. 直方图中有相邻杆之间没有间隙 。 The class boundaries (or class midpoints) are shown on the horizontal axis. 阶级界限 ( 或类的中点 ) 为水平轴显示 。 The vertical axis is either frequency, relative frequency, or percentage. 垂直轴是频率 , 频率 , 或百分
37、比 。 The height of the bars represent the frequency, relative frequency, or percentage. 条形的高度表示频率 , 相对频率或百分比 。 DCOVA 可视化数值数据 : 直方图 24 LLL Visualizing Numerical Data: The Histogram Class Frequency 10 but less than 20 3 .15 15 20 but less than 30 6 .30 30 30 but less than 40 5 .25 25 40 but less than 5
38、0 4 .20 20 50 but less than 60 2 .10 10 Total 20 1.00 100 Relative Frequency Percentage 024685 15 25 35 45 55 M o r eFrequencyH i s t o g r a m : A g e O f S t u d e n t s(In a percentage histogram the vertical axis would be defined to show the percentage of observations per class). DCOVA Histogram:
39、 Temperature 可视化数值数据 : 直方图 ( 百分比直方图纵轴将定义显示观察每班的百分比 )。 25 LLL Visualizing Numerical Data: The Polygon A percentage polygon is formed by having the midpoint of each class represent the data in that class and then connecting the sequence of midpoints at their respective class percentages. 个多边形的每个类的中点代表
40、 , 然后连接中点的顺序在各自班级的百分比数据形成 。 The cumulative percentage polygon, or ogive, displays the variable of interest along the X axis, and the cumulative percentages along the Y axis. 累计百分比的多边形 , 或卵形 , 显示沿 X轴感兴趣的变量 , 并沿 Y轴的累积百分比 。 Useful when there are two or more groups to compare. 有用 , 当有两个或更多的组比较 。 DCOVA 可
41、视化数值数据 : 多边形 26 LLL Visualizing Numerical Data: The Percentage Polygon DCOVA 可视化数值数据 : 多边形的百分比 27 LLL Visualizing Two Numerical Variables By Using Graphical Displays Two Numerical Variables Scatter Plot Time-Series Plot DCOVA 用图形显示可视化两个数值变量 两个数值变量 趋势图 时间序列图 散布图 , 扩散图 28 LLL The Scatter Plot Scatter
42、plots are used to examine possible relationships between two numerical variables. 散点图是用来检查可能的两个数值变量之间的关系 。 One variable is measured on the vertical axis and the other variable is measured on the horizontal axis. 变量在垂直轴和其他变量的测量是在水平轴上测量 。 DCOVA Volume per day Cost per day 23 125 26 140 29 146 33 160 3
43、8 167 42 170 50 188 55 195 60 200 C o s t p e r D a y v s . Pr o d u c ti o n V o l u m e 05010015020025020 30 40 50 60 70V o l u m e p e r D a yCost perDay趋势图 29 LLL A Time-Series Plot is used to study patterns in the values of a numeric variable over time. (Some variable vs Time) The Time Series P
44、lot DCOVA 0204060801001202007 2008 2009 2010 2011 2012 2013 2014 2015NumberofFranchisesY e a rN u m b e r o f F r a n c h i s e s , 2 0 0 7 to 2 0 1 5 Year Number of Franchises 2007 43 2008 54 2009 60 2010 73 2011 82 2012 95 2013 107 2014 99 2015 95 时间序列图 时间序列图用于研究数值变量随时间变化的模式 。( 一些变量与时间 ) 30 LLL A
45、multidimensional contingency table is constructed by tallying the responses of three or more categorical variables. In Excel you create a Pivot Table to yield an interactive display of this type. 在 Excel中 , 创建一个透视表 , 以产生这种类型的交互式显示 。 Organizing Many Categorical Variables: The Multidimensional Contingency Table DCOVA 组织许多分类变量 : 多维列联表 一个多维列联表的记录的 三个或 更多的分类变量的响应构建 。