1、Chapter 4 Chapter 4 Regression Regression Models ModelsLearning Objectives Learning Objectives Students will be able to: 1. Identify variables and use them in a regression model. 2. Develop simple linear regression equations from sample data and interpret the slope and intercept. 3. Compute the coef
2、ficient of determination and the coefficient of correlation and interpret their meanings. 4. Interpret the F-test in a linear regression model. 5. List the assumptions used in regression and use residual plots to identify problems.Learning Objectives Learning Objectives (continued) (continued) Stude
3、nts will be able to: 6. Develop a multiple regression model and use it to predict. 7. Use dummy variables to model categorical data. 8. Determine which variables should be included in a multiple regression model. 9. Transform a nonlinear function into a linear one for use in regression. 10. Understa
4、nd and avoid common mistakes made in the use of regression analysis.Chapter Outline Chapter Outline 4.1 Introduction 4.2 Scatter Diagrams 4.3 Simple Linear Regression 4.4 Measuring the Fit of a Regression Model 4.5 Using Computer Software for Regression 4.6 Assumptions of the Regression ModelChapter
5、 Outline Chapter Outline (continued) (continued) 4.7 Testing the Model for Significance 4.8 Multiple Regression Analysis 4.9 Binary or Dummy Variables 4.10 Model Building 4.11 Nonlinear Regression 4.12 Cautions and Pitfalls in Regression AnalysisIntroduction Introduction Regression analysis is a ver
6、y valuable tool for todays manager. Regression is used to: understand the relationship between variables. predict the value of one variable based on another variable. Cost estimation models are a good example.Introduction Introduction (continued) (continued) A regression model is comprised of a depe
7、ndent, or response, variable and an independent, or predictor, variable. Dependent Variable = Independent Variable(s) Prediction RelationshipScatter Diagram Scatter Diagram A scatter diagram is used to graphically investigate the relationship between the dependent and independent variables. Plot the
8、 dependent variable on the Y axis. Plot the independent variable on the X axis.Triple A Construction Triple A Construction Example Example Triple A Construction Company renovates old homes in Albany. They have found that its dollar volume of renovation work is dependent on the Albany area payroll. T
9、riple A Sales ($100,000s) Local Payroll ($100,000,000s) 6 3 8 4 9 6 5 4 4.5 2 9.5 5Triple A Construction Triple A Construction Example Example (continued) (continued) Scatter Diagram Scatter Diagram Dependent Variable Independent Variable Payroll Line Fit Plot 0 2 4 6 8 10 02468 Payroll ($100.000,00
10、0s) Sales ($ 1 00,000)Simple Linear Simple Linear Regression Regression Y = Y = 0 0 + + 1 1 X + error X + error Where, Y = dependent variable (response) X = independent variable (predictor / explanatory) 0 = intercept (value of Y when X = 0) 1 = slope of the regression line Error = random error Regr
11、ession models are used to test if a relationship exists between variables; that is, to use one variable to predict another. However, there is some random error that cannot be predicted.Simple Linear Simple Linear Regression Regression (continued) (continued) Sample data are used to estimate the true
12、 values for the intercept and slope. Y = b + b X Where, Y = predicted value of Y The difference between the actual value of Y and the predicted value (using sample data) is known as the error. 0 1 Error = (actual value) (predicted value) e = Y - YLeast Squares Least Squares Regression Regression Lea
13、st squares regression minimizes the sum of the squared errors. Payroll Line Fit Plot 0 2 4 6 8 10 02468 Payroll ($100.000,000s) Sales ($ 1 00,000)() () ( ) () ( ) 1 2 2 2 2 2 b XX X X X X XXYY X Yn X Y n XY XY n n = = = 01 1 bb b YX Y n X n = = Least Squares Least Squares Regression Equations Regres
14、sion Equations Y = b + b X 0 1 Least squares regression equations are:Calculating the Calculating the Regression Line: Regression Line: Triple A Construction Triple A Construction Sales (Y) Payroll (X) (X - X) (X-X)(Y-Y) 6311 8400 9644 5400 4.5 2 4 5 9 . 5512 . 5 2 2 Summations for each column: 42 2
15、4 10 12.5 Y = 42/6 = 7 X = 24/6 = 4Calculating the Calculating the Regression Line Regression Line (continued) (continued) Calculating the required parameters: b = (X-X)(Y-Y) 12.5 (X-X) 10 b = Y b X = 7 (1.25)(4) = 2 So, Y = 2 + 1.25 X 2 o 1 1 = = 1.25Using Regression Line Using Regression Line If t
16、he payroll estimations for next year were $600 million, what is the predicted value of Triple As sales? Y = 2 + 1.25 X Sales = 2 + 1.25 (payroll) So, Next year sales = 2 + 1.25 (6) = 9.5Measuring the Fit of Measuring the Fit of the Regression Model the Regression Model The variability in the Y varia
17、ble SST Total variability about the mean SSE Variability about the regression line SSR Variability that is explained Coefficient of Determination r 2 - Proportion of explained variation Correlation Coefficient r Strength of the relationship between Y and X variables To understand how well the model
18、predicts the response variable, we evaluate the following:Measuring the Fit of Measuring the Fit of the Regression Model the Regression Model Sum of Squares Total (SST) measures the total variable in Y. Sum of the Squared Error (SSE) is less than the SST because the regression line reduced the varia
19、bility. Sum of Squares due to Regression (SSR) indicated how much of the total variability is explained by the regression model. Errors (deviations) may be positive or negative. Summing the errors would be misleading, thus we square the terms prior to summing. SST = (Y-Y) 2 SSE = e = (Y-Y) 2 2 SSR =
20、 (Y-Y) 2Measuring the Fit of Measuring the Fit of the Regression Model the Regression Model (continued) (continued) For Triple A Construction: SST = (Y-Y) 2 SSE = e = (Y-Y) 2 2 SSR = (Y-Y) 2 = 22.5 = 6.875 = 15.625 Note: SST = SSR + SSE Explained Variability Unexplained VariabilityCoefficient of Coe
21、fficient of Determination Determination The coefficient of determination (r 2 ) is the proportion of the variability in Y that is explained by the regression equation. r 2 = SSR = 1 SSE SST SST For Triple A Construction: r 2 = 15.625 = 0.6944 22.5 69% of the variability in sales is explained by the
22、regression based on payroll. Note: 0 r 2 1Correlation Coefficient Correlation Coefficient = () 2 2 () 2 2 2 Y Y ( Y n X X n Y X XY n r For Triple A Construction, r = 0.8333 The correlation coefficient (r) measures the strength of the linear relationship. Note: -1 r 1Correlation Coefficient Correlati
23、on Coefficient (continued) (continued)Computer Software for Computer Software for Regression Regression In Excel, use Tools/ Data Analysis. This is an add-in option.Computer Software for Computer Software for Regression Regression (continued) (continued) After selecting the regression option, this w
24、ill appear X and Y ranges Specify labels if included in range Output area Residual (error) output Scatter diagram outputComputer Software for Computer Software for Regression Regression (continued) (continued) High r (close to 1) 2 Multiple r is correlation coefficient (r) A scatter diagram will be
25、given. Regression coefficientsAssumptions of the Assumptions of the Regression Model Regression Model Errors are independent. Errors are normally distributed. Errors have a mean of zero. Errors have a constant variance. We make certain assumptions about the errors in a regression model which allow f
26、or statistical testing. Assumptions:Residual Analysis Residual Analysis Residual analyses (plots) will highlight glaring violations of the assumptions. 0 X Healthy Residual Plot Healthy Residual Plot no violations no violations0 X Residual Analysis: Residual Analysis: Nonlinear Violation Nonlinear V
27、iolation Nonlinear Residual Plot Nonlinear Residual Plot violation violation0 X Nonconstant Error Residual Plot Nonconstant Error Residual Plot violation violation Residual Analysis: Residual Analysis: Nonconstant Error Nonconstant ErrorEstimating the Estimating the Variance Variance s = MSE = SSE n
28、k-1 The mean squared error (MSE) is the estimate of the error variance of the regression equation. 2 Where, n = number of observations in the sample k = number of independent variables For Triple A Construction, s = 1.7188 2Estimating the Estimating the Variance Variance (continued) (continued) s =
29、MSE The standard deviation of the regression is used in many statistical tests about the regression model. For Triple A Construction, s = 1.31Testing the Model for Testing the Model for Significance: F Significance: F - - test test An F-test is used to statistically test the null hypothesis that the
30、re is no linear relationship between the X and Y variables (i.e. = 0). If the significance level for the F test is low, we reject Ho and conclude there is a linear relationship. F = MSR MSE where, MSR = SSR k 1Testing the Model for Testing the Model for Significance: F Significance: F - - test test
31、For Triple A Construction: MSR = 15.625 = 15.625 1 F = 15.625 = 9.0909 1.7188 The significance level for F = 9.0909 is 0.0394, indicating we reject Ho and conclude a linear relationship exists between sales and payroll.Testing the Model for Testing the Model for Significance: R Significance: R 2 2 r
32、 2 is the best measure of the strength of the prediction relationship between the X and Y variables. Values closer to 1 indicate a strong prediction relationship. Good regression models have significant F-test and high r 2 values.Testing the Model for Testing the Model for Significance: Coefficient
33、Significance: Coefficient Hypotheses Hypotheses Statistical tests of significance can be performed on the coefficients. The null hypothesis is that the coefficient of X (i.e., the slope of the line) is 0. P values are the observed significance level and can be used to test the null hypothesis. For a
34、 simple linear regression the test of the regression coefficients gives the same information as the F-test.ANOVA Tables ANOVA Tables When developing a regression model, an ANOVA table is computing by most statistical software. The general form of the ANOVA table is helpful for understanding the inte
35、rrelatedness of error terms. DF SS MS F Significance Regression k SSR MSR MSR/MSE P-value Residual n-k-1 SSE MSE Total n-1 SSTMultiple Regression Multiple Regression Multiple regression models are similar to simple linear regression models except they include more than one X variable. Y = b + b X +
36、b X + b X 0 1 1 2 2 n n Independent variables slopeMultiple Regression: Multiple Regression: Wilson Realty Example Wilson Realty Example Price Sq. Feet Age Condition 35000 1926 30 Good 47000 2069 40 Excellent 58900 1706 32 Mint 60000 1847 38 Mint 78500 2285 26 Mint 79000 3752 35 Good 95000 3800 40 E
37、xcellent 87500 2300 18 Good 93000 2525 17 Good 67000 1950 27 Mint 70000 2323 30 Excellent 49900 1720 30 Excellent 55000 1396 15 Good 97000 1740 12 Mint Wilson Realty wants to develop a model to determine the suggested listing price for a house based on size and age.Wilson Realty Example Wilson Realt
38、y Example (continued) (continued) 67% of the variation in sales price is explained by size and age. Ho: No linear relationship is rejected Ho: 1 = 0 is rejected Ho: 2 = 0 is rejected Y = 60815.45 + 21.91(size) 1449.34 (age)Wilson Realty Example Wilson Realty Example (continued) (continued) Y = 60815
39、.45 + 21.91(size) 1449.34 (age) Wilson Realty has found a linear relationship between price and size and age. The coefficient for size indicates each additional square foot increases the value by $21.91, while each additional year in age decreases the value by $1449.34. For a 1900 square foot house
40、that is 10 years old, the following prediction can be made: $87,951 = 21.91(1900) + 1449.34(10)Binary Variables Binary Variables A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise. The number of dummy variables must equal one less than the number of
41、 categories of the qualitative variable. Binary (or dummy) variables are special variables that are created for qualitative data.Wilson Realty Example: Wilson Realty Example: Binary Variables Binary Variables Return to Wilson Realty, and lets evaluate how to use property condition in the regression
42、model. There are three categories: Mint, Excellent, and Good. X = 1 if the house is in excellent condition = 0 otherwise X = 1 if the house is in mint condition = 0 otherwise Note: If both X and X = 0 then the house is in good condition 3 3 4 4Wilson Realty: Binary Wilson Realty: Binary Variables Va
43、riables (continued) (continued) What can you say about the new model? Y = 48329.23 + 28.21 (size) 1981.41(age) + 23684.62 (if mint) + 16581.32 (if excellent)Model Building Model Building As more variables are added to the model, the r 2 usually increases. The adjusted r 2 takes into account the numb
44、er of independent variables in the model. The best model is a statistically significant model with a high r 2 and a few variables. Note: When variables are added to the model, the value of r 2 can never decrease; however, the adjusted r 2 may decrease.Model Building Model Building (continued) (conti
45、nued) Collinearity and multicollinearity create problems in the coefficients. The overall model prediction is still good; however individual interpretation of the variables is questionable. Collinearity or multicollinearity exists when an independent variable is correlated with another independent v
46、ariable. Nonlinear Regression Nonlinear Regression Transformations may be used to turn a nonlinear model into a linear model. Nonlinear relationships may exist between variables, thereby requiring a transformation of one or more variables to achieve linearity.Automobile Example: Automobile Example:
47、Nonlinear Regression Nonlinear Regression Engineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are studying the impact of weight on miles per gallon (MPG). MPG Weight MPG Weight 12 4.58 20 3.18 13 4.66 23 2.68 19 3.09 36 1.95 4.02 2.53 3.11 15 24 2.65 18 33 1.
48、70 19 42 1.92Automobile Example Automobile Example (continued) (continued) 0 5 10 15 20 25 30 35 40 45 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Weigth (1,000 lbs) MPG Linear regression line Nonlinear regression line Perhaps a nonlinear relationship exists?Automobile Example Automobile Example (continued) (co
49、ntinued) Linear regression model: MPG = 47.8 8.2 (weight) F significance = .0003 r 2 = .7446 Nonlinear (transformed variable) regression model MPG = 79.8 30.2(weigth) + 3.4 (weight) F significance = .0002 R 2 = .8478 2 Which model is best? What are the difficulties Which model is best? What are the difficulties