binary logistic regression超清楚实用.pdf-道客多多

资源描述

1、STAT3004 Multivariate Data Analysis Semester 1, AY 2007/2008 _ 1 Part 2: Logistic Regression 1. Introduction So far we have considered linear regression models where the response variable is continuous with a normal distribution. Often, though, we have data where the interest is in predicting or “ex

2、plaining” a binary (or dichotomous) outcome variable by a set of explanatory variables. This type of outcome variable is a two-category variable. Examples are: Success/failure of a treatment, explained by dosage of medicine administered, patients age, sex, weight and severity of condition. High/low

3、cholesterol level, explained by sex, age, whether a person smokes or not, etc. Use/non-use of contraception, explained by gender, age, whether married, education level, religion, etc. Vote for/against political party, explained by age, gender, education level, region, ethnicity, etc. Yes/No or Agree

4、/Disagree responses to questionnaire items in a survey. Logistic regression is the most popular technique available for modelling dichotomous dependent variables. 2. Modelling Dichotomous Outcome Variables Lets get technical just for a while! Going back to the simple linear regression model for a mo

5、ment, what we were doing there was actually specifying a model for the mean value of our outcome variable Y. Statisticians often use the notation EY for the mean of the variable Y in the population the E stands for “expected value”. Our model assumed that this mean was related to the value of the ex

6、planatory variable, x, via the linear equation xYE 10 += . (2.1) STAT3004 Multivariate Data Analysis Semester 1, AY 2007/2008 _ 2 Once we have fitted this model, the fitted line provides an estimate of the mean of the outcome variable Y for a “population” of subjects with explanatory variable value

7、x. So what happens if Y is dichotomous (rather than continuous)? Now let Y be a dichotomous outcome variable, coded as Y = 1 for the outcome of interest (denoted a “success”), and Y = 0 for the other possible outcome (denoted a “failure”). We use the Greek character pi to represent the probability t

8、hat the “success” outcome occurs in the population. The probability of a “failure” outcome is then 1 - pi. In this special case where Y is a binary variable, the mean of Y in the population is equal to pi. So our model becomes a model for the probability of a “success” outcome x10 pi += , (2.2) This

9、 is called the linear probability model. Example The variable Y denotes presence or absence of evidence of significant coronary heart disease (CHD), so that Y = 1 indicates CHD is present or Y = 0 indicates that it is not present in an individual. We are interested in the relationship between age (X

10、) and the presence or absence of CHD in a population, and we have a sample of 100 subjects selected to participate in a study. See Table 1.1 on page 3 of Hosmer and Lemeshow (2000) for the data. In a linear regression problem the first thing we would do (I hope!) to get an idea of the relationship b

11、etween the two variables would be to draw a scatterplot (see Figure 2.1). This is not particularly useful here, and it does not help us to establish whether the mean of Y (i.e. the probability pi) is linear in x. To get a better picture of the relationship we could create intervals for the explanato

12、ry variable X and compute the mean of Y within each group. In this example, we create age groups 20-29, 30-34, 35-39, , 5559, 60-69, calculate the proportion of “successes” in each age group, and then plot the proportion of “successes” against the mid-point of the age group (see Figure 2.2). This pl

13、ot shows better the dependence of pi on the value of the explanatory variable, x. It would not appear to be linear in x note the S-shaped curve. The change in pi per unit change in x becomes progressively smaller the closer pi gets to 0 or 1. STAT3004 Multivariate Data Analysis Semester 1, AY 2007/2

14、008 _ 3 Figure 2.1: Scatterplot of CHD by Age for 100 Subjects 20 30 40 50 60 70age00.20.40.60.81CHDFigure 2.2: Plot of the Proportion of Subjects with CHD in Each Age Group 20 30 40 50 60 70age0.00.20.40.60.81.0Proportion withCHDSTAT3004 Multivariate Data Analysis Semester 1, AY 2007/2008 _ 4 So wh

15、at is the problem with applying the standard linear regression model? It is possible to fit the model (2.2) using the ordinary least squares (OLS) method that we have already come across in linear regression. Indeed, such a model might produce sensible results in some cases. However: The predicted v

16、alues of pi obtained from fitting this model may be outside the interval 0,1. Since pi is a probability, its value must lie within the interval 0,1. However, the right-hand side (RHS) of equation (2.2) is unbounded so that, theoretically, the RHS can take on values from - to . This means we could ge

17、t a predicted probability of, for example, 2.13 from our fitted model, which is rather non-sensical! It turns out that if 0.25 pi 0.75 the linear probability model produces fairly sensible results though. The usual regression assumption of normality of Y is not satisfied - Y is not continuous, it on

18、ly takes a value of 0 or 1. What is the solution to this problem? Instead of fitting a model for pi, we use a transformation of pi. We shall consider the most commonly used transformation, the log of the odds of a “success” outcome, i.e. we shall model pipi1log . First, though, what do we mean by th

19、e “odds”? 3. Probabilities and Odds The odds are defined as the probability of a “success” outcome divided by the probability of a “failure” outcome odds = pipi= 1)Pr(success1 )Pr(success)Pr(failure )Pr(success . (3.1) It is easy to convert from probabilities to odds and back again. Note: Since pi l

20、ies between 0 and 1 the odds can take values between 0 to . STAT3004 Multivariate Data Analysis Semester 1, AY 2007/2008 _ 5 Examples 1. If pi = 0.8, then the odds are equal to 0.8/(1-0.8) = 0.8/0.2 = 4. 2. If pi = 0.5, then the odds are equal to 0.5/(1-0.5) = 0.5/0.5 = 1. So, if the odds are equal

21、to 1 the chance of “success” is the same as the chance of “failure”. 3. If the odds are equal to 0.3 then solving pi/(1 - pi) = 0.3 gives pi = 0.3/1.3 = 0.2308. So, we can think of the odds as another scale for representing probabilities. We also note here that since division by zero is not allowed,

22、 the odds will be undefined when the probability of “failure” (i.e. 1-pi) is 0. 4. The Logistic Regression Model The logistic regression model can be written in terms of the log of the odds, called the logit, as kk XXX pipipi += .)logit(1log 22110e . (4.1) The logit is just the (natural) logarithm o

23、f the odds. With this model, the range of values that the left-hand side can potentially take is now between - and , which is the same range as that of the right-hand side. Now we have something that looks very familiar. We have a linear model on the logit scale. This is the most common form of the

24、logistic regression model. An alternative and equivalent way of writing the logistic regression model in (4.1) is in terms of the odds, ).( 221101kk XXXe pipi += . (4.2) With this form of the model, the range of values that the right-hand side can take is now between 0 and (since the exponential fun

25、ction is non-negative), the same as for the odds. STAT3004 Multivariate Data Analysis Semester 1, AY 2007/2008 _ 6 A third way in which you will see the model written is in terms of the underlying probability of a “success” outcome, ).().(22110221101 kkkkXXXXXXeepi += . (4.3) This form is just obtai

26、ned by re-arranging (4.2). Notice that (4.3) can be written in a slightly different way as ).( 2211011kk XXXe pi += . (4.4) We emphasise here that all three forms of the model (4.1) (4.3) are equivalent. Figure 4.1 shows the logit function, i.e. a plot of logit(pi) against pi. Note that the curve is

27、 almost linear for 0.25 pi 0.75. This is why the linear probability model produces sensible results for pi within this range. Figure 4.1: A Plot of the Logit Function: logit(pi) (= logepi/(1-pi) vs pi. 5. Performing Logistic Regression Our aim is to quantify the relationship between the probability

28、of a “success” outcome, pi, and the explanatory variables X1, X2, , Xk based on some sample data. For now, we assume that 0.0 0.2 0.4 0.6 0.8 1.0-4-2024xlogit (x)STAT3004 Multivariate Data Analysis Semester 1, AY 2007/2008 _ 7 in the population there is a relationship between pi and a single continu

29、ous explanatory variable X and that this relationship is of the form X10e 1log)logit( pipipi += = . (5.1) This model can be estimated using SPSS (or practically any other general-purpose statistical software) as Xbb 10)logit( +=pi , (5.2) where b0 and b1 are the estimated regression coefficients. Th

30、e estimation for logistic regression is commonly performed using the statistical method of maximum likelihood estimation. Explicit closed-form formulae for the estimated regression coefficients like those obtained in the case of simple linear regression (see equations (1.6) and (1.7) in the linear r

31、egression handouts) are not usually available, so numerical methods are used. A proper introduction to these methods is beyond the scope of this course, and so well just let SPSS do the estimation for us. Example: Low Birth Weight Babies Suppose that we are interested in whether or not the gestation

32、al age (GAGE) of the human foetus (number of weeks from conception to birth) is related to the birth weight. The dependent variable is birth weight (BWGHT), which is coded as 1 = normal, 0 = low. The data for 24 babies (7 of whom were classified as having low weight at birth) are shown in Table 5.1.

33、 Table 5.1: Gestational Ages (in Weeks) of 24 Babies by Birth Weight Normal Birth Weight (BWGHT = 1) Low Birth Weight (BWGHT = 0) 40, 40, 37, 41, 40, 38, 40, 40 38, 40, 40, 42, 39, 40, 36, 38, 39 38, 35, 36, 37, 36, 38, 37 The model we shall fit is: STAT3004 Multivariate Data Analysis Semester 1, AY

34、 2007/2008 _ 8 GAGE)logit( 10 pi += . (5.3) The output given by SPSS from fitting this model is shown in Figure 5.1. Under Block 0, SPSS first produces some output which corresponds to fitting the “constant model” this is the model 0)logit( pi = , i.e. the model with the predictor variable excluded.

35、 Under Block 1, SPSS presents output for the model that includes the predictor variable. The estimated regression coefficients (see final table in the output labelled “Variables in the Equation”) are b0 = -48.908 , b1 = 1.313 , and so our fitted model is )logit(pi = - 48.908 + 1.313 GAGE . (5.4) Two

36、 obvious questions present themselves at this stage: 1. How do we know if this model fits the data well? 2. How do we interpret this fitted model? 6. How Good is the Model? We shall consider several approaches to assess the “fit” of the model. Note that in practice, reporting two or three of these i

37、s normally sufficient. 6.1 Classification Table One way of assessing how well the model fits the observed data is to produce a classification table. This is a simple tool which indicates how good the model is at predicting the outcome variable. As an example, consider the fitted model (5.4). First,

38、we choose a “cut-off” value c (usually 0.5). For each individual in the sample we “predict” their BWGHT condition as 1 (i.e. normal) if their fitted probability of being normal birth weight is greater than c, otherwise we predict it as 0 (i.e. low). We then construct a table showing how many of the

39、observations we have predicted correctly. STAT3004 Multivariate Data Analysis Semester 1, AY 2007/2008 _ 9 Figure 5.1: SPSS Output from Logistic Regression (Gestational Age Data) Case Processing Summary24 100.00 .024 100.00 .024 100.0Unweighted CasesaIncluded in AnalysisMissing CasesTotalSelected Ca

40、sesUnselected CasesTotalN PercentIf weight is in effect, see classification table for thetotal number of cases.a. Dependent Variable Encoding01Original Value.00001.0000Internal ValueBlock 0: Beginning Block Classification Tablea,b0 7 .00 17 100.070.8Observed.00001.0000BWGHTOverall PercentageStep 0.0

41、000 1.0000BWGHT PercentageCorrectPredictedConstant is included in the model.a. The cut value is .500b. Variables in the Equation.887 .449 3.904 1 .048 2.429ConstantStep 0B S.E. Wald df Sig. Exp(B)Variables not in the Equation10.427 1 .00110.427 1 .001GAGEVariablesOverall StatisticsStep 0Score df Sig

42、.STAT3004 Multivariate Data Analysis Semester 1, AY 2007/2008 _ 10 Block 1: Method = Enter Omnibus Tests of Model Coefficients12.676 1 .00012.676 1 .00012.676 1 .000StepBlockModelStep 1Chi-square df Sig.Model Summary16.298 .410 .585Step1-2 LoglikelihoodCox & SnellR SquareNagelkerkeR SquareHosmer and

43、 Lemeshow Test1.626 5 .898Step1Chi-square df Sig.Contingency Table for Hosmer and Lemeshow Test1 .951 0 .049 12 2.518 1 .482 32 1.753 1 1.247 32 1.372 3 3.628 50 .185 2 1.815 20 .213 8 7.787 80 .009 2 1.991 21234567Step1Observed ExpectedBWGHT = .0000Observed ExpectedBWGHT = 1.0000TotalClassification

44、 Tablea5 2 71.42 15 88.283.3Observed.00001.0000BWGHTOverall PercentageStep 1.0000 1.0000BWGHT PercentageCorrectPredictedThe cut value is .500a. Variables in the Equation1.313 .541 5.890 1 .015 3.716-48.908 20.338 5.783 1 .016 .000GAGEConstantStep1aB S.E. Wald df Sig. Exp(B)Variable(s) entered on ste

45、p 1: GAGE.a. STAT3004 Multivariate Data Analysis Semester 1, AY 2007/2008 _ 11 In this example (see the penultimate table in the Figure 5.1 SPSS output under Block 1 labelled “Classification Table”), we have 24 cases altogether. Of these, 7 were observed as having low birthweight (BWGHT = 0) and 5 o

46、f these 7 we correctly predict, i.e. they have a fitted probability (calculated from our model) of less than 0.5. Similarly, 15 out of the 17 observed as having normal birthweight are correctly predicted, i.e. they have a fitted probability of greater than 0.5. Generally, the higher the overall percentage of correct predictions (in this case 20/24 = 83%) the better the model. Howe

展开阅读全文