124
1 Logistic regression

1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Embed Size (px)

Citation preview

Page 1: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

1

Logistic regression

Page 2: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

2

Programme

• 2:05 pm Talk • 3:15 pm Coffee break • 3:45 pm A short exercise• 4:30 pm Finish

Page 3: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Revision

Correlation and regression

3

Page 4: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

4

A study

• Does watching screened violence promote violent behaviour in children?

• In a study of the effects of media violence, some children were measured on their Actual violence and on their Exposure to screened violence.

• Here is a scatterplot of Actual violence against Exposure.

Page 5: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

5

The scatterplot

• Each point in the plot represents one child.

• The coordinates of the point are the child’s scores on Exposure to and Actual violence.

• A strong statistical ASSOCIATION between Exposure to and Actual violence is evident from the shape of the cloud of points.

Page 6: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

A linear association

• An elliptical scatterplot such as this indicates a LINEAR association between the variables.

6

Page 7: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

7

The Pearson correlation

• Where the association is linear, its strength is measured by the Pearson correlation, the formula for which is:

Page 8: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Sums of squares and products

8

Page 9: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Alternative formula for the Pearson correlation

9

Page 10: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

10

Regression

• Regression is a set of statistical techniques enabling the researcher to exploit an association among variables to PREDICT the values of one variable from those of others.

• From regression, you can also ascertain the extent to which the variance of a target variable can be EXPLAINED or accounted for in terms of the other variables.

Page 11: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

11

Some key terms

• The variable we are trying to predict or account for is known variously as the DEPENDENT VARIABLE (DV), the CRITERION, or the TARGET VARIABLE.

• The predictors are known as the INDEPENDENT VARIABLES (IVs) or REGRESSORS.

• In our current example, the DV is Actual violence and the IV is Exposure to screen violence.

Page 12: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

12

Simple regression and multiple regression

• In SIMPLE REGRESSION, there is just ONE IV or regressor.

• In MULTIPLE REGRESSION, there are TWO OR MORE IVs or regressors.

Page 13: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

13

The regression line

• In simple regression, a line called the REGRESSION LINE is drawn through the points.

• The regression line is the line that fits the points most closely, according to the LEAST SQUARES CRITERION.

• The traditional approach to regression is therefore referred to as ORDINARY LEAST SQUARES (OLS) REGRESSION.

Page 14: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

14

A linear equation

Page 15: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

15

The equation of a straight line

Page 16: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

16

The regression line

Page 17: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

17

The regression equation

Page 18: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

18

On the graph …

Page 19: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

19

Interpretation of the slope or regression coefficient

• The slope or REGRESSION COEFFICIENT is the average number of units of change in the DV that result from a change of one unit on the IV.

• In our example, slope = .74 .• So, an increase of one unit in Exposure

produces, on average, an increase of .74 (rather less than one unit) in Actual violence.

Page 20: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

20

The residual (e)

Page 21: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

21

Joe’s residual score

• Joe scored 8 on Exposure and 9 on Actual.

• Joe’s predicted score from regression Y/ is 8.

• Joe’s residual score is : (9 – 8) = 1.

Page 22: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

22

Least squares criterion for goodness-of-fit

• The values of the slope and intercept of the regression line are such that the sum of the squares of the residuals (SSerror) is a minimum.

Page 23: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

23

A unique solution

• The values of b and c needed to achieve the least squares criterion are given by the formulae below.

• Clearly, the regression coefficient b is closely related to the Pearson correlation.

Page 24: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

24

Relation between the regression coefficient and the correlation

coefficient

• The value of the regression coefficient is directly proportional to the value of the correlation coefficient.

Page 25: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

25

Regression and correlation

• Regression and correlation are two sides of the same associative coin.

• The higher the correlation, the narrower the elliptical cloud of points in the scatterplot.

• For fixed values of the variances of X and Y, the higher the correlation, the greater will be the value of the regression coefficient.

Page 26: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

26

The violence data

Page 27: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

27

A negative correlation

• So far, I have considered only positive correlations. Here’s a negative one.

• Does the number of complaints made against GPs very inversely with the average length of their appointments?

• The following scatterplot supports this hypothesis.

Page 28: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

28

A strong negative correlation

Page 29: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

29

The signs of b and r

• The regression coefficient and the correlation always have the same sign.

• For the violence data, both are positive. • For the data on GPs’ appointments, both

are negative. • Note that a negative correlation of – 0.89

indicates an association as strong as one for which the correlation is +.89.

Page 30: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

30

Complete independence

• I take two random samples, each of size 10,000 from a normal population with mean 100 and SD 25. (The Syntax for doing this is shown in the appendix.)

• Since there should be no association between the two samples, the correlation between them should be zero.

• The scatterplot will be CIRCULAR.• The regression line will be HORIZONTAL, that

is, with zero slope.

Page 31: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

31

No association

Page 32: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

The case of independence

32

Page 33: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

33

Intercept-only regression

• The regression line is horizontal and passes through the value 100 on the y-axis. This is the MEAN VALUE OF THE DISTRIBUTION OF THE DEPENDENT VARIABLE.

• Here the intercept of the regression line is equal to the mean value of Y and its slope is zero.

• When X and Y are independent, you can only predict the mean value of Y, WHATEVER THE VALUE OF X.

• This is known as INTERCEPT- ONLY REGRESSION.

Page 34: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

34

Model-building

• When testing the goodness-of-fit of regression models to the data, a useful baseline is provided by the INDEPENDENCE MODEL, which makes intercept-only predictions of the dependent variable by predicting the mean value of the DV whatever the value of the IV.

• In several computing procedures, this is labelled as STEP 0 in the analysis.

• A good regression model should be a big improvement upon the independence model.

Page 35: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

35

The coefficient of determination (r2)

• The square of the Pearson correlation is known as the COEFFICIENT OF DETERMINATION.

• It is so-called because r2 is the proportion of the variance of Y that is accounted for by regression upon X.

Page 36: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

36

Coefficient of determination

Page 37: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

37

Cohen’s (1988) guidelines

Page 38: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

38

Multiple regression model

• We could try to expand our regression model to include additional variables, such as level of parental violence and their levels of formal education.

• We should then have to determine the relative importance of the various IVs and whether we needed to include all of them in the regression model.

• These are problems in MULTIPLE REGRESSION.

Page 39: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

39

Equations for simple and multiple regression

• In the multiple regression equation, b0 is the CONSTANT and b1, b2, …,bp are the PARTIAL REGRESSION COEFFICIENTS.

Page 40: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

40

Partial regression coefficients

• In multiple regression, a PARTIAL REGRESSION COEFFICIENT is the estimated average change in the DV resulting from an increase of one unit in one particular IV with ALL THE OTHER IVs HELD CONSTANT.

Page 41: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

41

The multiple correlation coefficient R

• The MULTIPLE CORRELATION COEFFICIENT (R) is the correlation between the target variable Y and the corresponding predictions Y/ of Y from regression upon the IVs.

Page 42: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

42

Notation

• When it is necessary to specify which variables are involved in a multiple regression, a subscript notation is used.

• The multiple correlation between Y and X1, X2, …, Xp is

Page 43: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

43

Properties of R

• R can never take a negative value, because the sign of the slope of the regression line is always the same as that of the correlation.

• Recall that the Pearson correlation varies within the range from –1 to +1, inclusive.

• The multiple correlation R, however, can only take values between zero and +1, inclusive.

Page 44: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

44

The case of one IV

• The multiple correlation coefficient is defined even in simple regression, where there is only one IV.

• Here, remembering that R can never be negative, it takes the ABSOLUTE VALUE of the Pearson correlation (r ) between X and Y, even when r has a negative value.

• So in SPSS, R is included in the output for simple regression as well as in the output for multiple regression.

Page 45: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

45

The coefficient of multiple determination R2

• In multiple regression, THE COEFFICIENT OF MULTIPLE DETERMINATION R2 is the proportion of variance of the dependent variable Y that is accounted for by regression upon the IVs.

Page 46: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

46

A spatial representation of the coefficient of multiple determination

Page 47: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

47

What if the DV is a set of categories?

• Simple and multiple OLS regression assume that all the variables are CONTINUOUS, that is, measures on an independent scale with units.

• But suppose we want to predict whether a person will suffer from a heart attack or contract a certain illness with known risk factors.

• Here, we are predicting not a VALUE, but CATEGORY MEMBERSHIP.

Page 48: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

48

Regression with a categorical DV

The two most commonly used techniques are:

1.Logistic regression

2.Discriminant analysis

Page 49: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

49

Discriminant analysis

• If all (or most) IVs are continuous, you might consider using DISCRIMINANT ANALYSIS (DA).

• But the DA model makes assumptions about the distributions of the IVs (such as multivariate normality) which research data often fail to satisfy.

• Moreover, DA doesn’t like qualitative IVs, such as sex or nationality.

• For these reasons, logistic regression is increasingly preferred to DA when the DV is categorical.

Page 50: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

50

Categorical IVs

• Unlike DA, logistic regression is happy with qualitative IVs; in fact, logistic regression is happy even if ALL the IVs are qualitative.

Page 51: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

51

A research question

• It is suspected that smoking and drinking are risk factors in the incidence of a pre-morbid blood condition.

• Records of the incidence of the condition in 100 patients are available, together with estimates of the amounts that they smoke and drink.

Page 52: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

52

The data

Page 53: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

53

How many of the patients have the antibody?

Page 54: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

54

Use Frequencies

Page 55: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

55

Frequencies dialog

Page 56: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

56

Forty-four of the hundred patients have the antibody

Page 57: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

57

The odds

• In an EXPERIMENT OF CHANCE (tossing a coin, rolling a die) the ODDS in favour of an event is the number of ways in which the event could occur, divided by the number of ways in which it could fail to occur.

• If a die is rolled, there is one way of getting a six and there are five ways of not getting a six.

• The odds in favour of a six are therefore 1/5.

Page 58: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

58

Odds in favour of having the antibody

• We know that out of 100 patients, 44 have the antibody. We select a person at random from this group.

• There are 44 ways of selecting a person with the antibody; and there are 56 ways of selecting someone without it.

• The odds in favour of the person having the antibody are 44/56 = .79.

Page 59: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

59

Probability

• A probability is a measure of likelihood ranging from 0 (an impossibility) to 1 (a certainty).

• The CLASSICAL DEFINITION of probability, like that of the odds, also arises in the context of an experiment of chance.

• The probability p of an event is the number of ways it can happen divided by the TOTAL number of possible outcomes.

• When a die is rolled, there are six possible outcomes. • There is one way of getting a six. • The probability of a six when a die is rolled is therefore

1/6.

Page 60: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

60

Relationship between probability and the odds

• Probability and the odds are both measures of likelihood and have been defined in the same context – an experiment of chance.

• They are related according to the equation on the left.

Page 61: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

61

Logarithms

• In a logarithmic system, numbers are expressed as powers (logarithms) of a constant called the BASE of the system.

• In COMMON LOGS, the base is 10.

• In NATURAL LOGS, the base is the mathematical constant e, where e is approximately 2.72 .

Page 62: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Definition of a logarithm

• The logarithm of a number is the POWER to which the base must be raised to give the number itself.

• So the log of 100 to the base ten is 2 because 102 = 100.

• The log of 1000 to the base ten is 3, because 103 = 1000.

62

Page 63: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Notation

63

Page 64: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

64

Logs and antilogs

• Before the IT revolution, calculations involving large numbers were done by converting the numbers to logs, working with the logs (which are much smaller numbers), then reversing the log function with the ANTILOG FUNCTION to get back to the original number scale.

Page 65: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Logs of sums and products

65

• The log of the PRODUCT of two numbers is the SUM of their logs.

• The log of the QUOTIENT of two numbers is the DIFFERENCE between their logs.

Page 66: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Two advantages of logs

66

• You are working with SMALLER numbers.

• You are replacing the operations of MULTIPLICATION and DIVISION with the easier ones of ADDITION and SUBTRACTION.

Page 67: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Antilogs

• The ANTILOG function reverses the process of obtaining a logarithm.

67

Page 68: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

The common antilog

68

Page 69: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

The natural antilog

69

Page 70: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

70

Antilogs (base 10)

Page 71: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

71

Antilogs (base e)

Page 72: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

The antilog of a sum

• The antilog of the SUM of two logs is the PRODUCT of the numbers themselves.

72

Page 73: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

73

An asymmetrical measure

• The ODDS suffers from ASYMMETRY OF RANGE.

• Extremely unlikely events have odds confined between 0 and 1; whereas very likely events can have huge odds running into millions.

• Two very likely events could be separated by millions in terms of odds; two very unlikely events will be separated by minute fractions.

Page 74: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

74

The log odds or logit

• The LOG ODDS (LOGIT) is the natural logarithm (log to the base e) of the odds.

Page 75: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

75

Even Steven

• Suppose the odds were 50 to 50 (50/50 =1).

• The natural log of 1 is zero (e0 = 1).

• So for raw odds of 50 to 50, the logit (log odds) is zero.

Page 76: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

76

Range of the logit

• The logit has a symmetrical range: a positive sign means the odds are in favour; a negative sign means the odds are against.

• Unlike the odds, which has a lower limit of zero, the logit has neither an upper nor a lower finite limit.

Page 77: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

77

Example

• In our current example, the odds in favour of a case having the antibody are 44/56 = 11/14 = .79

• logit = ln(.79) = –.24 • The event is less likely than not, hence the

negative sign. • If the odds in favour had been 56/44, the logit

would have been ln(56/44) = ln(1.27) = +.24. • Notice the symmetry of the scale of magnitude

around the neutral point at 0.

Page 78: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

78

Odds as antilogs

• A number such as the odds can be written as an ANTILOG, that is, the base e to the power of the natural log of the odds (the logit):

Page 79: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

79

Probability and the logit

• We can therefore express the probability in terms of the logit, rather than the odds.

Page 80: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

80

The logistic regression function

• We have arrived at the LOGISTIC REGRESSION FUNCTION:

Page 81: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

81

Assumptions of logistic regression

• Either you have the antibody or you don’t. • As smoking and alcohol increase, however, the

probability of having the antibody is assumed to increase CONTINUOUSLY as a function of the IVs.

• In logistic regression, we estimate the probability of having the antibody with the LOGISTIC REGRESSION FUNCTION

• If the estimated probability exceeds a cut-off (usually set at 0.5), the case is classified by the program as a Yes, rather than a No.

Page 82: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

82

A logistic regression function

Page 83: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

83

Logistic regression and logit functions

• We have seen that the logistic regression function is non-linear.

• The logit function, however, is assumed to be linear.

Page 84: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

84

The logit equation

• The logit function looks like an OLS linear regression equation, with a constant and partial regression coefficients.

Page 85: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

85

Typical graph of the logit function

Page 86: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

86

The decision rule

Page 87: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

87

Interpretation of a logistic regression coefficient

• The partial regression coefficient is the increase in the LOG ODDS or LOGIT arising from an increase of one unit in the independent variable.

• If one unit is added, the logit becomes:

Page 88: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

New odds

• To find the new odds you must MULTIPLY the old odds by exp(b).

Page 89: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

89

Example

• In terms of the ODDS, an increase of one unit in the IV MULTIPLIES the original odds by the ANTILOG of b, that is, by eb, or exp(b).

• If b = 1.1, exp(1.1) = 3.0• So an increase of one smoking unit results in the

odds being MULTIPLIED by 3, that is, the antibody is THREE times as likely to be present in the blood of those who smoke a unit more.

Page 90: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

90

The regression problem

• In the logit equation, we must find values of the constant and partial regression coefficients such that correct assignment to categories by the logistic regression function is maximised.

Page 91: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

91

No mathematical solution

• In logistic regression, there is no equivalent of the formulae for the intercept and coefficients in OLS regression.

• A ‘brute force’ computing algorithm is used whereby, starting at arbitrary values of the coefficients, the values are progressively adjusted to try to arrive at a set which maximises the likelihood of obtaining the observed frequencies.

Page 92: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

92

Iteration and ‘convergence’

• In a process known as ITERATION, estimates of the parameters are calculated again and again in the hope that they will ‘converge’ to stable values.

• IT DOESN’T ALWAYS HAPPEN!• We must therefore check that this

‘convergence’ really has been achieved by examining the ITERATION HISTORY in the SPSS output.

Page 93: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

93

Potential difficulties

• The algorithm will not run successfully if the IVs are too highly correlated. This is the familiar MULTICOLLINEARITY PROBLEM sometimes encountered in OLS regression.

Page 94: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

94

Centring

• As with OLS multiple regression, it is a good idea to CENTRE variables, by subtracting the mean from each score, so that the mean of the transformed scores is zero.

• CENTRING leaves the correlations among the variables unchanged.

• But centring makes the algorithm less likely to crash when there are substantial correlations among the variables.

Page 95: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

95

Finding binary logistic regression

Page 96: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

96

Covariates

In SPSS logistic regression dialogs, IVs that are scale or continuous variables are known as COVARIATES.

Page 97: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

97

Always ask for the ITERATION HISTORY, so that you can check whether the algorithm was able to

arrive at a stable estimate.

Page 98: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

98

Dire warning!

• Should the iteration history show failure to converge, the results of the analysis can be ridiculous!

• The effects of failure to converge are not limited to the IV concerned: they can mess up the whole analysis!

Page 99: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

99

The logistic regression dialog

Page 100: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

100

Fitting a model

• The goodness-of-fit of a model is measured by a log likelihood chi-square statistic.

• The SMALLER the value of chi-square, the BETTER the fit.

• The LARGER the p -value the better.

Page 101: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

101

Step 0 in logistic regression

• We know that 44/100 people have the condition. • Armed only with this fact, and with no knowledge

of any associations there might be among the variables, we shall maximise our hit rate if we predict ABSENCE of the condition for ANY person selected at random.

• This is the equivalent, in logistic regression, of INTERCEPT-ONLY (no-regression) prediction in OLS regression: there you just guess MY, whatever the value of X; here you always guess that the condition is absent.

Page 102: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

102

Here is the logistic regression output for Step 0

Page 103: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

Iteration history at Step 0

103

Page 104: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

104

Classification table at Step 0

Page 105: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

105

Iteration history after applying regression model

Page 106: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

106

Classification table at Step 1(after the regression model

has been applied)

Page 107: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

107

The Nagelkerke R2 statistic

• The NAGELKERKE statistic is the counterpart of the coefficient of determination R2 in OLS multiple regression.

• It is a measure of the PROPORTION OF THE TOTAL VARIANCE in incidence of the antibody accounted for by regression.

Page 108: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

108

The Nagelkerke R2 statistic

Page 109: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

109

Hosmer and Lemeshow contingency table

Page 110: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

110

Goodness-of-fit test

Page 111: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

111

The Wald statistic

• The WALD STATISTIC tests a regression coefficient for significance.

• The null hypothesis is that, in the population, the coefficient is zero.

• The Wald statistic is distributed approximately as chi-square on one degree of freedom.

Page 112: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

112

Some regression statistics

• The Wald statistic confirms that Smoking has an effect (p-value is very small) but Alcohol does not (the p-value is large).

Page 113: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

113

The regression coefficient

Page 114: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

114

The logit equation

Page 115: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

115

The logistic regression equation

Page 116: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

116

Graph of accuracy of prediction

Page 117: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

117

Conclusion

• The incidence of the blood condition is indeed predictable from regression and raises the hit rate from 54% to 85%.

• Smoking contributes significantly to the model.

• Alcohol does not contribute significantly to the model.

Page 118: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

118

The next step

• This session has been merely an introduction to the technique of logistic regression.

• The next step is to do some further reading.

Page 119: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

119

Getting started

• There’s an elementary section on logistic regression in

– Kinnear, P., & Gray, C. (2011). IBM SPSS Statistics 18 made simple. Hove: Psychology Press. Chapter 14.

• This is mainly a practical, get-started guide; but there is an outline of the rationale of the technique as well.

Page 120: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

120

• Dugard, P., Todman, J., & Staines, H. (2010) Approaching multivariate analysis: a practical introduction. (2nd ed.) London & New York: Routledge.

The next step

Page 121: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

121

Sage paperbacks

• Menard, S. (2002). Applied logistic regression analysis (2nd ed.). London: Sage.

• Jaccard, J. (2001). Interaction effects in logistic regression. London: Sage.

Page 122: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

122

• Tabachnik, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Allyn & Bacon. Chapter 10.

• Field, A. (2009). Discovering statistics using SPSS for Windows: Advanced techniques for the beginner (3rd ed.). London: Sage. Chapter 6.

Page 123: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

123

Appendix

Using syntax to draw random samples from specified

populations

Page 124: 1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish

124

Drawing two samples from a normal distribution with mean 100 and SD 15