80
November 13, 2013 Collect model fits for 4 problems Return reports VIFs Launch chapter 11 1

November 13, 2013

  • Upload
    ziarre

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

November 13, 2013. Collect model fits for 4 problems Return reports VIFs Launch chapter 11. Grocery Data Assignment. - PowerPoint PPT Presentation

Citation preview

Page 1: November  13, 2013

November 13, 2013

• Collect model fits for 4 problems• Return reports• VIFs• Launch chapter 11

1

Page 2: November  13, 2013

Grocery Data Assignment

X3 (holiday) and X1 (cases shipped) do the job and X2 adds nothing; normality and variance constant okay; no need for quadratic terms or interactions; no need at all to square X3 (two level factor)

2

Page 3: November  13, 2013

Some notes• State conclusion (final model) up front• Report model with fit of X1 and X3 rather than

from a fit with X1, X2, X3 and just drop X2• Check assumptions for X1, X3 model not model

with X2 as well• Box-Cox suggests no transformation even

though l = 2 is “best”• Interaction bit

3

Page 4: November  13, 2013

Notes on writing aspects• Avoid imperative form of verbs (Fit the

multivariate. Run the model. Be good.(You) Verb …

• Don’t use contractions. It’s bad form. They’re considered informal.

• Spell check does not catch wrong words (e.g., blow instead of below, not instead of note)

• Writing skills are important (benefits considerable)

4

Page 5: November  13, 2013

VIFs (not BFFs)• Variance Inflation Factors• (1-R2)-1 where R2 is the R-squared from regressing Xi on

the other Xi’s• Available in JMP if you know where to look

5

Page 6: November  13, 2013

• x3 appears to be both a “dud” for predicting y and not very collinear with either x1 or x2

6

Page 7: November  13, 2013

• In computing VIFs, need to regress x3 on x1 and x2, and compute 1/(1-RSquare) which here is about 100. What gives?

7

Page 8: November  13, 2013

8

Page 9: November  13, 2013

Added variable plot to the rescue

Note 1.6084963 matches earlier value(Suggest you run the other way as well.)

Regress x3 and x2 on x1 and save residuals

9

Page 10: November  13, 2013

10

Page 11: November  13, 2013

11

Page 12: November  13, 2013

12

Page 13: November  13, 2013

13

Page 14: November  13, 2013

Interesting body fat example

• Looks like very little co-linearity• However, massive multi-collinearity!• This is why it can be challenging at times!!!• Why we don’t throw extra dud variables

into the model

14

Page 15: November  13, 2013

Steps in the analysis• Multivariate to get acquainted with data• (Analyze distribution all variables)• Looking for a decent model—parsimonious

– Linear, interactions, quadratic– Stepwise if many variables– PRESS vs. root mean square error– Added variable plots according to taste

• Check assumptions (lots of plots)• Check for outliers, influential observations (hats and Cook’s Di)

15

Page 16: November  13, 2013

16

Chapter 11: Remedial MeasuresWe’ll cover in some detail:11.1 Weighted Least Squares11.2 Ridge regression11.4 Regression trees11.5 Bootstrapping

Page 17: November  13, 2013

17

11.1 Weighted Least SquaresSuppose that the constant variance assumption does not hold. Each residual has a different variance but keep the zero covariances:

Least squares is out—what should we do?

Page 18: November  13, 2013

18

Use Maximum Likelihood for inspiration!Likelihood:

Now define the i-th weight to be:

Then the likelihood is:

Page 19: November  13, 2013

19

Taking logarithms, we get:Log Likelihood is a constant plus:

Criterion is same as least squares, except each squared residual is weighted by wi -- hence the weighted least squares criterion.

The coefficient vector bw that minimizes Qw is the vector of weighted least squares estimates

Page 20: November  13, 2013

20

Matrix Approach to WLS

Let:

Then:

Page 21: November  13, 2013

21

Page 22: November  13, 2013

22

Usual Case: Variances are Unknown

Need to estimate each variance!

Recall:

Give a statistic that can estimate si2:

Give a statistic that can estimate si:

Page 23: November  13, 2013

23

Estimating a Standard Deviation FunctionStep 1: Do ordinary least squares; obtain residualsStep 2: Regress the absolute values of the residuals against Y or whatever predictor(s) seem to be associated with changes in the variances of the residuals.Step 3: Use the predicted absolute residual for case i, |ei| as the estimated variance of ei, call it si

Step 4: Then wi = (1/si)2

^

^

^

^

Page 24: November  13, 2013

• Subset x and y for table 11.1• Fit y on x and save residuals, compute absolute value of

residuals• Regress these residuals on x• The predicted values are estimated stan. Dev.’s• Weights are reciprocal of stan. Dev. Squared• Use these weights with WLS on original y and x

variables to get y-hat = 55.566 + .5963 x

24

Page 25: November  13, 2013

25

Pictures

Page 26: November  13, 2013

26

Example

Page 27: November  13, 2013

27

Notes on WLS Estimates1. WLS estimates are minimum variance, unbiased.2. If you use Ordinary Least Squares (OLS) when variance

is not constant, estimates are still unbiased, just not minimum variance.

3. If you have replicates at each unique X category, you can just use the sample standard deviation of the responses at each category to determine the weight for any response in the category.

4. R2 has no clear cut meaning here.5. Must use the standard deviation function value (instead

of s) for confidence intervals for prediction

Page 28: November  13, 2013

28

11.2 Ridge RegressionBiased regression to reduce the effect of multicollinearity.

Shrinkage estimation: Reduce the variance of the parameters by shrinking them (a bit) in absolute magnitude. This will introduce some bias, but may reduce the MSE overall. Recall: MSE = bias squared plus variance:

Page 29: November  13, 2013

29

Page 30: November  13, 2013

30

How to Shrink?

Penalized least squares!

Start with standardized regression model:

Add a “penalty” proportional to the total size of the parameters (proportionality or biasing constant is c):

Page 31: November  13, 2013

31

Matrix Ridge Solution

Start with small c and increase (iteratively) until the coefficients stabilize. Plot is called “ridge trace”

Here, use c about equal to .02

Page 32: November  13, 2013

32

Example

Page 33: November  13, 2013

• Decision trees… section 11.4– Kddnuggets.com suggests…– A bit of history– Breiman et al.– Respectability– Oldie but goodie slides SA– Titanic data

• Some bootstrapping stuff (probably not tonight)

33

Page 34: November  13, 2013

34

Taxonomy of Methods

Quantitative Qualitative Mixture

QuantitativeRegression Regression

or ANOVARegression

or Analysis of Covariance

Qualitative-2 levels

Logistic Regression

or Discriminant

Analysis

Logistic Regression or Log-linear

Models

Logistic Regression

Qualitative-more than 2

levels

Polytomous Logistic

Regression or

Discriminant Analysis

Polytomous Logistic

Regression or Log-linear

Models

Polytomous Logistic

Regression

Predictor (or Explanatory or Independent) Variables

Response (Dependent)

Variable

Page 35: November  13, 2013

35

Data mining and Predictive ModelingPredictive modeling mainly involves the application of:

• Regression• Logistic regression• Regression trees• Classification trees• Neural networks

to very large data sets.

The difference, technically, is because we have so much data, we can rely on the use of validation techniques---the use of training, validation and test sets to assess our models. There is much less concern about:

- Statistical significance (everything is significant!)- Outliers/influence (a few outliers have no effect)- Meaning of coefficients (models may have thousands of

predictors)- Distributional assumptions, independence, etc.

Page 36: November  13, 2013

36

Data mining and Predictive ModelingWe will talk about some of the statistical techniques used in predictive modeling, once the data have been gathered, cleaned, organized. But data gathering usually involves merging disparate data from different sources, data warehouses, etc., and usually represents at least 80% of the work.

General Rule: Your organization’s data warehouse will not have the information you need to build your predictive model. (Paraphrased, Usama Fayyad, VP data, Yahoo)

Page 37: November  13, 2013

37

Regression Trees

Idea: Can we cut up the predictor space into rectangles such that the response is roughly constant in each rectangle, but the mean changes from rectangle to rectangle?

We’ll just use the sample average (Y) in each rectangle as our predictor!

Simple, easy-to-calculate, assumption-free, nonparametric regression method. Note there is no “equation.” The predictive model takes the form of a decision tree.

_

Page 38: November  13, 2013

Steroid Data• See file ch11ta08steroidSplitTreeCalc.jmp• OverallAverage of y is 17.64; SSE is1284.8

38

Page 39: November  13, 2013

39

Example: Steroid Data

_

Fit

Predictive Model

3.55 8.133

13.675 16.95 22.2

Example: What is Y at Age = 9.5?^

Page 40: November  13, 2013

40

How do we find the regions (i.e., grow the tree)?

For one predictor X, it’s easy.

Step 1: To find the first split point Xs, make a grid of possible split points along the X axis. Each possible split point divides the X axis into two regions R21 and R22. Now compute SSE for the two-region regression tree:

Do this for every grid point X. The point that leads to the minimum SSE is the split point.

Steps 2: If you now have r regions, determine the best split point for each of the r regions as you did in step 1; choose the one that leads to the lowest SSE for the r + 1 regions.

Steps 3: Repeat Step 2 until SSE levels off (more later on stopping)

Page 41: November  13, 2013

Illustrate first split with Steroid Data• See file ch11ta08steroidSplitTreeCalc.jmp• OverallAverage of y is 17.64; SSE is1284.8

41

Page 42: November  13, 2013

In the JMP file, aforementioned…

• Point out calculations needed to determine optimal first split

• Easy but a bit tedious• Binary vs. multiple splits• Run it in JMP, be sure to set min # in splits• Fit conventional model as well…

42

Page 43: November  13, 2013

43

Growing the Steroid Level Tree

Split 1 Split 2

Split 3 Split 4

Page 44: November  13, 2013

44

When do we stop growing?

If you let the growth process go on forever, you’ll eventually have n regions each with just one observation. The mean of each region is the value of the observation, and R2 = 100%. (You fitted n means (parameters) and so you have n – n = 0 degrees of freedom for error). Where to stop??

We do this by data-splitting and cross-validation. After each split, use your model (tree) to predict each observation in a hold-out sample and compute MSPR or R2 (holdout) . As we saw with OLS regression, MSPR will start to increase (R2 for holdout will decrease) when we overfit.

We can rely on this because we have very large sample sizes.

Page 45: November  13, 2013

45

What about multiple predictors?For two or more predictors, no problem.

For each region, we have to determine the best predictor to split on AND the best split point for that predictor. So if we have p – 1 predictors, and at stage r we have r regions, there are r(p – 1) possible split points.

Example: Three splits for two predictors

Page 46: November  13, 2013

46

GPA Data Results (text)

Page 47: November  13, 2013

47

Using JMP for Regression Trees

• Analyze >> Modeling >> Partition• Exclude at least 1/3 for validation sample using: Rows

>> Row Selection >> Select Randomly; then Rows >> Exclude

• JMP will automatically give the predicted R2 value (1 – SSE/SSTO for the validation set)

• You need to manually call for a split (doesn’t fit the tree automatically)

Page 48: November  13, 2013

48

Note: R2 for hold-out sample

As you grow the tree this value will peak and begin to decline!

Split button

Clicking the red triangle gives options: select “split history” to see a plot of predicted R2 vs. number of splits

Page 49: November  13, 2013

49

Classification Trees• Regression tree equivalent of logistic regression• Response is binary 0-1; average response in each

region is now p, not Y• For each possible split point, instead of SSE, we

compute the G2 statistic for the resulting 2 by r contingency table.

Split goes to the smallest value. (Can also use the negative of the log(p-value), where the p-value is adjusted in a Bonferroni-like manner. This is called the “Logworth” statistic. Again, you want a small value.)

r

ji

G1

2

1

2

ExpectedObservedlog Observed)(2

Page 50: November  13, 2013

50

Page 51: November  13, 2013

51

Page 52: November  13, 2013

52

Page 53: November  13, 2013

53

Understanding ROC and Lift ChartsAssessing ability to classify a case (predict) correctly in logistic regression, classification trees, or neural networks (with binary responses) as a function of the cutoff value chosen.

ROC Curve: Plot true positive rate [P(Y = 1|Y=1)] vs false positive rate [(P(Y = 1|Y=0)].

Example 1: Classify top 40% (of predicted probabilities) as 1; bottom 60% as 0. Same as cutoff = .45, here.

Pred prob: .49 .48 .47 .46 .43 .41 .38 .36 .32 .29Data: 1 1 1 0 0 1 1 0 1 0Classification: 1 1 1 1 0 0 0 0 0 0

Top 40%--------- Bottom 60%----------------

^^

Page 54: November  13, 2013

54

Calculating sensitivity (true pos) and 1-specificity (false pos)

True positive rate: P(Yhat = 1|Y=1) = 3/6 = .5 (Y axis value)False positive rate: P(Yhat = 1|Y=0) = 1/4 = .25 (X axis value)

FalsePos

True

Pos

0.80.70.60.50.40.30.20.10.0

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Scatterplot of TruePos vs FalsePos

Page 55: November  13, 2013

55

Example 2: Classify top 40% (of predicted probabilities) as 1; bottom 60% as 0. Same as cutoff = .45, here.

Pred prob: .49 .48 .47 .46 .43 .41 .38 .36 .32 .29Data: 1 1 1 0 0 0 0 0 0 0Classification: 1 1 1 1 0 0 0 0 0 0

Top 40%--------- Bottom 60%----------------

True positive rate: P(Yhat = 1|Y=1) = 3/3 = 1.0 (Y axis value)False positive rate: P(Yhat = 1|Y=0) = 1/7 = .1428 (X axis value)

FalsePos

True

Pos

0.90.80.70.60.50.40.30.20.10.0

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

Scatterplot of TruePos vs FalsePos

Page 56: November  13, 2013

56

11.5 Bootstrapping in Regression

Bootstrapping:

A method that uses computer simulation, rather than theory and analytical results, to obtain sampling distributions of statistics. From these we can estimate the precision of an estimator.

Page 57: November  13, 2013

57

Background: Simulated Intervals

Suppose:

1. Our objective is to get a confidence interval for the slope in a simple linear regression setting.

2. We know the distribution of Y at each X value.

How can we use computer simulation (Minitab) to get a confidence interval for the slope?

Page 58: November  13, 2013

58

Background: Simulated Intervals

Easy:

1. Obtain a random Y value for each of the n X points

2. Compute the regression.

3. Store b1

Do the above, say 1,000,000 times. Do a histogram of the b1 values, use the .025 and .975 percentiles!

Page 59: November  13, 2013

59

Simulated Intervals: Example

Toluca company data: Assume E(Y) = 62 + 3.6 X and s = 50

That is: e ~ N(62 + 3.6 X , 50).

Exec: let k1 = k1 + 1random 25 c3; normal 0 50.let c4 = 62 + 3.6*X + c3Regress c4 1 'X';Coefficients c5.let c6(k1) = c5(2)

Page 60: November  13, 2013

60

What if we don’t know the distribution of errors?

Answer: Use the empirical distribution:

1. Fit the model (assume its true)2. Then in the simulation, for each run, obtain a random

sample n residuals (with replacement) from the n observed residuals.

3. Compute the new Y values, run the regression, and store the bootstrap slope value, b1*

This is the basic approach to the fixed-X sampling bootstrap

Page 61: November  13, 2013

61

To obtain confidence interval:

1. Could use percentiles as previous.

2. Better approach is the reflection method:

d1 = b1 – b1*(a/2)d2 = b1*(1 - a/2) – b1

b1 – d2 < b1 < b1 + d1

Page 62: November  13, 2013

62

Random-X Sampling Version

When error variances are not constant or predictor variables cannot be regarded as fixed constants, random X sampling is used:

For each bootstrap sample, we sample a (Y, X) pair with replacement from the data set.

In effect we sample rows of the data set with replacement.

Page 63: November  13, 2013

63

Fixed X Example—Toluca Data

let k1 = k1 + 1sample 25 c3 c5; replace.let c6 = c4 + c5Regress c6 1 'X';Coefficients c7.let c8(k1) = c7(2)

Assume that the base regression has been run. We have stored the residuals in column c3, predicted values in c4.

Page 64: November  13, 2013

64

Neural NetworksThe i-th observation is modeled as a nonlinear function of m derived predictors, H0, …, Hm-1.

Page 65: November  13, 2013

65

Neural NetworksOK, so what is gY and how are the predictors derived?

gY is usually a logistic function and the Hj are a nonlinear function of a linear combination of the predictors X

Here Xi is the i-th row of the X matrix

Page 66: November  13, 2013

66

Neural NetworksPut these together and you get the neural network model:

A common choice for all of the nonlinear function is again the logistic:

Page 67: November  13, 2013

67

Neural NetworksThe gj functions are sometimes called the “activation” functions:

The original idea was that when a linear combination of the predictors got large enough, a brain synapse would “fire” or “activate.” So this was an attempt to model a “step” input function.

Page 68: November  13, 2013

68

Neural Networks

The gj functions are sometimes called the “activation” functions:

The original idea was that when a linear combination of the predictors got large enough, a brain synapse would “fire” or “activate.” So this was an attempt to model a “step” input function.

Page 69: November  13, 2013

69

Neural NetworksUsing the logistic for the gY and gj functions leads to the single-hidden-layer, feedforward neural network. Sometimes called the single layer perceptron.

iii HY b 1)]'(exp1[

Page 70: November  13, 2013

70

Network RepresentationUseful to view as network and compare to multiple regression:

Page 71: November  13, 2013

71

Parameter Estimation: Penalized Least Squares

Recall that we found if too many parameters are fit in OLS, our ability to predict hold-out data can deteriorate. So we looked at adjusted R2, AIC, BIC, Mallows Cp, which all have built-in penalties for having too many parameters.

Dropping some predictors is like setting the corresponding parameter estimate to zero, which “shrinks” the size of the regression coefficient vector:

Another way to do this would be to leave all of the predictors in, but require that there be penalty on the estimation method for Size(b).

2)(Size ibb

Page 72: November  13, 2013

72

Parameter Estimation: Penalized Least Squares

This leads to the “penalized least squares” method. Choose the parameter estimates to minimize:

Where the overfit penalty:

is the sum of squares of the estimates.

Page 73: November  13, 2013

73

Example Using JMP (SAS) SoftwareWe’ll consider the Ischemic Heart Disease data set in Appendix C.9. Response is log(total cost subscriber claims), and the predictors considered are:

Note: X1 is variable 5, X2 is variable 6, X3 is variable 9, and X4 is variable 8. The first 400 observations are used to fit (train) the model, and the last 388 are held out for validation

Page 74: November  13, 2013

74

Example Using JMP (SAS) Software

Page 75: November  13, 2013

75

Example Using JMP (SAS) Software

Page 76: November  13, 2013

76

Example Using JMP (SAS) Software

Page 77: November  13, 2013

77

Comparison with Linear and Quadratic OLS Fits

Page 78: November  13, 2013

78

Comparison of Statistical and NN Terms

Page 79: November  13, 2013

A Sampling Application

• Frequently, have an idea about the variability in y off of an x-variable

• Forestry application:– What is average age of trees in a stand– Diameter of tree is “easy”– Age of tree via ?????

79

Page 80: November  13, 2013

• 20 trees in sample• 1132 in the forest• Average diameter is 10.3118.3 off of fit• (raw average ageis 107.4 ; Average diameter inSample is 9.44…i.e., Sample tended to have “smaller trees in it. Fit corrects for this.)Note, got 118.46 using estimated standard deviations

80