62
PREDICTIVE ANALYTICS USING REGRESSION Sumeet Gupta Associate Professor Indian Institute of Management Raipur

Predictive Analytics Using Regression

Embed Size (px)

DESCRIPTION

Regression

Citation preview

PREDICTIVE ANALYTICS USING REGRESSION Sumeet Gupta Associate ProfessorIndian Institute of Management Raipur Outline Basic Concepts Applications of Predictive Modeling Linear Regression in One Variable using OLS Multiple Linear Regression Assumptions in RegressionExplanatory Vs Predictive Modeling Performance Evaluation of Predictive Models Practical Exercises Case: Nils Baker Case: Pedigree Vs Grit BASIC CONCEPTS Predictive Modeling: Applications Predictive customer activity on credit cards from their demographic and historical activity patterns Predicting the time to failure or equipment based on utilization and environment conditions Predicting expenditures on vacation travel based on historical frequent flyer data Predicting staffing requirements at help desks based on historical data and product and sales information Predicting sales from cross selling of products from historical information Predicting the impact of discounts on sales in retail outlets 4 Basic Concept: Relationships Examples of relationships: Sales and earnings Cost and number produced Microsoft and the stock market Effort and results Scatterplot A picture to explore the relationship in bivariate data Correlationr Measures strength of the relationship (from 1 to 1) Regression Predicting one variable from the other 5 Basic Concept: Correlation r = 1 A perfect straight line tilting up to the right r = 0 No overall tilt No relationship? r = 1 A perfect straight line tilting down to the right X Y X Y X Y X Y X Y X Y 6 Basic Concepts: Simple Linear Model Linear Model for the Population The foundation for statistical inference in regression Observed Y is a straight line, plus randomness Y = ! + "X + #Randomness of individuals Population relationship, on average {X Y # 7 Basic Concepts: Simple Linear Model Time Spent vs. Internet Pages Viewed Two measures of the abilities of25 Internet sites At the top right are eBay, Yahoo!, and MSN Correlation is r = 0.964 Very strong positive association (since r is close to 1) Linear relationship Straight line with scatter Increasing relationship Tilts up and to the right 0 30 60 90 0100200 Pages per person Minutes per person eBay Yahoo! MSN 0100200 Pages per person Yahoo! 8 Basic Concepts: Simple Linear Model Dollars vs. Deals For mergers and acquisitions by investment bankers 244 deals worth $756 billion by Goldman Sachs Correlation is r = 0.419 Positive association Linear relationship Straight line with scatter Increasing relationship Tilts up and to the right $0 $500 $1,000 0100200300400 Deals Dollars (billions) 9 Basic Concepts: Simple Linear Model Interest Rate vs. Loan Fee For mortgages If the interest rate is lower, does the bank make it up with a higher loan fee? Correlation is r = 0.890 Strong negative association Linear relationship Straight line with scatter Decreasing relationship Tilts down and to the right 5.0% 5.5% 6.0% 0%1%2%3%4% Loan fee Interest rate 10 Basic Concepts: Simple Linear Model Todays vs. Yesterdays Percent Change Is there momentum? If the market was up yesterday, is it more likely to be up today? Or is each days performance independent? Correlation is r = 0.11 A weak relationship? No relationship? Tilt is neither up nor down -3%-2%-1%0%1%2%3%-3% -2% -1%0%1%2%3%Yesterday's changeToday's change11 $0$25$50$75$100$450 $500 $550 $600 $650Strike PriceCall PriceCall Price vs. Strike Price For stock options Call Price is the price of the option contract to buy stock at the Strike Price The right to buy at a lower strike price has more value A nonlinear relationship Not a straight line: A curved relationship Correlation r = 0.895 A negative relationship: Higher strike price goes with lower call price Basic Concepts: Simple Linear Model 12 Basic Concepts: Simple Linear Model Output Yield vs. Temperature For an industrial process With a best optimal temperature setting A nonlinear relationship Not a straight line: A curved relationship Correlation r = 0.0155 rsuggests no relationship But relationship is strong It tilts neither up nor down 120 130 140 150 160 500600700800900 Temperature Yield of process 13 Basic Concepts: Simple Linear Model Circuit Miles vs. Investment (lower left) For telecommunications firms A relationship with unequal variability More vertical variation at the right than at the left Variability is stabilized by taking logarithms (lower right) Correlation r = 0.820 0 1,000 2,000 01,0002,000 Investment ($millions) Circuit miles (millions) 15 20 1520 Log of investment Log of miles r = 0.957 14 Basic Concepts: Simple Linear Model Price vs. Coupon Payment For trading in the bond market Bonds paying a higher coupon generally cost more Two clusters are visible Ordinary bonds (value is from coupon) Inflation-indexed bonds (payout rises with inflation) Correlation r = 0.950 for all bonds Correlation r = 0.994 Ordinary bonds only $100 $150 0%5%10% Bid price 0%5%10% Coupon rate 15 Basic Concepts: Simple Linear Model Cost vs. Number Produced For a production facility It usually costs more to produce more An outlier is visible A disaster (a fire at the factory) High cost, but few produced 3,000 4,000 5,000 20304050 Number produced Cost 0 10,000 0204060 Number produced Cost Outlier removed: More details, r = 0.869 r = 0.623 16 Basic Concepts: OLS Modeling Salary vs. Years Experience For n = 6 employees Linear (straight line) relationship Increasing relationshiphigher salary generally goes with higher experience Correlation r = 0.8667 20 30 40 50 60 01020 Experience Salary ($thousand) Experience 15 10 20 5 15 5 Salary 30 35 55 22 40 27 17 Basic Concepts: OLS Modeling Summarizes bivariate data: Predicts Y from X with smallest errors (in vertical direction, for Y axis) Intercept is 15.32 salary (at 0 years of experience) Slope is 1.673 salary (for each additional year of experience, on average) 10 20 30 40 50 60 01020 Experience (X) Salary (Y) 18 Basic Concepts: OLS Modeling Predicted Value comes from Least-Squares Line For example, Mary (with 20 years of experience) has predicted salary 15.32+1.673(20) =48.8 So does anyone with 20 years of experience Residual is actual Y minus predicted Y Marys residual is 55 48.8 =6.2 She earns about $6,200 more than the predicted salary for a person with 20 years of experience A person who earns less than predicted will have a negative residual 19 Basic Concepts: OLS Modeling 10 20 30 40 50 60 01020Experience Salary Mary earns 55 thousand Marys predicted value is 48.8 Marys residual is 6.2 20 Basic Concepts: OLS Modeling Standard Error of Estimate Approximate size of prediction errors (residuals) Actual Y minus predicted Y:Y[a+bX] Example (Salary vs. Experience) Predicted salaries are about 6.52 (i.e., $6,520) away from actual salaries ( )2112!!! =nnr S SY e( ) 52 . 62 61 68667 . 0 1 686 . 112=!!! =eS21 Basic Concepts: OLS Modeling Interpretation: similar to standard deviation Can move Least-Squares Line up and down by Se About 68% of the data are within one standard error of estimate of the least-squares line (For a bivariate normal distribution) 20 30 40 50 60 01020 Experience Salary 22 Multiple Linear Regression Linear Model for the Population Y = (! + "1 X1 + "2 X2 + ! + "k Xk) + # = (Population relationship)+ Randomness Where # has a normal distribution with mean 0 and constant standard deviation $, and this randomness is independent from one case to another An assumption needed for statistical inference 23 Multiple Linear Regression: Results Intercept: a Predicted value for Y when every X is 0 Regression Coefficients: b1, b2, !bk The effect of each X on Y, holding all other X variables constant Prediction Equation or Regression Equation (Predicted Y) = a+b1 X1+b2 X2+!+bk Xk The predicted Y, given the values for all X variables Prediction Errors or Residuals (Actual Y) (Predicted Y) 24 Multiple Linear Regression: Results t Tests for Individual Regression Coefficients Significant or not significant, for each X variable Tests whether a particular X variable has an effect on Y, holding the other X variables constant Should be performed only if the F test is significant Standard Errors of the Regression Coefficients (with n k 1 degrees of freedom) Indicates the estimated sampling standard deviation of each regression coefficient Used in the usual way to find confidence intervals and hypothesis tests for individual regression coefficients kb b bS S S , , ,2 1!25 Multiple Linear Regression: Results Predicted Page Costs for Audubon = a + b1 X1 + b2 X2 + b3 X3 =$4,043 + 3.79(Audience) 124(Percent Male)+ 0.903(Median Income) =$4,043 + 3.79(1,645) 124(51.1) + 0.903(38,787) =$38,966 ActualPage Costs are $25,315 Residual is $25,315 38,966 =$13,651 Audubon hasPage Costs $13,651 lower than you would expect for a magazine with its characteristics (Audience, Percent Male, and Median Income) 26 Standard ErrorStandard Error of Estimate Se Indicates the approximate size of the prediction errors About how far are the Y values from their predictions? For the magazine data Se = S =$21,578 Actual Page Costs are about $21,578 from their predictions for this group of magazines (using regression) Compare to SY = $45,446: ActualPage Costs are about $45,446 from their average (not using regression) Using the regression equation to predict Page Costs (instead of simply using ) the typical error is reduced from $45,446 to $21,578 Y27 Coeff. of Determination 28 The strength of association is measured by the square of the multiple correlation coefficient, R2, which is also called the coefficient ofmultiple determination.R2 = SSregSSyR2 is adjusted for the number of independent variables and the sample size by using the following formula: Adjusted R2 =R2 - k(1 - R2)n - k - 1Coeff. of Determination Coefficient of Determination R2 Indicates the percentage of the variation in Y that is explained by (or attributed to) all of the X variables How well do the X variables explain Y? For the magazine data R2 = 0.787 =78.7% The X variables (Audience, Percent Male, and Median Income) taken together explain 78.7% of the variance of Page Costs This leaves 100% 78.7% = 21.3% of the variation in Page Costs unexplained 29 The F test Is the regression significant? Do the X variables, taken together, explain a significant amount of the variation in Y? The null hypothesis claims that, in the population, the X variables do not help explain Y; all coefficients are 0 H0: "1 = "2 = ! = "k = 0 The research hypothesis claims that, in the population, at least one of the X variables does help explain Y H1: At least one of"1, "2, !, "k % 0 30 The F test 31 H0 : R2pop = 0 This is equivalent to the following null hypothesis:

H0: !1= !2=!3= . . . = !k=0

The overall test can be conducted by using an F statistic:

F = SSreg/kSSres/(n - k - 1) = R2/k(1 - R2)/(n- k - 1)which has an F distribution with k and (n - k -1) degrees of freedom. Performing the F test Three equivalent methods for performing F test; they always give the same result Use the p-value If p < 0.05, then the test is significant Same interpretation as p-values in Chapter 10 Use the R2 value If R2 is larger than the value in the R2 table, then the result is significant Do the X variables explain more than just randomness? Use the F statistic If the F statistic is larger than the value in the F table, then the result is significant 32 Example: F test For the magazine data, The X variables (Audience, Percent Male, and Median Income) explain a very highly significant percentage of the variation in Page Costs The p-value, listed as 0.000, is less than 0.0005, and is therefore very highly significant (since it is less than 0.001) The R2 value, 78.7%, is greater than 27.1% (from the R2 table at level 0.1% with n = 55 and k = 3), and is therefore very highly significant The F statistic, 62.84, is greater than the value (between 7.054 and 6.171) from the F table at level 0.1%, and is therefore very highly significant 33 t Tests A t test for each regression coefficient To be used only if the F test is significant If F is not significant, you should not look at the t tests Does the jth X variable have a significant effect on Y, holding the other X variables constant? Hypotheses are H0: "j = 0,H1: "j % 0 Test using the confidence interval use the t table with n k 1 degrees of freedom Or use the t statisticcompare to the t table value with n k 1 degrees of freedom jb j statisticS b t / =jb jtS b 34 Example: t Tests Testing b1, the coefficient for Audience b1 = 3.79, t = 13.5, p = 0.000 Audience has a very highly significant effect on Page Costs, after adjusting for Percent Male and Median Income Testing b2, the coefficient for Percent Male b2 = 124, t = 0.90, p = 0.374 Percent Male does not have a significant effect on Page Costs, after adjusting for Audience and Median Income Testing b3, the coefficient for Median Income b3 = 0.903, t = 2.44, p = 0.018 Median Income has a significant effect on Page Costs, after adjusting for Audience and Percent Male 35 36 Assumptions underlying the statistical techniques should be tested twice First for the separate variables Secondforthemultivariatemodelvariate,whichacts collectivelyforthevariablesintheanalysisandthusmust meet the same assumption as individual variables. Differs for different multivariate techniqueAssumptions in Regression Assumptions in Regression Linearity The independent variable has a linear relationship with the dependent variable Normality The residuals or the dependent variable follow a normal distribution Multicollinearity When some X variables are too similar to one another Homoskedasticity The variability in Y values for a given set of predictors is the same regardless of the values of the predictorsIndependence among cases (Absence of correlated errors) The cases are independent of each other Assumptions in Regression Normality 38 The residuals or the dependent variable follow a normal distribution If the variation from normality is significant then all statistical tests are invalid Graphical Analysis Histogram and Normal probability plot Peaked and Skewed distribution result in non-normality Statistical Analysis If Z value exceeds critical value, then the distribution is non-normal Kolmogorov Smirnov Test; Shapiro-Wilks TestAssumptions in Regression Normality 39 Assumptions in Regression Homoskedasticity 40 Assumption related primarily to dependence relationships between variables Assumption that the dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s). The variance of the dependent variable should not be concentrated in only a limited range of the independent values Source Type of variable Skewed distribution Assumptions in Regression Homoskedasticity 41 Graphical Analysis Analysis of residuals in case of Regression Statistical Analysis Variances within groups formed by non-metric variables Levene Test Boxs M Test Remedy Data Transformation Assumptions in Regression Homoskedasticity 42 Graphical Analysis Assumptions in Regression Linearity 43 Assumption for all multivariate techniques based on correlational measures such asmultiple regression, logistics regression, factor analysis, andstructural equation modeling Correlation represents only the linear association between variables Identification Scatterplots or examination of residuals using regression RemedyData Transformations Assumptions in Regression Linearity 44 Assumptions in Regression Absence of Correlated Errors 45 Prediction errors should not be correlated with each other Identification Most possible cause is the data collection process, such as two separate groups in the data collection process Remedy Including the omitted causal factor into the multivariate analysis Assumptions in Regression Multicollinearity Multicollinearity arises when intercorrelations among the predictors are very high. Multicollinearity can result in several problems, including: The partial regression coefficients may not be estimated precisely.The standard errors are likely to be high. The magnitudes as well as the signs of the partial regression coefficients may change from sample to sample. It becomes difficult to assess the relative importance of the independent variables in explaining the variation in the dependent variable.Predictor variables may be incorrectly included or removed in stepwise regression. 46 Assumptions in Regression Multicollinearity 47 The ability of an independent variable to improve the prediction of the dependent variable is related not only to its correlation to the dependent variable, but also to the correlation(s) of the additional independent variable to the independent variable(s) already in the regression equation Collinearity is the association, measured as the correlation, between tow independent variables Multicollinearity refers to the correlation among three or more independent variables Impact Reduces any single IVs predictive power by the extent to which it is associated with the other independent variables Assumptions in Regression Multicollinearity 48 Measuring Multicollinearity Tolerance Amount of variability of the selected independent variable not explained by the other independent variables Tolerance Values should be high Cut-off is 0.1 but greater than 0.5 gives better results VIFInverse of Tolerance Should be low (typically below 2.0 and usually below 10) Assumptions in Regression Multicollinearity Remedy for Multicollinearity A simple procedure for adjusting for multicollinearity consists of using only one of the variables in a highly correlated set of variables. Omit highly correlated independent variables and identify other independent variables to help the prediction Alternatively, the set of independent variables can be transformed into a new set of predictors that are mutually independent by using techniques such as principal components analysis. More specialized techniques, such as ridge regression and latent root regression, can also be used. 49 Assumptions in Regression Data Transformations 50 To correct violations of the statistical assumptions underlying the multivariate techniques To improve the relationship between variables Transformation to achieve Normality and Homoscedasticity Flat Distribution Inverse transformation Negatively Skewed Distribution Square Root Transformation Positively Skewed Distribution Logarithmic Transformation If the residuals in regression are cone shaped then Cone opens to right Inverse transformation Cone opens to left Square root transformation Assumptions in Regression Data Transformations 51 Transformation to achieve Linearity Assumptions in Regression Data Transformations 52 Assumptions in Regression 53 General guidelines for transformation For a noticeable effect of transformation the ratio of a variables mean to the standard deviation should be less than 4.0 When the transformation can be performed on either of the two variables, select the one with smallest ratio of mean/sd. Transformation should be applied to independent variables except in case of heteroscedasticity Heteroscedasticity can only be remedied by transformation of the dependent variable in a dependent relationship If the heteroscedastic relationship is also non-linear the dependent variable and perhaps the independent variables must be transformed Transformations may change the interpretation of the variables Issues in Regression Variable Selection How to choose from a long list of X variables? Too many: waste the information in the data Too few: risk ignoring useful predictive information Model Misspecification Perhaps the multiple regression linear model is wrong Unequal variability? Nonlinearity? Interaction? 54 EXPLANATORY VS PREDICTIVE MODELING Explanatory Vs Predictive Modeling Explanatory models fits the data closely, whereas a good predictive model predicts new cases accurately Explanatory models uses entire dataset for estimating the best-fit model and to maximize explanatory variance (R2). Predictive models estimate the model on training set and assess it on the new, unobserved data Performance measures for explanatory models measures how close the data fit the models, whereas in predictive models performance is measured by predictive accuracy

Performance Evaluation Prediction Error for observation i= Actual y value predicted y value Popular numerical measures of predictive accuracy MAE or MAD (Mean absolute error / deviation) Average Error MAPE (Mean Absolute Percentage Error) Performance Evaluation RMSE (Root mean squared error) Total SSE (total sum of squared errors) CASE Case: Pedigree Vs Grit Why does a low R2 does not make the regression useless? Describe a situation in which a useless regression has a high R2. Check the validity of the linear regression model assumptions. Estimate the excess returns of Bobs and Putneys funds. Between them, who is expected to obtain higher returns at their current funds and by how much? If hired by the firm, who is expected to obtain higher returns and by how much? Can you prove at the 5% level of significance that Bob would get higher expected returns if he had attended Princeton instead of Ohio State? Can you prove at the 10% level of significance that Bob would get at least 1% higher expected returns by managing a growth fund? Is there strong evidence that fund managers with MBA perform worse than fund managers without MBA? What is held constant in this comparison? Based on your analysis of the case, which candidate do you support for AMBTPMs job opening: Bob or Putney? Discuss Case: Nils Baker Is the presence of a physical Bank Branch creating demand for checking accounts? Thank You