Upload
leque
View
217
Download
0
Embed Size (px)
Citation preview
© Pristine © Pristine – www.edupristine.com
Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)
© Pristine 1
Multivariate Linear Regression- Fixing Heteroskedasticity
Univariate scenario:
• Find the “Standard Deviation” of “response” variable for the different levels of “independent” variable
• Divide the independent values of the response variable by the respective “standard deviation”
• The scaled values become the new “response variable”
• E.g. if variable is “Fuel Type”
– If Fuel Type is “D”, divide “Capped Losses” by SquareRoot(33862) = 184
– If Fuel Type is “P”, divide “Capped Losses” by SquareRoot(16400) = 128
Multivariate scenario:
• Create all possible unique combinations of independent variables
• For each of the combinations, find “Standard Deviations”
• Divide the independent values of the response variable by the respective “standard deviation”
• Too cumbersome to do manually using MS Excel. Also the process is iterative.
• More convenient to do using Statistical packages like R.
Course approach
• First fit a multivariate regression without fixing heteroskedasticity to get a final set of significant variables
• Then do manual adjustment and re-fit regression using MS Excel. This will be just for demonstration. As manual adjustment is always questionable.
• Demonstrate linear regression using R
© Pristine
Linear Regression- Preparing MS Excel
2
1 2 3
4
5
© Pristine
Linear Regression- Using MS Excel (Demo.)
3
1
2
3 4
5
© Pristine
Multivariate Linear Regression- Variable Selection
4
Variable selection to be done on the basis of
• Multicollinearity (correlation between independent variables)
• Banding of variables e.g. whether to use “Age” or “Age Band” (also called custom bands)
• Statistical significance of variables tested after performing above two steps
List of independent variables:
1. Age
2. Age Band
3. Years of Driving Experience
4. Number of Vehicles
5. Gender
6. Married
7. Vehicle Age
8. Vehicle Age Band
9. Fuel Type
© Pristine
Multivariate Linear Regression- Variable Selection (Multicollinearity)
5
“Age” and “Years of Driving Experience” are highly correlated (Correlation Coefficient = 0.9972). We can use either of the variables in regression
Q: Which one to use and which one to reject?
Sol: Fit two separate models using either of the variable one at a time. Check for goodness of fit (R2 in this case). The variable producing higher R2 gets accepted.
Regression Statistics (Age)
Multiple R 0.475766
R Square 0.226354
Adjusted R Square 0.226303
Standard Error 201.2306
Observations 15290
Regression Statistics(Yrs Driving Experience)
Multiple R 0.475273
R Square 0.225885
Adjusted R Square 0.225834
Standard Error 201.2916
Observations 15290
R2 for Age > R2 for Years of Driving Experience
Reject Years of Driving Experience
© Pristine
Multivariate Linear Regression- Custom Bands
6
Investigate whether to use “Age” or “Age band”
Fit regression independently using “Age” and “Age Band”
Before fitting regression, “Age Band” needs to be converted to numerical form from categorical. Replace “Age Band” values with “Average Age” for the particular band.
R2 for Average Age > R2 for Age
Select “Average Age”
Age Band Sum of Age # Policies Average Age
16-25 93,770.0 4,563.0 20.6
26-59 270,793.0 6,384.0 42.4
60+ 282,636.0 4,343.0 65.1
Regression Statistics (Average Age)
Multiple R 0.509969
R Square 0.260068
Adjusted R Square 0.26002
Standard Error 196.7971
Observations 15290
Regression Statistics (Age)
Multiple R 0.475766
R Square 0.226354
Adjusted R Square 0.226303
Standard Error 201.2306
Observations 15290
Regressions results using “Age” and “Average Age”
© Pristine
Regression Statistics (Average Vehicle Age)
Multiple R 0.303099405
R Square 0.09186925
Adjusted R Square 0.091809848
Standard Error 218.0203272
Observations 15290
Regression Statistics (Vehicle Age)
Multiple R 0.289431325
R Square 0.083770492
Adjusted R Square 0.083710561
Standard Error 218.9903277
Observations 15290
Multivariate Linear Regression- Custom Bands
7
Investigate whether to use “Vehicle Age” or “Vehicle Age band”
Fit regression independently using “Vehicle Age” and “Vehicle Age Band”
Before fitting regression, “Vehicle Age Band” needs to be converted to numerical form from categorical. Replace “Vehicle Age Band” values with “Vehicle Average Age” for the particular band.
R2 for Average Vehicle Age > R2 for Vehicle Age
Select “Average Vehicle Age”
Regressions results using “Vehicle Age” and “Average Vehicle Age”
Vehicle Age Band Sum of Vehicle Age # Policies Average Vehicle Age
0-5 9,229 3,688 2.50
6-10 44,298 5,523 8.02
11+ 78,819 6,079 12.97
© Pristine
Multivariate Linear Regression- Variable Selection
8
List of shortlisted variables:
1. Age Band in the form of “Average Age” of the band (selected out of “Age” and “Age Band”). Also got selected over “Years of Driving Experience”.
2. Number of Vehicles
3. Gender
4. Married
5. Vehicle Age Band in the form of “Average Vehicle Age” of the band (selected out of “Vehicle Age” and “Vehicle Age Band”).
6. Fuel Type
We will run regression in “multivariate” fashion and then select final list of variables by taking into consideration “statistical significance”.
© Pristine
Multivariate Linear Regression- Categorical variable conversion
9
Categorical variables in Binary form need to be converted to their numerical equivalent (0, 1)
1. Gender (F = 0 and M = 1)
2. Married (Married = 0 and Single = 1)
3. Fuel Type (P = 0, D = 1)
Snapshot of the final data on which we will run the multivariate regression
© Pristine
Multivariate Linear Regression- Output
10
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.865972274
R Square 0.749907979
Adjusted R Square 0.749809794
Standard Error 114.4310136
Observations 15290
ANOVA
df SS MS F Significance F
Regression 6 600073213.5 100012202.3 7637.751088 0
Residual 15283 200122584.4 13094.45688
Total 15289 800195798
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 624.56529 5.29192 118.02233 0.00000 614.19249 634.93809 614.19249 634.93809
Avg Age -5.55974 0.06546 -84.93889 0.00000 -5.68804 -5.43144 -5.68804 -5.43144
Number of Vehicles 0.17875 0.97039 0.18420 0.85386 -1.72333 2.08082 -1.72333 2.08082
Gender Dummy 50.88326 1.89081 26.91084 0.00000 47.17705 54.58947 47.17705 54.58947
Married Dummy 78.39837 1.92148 40.80106 0.00000 74.63204 82.16469 74.63204 82.16469
Avg Vehicle Age -15.14220 0.26734 -56.63987 0.00000 -15.66623 -14.61818 -15.66623 -14.61818
Fuel Type Dummy 267.93559 2.74845 97.48614 0.00000 262.54830 273.32287 262.54830 273.32287
© Pristine
Multivariate Linear Regression- Output
11
Independent Vars Coefficients(b) Standard Error
(σ) t Stat (b/σ)
P-value (t-dist table)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Intercept a 624.565 5.292 118.022 0.000 614.192 634.938 614.192 634.938
X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431
X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081
X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589
X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165
X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618
X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323
ANOVA
df SS MS (SS/df) F (MSReg/MSRes) Significance F (from
F dist table)
Regression {∑ (ypredictedl- ymean)2} p 6 600073213.5 100012202.3 7637.75 0
Residual {∑(yactual - ypredicted)2} n-p-1 15283 200122584.4 13094.457
Total {∑(yactual - ymean)2} n-1 15289 800195798
Regression Statistics
Multiple R SquareRoot(R Square) 0.8659723
R Square SS Regression/SS Total 0.7499080
Adjusted R Square R2 - (1 - R2)*{p/(n-p-1)} 0.7498098
Standard Error SquareRoot{SS Residual/(n-p-1)} 114.4310136
Observations n 15290
1
2
3
Insignificant
© Pristine
Multivariate Linear Regression- Output (Significance Test)
12
Independent Vars Coefficients(b) Standard Error
(σ) t Stat (b/σ)
P-value (t-dist table)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Intercept a 624.565 5.292 118.022 0.000 614.192 634.938 614.192 634.938
X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431
X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081
X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589
X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165
X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618
X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323
1
Significance test of coefficients based on Normal distribution H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression) H1: b is different than 0 Test statistic, Z = (b-0)/σ (at 95% two tailed confidence interval, Z = 1.96) Confidence interval = (b – 1.96 * σ, b + 1.96 * σ) For variable to be significant, the interval must not contain “0”. Example1: Avg Age. Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431) No zero in the interval. Hence significant. Example2: Number of Vehicles Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080) Zero is present in the interval. Hence insignificant.
© Pristine
Multivariate Linear Regression- Output (Significance Test)
13
Independent Vars Coefficients(b) Standard Error
(σ) t Stat (b/σ)
P-value (t-dist table)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Lower 95% (b-1.96*σ)
Upper 95% (b+1.96*σ)
Intercept a 624.565 5.292 118.022 0.000 614.192 634.938 614.192 634.938
X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431
X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081
X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589
X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165
X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618
X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323
1
Significance test of coefficients based on t distribution. • b/StdErr(b) ~ tn-2
H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression) H1: b is different than 0 At 95% two tailed confidence interval and df greater that 120, t = 1.96) Confidence interval = (b – 1.96 * σ, b + 1.96 * σ) For variable to be significant, the interval must not contain “0”. Example1: Avg Age. Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431) No zero in the interval. Hence significant. Example2: Number of Vehicles Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080) Zero is present in the interval. Hence insignificant.
© Pristine
Multivariate Linear Regression- Output at 95% Confidence Interval
14
SUMMARY OUTPUT
Regression Statistics Excluding "Num
Vehicles" Including "Num
Vehicles"
Multiple R 0.865971953 0.865972274 R Square 0.749907424 0.749907979
Adjusted R Square 0.749825608 0.749809794 Standard Error 114.4273971 114.4310136 Observations 15290 15290
ANOVA
df SS MS F Significance F
Regression 5 600072769.2 120014553.8 9165.874 0
Residual 15284 200123028.7 13093.6292 Total 15289 800195798
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 625.005 4.723 132.333 0.00 615.7474 634.2625 615.7474 634.2625 Avg Age -5.560 0.065 -84.942 0.00 -5.6879 -5.4314 -5.6879 -5.4314 Gender Dummy 50.883 1.891 26.912 0.00 47.1768 54.5890 47.1768 54.5890 Married Dummy 78.402 1.921 40.806 0.00 74.6356 82.1677 74.6356 82.1677 Avg Vehicle Age -15.142 0.267 -56.641 0.00 -15.6660 -14.6180 -15.6660 -14.6180
Fuel Type Dummy 267.935 2.748 97.489 0.00 262.5480 273.3223 262.5480 273.3223
Adjusted R-square improved
© Pristine
Multivariate Linear Regression- Regression Equation
15
Predicted Losses = 625.004932715948 – 5.5596551344537 * Avg Age + 50.8828923910091 * Gender Dummy +
78.4016899779131 * Married Dummy -15.1420259903571 * Avg Vehicle Age + 267.935139741526 * Fuel Type Dummy
Interpretation:
Illustration of using the equation given in MS Excel
Coefficients
Sign of Coefficient
Inference
Intercept 625.005
Avg Age -5.560 -ve Higher is the age, lower is the loss
Gender Dummy 50.883 +ve Average Loss for Males is higher than Females
Married Dummy 78.402 +ve Average Loss for Single is higher than Married
Avg Vehicle Age -15.142 -ve Older is the vehicle, lower are the losses
Fuel Type Dummy 267.935 +ve Losses are higher for Fuel type D
© Pristine
Multivariate Linear Regression- Residual Plot
16
Residual plot: • Residuals calculated as “Actual Capped Losses” – “Predicted Capped Losses” • Residuals should have a uniform distribution else there’s some bias in the model • Except for a few observations (circled in red), residuals are uniformly distributed
-400
-200
0
200
400
600
800
1000
1200
0 2000 4000 6000 8000 10000 12000 14000
Capped Losses- Residual
© Pristine
Multivariate Linear Regression- Gains Chart and Gini
17
Gains chart is used to represent the effectiveness of a model prediction which is quantified by means of Gini Coefficient
Methodology illustrated using MS Excel
Equal Obs
Bin # Policies Predicted Loss Actual Loss
Cumulative Actual Loss
Random Cumulative %
Obs % Cumulative Actual
Loss Area Under Gains Curve
Gini Coeff
0 0 0 0 0 0 0 0 0 0.27177 1 1528 1,167,070 1,230,474 1,230,474 10% 10% 20.87% 0.0104 2 1529 1,046,034 991,944 2,222,418 10% 20% 37.69% 0.0293 3 1529 757,330 746,854 2,969,272 10% 30% 50.36% 0.0440 4 1529 589,366 552,534 3,521,806 10% 40% 59.73% 0.0550 5 1529 531,160 553,919 4,075,725 10% 50% 69.12% 0.0644 6 1529 485,428 477,284 4,553,009 10% 60% 77.22% 0.0732 7 1529 432,934 385,411 4,938,420 10% 70% 83.75% 0.0805 8 1529 385,595 423,814 5,362,234 10% 80% 90.94% 0.0873 9 1529 308,050 310,846 5,673,081 10% 90% 96.21% 0.0936
10 1530 193,465 223,351 5,896,432 10% 100% 100.00% 0.0981
0
500
1000
1500
2000
2500
3000
3500
-
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
0 2 4 6 8 10 12
# P
olic
ies
Loss
es
Bins of Equal # Policies
Actual vs Predicted Losses
# Policies
Predicted Loss
Actual Loss
0%
20%
40%
60%
80%
100%
0 2 4 6 8 10
%C
um
ula
tive
Act
ual
Lo
ss
Bins of Equal # Policies
Gains Chart
Cumulative % Obs
% Cumulative Actual Loss
© Pristine 18
Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)
Create unique combinations of the variables - Avg Age, Gender Dummy, Married Dummy, Avg Vehicle Age and Fuel Type Dummy
1
2 Find “Standard
Deviation” of capped Losses for the segments. Detailed methodology explained in MS Excel.
3 Calculate “Standardized
Capped Losses” as “Capped Losses / Segment Std Dev”. This becomes the new response variable.
Manually doing this kind of exercise can be flawed as some the segments could be sparsely populated.
This demo. Is just to explain the underlying technique/methodology.
Statistical packages like SAS, R have in-built capability to take care of this.
© Pristine
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.359167467
R Square 0.129001269
Adjusted R Square 0.128716331
Standard Error 4.77078689
Observations 15290
ANOVA
df SS MS F Significance F
Regression 5 51522.10 10304.42 452.73 0
Residual 15284 347870.07 22.76
Total 15289 399392.17
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 12.476 0.197 63.374 0.000 12.091 12.862 12.091 12.862
Avg Age -0.086 0.003 -31.554 0.000 -0.091 -0.081 -0.091 -0.081
Gender Dummy 0.213 0.079 2.702 0.007 0.058 0.368 0.058 0.368
Married Dummy -0.204 0.080 -2.552 0.011 -0.361 -0.047 -0.361 -0.047
Avg Vehicle Age -0.376 0.011 -33.770 0.000 -0.398 -0.354 -0.398 -0.354
Fuel Type Dummy 0.136 0.115 1.188 0.235 -0.088 0.361 -0.088 0.361
19
Insignificant which is questionable as “D” and “P” have significantly differe mean losses
Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)
© Pristine
Multivariate Linear Regression- Using R
Step1: Download and install R software from http://www.r-project.org/ Step2: Convert the data to R readable format e.g. *.csv.
• D:\Linear Reg using R\Linear_Reg_Sample_Data.csv
Writing R code for • Reading the data
• Fitting the Linear Regression
20
> LinRegData <- read.csv(file = "D:\\Linear Reg using R\\Linear_Reg_Sample_Data.csv") >FitLinReg <- lm(Capped_Losses ~ Number_of_Vehicles + Avg_Age + Gender_Dummy + Married_Dummy + Avg_Vehicle_Age + Fuel_Type_Dummy, LinRegData) >
© Pristine
Multivariate Linear Regression- Using R
21
Output
© Pristine
Thank you!
© Pristine – www.edupristine.com
Pristine
702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA
www.edupristine.com
Ph. +91 22 3215 6191