23
© Pristine www.edupristine.com Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine – Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

  • Upload
    leque

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine © Pristine – www.edupristine.com

Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

Page 2: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine 1

Multivariate Linear Regression- Fixing Heteroskedasticity

Univariate scenario:

• Find the “Standard Deviation” of “response” variable for the different levels of “independent” variable

• Divide the independent values of the response variable by the respective “standard deviation”

• The scaled values become the new “response variable”

• E.g. if variable is “Fuel Type”

– If Fuel Type is “D”, divide “Capped Losses” by SquareRoot(33862) = 184

– If Fuel Type is “P”, divide “Capped Losses” by SquareRoot(16400) = 128

Multivariate scenario:

• Create all possible unique combinations of independent variables

• For each of the combinations, find “Standard Deviations”

• Divide the independent values of the response variable by the respective “standard deviation”

• Too cumbersome to do manually using MS Excel. Also the process is iterative.

• More convenient to do using Statistical packages like R.

Course approach

• First fit a multivariate regression without fixing heteroskedasticity to get a final set of significant variables

• Then do manual adjustment and re-fit regression using MS Excel. This will be just for demonstration. As manual adjustment is always questionable.

• Demonstrate linear regression using R

Page 3: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Linear Regression- Preparing MS Excel

2

1 2 3

4

5

Page 4: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Linear Regression- Using MS Excel (Demo.)

3

1

2

3 4

5

Page 5: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Variable Selection

4

Variable selection to be done on the basis of

• Multicollinearity (correlation between independent variables)

• Banding of variables e.g. whether to use “Age” or “Age Band” (also called custom bands)

• Statistical significance of variables tested after performing above two steps

List of independent variables:

1. Age

2. Age Band

3. Years of Driving Experience

4. Number of Vehicles

5. Gender

6. Married

7. Vehicle Age

8. Vehicle Age Band

9. Fuel Type

Page 6: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Variable Selection (Multicollinearity)

5

“Age” and “Years of Driving Experience” are highly correlated (Correlation Coefficient = 0.9972). We can use either of the variables in regression

Q: Which one to use and which one to reject?

Sol: Fit two separate models using either of the variable one at a time. Check for goodness of fit (R2 in this case). The variable producing higher R2 gets accepted.

Regression Statistics (Age)

Multiple R 0.475766

R Square 0.226354

Adjusted R Square 0.226303

Standard Error 201.2306

Observations 15290

Regression Statistics(Yrs Driving Experience)

Multiple R 0.475273

R Square 0.225885

Adjusted R Square 0.225834

Standard Error 201.2916

Observations 15290

R2 for Age > R2 for Years of Driving Experience

Reject Years of Driving Experience

Page 7: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Custom Bands

6

Investigate whether to use “Age” or “Age band”

Fit regression independently using “Age” and “Age Band”

Before fitting regression, “Age Band” needs to be converted to numerical form from categorical. Replace “Age Band” values with “Average Age” for the particular band.

R2 for Average Age > R2 for Age

Select “Average Age”

Age Band Sum of Age # Policies Average Age

16-25 93,770.0 4,563.0 20.6

26-59 270,793.0 6,384.0 42.4

60+ 282,636.0 4,343.0 65.1

Regression Statistics (Average Age)

Multiple R 0.509969

R Square 0.260068

Adjusted R Square 0.26002

Standard Error 196.7971

Observations 15290

Regression Statistics (Age)

Multiple R 0.475766

R Square 0.226354

Adjusted R Square 0.226303

Standard Error 201.2306

Observations 15290

Regressions results using “Age” and “Average Age”

Page 8: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Regression Statistics (Average Vehicle Age)

Multiple R 0.303099405

R Square 0.09186925

Adjusted R Square 0.091809848

Standard Error 218.0203272

Observations 15290

Regression Statistics (Vehicle Age)

Multiple R 0.289431325

R Square 0.083770492

Adjusted R Square 0.083710561

Standard Error 218.9903277

Observations 15290

Multivariate Linear Regression- Custom Bands

7

Investigate whether to use “Vehicle Age” or “Vehicle Age band”

Fit regression independently using “Vehicle Age” and “Vehicle Age Band”

Before fitting regression, “Vehicle Age Band” needs to be converted to numerical form from categorical. Replace “Vehicle Age Band” values with “Vehicle Average Age” for the particular band.

R2 for Average Vehicle Age > R2 for Vehicle Age

Select “Average Vehicle Age”

Regressions results using “Vehicle Age” and “Average Vehicle Age”

Vehicle Age Band Sum of Vehicle Age # Policies Average Vehicle Age

0-5 9,229 3,688 2.50

6-10 44,298 5,523 8.02

11+ 78,819 6,079 12.97

Page 9: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Variable Selection

8

List of shortlisted variables:

1. Age Band in the form of “Average Age” of the band (selected out of “Age” and “Age Band”). Also got selected over “Years of Driving Experience”.

2. Number of Vehicles

3. Gender

4. Married

5. Vehicle Age Band in the form of “Average Vehicle Age” of the band (selected out of “Vehicle Age” and “Vehicle Age Band”).

6. Fuel Type

We will run regression in “multivariate” fashion and then select final list of variables by taking into consideration “statistical significance”.

Page 10: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Categorical variable conversion

9

Categorical variables in Binary form need to be converted to their numerical equivalent (0, 1)

1. Gender (F = 0 and M = 1)

2. Married (Married = 0 and Single = 1)

3. Fuel Type (P = 0, D = 1)

Snapshot of the final data on which we will run the multivariate regression

Page 11: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Output

10

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.865972274

R Square 0.749907979

Adjusted R Square 0.749809794

Standard Error 114.4310136

Observations 15290

ANOVA

df SS MS F Significance F

Regression 6 600073213.5 100012202.3 7637.751088 0

Residual 15283 200122584.4 13094.45688

Total 15289 800195798

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 624.56529 5.29192 118.02233 0.00000 614.19249 634.93809 614.19249 634.93809

Avg Age -5.55974 0.06546 -84.93889 0.00000 -5.68804 -5.43144 -5.68804 -5.43144

Number of Vehicles 0.17875 0.97039 0.18420 0.85386 -1.72333 2.08082 -1.72333 2.08082

Gender Dummy 50.88326 1.89081 26.91084 0.00000 47.17705 54.58947 47.17705 54.58947

Married Dummy 78.39837 1.92148 40.80106 0.00000 74.63204 82.16469 74.63204 82.16469

Avg Vehicle Age -15.14220 0.26734 -56.63987 0.00000 -15.66623 -14.61818 -15.66623 -14.61818

Fuel Type Dummy 267.93559 2.74845 97.48614 0.00000 262.54830 273.32287 262.54830 273.32287

Page 12: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Output

11

Independent Vars Coefficients(b) Standard Error

(σ) t Stat (b/σ)

P-value (t-dist table)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Intercept a 624.565 5.292 118.022 0.000 614.192 634.938 614.192 634.938

X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431

X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081

X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589

X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165

X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618

X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323

ANOVA

df SS MS (SS/df) F (MSReg/MSRes) Significance F (from

F dist table)

Regression {∑ (ypredictedl- ymean)2} p 6 600073213.5 100012202.3 7637.75 0

Residual {∑(yactual - ypredicted)2} n-p-1 15283 200122584.4 13094.457

Total {∑(yactual - ymean)2} n-1 15289 800195798

Regression Statistics

Multiple R SquareRoot(R Square) 0.8659723

R Square SS Regression/SS Total 0.7499080

Adjusted R Square R2 - (1 - R2)*{p/(n-p-1)} 0.7498098

Standard Error SquareRoot{SS Residual/(n-p-1)} 114.4310136

Observations n 15290

1

2

3

Insignificant

Page 13: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Output (Significance Test)

12

Independent Vars Coefficients(b) Standard Error

(σ) t Stat (b/σ)

P-value (t-dist table)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Intercept a 624.565 5.292 118.022 0.000 614.192 634.938 614.192 634.938

X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431

X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081

X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589

X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165

X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618

X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323

1

Significance test of coefficients based on Normal distribution H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression) H1: b is different than 0 Test statistic, Z = (b-0)/σ (at 95% two tailed confidence interval, Z = 1.96) Confidence interval = (b – 1.96 * σ, b + 1.96 * σ) For variable to be significant, the interval must not contain “0”. Example1: Avg Age. Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431) No zero in the interval. Hence significant. Example2: Number of Vehicles Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080) Zero is present in the interval. Hence insignificant.

Page 14: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Output (Significance Test)

13

Independent Vars Coefficients(b) Standard Error

(σ) t Stat (b/σ)

P-value (t-dist table)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Lower 95% (b-1.96*σ)

Upper 95% (b+1.96*σ)

Intercept a 624.565 5.292 118.022 0.000 614.192 634.938 614.192 634.938

X1 Avg Age b1 -5.560 0.065 -84.939 0.000 -5.688 -5.431 -5.688 -5.431

X2 Number of Vehicles b2 0.179 0.970 0.184 0.854 -1.723 2.081 -1.723 2.081

X3 Gender Dummy b3 50.883 1.891 26.911 0.000 47.177 54.589 47.177 54.589

X4 Married Dummy b4 78.398 1.921 40.801 0.000 74.632 82.165 74.632 82.165

X5 Avg Vehicle Age b5 -15.142 0.267 -56.640 0.000 -15.666 -14.618 -15.666 -14.618

X6 Fuel Type Dummy b6 267.936 2.748 97.486 0.000 262.548 273.323 262.548 273.323

1

Significance test of coefficients based on t distribution. • b/StdErr(b) ~ tn-2

H0: b is no different that 0 (i.e. 0 is the coefficient when the variable is not included in regression) H1: b is different than 0 At 95% two tailed confidence interval and df greater that 120, t = 1.96) Confidence interval = (b – 1.96 * σ, b + 1.96 * σ) For variable to be significant, the interval must not contain “0”. Example1: Avg Age. Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431) No zero in the interval. Hence significant. Example2: Number of Vehicles Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080) Zero is present in the interval. Hence insignificant.

Page 15: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Output at 95% Confidence Interval

14

SUMMARY OUTPUT

Regression Statistics Excluding "Num

Vehicles" Including "Num

Vehicles"

Multiple R 0.865971953 0.865972274 R Square 0.749907424 0.749907979

Adjusted R Square 0.749825608 0.749809794 Standard Error 114.4273971 114.4310136 Observations 15290 15290

ANOVA

df SS MS F Significance F

Regression 5 600072769.2 120014553.8 9165.874 0

Residual 15284 200123028.7 13093.6292 Total 15289 800195798

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 625.005 4.723 132.333 0.00 615.7474 634.2625 615.7474 634.2625 Avg Age -5.560 0.065 -84.942 0.00 -5.6879 -5.4314 -5.6879 -5.4314 Gender Dummy 50.883 1.891 26.912 0.00 47.1768 54.5890 47.1768 54.5890 Married Dummy 78.402 1.921 40.806 0.00 74.6356 82.1677 74.6356 82.1677 Avg Vehicle Age -15.142 0.267 -56.641 0.00 -15.6660 -14.6180 -15.6660 -14.6180

Fuel Type Dummy 267.935 2.748 97.489 0.00 262.5480 273.3223 262.5480 273.3223

Adjusted R-square improved

Page 16: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Regression Equation

15

Predicted Losses = 625.004932715948 – 5.5596551344537 * Avg Age + 50.8828923910091 * Gender Dummy +

78.4016899779131 * Married Dummy -15.1420259903571 * Avg Vehicle Age + 267.935139741526 * Fuel Type Dummy

Interpretation:

Illustration of using the equation given in MS Excel

Coefficients

Sign of Coefficient

Inference

Intercept 625.005

Avg Age -5.560 -ve Higher is the age, lower is the loss

Gender Dummy 50.883 +ve Average Loss for Males is higher than Females

Married Dummy 78.402 +ve Average Loss for Single is higher than Married

Avg Vehicle Age -15.142 -ve Older is the vehicle, lower are the losses

Fuel Type Dummy 267.935 +ve Losses are higher for Fuel type D

Page 17: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Residual Plot

16

Residual plot: • Residuals calculated as “Actual Capped Losses” – “Predicted Capped Losses” • Residuals should have a uniform distribution else there’s some bias in the model • Except for a few observations (circled in red), residuals are uniformly distributed

-400

-200

0

200

400

600

800

1000

1200

0 2000 4000 6000 8000 10000 12000 14000

Capped Losses- Residual

Page 18: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Gains Chart and Gini

17

Gains chart is used to represent the effectiveness of a model prediction which is quantified by means of Gini Coefficient

Methodology illustrated using MS Excel

Equal Obs

Bin # Policies Predicted Loss Actual Loss

Cumulative Actual Loss

Random Cumulative %

Obs % Cumulative Actual

Loss Area Under Gains Curve

Gini Coeff

0 0 0 0 0 0 0 0 0 0.27177 1 1528 1,167,070 1,230,474 1,230,474 10% 10% 20.87% 0.0104 2 1529 1,046,034 991,944 2,222,418 10% 20% 37.69% 0.0293 3 1529 757,330 746,854 2,969,272 10% 30% 50.36% 0.0440 4 1529 589,366 552,534 3,521,806 10% 40% 59.73% 0.0550 5 1529 531,160 553,919 4,075,725 10% 50% 69.12% 0.0644 6 1529 485,428 477,284 4,553,009 10% 60% 77.22% 0.0732 7 1529 432,934 385,411 4,938,420 10% 70% 83.75% 0.0805 8 1529 385,595 423,814 5,362,234 10% 80% 90.94% 0.0873 9 1529 308,050 310,846 5,673,081 10% 90% 96.21% 0.0936

10 1530 193,465 223,351 5,896,432 10% 100% 100.00% 0.0981

0

500

1000

1500

2000

2500

3000

3500

-

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

0 2 4 6 8 10 12

# P

olic

ies

Loss

es

Bins of Equal # Policies

Actual vs Predicted Losses

# Policies

Predicted Loss

Actual Loss

0%

20%

40%

60%

80%

100%

0 2 4 6 8 10

%C

um

ula

tive

Act

ual

Lo

ss

Bins of Equal # Policies

Gains Chart

Cumulative % Obs

% Cumulative Actual Loss

Page 19: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine 18

Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)

Create unique combinations of the variables - Avg Age, Gender Dummy, Married Dummy, Avg Vehicle Age and Fuel Type Dummy

1

2 Find “Standard

Deviation” of capped Losses for the segments. Detailed methodology explained in MS Excel.

3 Calculate “Standardized

Capped Losses” as “Capped Losses / Segment Std Dev”. This becomes the new response variable.

Manually doing this kind of exercise can be flawed as some the segments could be sparsely populated.

This demo. Is just to explain the underlying technique/methodology.

Statistical packages like SAS, R have in-built capability to take care of this.

Page 20: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.359167467

R Square 0.129001269

Adjusted R Square 0.128716331

Standard Error 4.77078689

Observations 15290

ANOVA

df SS MS F Significance F

Regression 5 51522.10 10304.42 452.73 0

Residual 15284 347870.07 22.76

Total 15289 399392.17

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 12.476 0.197 63.374 0.000 12.091 12.862 12.091 12.862

Avg Age -0.086 0.003 -31.554 0.000 -0.091 -0.081 -0.091 -0.081

Gender Dummy 0.213 0.079 2.702 0.007 0.058 0.368 0.058 0.368

Married Dummy -0.204 0.080 -2.552 0.011 -0.361 -0.047 -0.361 -0.047

Avg Vehicle Age -0.376 0.011 -33.770 0.000 -0.398 -0.354 -0.398 -0.354

Fuel Type Dummy 0.136 0.115 1.188 0.235 -0.088 0.361 -0.088 0.361

19

Insignificant which is questionable as “D” and “P” have significantly differe mean losses

Multivariate Linear Regression- Fixing Heteroskedasticity (Demo.)

Page 21: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Using R

Step1: Download and install R software from http://www.r-project.org/ Step2: Convert the data to R readable format e.g. *.csv.

• D:\Linear Reg using R\Linear_Reg_Sample_Data.csv

Writing R code for • Reading the data

• Fitting the Linear Regression

20

> LinRegData <- read.csv(file = "D:\\Linear Reg using R\\Linear_Reg_Sample_Data.csv") >FitLinReg <- lm(Capped_Losses ~ Number_of_Vehicles + Avg_Age + Gender_Dummy + Married_Dummy + Avg_Vehicle_Age + Fuel_Type_Dummy, LinRegData) >

Page 22: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Multivariate Linear Regression- Using R

21

Output

Page 23: Business Analytics Multivariate Linear Regression … website... · © Pristine © Pristine –  Business Analytics Multivariate Linear Regression (Using Ms-Excel & “R”)

© Pristine

Thank you!

© Pristine – www.edupristine.com

Pristine

702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA

www.edupristine.com

Ph. +91 22 3215 6191