Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Regularization For The Linear ModelsAnd Extensions
R-Project : Ferritin Data Set
Student’s Names :
Chantrea SAM
Sothea HAS
December 11, 2016
Contents
1 Introduction 2
2 Data description and descriptive analysis 32.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Descriptive analysis . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Response Variable : Ferritin (Ferr) . . . . . . . . . . . 32.2.2 Predictors and correlations . . . . . . . . . . . . . . . . 4
3 Construction Multilinear Models 63.1 Some Trivial Multilinear models . . . . . . . . . . . . . . . . . 73.2 Exhaustive search . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Contruction by stepwise (Forward/Backward) . . . . . . . . . 9
4 Penalisation Regression 114.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.1 Lasso.min & Lasso.1se . . . . . . . . . . . . . . . . . . 134.2.2 Lasso.BIC & Lasso.mBIC . . . . . . . . . . . . . . . . 15
4.3 Residual of Ridge & Lasso regression . . . . . . . . . . . . . . 16
5 Evaluation of obtained models 18
6 Elastic Net 19
7 Conclusion 20
1
1 Introduction
Ferritin is the major iron storage protein of the body and its level can beused to indirectly measure the iron level in the body. its normal value rangeis:
• Male: 12 to 300 nanograms per millilter (ng/mL)
• Female: 12 to 150 ng/mL
Objective
In this project, our goal is building an appropriate model for predictingFerritin level in the body of individuals by considering the following Ferrvariable in Ferritin data set (as response variable) using only significantpredictors by reducing those redundant variables as much as possible. Wewill do this using all methods and algorithms seen during the course andsome other related methods.
2
2 Data description and descriptive analysis
2.1 Data description
This data set has been collected at the Australian National Sport Institute,representing the concentration in Ferritin and various covariate for 102 menand 100 women. Therefore, this data set consists of 202 rows and 13 columnscorrespoding to 13 variables:• Sex : Male or female• Sport : Different kind of sports• RCC : Red Cell Count• WCC : White Cell Count• HC : Hematocrit• Hg : Hemoglobin• Ferr : Plasma ferritin concentration (ng/mL)
• BMI : Body Mass Index =weight
height2
• SSF : Sum of Skin Folds• XBfat : Body Fat (%)• LBM : Lean Body Mass• Ht : Height (cm)• Wt : Weight (kg)
Sex Sport RCC WCC Hc Hg Ferr BMI SSF X.Bfat LBM Ht Wt
1 female BBall 3.96 7.5 37.5 12.3 60 20.56 109.1 19.75 63.32 195.9 78.9
2 female BBall 4.41 8.3 38.2 12.7 68 20.67 102.8 21.30 58.55 189.7 74.4
3 female BBall 4.14 5.0 36.4 11.6 21 21.86 104.6 19.88 55.36 177.8 69.1
4 female BBall 4.11 5.3 37.3 12.6 69 21.88 126.4 23.66 57.18 185.0 74.9
5 female BBall 4.45 6.8 41.5 14.0 29 18.96 80.3 17.64 53.20 184.6 64.6
6 female BBall 4.10 4.4 37.4 12.5 42 21.04 75.2 15.58 53.77 174.0 63.7
2.2 Descriptive analysis
2.2.1 Response Variable : Ferritin (Ferr)
In this data set, the response variable Ferr takes value between 8.00ng/mLand 234.00ng/mL with average value of 76.88ng/m, variance of 2256.368 andother statistical descriptions as shown in the Table 2.2.1.a.
3
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.00 41.25 65.50 76.88 97.00 234.00
Table 2.2.1.a Ferritin summary
This variable appears not to follow the Guassian distribution by imme-diately look at its distribution in Figure 2.2.1.a. Moreover, the Shapirotest of Normality in the Table 2.2.1.b also agrees to this claim since thep − value = 5.265e − 11 which is very small so that we can reject the Nor-mality of this variable.
Moreover, depending on the Boxplot of Figure 2.2.1.a, we can see thatthere are some weired data points which will be considered as another prob-lem for our regression. However, we will try to see the effect of those outlierin our model later.
Histogram
Den
sity
0 50 100 150 200
0.00
00.
004
0.00
80.
012
050
100
150
200
Boxplot
0 50 100 200
0.0
0.2
0.4
0.6
0.8
1.0
Repartition FunctionF
n(x)
Distribution of Ferritin
Figure 2.2.1.a Distribution of Ferritin
Shapiro-Wilk normality test
data: ferritin0$Ferr
W = 0.89008, p-value = 5.265e-11
Table 2.2.1.b Shapiro Test of Ferritin
2.2.2 Predictors and correlations
There are 12 predictors in our data set and 2 of them are categorical variables.In this part, we will focus on two important kind of correlations. One isbetween independent variables, and another one is between independent and
4
response variable. Since there are two kinds of predictors in our data set, wewill look at those relationships separately.
Figure 2.2.2.a contains informations of correlations between numericalvariables. We can see that there are some pairs of predictors appear to behighly correlated with each other which is what we don’t want to see. Theseredundant variables will lead to overfitting in our regression. On the otherhand, we also can see that there is no predictor appear to be highly correlatedwith the response variable which is not so good for our regression as well.These poor correlations between predictors and response variable will leadto poor prediction (small value of r2 ).
Corr:
0.147
Corr:
0.925
Corr:
0.153
Corr:
0.889
Corr:
0.135
Corr:
0.951
Corr:
0.251
Corr:
0.132
Corr:
0.258
Corr:
0.308
Corr:
0.299
Corr:
0.177
Corr:
0.321
Corr:
0.383
Corr:
0.303
Corr:
−0.403
Corr:
0.137
Corr:
−0.449
Corr:
−0.435
Corr:
−0.108
Corr:
0.321
Corr:
−0.494
Corr:
0.108
Corr:
−0.532
Corr:
−0.532
Corr:
−0.183
Corr:
0.188
Corr:
0.963
Corr:
0.551
Corr:
0.103
Corr:
0.583
Corr:
0.611
Corr:
0.318
Corr:
0.714
Corr:
−0.208
Corr:
−0.362
Corr:
0.359
Corr:
0.077
Corr:
0.371
Corr:
0.352
Corr:
0.123
Corr:
0.337
Corr:
−0.0713
Corr:
−0.188
Corr:
0.802
Corr:
0.404
Corr:
0.156
Corr:
0.424
Corr:
0.455
Corr:
0.274
Corr:
0.846
Corr:
0.154
Corr:
−0.000162
Corr:
0.931
Corr:
0.781
RCC WCC Hc Hg Ferr BMI SSF X.Bfat LBM Ht Wt
RC
CW
CC
Hc
Hg
Ferr
BM
IS
SF
X.B
fatLB
MH
tW
t
4 5 6 3 6 912 35404550556012141618 050100150200 2025303550100150200102030 507090 160180200 5075100125
0.00.20.40.60.8
369
12
354045505560
12141618
050
100150200
20253035
50100150200
102030
507090
160180200
5075
100125
Figure 2.2.2.a Correlations between numerical variables
Figure 2.2.2.b contains the relation between the response variable andcategorical predictors.
5
Sex Sport Ferr
Sex
Sport
Ferr
0 5 10 0 5 10 012345012345012345012345012345012345012345012345012345012345 0 50 100 150 200
0
25
50
75
100
05101520
05101520
05101520
05101520
05101520
05101520
05101520
05101520
05101520
05101520
0
50
100
150
200
Figure 2.2.2.b Response variale and categorical variables.
3 Construction Multilinear Models
Model, Y = Xβ + ε where X is n× (p+ k + 1) matrix where :
• Y is the response variable
• X is the predictor
• n is number of observations(202)
• p is number of variables(13)
• k dummy variables(10 : 1 from variable Sex and 9 from variable Sport)
The vector β = β0, β1, β2, ..., βk and E(ε) = 0, V (ε) = σ2 where :
6
• β is parameter
• ε is error term
3.1 Some Trivial Multilinear models
We try some trivial models as following:
• null0 : the model using only intercept as independent variable to pre-dict the response variable Ferr.
• full0 : the model using all independent variables to predict the re-sponse variable.
• null1 : the model with only intercept but we applied logarithmictransformation on the response variable Ferr.
• full1 : the model using all predictors but we applied logarithmictransformation on the response variable Ferr.
• null : the model with only intercept and we applied logarithmictransformation on the response variable Ferr after removing one datapoints which disturbed the normality of the residual.
• full : the model with all predictors and we applied logarithmic trans-formation on the response variable Ferr after removing out 1 datapoint (outlier).
comment: the null1 and full1 are better models than null0 and full0since it incresed adjusted− r2 values of the previous models. Consequently,after removing one data point from our data sample, we obtained the bettermodels null and full with better normality of the residuals as shown in theShapiro-test of table 3.1.a and in figure 3.1.a .
7
3.5 4.0 4.5 5.0
−1.
00.
01.
0
Fitted values
Res
idua
ls
Residuals vs Fitted
35 144 101
−3 −2 −1 0 1 2 3
−2
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
35144101
0 50 100 150 200
0.00
00.
015
0.03
0
Obs. number
Coo
k's
dist
ance
Cook's distance181
101 144
0.00 0.10 0.20 0.30
−3
−1
01
2
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
Residuals vs Leverage
181
101144
Figure 3.1.a Diagonastic plot of full model
Shapiro-Wilk normality test
data: full$residuals
W = 0.98901, p-value = 0.1344
Table 3.1.a Shapiro Test of residual full model
3.2 Exhaustive search
This algorithm considers all possible models with different number of predic-tors. By looking at the values of Residual Sum of Square (RSS) , AdjustedR2, Cp and BIC in figure 3.2.a, we can keep in mind the approximated num-ber of variables which provides good values of those criteria. In other word,we look for the number of subset size (number of variables) which at the
8
same time gives the small value of RSS, Cp, BIC and high Adj-R2. We willconsider such model as a good one.
0 5 10 15
3.75
3.85
3.95
subset size
RS
S
0 5 10 15
0.22
0.26
0.30
0.34
subset sizeA
djus
ted
R2
0 5 10 15
515
2535
subset size
Mal
low
s' C
p
0 5 10 15
−60
−40
−20
0
subset size
BIC
Figure 3.2.a Exhaustrive search and criteria
3.3 Contruction by stepwise (Forward/Backward)
This algorithm considers all models starting from null model (using onlyintercept) to more complex model with more variables called Forward re-gression. Another regression starts from full model (using all variables) tomore simple model with less variables called Backward regression. In thiscase, we will use stepwise regression with both directions and we measureeach model using two criteria AIC and BIC. We name those two models withbest value of AIC and BIC, step.AIC and step.BIC respectively.
9
3.5 4.0 4.5 5.0
−1.
0−
0.5
0.0
0.5
1.0
Fitted values
Res
idua
ls
Residuals vs Fitted
35 144101
−3 −1 0 1 2 3−
2−
10
12
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
35144101
0 50 100 150 200
0.00
00.
010
0.02
00.
030
Obs. number
Coo
k's
dist
ance
Cook's distance101
144
181
3.6 4.0 4.4 4.8
−1.
5−
0.5
0.0
0.5
1.0
Fitted values
Res
idua
ls
Residuals vs Fitted
144
3576
−3 −1 0 1 2 3
−3
−2
−1
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
144
3576
0 50 100 150 200
0.00
0.01
0.02
0.03
0.04
Obs. number
Coo
k's
dist
ance
Cook's distance76
3596
Figure 3.3.a Residual of step.AIC & step.BIC
Shapiro-Wilk normality test
data: step.AIC$residuals
W = 0.99047, p-value = 0.2184
Shapiro-Wilk normality test
data: step.BIC$residuals
W = 0.99167, p-value = 0.3194
Table 3.3.a Shapiro test of step.AIC & step.BIC
We check for the diagnostic of these two models. It appears to be good.The residuals seem to be symmetric in figure 3.3.a. Shapiro-test of Normalityof table 3.3.a shows that we can’t reject the assumption of being normal ofthe residuals since the significant levels are both higher than 20%.
10
4 Penalisation Regression
In this part we are interested in the regularised methods ridge and LASSO inorder to constrain the variance of our estimator and eventualy improve ourprediction error. We trade between the Residual Sum of Square (RSS) andVariance of the estimator by constraining our estimator in a limited region.
β = argminβ‖Y −Xβ‖22 + λΩ(β)
4.1 Ridge Regression
The Ridge regression constrains the variance of estimator using ‖.‖2. Wemeasure each model through cross-validation with the minimum and a firststandard error rules (the most penalized model with a ”1se” distance fromthe model with the least error). We shall name the corresponding modelsridge.min and ridge.1se.
βridge = argminβ‖Y −Xβ‖22 + λ|β‖22
We can represent the regularisation path (each path correspond to eachpredictor) in function of different mesurements (lambda) as shown in figure4.1.a. Figure 4.1.b shows the 10-fold cross validation of Ridge regression.
11
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
Log Lambda
Coe
ffici
ents
19 19 19 19 19
Figure 4.1.a Coefficient of Ridge regression
12
−2 0 2 4 6
0.25
0.30
0.35
0.40
log(Lambda)
Mea
n−S
quar
ed E
rror
19 19 19 19 19 19 19 19 19 19 19 19 19
Figure 4.1.b Cross validation of Ridge regression
4.2 Lasso Regression
4.2.1 Lasso.min & Lasso.1se
It is similar to the Ridge regression. The difference is in the LASSO regressionwe constrain the variance of estimator using ‖.‖1. Because of the nature ofthis norm, it puts some variables to be 0 and the remaining variables areconsidered as significant for our model. We name lasso.min, lasso.1se forthe models obtained by minimum and 1st standard error rules respectively.
βlasso = argminβ‖Y −Xβ‖22 + λ|β‖1
13
−8 −6 −4 −2
−0.
20.
00.
20.
40.
60.
81.
0
Log Lambda
Coe
ffici
ents
19 17 12 2
Figure 4.2.1.a Coefficient of Lasso regression
14
−8 −7 −6 −5 −4 −3 −2 −1
0.25
0.30
0.35
0.40
Cross Validation for Lasso
log(Lambda)
Mea
n−S
quar
ed E
rror
19 18 17 17 17 17 15 12 8 8 5 4 2 1 1
Figure 4.2.1.b Cross validation of Lasso regression
4.2.2 Lasso.BIC & Lasso.mBIC
Figure 4.2.2.a shows that the value of four different creteria of lasso regres-sion. We pick other two models of lasso using two criteria :BIC = n log(errD) + log(n)df(λ)mBIC = n log(errD) + (log(n) + 2 log(p))df(λ)We name lasso.BIC and lasso.mBIC for the corresponding models.
15
−250
−200
−150
−100
0.001 0.010 0.100
lambda
valu
e
critere
AIC
BIC
eBIC
mBIC
Figure 4.2.2.a Standard criteria of Lasso regression
4.3 Residual of Ridge & Lasso regression
We can see that their residuals appear to be very good with symmetry asshown in the figure 4.3.a. Shapiro test in figure 4.3.b shows that we cannotreject the normality of those residuals since the significant levels are veryhigh.
16
0 10 20 30 40 50
−1.
0−
0.5
0.0
0.5
fitted values
Res
idua
l of r
idge
.min
0 10 20 30 40 50−
1.0
−0.
50.
00.
5
fitted values
Res
idua
l of r
idge
.1se
0 10 20 30 40 50
−1.
0−
0.5
0.0
0.5
fitted values
Res
idua
l of l
asso
.min
0 10 20 30 40 50
−0.
50.
00.
5
fitted values
Res
idua
l of l
asso
.1se
0 10 20 30 40 50
−0.
50.
00.
5
fitted values
Res
idua
l of l
asso
.BIC
0 10 20 30 40 50
−0.
50.
00.
5
fitted values
Res
idua
l of l
asso
.mB
IC
Figure 4.3.a Residual of Ridge & Lasso regressioin
Normality tests of them
Shapiro-Wilk normality test
data: err_ridge.min
W = 0.97618, p-value = 0.4043
Shapiro-Wilk normality test
data: err_ridge.1se
W = 0.98264, p-value = 0.6673
Shapiro-Wilk normality test
data: err_lasso.min
W = 0.97662, p-value = 0.4197
17
Shapiro-Wilk normality test
data: err_lasso.1se
W = 0.96595, p-value = 0.1577
Shapiro-Wilk normality test
data: err_lasso.BIC
W = 0.97046, p-value = 0.2416
Shapiro-Wilk normality test
data: err_lasso.mBIC
W = 0.96554, p-value = 0.1515
Table 4.3.a Shapiro test of Ridge & Lasso regressioin
Their residuals appear to be very good with symmetry and their Normal-ities with high significant level.
5 Evaluation of obtained models
We reran the program 100 times and then we observed the following results.
• Ridge.min : the model with smallest error among all models (more than80 times of 100 times) but it always had around nineteen variables.
• Lasso.min : the second model with small error (more than 60 times of100 times) and its number of variables were between 13 and 16.
• Lasso.BIC : the best model of the last three models (more than 80 timesof 100 times). It has smaller error and number of variables between 2and 5 variables which is a good number.
Figure 5.a is a particular case of all obtained models.
model lambda error num_variables
1 null NA 0.6000601 1
2 full NA 0.4870134 21
18
3 new.full NA 0.4902630 16
4 step.AIC NA 0.4817147 15
5 step.BIC NA 0.5083180 4
6 ridge.min 0.0384273960205227 0.2137813 19
7 ridge.1se 0.326538118014613 0.2102107 19
8 lasso.min 0.0162518790801139 0.2325777 12
9 lasso.1se 0.104468267020236 0.2239882 3
10 lasso.BIC 0.0597805832624959 0.2133363 4
11 lasso.mBIC 0.114653794087908 0.2242685 2
Table 5.a Evaluation table
6 Elastic Net
Elastic net is a hybrid approach that blends both penalization of the L2 andL1 norms. Ridge and LASSO are particular cases of Elastic net.
βelas = argminβ‖Y −Xβ‖22 + λ((1− α)‖β‖22 + α‖β‖1)
• If α = 0⇔ ridge regression.
• If α = 1⇔ LASSO regression.
• If 0 < α < 1⇔ Elastic net.
We consider the following cases with different values of α ∈ 0.05, 0.1, ..., 0.95.The following table is a particular case of Elastic net we obtained.
alpha lambda.min error.min var.min lambda.1se error.1se var.1se
1 0.05 0.015086908 0.2154716 17 0.32503758 0.2153363 12
2 0.10 0.012011312 0.2172676 18 0.25877587 0.2138133 9
3 0.15 0.018498459 0.2189751 18 0.33087260 0.2143673 8
4 0.20 0.015226527 0.2192108 17 0.29890292 0.2156557 7
5 0.25 0.009214647 0.2212308 16 0.31610565 0.2145268 6
6 0.30 0.017739192 0.2207122 17 0.21869705 0.2131368 5
7 0.35 0.016687493 0.2211267 16 0.15562812 0.2137326 5
8 0.40 0.010064286 0.2218687 16 0.28663449 0.2166657 4
9 0.45 0.002929422 0.2224653 17 0.16001316 0.2184666 5
19
10 0.50 0.039150892 0.2232277 17 0.19037519 0.2207121 4
11 0.55 0.018557552 0.2234966 11 0.14368439 0.2184268 4
12 0.60 0.001514353 0.2250908 17 0.17411378 0.2150205 2
13 0.65 0.025002891 0.2276682 14 0.14644245 0.2185514 3
14 0.70 0.025480596 0.2237563 16 0.12390199 0.2233571 4
15 0.75 0.010294612 0.2249556 12 0.07970744 0.2234584 3
16 0.80 0.011624903 0.2267506 12 0.09878302 0.2169137 2
17 0.85 0.020984020 0.2278238 9 0.09297225 0.2236783 2
18 0.90 0.023871142 0.2260372 15 0.10576399 0.2211997 3
19 0.95 0.009789392 0.2300646 12 0.10019747 0.2191753 3
Table 6.a Particular case of Elastic net
7 Conclusion
As we saw during analysis, the obtained models with less predictors (4 or 5)are lasso.min, lasso.1se, lasso.BIC and lasso.mBIC. Moreover, the remainingpredictors of each model are mostly Sex Male , Sport Tennis, SportWPolo, BMI. Some of these predictors are not reasonable in predictingFerritin level at all such that Sport Tennis, Sport WPolo. This is theconsequence of poor relation between predictors and response variable aswell as some strong correlation between predictors. So, it seems like thosepredictors have no enough significant information for predicting our responsevariable Ferr.
20
References
[1] R Project - Regularization for the Linear models and extensions,Julien Chiquet, Alia Dehman http://julien.cremeriefamily.info/
project_ensiie/projets.html
[2] Introduction to regularized methods for regression(courses), Julien Chi-quet http://julien.cremeriefamily.info/teachings_M1MINT_Reg.
html
[3] Lasso (statistics) from Wikipedia, the free encyclopedia https://en.
wikipedia.org/wiki/Lasso_(statistics)
[4] R-Bloggers https://www.r-bloggers.com/
be-careful-evaluating-model-predictions/
21