R-Project : Ferritin Data Set - WordPress.com · 2016. 12. 17. · 2 Data description and descriptive analysis 2.1 Data description This data set has been collected at the Australian

Regularization For The Linear ModelsAnd Extensions

R-Project : Ferritin Data Set

Student’s Names :

Chantrea SAM

Sothea HAS

December 11, 2016

Contents

1 Introduction 2

2 Data description and descriptive analysis 32.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Descriptive analysis . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Response Variable : Ferritin (Ferr) . . . . . . . . . . . 32.2.2 Predictors and correlations . . . . . . . . . . . . . . . . 4

3 Construction Multilinear Models 63.1 Some Trivial Multilinear models . . . . . . . . . . . . . . . . . 73.2 Exhaustive search . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Contruction by stepwise (Forward/Backward) . . . . . . . . . 9

4 Penalisation Regression 114.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2.1 Lasso.min & Lasso.1se . . . . . . . . . . . . . . . . . . 134.2.2 Lasso.BIC & Lasso.mBIC . . . . . . . . . . . . . . . . 15

4.3 Residual of Ridge & Lasso regression . . . . . . . . . . . . . . 16

5 Evaluation of obtained models 18

6 Elastic Net 19

7 Conclusion 20

1

1 Introduction

Ferritin is the major iron storage protein of the body and its level can beused to indirectly measure the iron level in the body. its normal value rangeis:

• Male: 12 to 300 nanograms per millilter (ng/mL)

• Female: 12 to 150 ng/mL

Objective

In this project, our goal is building an appropriate model for predictingFerritin level in the body of individuals by considering the following Ferrvariable in Ferritin data set (as response variable) using only significantpredictors by reducing those redundant variables as much as possible. Wewill do this using all methods and algorithms seen during the course andsome other related methods.

2

2 Data description and descriptive analysis

2.1 Data description

This data set has been collected at the Australian National Sport Institute,representing the concentration in Ferritin and various covariate for 102 menand 100 women. Therefore, this data set consists of 202 rows and 13 columnscorrespoding to 13 variables:• Sex : Male or female• Sport : Different kind of sports• RCC : Red Cell Count• WCC : White Cell Count• HC : Hematocrit• Hg : Hemoglobin• Ferr : Plasma ferritin concentration (ng/mL)

• BMI : Body Mass Index =weight

height2

• SSF : Sum of Skin Folds• XBfat : Body Fat (%)• LBM : Lean Body Mass• Ht : Height (cm)• Wt : Weight (kg)

Sex Sport RCC WCC Hc Hg Ferr BMI SSF X.Bfat LBM Ht Wt

1 female BBall 3.96 7.5 37.5 12.3 60 20.56 109.1 19.75 63.32 195.9 78.9

2 female BBall 4.41 8.3 38.2 12.7 68 20.67 102.8 21.30 58.55 189.7 74.4

3 female BBall 4.14 5.0 36.4 11.6 21 21.86 104.6 19.88 55.36 177.8 69.1

4 female BBall 4.11 5.3 37.3 12.6 69 21.88 126.4 23.66 57.18 185.0 74.9

5 female BBall 4.45 6.8 41.5 14.0 29 18.96 80.3 17.64 53.20 184.6 64.6

6 female BBall 4.10 4.4 37.4 12.5 42 21.04 75.2 15.58 53.77 174.0 63.7

2.2 Descriptive analysis

2.2.1 Response Variable : Ferritin (Ferr)

In this data set, the response variable Ferr takes value between 8.00ng/mLand 234.00ng/mL with average value of 76.88ng/m, variance of 2256.368 andother statistical descriptions as shown in the Table 2.2.1.a.

3

Min. 1st Qu. Median Mean 3rd Qu. Max.

8.00 41.25 65.50 76.88 97.00 234.00

Table 2.2.1.a Ferritin summary

This variable appears not to follow the Guassian distribution by imme-diately look at its distribution in Figure 2.2.1.a. Moreover, the Shapirotest of Normality in the Table 2.2.1.b also agrees to this claim since thep − value = 5.265e − 11 which is very small so that we can reject the Nor-mality of this variable.

Moreover, depending on the Boxplot of Figure 2.2.1.a, we can see thatthere are some weired data points which will be considered as another prob-lem for our regression. However, we will try to see the effect of those outlierin our model later.

Histogram

Den

sity

0 50 100 150 200

0.00

00.

004

0.00

80.

012

050

100

150

200

Boxplot

0 50 100 200

0.0

0.2

0.4

0.6

0.8

1.0

Repartition FunctionF

n(x)

Distribution of Ferritin

Figure 2.2.1.a Distribution of Ferritin

Shapiro-Wilk normality test

data: ferritin0$Ferr

W = 0.89008, p-value = 5.265e-11

Table 2.2.1.b Shapiro Test of Ferritin

2.2.2 Predictors and correlations

There are 12 predictors in our data set and 2 of them are categorical variables.In this part, we will focus on two important kind of correlations. One isbetween independent variables, and another one is between independent and

4

response variable. Since there are two kinds of predictors in our data set, wewill look at those relationships separately.

Figure 2.2.2.a contains informations of correlations between numericalvariables. We can see that there are some pairs of predictors appear to behighly correlated with each other which is what we don’t want to see. Theseredundant variables will lead to overfitting in our regression. On the otherhand, we also can see that there is no predictor appear to be highly correlatedwith the response variable which is not so good for our regression as well.These poor correlations between predictors and response variable will leadto poor prediction (small value of r2 ).

Corr:

0.147

Corr:

0.925

Corr:

0.153

Corr:

0.889

Corr:

0.135

Corr:

0.951

Corr:

0.251

Corr:

0.132

Corr:

0.258

Corr:

0.308

Corr:

0.299

Corr:

0.177

Corr:

0.321

Corr:

0.383

Corr:

0.303

Corr:

−0.403

Corr:

0.137

Corr:

−0.449

Corr:

−0.435

Corr:

−0.108

Corr:

0.321

Corr:

−0.494

Corr:

0.108

Corr:

−0.532

Corr:

−0.532

Corr:

−0.183

Corr:

0.188

Corr:

0.963

Corr:

0.551

Corr:

0.103

Corr:

0.583

Corr:

0.611

Corr:

0.318

Corr:

0.714

Corr:

−0.208

Corr:

−0.362

Corr:

0.359

Corr:

0.077

Corr:

0.371

Corr:

0.352

Corr:

0.123

Corr:

0.337

Corr:

−0.0713

Corr:

−0.188

Corr:

0.802

Corr:

0.404

Corr:

0.156

Corr:

0.424

Corr:

0.455

Corr:

0.274

Corr:

0.846

Corr:

0.154

Corr:

−0.000162

Corr:

0.931

Corr:

0.781

RCC WCC Hc Hg Ferr BMI SSF X.Bfat LBM Ht Wt

RC

CW

CC

Hc

Hg

Ferr

BM

IS

SF

X.B

fatLB

MH

tW

t

4 5 6 3 6 912 35404550556012141618 050100150200 2025303550100150200102030 507090 160180200 5075100125

0.00.20.40.60.8

369

12

354045505560

12141618

050

100150200

20253035

50100150200

102030

507090

160180200

5075

100125

Figure 2.2.2.a Correlations between numerical variables

Figure 2.2.2.b contains the relation between the response variable andcategorical predictors.

5

Sex Sport Ferr

Sex

Sport

Ferr

0 5 10 0 5 10 012345012345012345012345012345012345012345012345012345012345 0 50 100 150 200

0

25

50

75

100

05101520

05101520

05101520

05101520

05101520

05101520

05101520

05101520

05101520

05101520

0

50

100

150

200

Figure 2.2.2.b Response variale and categorical variables.

3 Construction Multilinear Models

Model, Y = Xβ + ε where X is n× (p+ k + 1) matrix where :

• Y is the response variable

• X is the predictor

• n is number of observations(202)

• p is number of variables(13)

• k dummy variables(10 : 1 from variable Sex and 9 from variable Sport)

The vector β = β0, β1, β2, ..., βk and E(ε) = 0, V (ε) = σ2 where :

6

• β is parameter

• ε is error term

3.1 Some Trivial Multilinear models

We try some trivial models as following:

• null0 : the model using only intercept as independent variable to pre-dict the response variable Ferr.

• full0 : the model using all independent variables to predict the re-sponse variable.

• null1 : the model with only intercept but we applied logarithmictransformation on the response variable Ferr.

• full1 : the model using all predictors but we applied logarithmictransformation on the response variable Ferr.

• null : the model with only intercept and we applied logarithmictransformation on the response variable Ferr after removing one datapoints which disturbed the normality of the residual.

• full : the model with all predictors and we applied logarithmic trans-formation on the response variable Ferr after removing out 1 datapoint (outlier).

comment: the null1 and full1 are better models than null0 and full0since it incresed adjusted− r2 values of the previous models. Consequently,after removing one data point from our data sample, we obtained the bettermodels null and full with better normality of the residuals as shown in theShapiro-test of table 3.1.a and in figure 3.1.a .

7

3.5 4.0 4.5 5.0

−1.

00.

01.

0

Fitted values

Res

idua

ls

Residuals vs Fitted

35 144 101

−3 −2 −1 0 1 2 3

−2

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

35144101

0 50 100 150 200

0.00

00.

015

0.03

0

Obs. number

Coo

k's

dist

ance

Cook's distance181

101 144

0.00 0.10 0.20 0.30

−3

−1

01

2

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

Residuals vs Leverage

181

101144

Figure 3.1.a Diagonastic plot of full model


data: full$residuals

W = 0.98901, p-value = 0.1344

Table 3.1.a Shapiro Test of residual full model

3.2 Exhaustive search

This algorithm considers all possible models with different number of predic-tors. By looking at the values of Residual Sum of Square (RSS) , AdjustedR2, Cp and BIC in figure 3.2.a, we can keep in mind the approximated num-ber of variables which provides good values of those criteria. In other word,we look for the number of subset size (number of variables) which at the

8

same time gives the small value of RSS, Cp, BIC and high Adj-R2. We willconsider such model as a good one.

0 5 10 15

3.75

3.85

3.95

subset size

RS

S

0 5 10 15

0.22

0.26

0.30

0.34

subset sizeA

djus

ted

R2

0 5 10 15

515

2535

subset size

Mal

low

s' C

p

0 5 10 15

−60

−40

−20

0

subset size

BIC

Figure 3.2.a Exhaustrive search and criteria

3.3 Contruction by stepwise (Forward/Backward)

This algorithm considers all models starting from null model (using onlyintercept) to more complex model with more variables called Forward re-gression. Another regression starts from full model (using all variables) tomore simple model with less variables called Backward regression. In thiscase, we will use stepwise regression with both directions and we measureeach model using two criteria AIC and BIC. We name those two models withbest value of AIC and BIC, step.AIC and step.BIC respectively.

9

3.5 4.0 4.5 5.0

−1.

0−

0.5

0.0

0.5

1.0

Fitted values

Res

idua

ls

Residuals vs Fitted

35 144101

−3 −1 0 1 2 3−

2−

10

12


Sta

ndar

dize

d re

sidu

als

Normal Q−Q

35144101

0 50 100 150 200

0.00

00.

010

0.02

00.

030

Obs. number

Coo

k's

dist

ance

Cook's distance101

144

181

3.6 4.0 4.4 4.8

−1.

5−

0.5

0.0

0.5

1.0

Fitted values

Res

idua

ls

Residuals vs Fitted

144

3576

−3 −1 0 1 2 3

−3

−2

−1

01

2


Sta

ndar

dize

d re

sidu

als

Normal Q−Q

144

3576

0 50 100 150 200

0.00

0.01

0.02

0.03

0.04

Obs. number

Coo

k's

dist

ance

Cook's distance76

3596

Figure 3.3.a Residual of step.AIC & step.BIC


data: step.AIC$residuals

W = 0.99047, p-value = 0.2184


data: step.BIC$residuals

W = 0.99167, p-value = 0.3194

Table 3.3.a Shapiro test of step.AIC & step.BIC

We check for the diagnostic of these two models. It appears to be good.The residuals seem to be symmetric in figure 3.3.a. Shapiro-test of Normalityof table 3.3.a shows that we can’t reject the assumption of being normal ofthe residuals since the significant levels are both higher than 20%.

10

4 Penalisation Regression

In this part we are interested in the regularised methods ridge and LASSO inorder to constrain the variance of our estimator and eventualy improve ourprediction error. We trade between the Residual Sum of Square (RSS) andVariance of the estimator by constraining our estimator in a limited region.

β = argminβ‖Y −Xβ‖22 + λΩ(β)

4.1 Ridge Regression

The Ridge regression constrains the variance of estimator using ‖.‖2. Wemeasure each model through cross-validation with the minimum and a firststandard error rules (the most penalized model with a ”1se” distance fromthe model with the least error). We shall name the corresponding modelsridge.min and ridge.1se.

βridge = argminβ‖Y −Xβ‖22 + λ|β‖22

We can represent the regularisation path (each path correspond to eachpredictor) in function of different mesurements (lambda) as shown in figure4.1.a. Figure 4.1.b shows the 10-fold cross validation of Ridge regression.

11

−2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

Log Lambda

Coe

ffici

ents

19 19 19 19 19

Figure 4.1.a Coefficient of Ridge regression

12

−2 0 2 4 6

0.25

0.30

0.35

0.40

log(Lambda)

Mea

n−S

quar

ed E

rror

19 19 19 19 19 19 19 19 19 19 19 19 19

Figure 4.1.b Cross validation of Ridge regression

4.2 Lasso Regression

4.2.1 Lasso.min & Lasso.1se

It is similar to the Ridge regression. The difference is in the LASSO regressionwe constrain the variance of estimator using ‖.‖1. Because of the nature ofthis norm, it puts some variables to be 0 and the remaining variables areconsidered as significant for our model. We name lasso.min, lasso.1se forthe models obtained by minimum and 1st standard error rules respectively.

βlasso = argminβ‖Y −Xβ‖22 + λ|β‖1

13

−8 −6 −4 −2

−0.

20.

00.

20.

40.

60.

81.

0

Log Lambda

Coe

ffici

ents

19 17 12 2

Figure 4.2.1.a Coefficient of Lasso regression

14

−8 −7 −6 −5 −4 −3 −2 −1

0.25

0.30

0.35

0.40

Cross Validation for Lasso

log(Lambda)

Mea

n−S

quar

ed E

rror

19 18 17 17 17 17 15 12 8 8 5 4 2 1 1

Figure 4.2.1.b Cross validation of Lasso regression

4.2.2 Lasso.BIC & Lasso.mBIC

Figure 4.2.2.a shows that the value of four different creteria of lasso regres-sion. We pick other two models of lasso using two criteria :BIC = n log(errD) + log(n)df(λ)mBIC = n log(errD) + (log(n) + 2 log(p))df(λ)We name lasso.BIC and lasso.mBIC for the corresponding models.

15

−250

−200

−150

−100

0.001 0.010 0.100

lambda

valu

e

critere

AIC

BIC

eBIC

mBIC

Figure 4.2.2.a Standard criteria of Lasso regression

4.3 Residual of Ridge & Lasso regression

We can see that their residuals appear to be very good with symmetry asshown in the figure 4.3.a. Shapiro test in figure 4.3.b shows that we cannotreject the normality of those residuals since the significant levels are veryhigh.

16

0 10 20 30 40 50

−1.

0−

0.5

0.0

0.5

fitted values

Res

idua

l of r

idge

.min

0 10 20 30 40 50−

1.0

−0.

50.

00.

5

fitted values

Res

idua

l of r

idge

.1se

0 10 20 30 40 50

−1.

0−

0.5

0.0

0.5

fitted values

Res

idua

l of l

asso

.min

0 10 20 30 40 50

−0.

50.

00.

5

fitted values

Res

idua

l of l

asso

.1se

0 10 20 30 40 50

−0.

50.

00.

5

fitted values

Res

idua

l of l

asso

.BIC

0 10 20 30 40 50

−0.

50.

00.

5

fitted values

Res

idua

l of l

asso

.mB

IC

Figure 4.3.a Residual of Ridge & Lasso regressioin

Normality tests of them


data: err_ridge.min

W = 0.97618, p-value = 0.4043


data: err_ridge.1se

W = 0.98264, p-value = 0.6673


data: err_lasso.min

W = 0.97662, p-value = 0.4197

17


data: err_lasso.1se

W = 0.96595, p-value = 0.1577


data: err_lasso.BIC

W = 0.97046, p-value = 0.2416


data: err_lasso.mBIC

W = 0.96554, p-value = 0.1515

Table 4.3.a Shapiro test of Ridge & Lasso regressioin

Their residuals appear to be very good with symmetry and their Normal-ities with high significant level.

5 Evaluation of obtained models

We reran the program 100 times and then we observed the following results.

• Ridge.min : the model with smallest error among all models (more than80 times of 100 times) but it always had around nineteen variables.

• Lasso.min : the second model with small error (more than 60 times of100 times) and its number of variables were between 13 and 16.

• Lasso.BIC : the best model of the last three models (more than 80 timesof 100 times). It has smaller error and number of variables between 2and 5 variables which is a good number.

Figure 5.a is a particular case of all obtained models.

model lambda error num_variables

1 null NA 0.6000601 1

2 full NA 0.4870134 21

18

3 new.full NA 0.4902630 16

4 step.AIC NA 0.4817147 15

5 step.BIC NA 0.5083180 4

6 ridge.min 0.0384273960205227 0.2137813 19

7 ridge.1se 0.326538118014613 0.2102107 19

8 lasso.min 0.0162518790801139 0.2325777 12

9 lasso.1se 0.104468267020236 0.2239882 3

10 lasso.BIC 0.0597805832624959 0.2133363 4

11 lasso.mBIC 0.114653794087908 0.2242685 2

Table 5.a Evaluation table

6 Elastic Net

Elastic net is a hybrid approach that blends both penalization of the L2 andL1 norms. Ridge and LASSO are particular cases of Elastic net.

βelas = argminβ‖Y −Xβ‖22 + λ((1− α)‖β‖22 + α‖β‖1)

• If α = 0⇔ ridge regression.

• If α = 1⇔ LASSO regression.

• If 0 < α < 1⇔ Elastic net.

We consider the following cases with different values of α ∈ 0.05, 0.1, ..., 0.95.The following table is a particular case of Elastic net we obtained.

alpha lambda.min error.min var.min lambda.1se error.1se var.1se

1 0.05 0.015086908 0.2154716 17 0.32503758 0.2153363 12

2 0.10 0.012011312 0.2172676 18 0.25877587 0.2138133 9

3 0.15 0.018498459 0.2189751 18 0.33087260 0.2143673 8

4 0.20 0.015226527 0.2192108 17 0.29890292 0.2156557 7

5 0.25 0.009214647 0.2212308 16 0.31610565 0.2145268 6

6 0.30 0.017739192 0.2207122 17 0.21869705 0.2131368 5

7 0.35 0.016687493 0.2211267 16 0.15562812 0.2137326 5

8 0.40 0.010064286 0.2218687 16 0.28663449 0.2166657 4

9 0.45 0.002929422 0.2224653 17 0.16001316 0.2184666 5

19

10 0.50 0.039150892 0.2232277 17 0.19037519 0.2207121 4

11 0.55 0.018557552 0.2234966 11 0.14368439 0.2184268 4

12 0.60 0.001514353 0.2250908 17 0.17411378 0.2150205 2

13 0.65 0.025002891 0.2276682 14 0.14644245 0.2185514 3

14 0.70 0.025480596 0.2237563 16 0.12390199 0.2233571 4

15 0.75 0.010294612 0.2249556 12 0.07970744 0.2234584 3

16 0.80 0.011624903 0.2267506 12 0.09878302 0.2169137 2

17 0.85 0.020984020 0.2278238 9 0.09297225 0.2236783 2

18 0.90 0.023871142 0.2260372 15 0.10576399 0.2211997 3

19 0.95 0.009789392 0.2300646 12 0.10019747 0.2191753 3

Table 6.a Particular case of Elastic net

7 Conclusion

As we saw during analysis, the obtained models with less predictors (4 or 5)are lasso.min, lasso.1se, lasso.BIC and lasso.mBIC. Moreover, the remainingpredictors of each model are mostly Sex Male , Sport Tennis, SportWPolo, BMI. Some of these predictors are not reasonable in predictingFerritin level at all such that Sport Tennis, Sport WPolo. This is theconsequence of poor relation between predictors and response variable aswell as some strong correlation between predictors. So, it seems like thosepredictors have no enough significant information for predicting our responsevariable Ferr.

20

References

[1] R Project - Regularization for the Linear models and extensions,Julien Chiquet, Alia Dehman http://julien.cremeriefamily.info/

project_ensiie/projets.html

[2] Introduction to regularized methods for regression(courses), Julien Chi-quet http://julien.cremeriefamily.info/teachings_M1MINT_Reg.

html

[3] Lasso (statistics) from Wikipedia, the free encyclopedia https://en.

wikipedia.org/wiki/Lasso_(statistics)

[4] R-Bloggers https://www.r-bloggers.com/

be-careful-evaluating-model-predictions/

21

http://julien.cremeriefamily.info/project_ensiie/projets.html

http://julien.cremeriefamily.info/project_ensiie/projets.html

http://julien.cremeriefamily.info/teachings_M1MINT_Reg.html

http://julien.cremeriefamily.info/teachings_M1MINT_Reg.html

https://en.wikipedia.org/wiki/Lasso_(statistics)

https://en.wikipedia.org/wiki/Lasso_(statistics)

https://www.r-bloggers.com/be-careful-evaluating-model-predictions/

https://www.r-bloggers.com/be-careful-evaluating-model-predictions/

Documents

R-Project : Ferritin Data Set - WordPress.com · 2016. 12. 17. · 2 Data description and descriptive analysis 2.1 Data description This data set has been collected at the Australian