If you can't read please download the document
Upload
arthur-charpentier
View
18.115
Download
3
Embed Size (px)
Citation preview
http://www.ub.edu/riskcenter
Machine Learning & Data Science for Actuaries, with R
Arthur Charpentier (Universit de Rennes 1 & UQM)
Universitat de Barcelona, April 2016.
http://freakonometrics.hypotheses.org
@freakonometrics 1
http://www.ub.edu/riskcenter
Machine Learning & Data Science for Actuaries, with R
Arthur Charpentier (Universit de Rennes 1 & UQM)
Professor, Economics Department, Univ. Rennes 1In charge of Data Science for Actuaries program, IAResearch Chair actinfo (Institut Louis Bachelier)(previously Actuarial Sciences at UQM & ENSAE Paristechactuary in Hong Kong, IT & Stats FFSA)
PhD in Statistics (KU Leuven), Fellow Institute of ActuariesMSc in Financial Mathematics (Paris Dauphine) & ENSAEEditor of the freakonometrics.hypotheses.orgs blogEditor of Computational Actuarial Science, CRC
@freakonometrics 2
http://www.ub.edu/riskcenter
Agenda
0. Introduction, see slides
1. Classification, y {0, 1}
2. Regression Models, y R
3. Model Choice, Feature Selection, etc
4. Data Visualisation & Maps
@freakonometrics 3
https://freakonometrics.hypotheses.org/48127
http://www.ub.edu/riskcenter
Part 1.Classification, y {0, 1}
@freakonometrics 4
http://www.ub.edu/riskcenter
Classification?
Example: Fraud detection, automatic reading (classifying handwritingsymbols), face recognition, accident occurence, death, purchase of optinalinsurance cover, etcHere yi {0, 1} or yi {1,+1} or yi {, }.
We look for a (good) predictive model here.There will be two steps,
the score function, s(x) = P(Y = 1|X = x) [0, 1]
the classification function s(x) Y {0, 1}.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
@freakonometrics 5
http://www.ub.edu/riskcenter
Modeling a 0/1 random variable
Myocardial infarction of patients admited in E.R.
heart rate (FRCAR), heart index (INCAR) stroke index (INSYS) diastolic pressure (PRDIA) pulmonary arterial pressure (PAPUL) ventricular pressure (PVENT) lung resistance (REPUL) death or survival (PRONO)
1 > myocarde =read. table ("http:// freakonometrics .free.fr/ myocarde .csv",
head=TRUE ,sep=";")
@freakonometrics 6
http://www.ub.edu/riskcenter
Logistic RegressionAssume that P(Yi = 1) = i,
logit(i) = xTi , where logit(i) = log(
i1 i
),
ori = logit1(xTi ) =
exp[xTi ]1 + exp[xTi ]
.
The log-likelihood is
logL() =ni=1
yi log(i)+(1yi) log(1i) =ni=1
yi log(i())+(1yi) log(1i())
and the first order conditions are solved numerically
logL()k
=ni=1
xk,i[yi i()] = 0.
@freakonometrics 7
http://www.ub.edu/riskcenter
Logistic Regression, Output (with R)1 > logistic summary ( logistic )
3
4 Coefficients :
5 Estimate Std. Error z value Pr(>|z|)
6 ( Intercept ) -10.187642 11.895227 -0.856 0.392
7 FRCAR 0.138178 0.114112 1.211 0.226
8 INCAR -5.862429 6.748785 -0.869 0.385
9 INSYS 0.717084 0.561445 1.277 0.202
10 PRDIA -0.073668 0.291636 -0.253 0.801
11 PAPUL 0.016757 0.341942 0.049 0.961
12 PVENT -0.106776 0.110550 -0.966 0.334
13 REPUL -0.003154 0.004891 -0.645 0.519
14
15 ( Dispersion parameter for binomial family taken to be 1)
16
17 Number of Fisher Scoring iterations : 7
@freakonometrics 8
http://www.ub.edu/riskcenter
Logistic Regression, Output (with R)
1 > library (VGAM)
2 > mlogistic summary ( mlogistic )
4
5 Coefficients :
6 Estimate Std. Error z value
7 ( Intercept ) 10.1876411 11.8941581 0.856525
8 FRCAR -0.1381781 0.1141056 -1.210967
9 INCAR 5.8624289 6.7484319 0.868710
10 INSYS -0.7170840 0.5613961 -1.277323
11 PRDIA 0.0736682 0.2916276 0.252610
12 PAPUL -0.0167565 0.3419255 -0.049006
13 PVENT 0.1067760 0.1105456 0.965901
14 REPUL 0.0031542 0.0048907 0.644939
15
16 Name of linear predictor : log(mu [ ,1]/mu [ ,2])
@freakonometrics 9
http://www.ub.edu/riskcenter
Logistic (Multinomial) Regression
In the Bernoulli case, y {0, 1},
P(Y = 1) = eXT
1 + eXT= p1p0 + p1
p1 and P(Y = 0) =1
1 + eXT= p0p0 + p1
p0
In the multinomial case, y {A,B,C}
P(X = A) = pApA + pB + pC
pA i.e. P(X = A) =eX
TA
eXTB + eXTB + 1
P(X = B) = pBpA + pB + pC
pB i.e. P(X = B) =eX
TB
eXTA + eXTB + 1
P(X = C) = pCpA + pB + pC
pC i.e. P(X = C) =1
eXTA + eXTB + 1
@freakonometrics 10
http://www.ub.edu/riskcenter
Logistic Regression, Numerical Issues
The algorithm to compute is
1. start with some initial value 0
2. define k = k1 H(k1)1 logL(k1)
where logL()is the gradient, and H() the Hessian matrix, also calledFishers score.
The generic term of the Hessian is
2 logL()k`
=ni=1
Xk,iX`,i[yi i()]
Define = [i,j ] = diag(i(1 i)) so that the gradient is writen
logL() = logL()
= XT(y )
@freakonometrics 11
http://www.ub.edu/riskcenter
Logistic Regression, Numerical Issues
and the HessianH() =
2 logL()T
= XTX
The gradient descent algorithm is then
k = (XTX)1XTZ where Z = Xk1 +XT1(y ),
From maximum likelihood properties, if = limk
k,
n( ) L N (0, I()1).
From a numerical point of view, this asymptotic variance I()1 satisfiesI()1 = H().
@freakonometrics 12
http://www.ub.edu/riskcenter
Logistic Regression, Numerical Issues1 > X= cbind (1,as. matrix ( myocarde [ ,1:7]))
2 > Y= myocarde $ PRONO ==" Survival "
3 > beta=as. matrix (lm(Y~0+X)$ coefficients ,ncol =1)
4 > for(s in 1:9){
5 + pi=exp(X%*%beta[,s])/(1+ exp(X%*%beta[,s]))
6 + gradient =t(X)%*%(Y-pi)
7 + omega = matrix (0, nrow(X),nrow(X));diag( omega )=( pi*(1-pi))
8 + Hessian =-t(X)%*% omega %*%X
9 + beta= cbind (beta ,beta[,s]- solve ( Hessian )%*% gradient )}
10 > beta
11 > -solve ( Hessian )
12 > sqrt(-diag( solve ( Hessian )))
@freakonometrics 13
http://www.ub.edu/riskcenter
Predicted Probability
Let m(x) = E(Y |X = x). With a logistic regression, we can get a prediction
m(x) = exp[xT]
1 + exp[xT]
1 > predict (logistic ,type=" response ") [1:5]
2 1 2 3 4 5
3 0.6013894 0.1693769 0.3289560 0.8817594 0.1424219
4 > predict ( mlogistic ,type=" response ")[1:5 ,]
5 Death Survival
6 1 0.3986106 0.6013894
7 2 0.8306231 0.1693769
8 3 0.6710440 0.3289560
9 4 0.1182406 0.8817594
10 5 0.8575781 0.1424219
@freakonometrics 14
http://www.ub.edu/riskcenter
Predicted Probability
m(x) = exp[xT]
1 + exp[xT]= exp[0 + 1x1 + + kxk]
1 + exp[0 + 1x1 + + kxk]use
1 > predict (fit_glm , newdata = data , type=" response ")
e.g.
1 > GLM pred_GLM = function (p,r){
3 + return ( predict (GLM , newdata =
4 + data. frame ( PVENT =p, REPUL =r), type=" response ")}
0 5 10 15 20
500
1000
1500
2000
2500
3000
PVENT
RE
PU
L
@freakonometrics 15
http://www.ub.edu/riskcenter
Predictive Classifier
To go from a score to a class:
if s(x) > s, then Y (x) = 1 and s(x) s, then Y (x) = 0
Plot TP (s) = P[Y = 1|Y = 1] against FP (s) = P[Y = 1|Y = 0]
@freakonometrics 16
http://www.ub.edu/riskcenter
Predictive Classifier
With a threshold (e.g. s = 50%) and the predicted probabilities, one can get aclassifier and the confusion matrix
1 > probabilities predictions .5) +1]
3 > table ( predictions , myocarde $ PRONO )
4
5 predictions Death Survival
6 Death 25 3
7 Survival 4 39
@freakonometrics 17
http://www.ub.edu/riskcenter
Visualization of a Classifier in Higher Dimension...
4 2 0 2 4
4
2
02
4
Dim 1 (54.26%)
Dim
2 (
18.6
4%)
1
2
3
4 56
7
8
9
101112
13
14
15
16 17
18
19
20
2122
23
2425
2627
28
29
30
31
32
33
34
353637
38 394041
42
43
4445
4647
48
49
50
5152 53
54
55
56
57
58
59
6061
62
63
6465
66
67
68
69
7071
Death
Survival
DeathSurvival
4 2 0 2 4
4
2
02
4
Dim 1 (54.26%)
Dim
2 (
18.6
4%)
1
2
3
4 56
7
8
9
101112
13
14
15
16 17
18
19
20
2122
23
2425
2627
28
29
30
31
32
33
34
353637
38 394041
42
43
4445
4647
48
49
50
5152 53
54
55
56
57
58
59
6061
62
63
6465
66
67
68
69
7071
Death
Survival
DeathSurvival
0.5
Point z = (z1, z2, 0, , 0) x = (x1, x2, , xk).
@freakonometrics 18
http://www.ub.edu/riskcenter
... but be carefull about interpretation !
1 > prediction = predict (logistic ,type=" response ")
Use a 25% probability threshold1 > table ( prediction >.25 , myocarde $ PRONO )
2 Death Survival
3 FALSE 19 2
4 TRUE 10 40
or a 75% probability threshold1 > table ( prediction >.75 , myocarde $ PRONO )
2 Death Survival
3 FALSE 27 9
4 TRUE 2 33
@freakonometrics 19
http://www.ub.edu/riskcenter
Why a Logistic and not a Probit Regression?Bliss (1934) suggested a model such that
P(Y = 1|X = x) = H(xT) where H() = ()
the c.d.f. of the N (0, 1) distribution. This is the probit model.This yields a latent model, yi = 1(y?i > 0) where
y?i = xTi + i is a nonobservable score.
In the logistic regression, we model the odds ratio,
P(Y = 1|X = x)P(Y 6= 1|X = x) = exp[x
T]
P(Y = 1|X = x) = H(xT) where H() = exp[]1 + exp[]
which is the c.d.f. of the logistic variable, see Verhulst (1845)
@freakonometrics 20
http://www.sciencemag.org/content/79/2037/38http://gdz.sub.uni-goettingen.de/dms/load/img/?PPN=PPN129323640_0018&DMDID=dmdlog7
http://www.ub.edu/riskcenter
k-Nearest Neighbors (a.k.a. k-NN)
In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is anon-parametric method used for classification and regression. (Source: wikipedia).
E[Y |X = x] 1k
xix small
yi
For k-Nearest Neighbors, the class is usually the majority vote of the k closestneighbors of x.
1 > library ( caret )
2 > KNN
4 > pred_KNN = function (p,r){
5 + return ( predict (KNN , newdata =
6 + data. frame ( PVENT =p, REPUL =r), type="prob")[ ,2]}0 5 10 15 20
500
1000
1500
2000
2500
3000
PVENT
RE
PU
L
@freakonometrics 21
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
http://www.ub.edu/riskcenter
k-Nearest Neighbors
Distance should not be sensitive to units: normalize by standard deviation
1 > sP
http://www.ub.edu/riskcenter
k-Nearest Neighbors and Curse of Dimensionality
The higher the dimension, the larger the distance to the closest neigbbor
mini{1, ,n}
{a,xi},xi Rd.
e.g. xs drawn from U([0, 1]) and a = 0,
dim1 dim2 dim3 dim4 dim5
0.0
0.2
0.4
0.6
0.8
1.0
dim1 dim2 dim3 dim4 dim50.
00.
20.
40.
60.
81.
0
n = 10 n = 100
@freakonometrics 23
http://www.ub.edu/riskcenter
Classification (and Regression) Trees, CART
one of the predictive modelling approaches used in statistics, data mining andmachine learning [...] In tree structures, leaves represent class labels andbranches represent conjunctions of features that lead to those class labels.(Source: wikipedia).
1 > library ( rpart )
2 > cart library ( rpart .plot)
4 > library ( rattle )
5 > prp(cart , type =2, extra =1)
or1 > fancyRpartPlot (cart , sub="")
@freakonometrics 24
https://en.wikipedia.org/wiki/Decision_tree_learning
http://www.ub.edu/riskcenter
Classification (and Regression) Trees, CART
The impurity is a function of the probability to have 1 at node N , i.e.P[Y = 1| node N ], and
I(N) = (P[Y = 1| node N ])
is nonnegative ( 0), symmetric ((p) = (1 p)), with a minimum in 0 and1 ((0) = (1) < (p)), e.g.
Bayes error: (p) = min{p, 1 p}
cross-entropy: (p) = p log(p) (1 p) log(1 p)
Gini index: (p) = p(1 p)
Those functions are concave, minimum at p = 0 and 1, maximum at p = 1/2.
@freakonometrics 25
http://www.ub.edu/riskcenter
Classification (and Regression) Trees, CART
To split N into two {NL, NR}, consider
I(NL, NR) =
x{L,R}
nxnI(Nx)
e.g. Gini index (used originally in CART, see Breiman et al. (1984))
gini(NL, NR) =
x{L,R}
nxn
y{0,1}
nx,ynx
(1 nx,y
nx
)
and the cross-entropy (used in C4.5 and C5.0)
entropy(NL, NR) =
x{L,R}
nxn
y{0,1}
nx,ynx
log(nx,ynx
)
@freakonometrics 26
https://books.google.ca/books?id=JwQx-WOmSyQC&hl=frhttps://en.wikipedia.org/wiki/C4.5_algorithm
http://www.ub.edu/riskcenter
Classification (and Regression) Trees, CART
1.0 1.5 2.0 2.5 3.0
0.
45
0.35
0.
25
INCAR
15 20 25 30
0.
45
0.35
0.
25
INSYS
12 16 20 24
0.
45
0.35
0.
25
PRDIA
20 25 30 35
0.
45
0.35
0.
25
PAPUL
4 6 8 10 12 14 16
0.
45
0.35
0.
25
PVENT
500 1000 1500 2000
0.
45
0.35
0.
25
REPUL
NL: {xi,j s} NR: {xi,j > s}
solve maxj{1, ,k},s
{I(NL, NR)}
first split
second split
1.8 2.2 2.6 3.0
0.
20
0.18
0.
16
0.14
INCAR
20 24 28 32
0.
20
0.18
0.
16
0.14
INSYS
12 14 16 18 20 22
0.
20
0.18
0.
16
0.14
PRDIA
16 18 20 22 24 26 28
0.
20
0.18
0.
16
0.14
PAPUL
4 6 8 10 12 14
0.
20
0.18
0.
16
0.14
PVENT
500 700 900 1100
0.
20
0.18
0.
16
0.14
REPUL
@freakonometrics 27
http://www.ub.edu/riskcenter
Pruning Trees
One can grow a big tree, until leaves have a (preset) small number ofobservations, and then possibly go back and prune branches (or leaves) that donot improve gains on good classification sufficiently.
Or we can decide, at each node, whether we split, or not.
@freakonometrics 28
http://www.ub.edu/riskcenter
Pruning Trees
In trees, overfitting increases with the number of steps, and leaves. Drop inimpurity at node N is defined as
I(NL, NR) = I(N) I(NL, NR) = I(N)(nLnI(NL)
nRnI(NR)
)
1 > library ( rpart )
2 > CART
4 > pred_CART = function (p,r){
5 + return ( predict (CART , newdata =
6 + data. frame ( PVENT =p, REPUL =r)[," Survival "])}0 5 10 15 20
500
1000
1500
2000
2500
3000
PVENT
RE
PU
L
we cut if I(NL, NR)/I(N) (relative gain) exceeds cp (complexityparameter, default 1%).
@freakonometrics 29
http://www.ub.edu/riskcenter
Pruning Trees
1 > library ( rpart )
2 > CART
4 > pred_CART = function (p,r){
5 + return ( predict (CART , newdata =
6 + data. frame ( PVENT =p, REPUL =r)[," Survival "])}0 5 10 15 20
500
1000
1500
2000
2500
3000
PVENT
RE
PU
L
See also1 > library ( mvpart )
2 > ? prune
Define the missclassification rate of a tree R(tree)
@freakonometrics 30
http://www.ub.edu/riskcenter
Pruning Trees
Given a cost-complexity parameter cp (see tunning parameter in Ridge-Lasso)define a penalized R()
Rcp(tree) = R(tree) loss
+ cptree complexity
If cp is small the optimal tree is large, if cp is large the optimal tree has no leaf,see Breiman et al. (1984).
1 > cart plotcp (cart)
3 > prune (cart , cp =0.06)
cp
X
val R
elat
ive
Err
or
0.4
0.6
0.8
1.0
1.2
Inf 0.27 0.06 0.024 0.013
1 2 3 7 9
size of tree
@freakonometrics 31
https://books.google.ca/books?id=JwQx-WOmSyQC&hl=fr
http://www.ub.edu/riskcenter
Bagging
Bootstrapped Aggregation (Bagging) , is a machine learning ensemblemeta-algorithm designed to improve the stability and accuracy of machinelearning algorithms used in statistical classification (Source: wikipedia).
It is an ensemble method that creates multiple models of the same type fromdifferent sub-samples of the same dataset [boostrap]. The predictions from eachseparate model are combined together to provide a superior result [aggregation].
can be used on any kind of model, but interesting for trees, see Breiman (1996)
Boostrap can be used to define the concept of margin,
margini =1B
Bb=1
1(yi = yi)1B
Bb=1
1(yi 6= yi)
Remark Probability that ith raw is not selection (1 n1)n e1 36.8%, cftraining / validation samples (2/3-1/3)
@freakonometrics 32
https://en.wikipedia.org/wiki/Bootstrap_aggregatinghttp://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf
http://www.ub.edu/riskcenter
Bagging Trees
5 10 15 20
500
1000
1500
2000
2500
3000
PVENT
RE
PU
L
1 > margin for(b in 1:1 e4){
3 + idx = sample (1:n,size=n, replace =TRUE)
4 > cart margin [j ,] .5)!=
( myocarde $ PRONO ==" Survival ")
7 + }
8 > apply (margin , 2, mean)
@freakonometrics 33
http://www.ub.edu/riskcenter
Bagging Trees
Interesting because of instability in CARTs (in terms of tree structure, notnecessarily prediction)
@freakonometrics 34
http://www.ub.edu/riskcenter
Bagging and Variance, Bagging and Bias
Assume that y = m(x) + . The mean squared error over repeated randomsamples can be decomposed in three parts Hastie et al. (2001)
E[(Y m(x))2] = 21
+[E[m(x)]m(x)
]2 2
+E([m(x) E[(m(x)]
]2) 3
1 reflects the variance of Y around m(x)
2 is the squared bias of m(x)
3 is the variance of m(x)
bias-variance tradeoff. Boostrap can be used to reduce the bias, and hevariance (but be careful of outliers)
@freakonometrics 35
http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.ub.edu/riskcenter
1 > library ( ipred )
2 > BAG
4 > pred_BAG = function (p,r){
5 + return ( predict (BAG , newdata =
6 + data. frame ( PVENT =p, REPUL =r), type="prob")[ ,2])}0 5 10 15 20
500
1000
1500
2000
2500
3000
PVENT
RE
PU
L
@freakonometrics 36
http://www.ub.edu/riskcenter
Random Forests
Strictly speaking, when boostrapping among observations, and aggregating, weuse a bagging algorithm.
In the random forest algorithm, we combine Breimans bagging idea and therandom selection of features, introduced independently by Ho (1995)) and Amit& Geman (1997))
1 > library ( randomForest )
2 > RF
4 > pred_RF = function (p,r){
5 + return ( predict (RF , newdata =
6 + data. frame ( PVENT =p, REPUL =r), type="prob")[ ,2])}0 5 10 15 20
500
1000
1500
2000
2500
3000
PVENT
RE
PU
L
@freakonometrics 37
http://cm.bell-labs.com/cm/cs/who/tkh/papers/odt.pdfhttp://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdfhttp://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdf
http://www.ub.edu/riskcenter
Random Forest
At each node, selectk covariates out of k (randomly).
can deal with small n large k-problems
Random Forest are used not only for prediction, but also to assess variableimportance (discussed later on).
@freakonometrics 38
http://www.ub.edu/riskcenter
Support Vector Machine
SVMs were developed in the 90s based on previous work, from Vapnik & Lerner(1963), see Vailant (1984)Assume that points are linearly separable, i.e. there is and b such that
Y =
+1 if Tx+ b > 01 if Tx+ b < 0Problem: infinite number of solutions, need a good one,that separate the data, (somehow) far from the data.
Concept : VC dimension. Let H : {h : Rd 7 {1,+1}}. Then H is saidto shatter a set of points X is all dichotomies can be achieved.E.g. with those three points, all configurations can be achieved
@freakonometrics 39
http://www.cs.iastate.edu/~cs573x/vapnik-portraits1963.pdfhttp://www.cs.iastate.edu/~cs573x/vapnik-portraits1963.pdfhttps://people.mpi-inf.mpg.de/~mehlhorn/SeminarEvolvability/ValiantLearnable.pdf
http://www.ub.edu/riskcenter
Support Vector Machine
E.g. with those four points, several configurations cannot be achieved(with some linear separator, but they can with some quadratic one)
@freakonometrics 40
http://www.ub.edu/riskcenter
Support Vector Machine
Vapniks (VC) dimension is the size of the largest shattered subset of X.
This dimension is intersting to get an upper bound of the probability ofmiss-classification (with some complexity penalty, function of VC(H)).
Now, in practice, where is the optimal hyperplane ?
The distance from x0 to the hyperplane Tx+ b is
d(x0, H,b) =Tx0 + b
and the optimal hyperplane (in the separable case) is
argmin{
mini=1, ,n
d(xi, H,b)}
@freakonometrics 41
http://www.ub.edu/riskcenter
Support Vector Machine
Define support vectors as observations such that
|Txi + b| = 1
The margin is the distance between hyperplanes defined bysupport vectors.
The distance from support vectors to H,b is 1, and the margin is then21.
the algorithm is to minimize the inverse of the margins s.t. H,b separates1 points, i.e.
min{
12
T
}s.t. Yi(Txi + b) 1, i.
@freakonometrics 42
http://www.ub.edu/riskcenter
Support Vector Machine
Problem difficult to solve: many inequality constraints (n)
solve the dual problem...
In the primal space, the solution was
=
iYixi withi=1
iYi = 0.
In the dual space, the problem becomes (hint: consider the Lagrangian)
max{i=1
i 12i=1
ijYiYjxTi xj
}s.t.
i=1
iYi = 0.
which is usually written
min
{12
TQ 1T}
s.t.
0 i iyT = 0where Q = [Qi,j ] and Qi,j = yiyjxTi xj .
@freakonometrics 43
http://www.ub.edu/riskcenter
Support Vector Machine
Now, what about the non-separable case?
Here, we cannot have yi(Txi + b) 1 i.
introduce slack variables, Txi + b +1 i when yi = +1Txi + b 1 + i when yi = 1where i 0 i. There is a classification error when i > 1.
The idea is then to solve
min{
12
T + C1T1>1}, instead ofmin
{12
T
}
@freakonometrics 44
http://www.ub.edu/riskcenter
Support Vector Machines, with a Linear Kernel
So far,d(x0, H,b) = min
xH,b{x0 x`2}
where `2 is the Euclidean (`2) norm,
x0 x`2 =
(x0 x) (x0 x) =x0x0 2x0x+ xx
1 > library ( kernlab )
2 > SVM2 pred_SVM2 = function (p,r){
5 + return ( predict (SVM2 , newdata =
6 + data. frame ( PVENT =p, REPUL =r), type=" probabilities
")[ ,2])} 0 5 10 15 20
500
1000
1500
2000
2500
3000
PVENT
RE
PU
L
@freakonometrics 45
http://www.ub.edu/riskcenter
Support Vector Machines, with a Non Linear Kernel
More generally,d(x0, H,b) = min
xH,b{x0 xk}
where k is some kernel-based norm,
x0 xk =k(x0,x0) 2k(x0,x) + k(xx)
1 > library ( kernlab )
2 > SVM2 pred_SVM2 = function (p,r){
5 + return ( predict (SVM2 , newdata =
6 + data. frame ( PVENT =p, REPUL =r), type=" probabilities
")[ ,2])} 0 5 10 15 20
500
1000
1500
2000
2500
3000
PVENT
RE
PU
L
@freakonometrics 46
http://www.ub.edu/riskcenter
Heuristics on SVMs
An interpretation is that data arent linearly seperable in the original space, butmight be separare by some kernel transformation,
@freakonometrics 47
http://www.ub.edu/riskcenter
Still Hungry ?
There are still several (machine learning) techniques that can be used forclassification
Fishers Linear or Quadratic Discrimination (closely related to logisticregression, and PCA), see Fisher (1936))
X|Y = 0 N (0,0) and X|Y = 1 N (1,1)
@freakonometrics 48
http://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x/abstract;jsessionid=C4721283C0ECF3893F91BFEBBDF3D094.f03t02)
http://www.ub.edu/riskcenter
Still Hungry ?
Perceptron or more generally Neural Networks In machine learning, neuralnetworks are a family of statistical learning models inspired by biologicalneural networks and are used to estimate or approximate functions that candepend on a large number of inputs and are generally unknown. wikipedia,see Rosenblatt (1957)
Boosting (see next section)
Naive Bayes In machine learning, naive Bayes classifiers are a family ofsimple probabilistic classifiers based on applying Bayes theorem with strong(naive) independence assumptions between the features. wikipedia, see Russell& Norvig (2003)
See also the (great) package1 > library ( caret )
@freakonometrics 49
https://en.wikipedia.org/wiki/Artificial_neural_networkhttp://www.researchgate.net/post/Where_to_find_the_original_paper_of_Rosenblatts_Perceptron_algorithm_The_perceptron_a_perceiving_and_recognizing_automaton_Rosenblatt_F_1957https://en.wikipedia.org/wiki/Naive_Bayes_classifierhttp://aima.cs.berkeley.edu/http://aima.cs.berkeley.edu/
http://www.ub.edu/riskcenter
Difference in Differences
In many applications (e.g. marketing), we do need twomodels to analyze the impact of a treatment. We needtwo groups, a control and a treatment group.Data : {(xi, yi)} with yi {, }Data : {(xj , yj)} with yi {, }See clinical trials, treatment vs. control groupE.g. direct mail campaign in a bank
Control Promotion
No Purchase 85.17% 61.60%Purchase 14.83% 38.40%
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
overall uplift effect +23.57%, see Guelman et al. (2014) for more details.
@freakonometrics 50
http://www.sciencedirect.com/science/article/pii/S0167668714000717
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims
Consider a (standard) logistic regression, on two covariate (age of driver, and ageof camping-car)
= logit1(0 + 1x1 + 2x2)
1 > reg_glm=glm( nombre ~ ageconducteur + agevehicule ,
data=camping , family = binomial )
0.00 0.05 0.10 0.15
20 30 40 50 60 70 80 90
010
2030
40
Age du conducteur principal
Age
du
vhi
cule
@freakonometrics 51
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims
Consider a (standard) logistic regression, on two covariate (age of driver, and ageof camping-car), smoothed with splines
= logit1(0 + s1(x1) + s2(x2))
1 > reg_add=glm( nombre ~bs( ageconducteur )+bs(
agevehicule ),data=camping , family = binomial )
0.00 0.05 0.10 0.15
20 30 40 50 60 70 80 90
010
2030
40
Age du conducteur principal
Age
du
v?hi
cule
@freakonometrics 52
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims
Consider a (standard) logistic regression, on two covariate (age of driver, and ageof camping-car), smoothed with bivariate spline
= logit1(0 + s(x1,x2))
1 > library (mgcv)
2 > reg_gam=gam( nombre ~s( ageconducteur , agevehicule ),
data=camping , family = binomial )
0.00 0.05 0.10 0.15
20 30 40 50 60 70 80 90
010
2030
40
Age du conducteur principal
Age
du
vhi
cule
@freakonometrics 53
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims
One can also use k-Nearest Neighbours (k-NN)
1 > library ( caret )
2 > sc=sd( camping $ ageconducteur )
3 > sv=sd( camping $ agevehicule )
4 > knn=knn3 (( nombre ==1)~I( ageconducteur /sc)+I(
agevehicule /sv),data=camping ,k =100)
0.00 0.05 0.10 0.15
20 30 40 50 60 70 80 90
010
2030
40
Age du conducteur principal
Age
du
vhi
cule
(be carefull about scaling problems)
@freakonometrics 54
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims
We can also use a tree1 > tree= rpart (( nombre ==1)~ ageconducteur + agevehicule ,
2 data=camping ,cp =7e -4)
0.00 0.05 0.10 0.15
20 30 40 50 60 70 80 90
010
2030
40
Age du conducteur principal
Age
du
vhi
cule
0.00 0.05 0.10 0.15
20 30 40 50 60 70 80 900
1020
3040
Age du conducteur principal
Age
du
vhi
cule
@freakonometrics 55
http://www.ub.edu/riskcenter
Application on Motor Insurance Claimsor bagging techniques (rather close to random forests)
1 > library ( ipred )
2 > bag= bagging (( nombre ==1)~ ageconducteur +
agevehicule ,data= camping )
3 > library ( randomForest )
4 > rf= randomForest (( nombre ==1)~ ageconducteur +
agevehicule ,data= camping )
0.00 0.05 0.10 0.15
20 30 40 50 60 70 80 90
010
2030
40
Age du conducteur principal
Age
du
vhi
cule
@freakonometrics 56
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims
Boosting algorithms can also be considered (see next time)
1 > library ( dismo )
2 > library (gbm)
3 > fit predict (fit , type=" response ", n. trees =700)
20 30 40 50 60 70 80 90
0.00
0.05
0.10
0.15
Age du conducteur principal
@freakonometrics 57
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims
Boosting algorithms can also be considered (see next time)
1 > library ( dismo )
2 > library (gbm)
3 > fit predict (fit , type=" response ", n. trees =400)
0.00 0.05 0.10 0.15
20 30 40 50 60 70 80 90
010
2030
40
Age du conducteur principal
Age
du
vhi
cule
@freakonometrics 58
http://www.ub.edu/riskcenter
Part 2.Regression
@freakonometrics 59
http://www.ub.edu/riskcenter
Regression?
In statistics, regression analysis is a statistical process for estimating therelationships among variables [...] In a narrower sense, regression may referspecifically to the estimation of continuous response variables, as opposed to thediscrete response variables used in classification. (Source: wikipedia).
Here regression is opposed to classification (as in the CART algorithm). y iseither a continuous variable y R or a counting variable y N .
@freakonometrics 60
https://en.wikipedia.org/wiki/Regression_analysis
http://www.ub.edu/riskcenter
Regression? Parametrics, nonparametrics and machine learning
In many cases in econometric and actuarial literature we simply want a good fitfor the conditional expectation, E[Y |X = x].
Regression analysis estimates the conditional expectation of the dependentvariable given the independent variables (Source: wikipedia).
Example: A popular nonparametric technique, kernel based regression,
m(x) =i Yi Kh(Xi x)iKh(Xi x)
In econometric litterature, interest on asymptotic normality properties andplug-in techniques.
In machine learning, interest on out-of sample cross-validation algorithms.
@freakonometrics 61
https://en.wikipedia.org/wiki/Regression_analysis
http://www.ub.edu/riskcenter
Linear, Non-Linear and Generalized Linear
Linear Model:
(Y |X = x) N (x, 2)
E[Y |X = x] = x = xT
1 > fit
http://www.ub.edu/riskcenter
Linear, Non-Linear and Generalized Linear
NonLinear / NonParametric Model:
(Y |X = x) N (x, 2)
E[Y |X = x] = x = m(x)
1 > fit fit
http://www.ub.edu/riskcenter
Linear, Non-Linear and Generalized Linear
Generalized Linear Model:
(Y |X = x) L(x, )
E[Y |X = x] = h1(x) = h1(xT)
1 > fit
http://www.ub.edu/riskcenter
Linear Model
Consider a linear regression model, yi = xTi + i.
is estimated using ordinary least squares, = [XTX]1XTY
best linear unbiased estimator
Unbiased estimators in important in statistics because they have nicemathematical properties (see Cramr-Rao lower bound).
Looking for biased estimators (bias-variance tradeoff) becomes important inhigh-dimension, see Burr & Fry (2005)
@freakonometrics 65
http://www.jstor.org/stable/25471022
http://www.ub.edu/riskcenter
Linear Model and Loss Functions
Consider a linear model, with some general loss function `, set `(x, y) = R(x y)and consider,
argmin{
ni=1
`(yi,xTi )}
If R is differentiable, the first order condition would be
ni=1
R(yi xTi
) xTi = 0.
i.e.ni=1
(yi xTi
) i
(yi xTi
)xTi = 0 with (x) =
R(x)x
,
It is the first order condition of a weighted `2 regression.
@freakonometrics 66
http://www.ub.edu/riskcenter
Linear Model and Loss Functions
But weights are unknown: use and iterative algorithm1 > e for( i in 1:100) {
3 + W
http://www.ub.edu/riskcenter
Bagging Linear Models
1 > V= matrix (NA ,100 ,251)
2 > for(i in 1:100) {
3 + ind
http://www.ub.edu/riskcenter
Regression Smoothers, natura non facit saltus
In statistical learning procedures, a key role is played by basis functions. We willsee that it is common to assume that
m(x) =Mm=0
Mhm(x),
where h0 is usually a constant function and hm defined basis functions.
For instance, hm(x) = xm for a polynomial expansion witha single predictor, or hm(x) = (x sm)+ for some knotssms (for linear splines, but one can consider quadratic orcubic ones).
@freakonometrics 69
http://www.ub.edu/riskcenter
Regression Smoothers: Polynomial Functions
Stone-Weiestrass theorem every continuous function defined on a closedinterval [a, b] can be uniformly approximated as closely as desired by apolynomial function
1 > fit predict (fit , newdata = data. frame (X=x))
@freakonometrics 70
http://www.ub.edu/riskcenter
Regression Smoothers: Spline Functions
1 > fit predict (fit , newdata = data. frame (X=x))
@freakonometrics 71
http://www.ub.edu/riskcenter
Regression Smoothers: Spline Functions
1 > fit predict (fit , newdata = data. frame (X=x))
see Generalized Additive Models.
@freakonometrics 72
http://www.ub.edu/riskcenter
Fixed Knots vs. Optimized Ones
1 > library ( freeknotsplines )
2 > gen fit predict (fit , newdata = data. frame (X=x))
@freakonometrics 73
http://www.ub.edu/riskcenter
Interpretation of Penalty
Unbiased estimators are important in mathematical statistics, but are they thebest estimators ?
Consider a sample, i.i.d., {y1, , yn} with distribution N (, 2). Define = Y . What is the optimal ? to get the best estimator of ?
bias: bias()
= E() = ( 1)
variance: Var()
= 22
n
mse: mse()
= ( 1)22 + 22
n
The optimal value is ? = 2
2 + 2
n
< 1.
@freakonometrics 74
http://www.ub.edu/riskcenter
Linear Model
Consider some linear model yi = xTi + i for all i = 1, , n.
Assume that i are i.i.d. with E() = 0 (and finite variance). Writey1...yn
y,n1
=
1 x1,1 x1,k...
.... . .
...1 xn,1 xn,k
X,n(k+1)
0
1...k
,(k+1)1
+
1...n
,n1
.
Assuming N (0, 2I), the maximum likelihood estimator of is
= argmin{y XT`2} = (XTX)1XTy
... under the assumtption that XTX is a full-rank matrix.
What if XTiX cannot be inverted? Then = [XTX]1XTy does not exist, but = [XTX + I]1XTy always exist if > 0.
@freakonometrics 75
http://www.ub.edu/riskcenter
Ridge Regression
The estimator = [XTX + I]1XTy is the Ridge estimate obtained as solutionof
= argmin
ni=1
[yi 0 xTi ]2 + `2 1T2
for some tuning parameter . One can also write
= argmin;`2s
{Y XT`2}
Remark Note that we solve = argmin{objective()} where
objective() = L() training loss
+ R() regularization
@freakonometrics 76
http://www.ub.edu/riskcenter
Going further on sparcity issues
In severall applications, k can be (very) large, but a lot of features are just noise:j = 0 for many js. Let s denote the number of relevent features, with s
http://www.ub.edu/riskcenter
Going further on sparcity issues
Define a`0 =
1(|ai| > 0). Ici dim() = s.
We wish we could solve
= argmin;`0s
{Y XT`2}
Problem: it is usually not possible to describe all possible constraints, since(s
k
)coefficients should be chosen here (with k (very) large).
Idea: solve the dual problem
= argmin;Y XT`2h
{`0}
where we might convexify the `0 norm, `0 .
@freakonometrics 78
http://www.ub.edu/riskcenter
Regularization `0, `1 et `2
@freakonometrics 79
http://www.ub.edu/riskcenter
Optimal LASSO Penalty
Use cross validation, e.g. K-fold,
(k)() = argmin
i6Ik
[yi xTi ]2 +
then compute the sum of the squared errors,
Qk() =iIk
[yi xTi (k)()]2
and finally solve
? = argmin{Q() = 1
K
k
Qk()}
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest thelargest such that
Q() Q(?) + se[?] with se[]2 = 1K2
Kk=1
[Qk()Q()]2
@freakonometrics 80
http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.ub.edu/riskcenter
Going further on sparcity issues
On [1,+1]k, the convex hull of `0 is `1On [a,+a]k, the convex hull of `0 is a1`1Hence,
= argmin;`1s
{Y XT`2}
is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem
= argmin{Y XT`2+`1}
@freakonometrics 81
http://www.ub.edu/riskcenter
LASSO Least Absolute Shrinkage and Selection Operator
argmin{Y XT`2+`1}
is a convex problem (several algorithms?), but not strictly convex (no unicity ofthe minimum). Nevertheless, predictions y = xT are unique
? MM, minimize majorization, coordinate descent Hunter (2003).
@freakonometrics 82
http://sites.stat.psu.edu/~dhunter/papers/mmtutorial.pdf
http://www.ub.edu/riskcenter
1 > freq = merge (contrat , nombre _RC)
2 > freq = merge (freq , nombre _DO)
3 > freq [ ,10]= as. factor (freq [ ,10])
4 > mx= cbind (freq[,c(4 ,5 ,6)],freq [ ,9]=="D",
freq [ ,3]% in%c("A","B","C"))
5 > colnames (mx)=c( names (freq)[c(4 ,5 ,6)],"
diesel ","zone")
6 > for(i in 1: ncol(mx)) mx[,i]=( mx[,i]-mean(
mx[,i]))/sd(mx[,i])
7 > names (mx)
8 [1] puissance agevehicule ageconducteur
diesel zone
9 > library ( glmnet )
10 > fit = glmnet (x=as. matrix (mx), y=freq [ ,11] ,
offset =log(freq [ ,2]) , family = " poisson
")
11 > plot(fit , xvar=" lambda ", label =TRUE)
LASSO, third party
10 9 8 7 6 5
0.
20
0.15
0.
10
0.05
0.00
0.05
0.10
Log Lambda
Coe
ffici
ents
4 4 4 4 3 1
1
3
4
5
@freakonometrics 83
http://www.ub.edu/riskcenter
LASSO, third party
1 > plot(fit , label =TRUE)
2 > cvfit = cv. glmnet (x=as. matrix (mx), y=freq
[ ,11] , offset =log(freq [ ,2]) ,family = "
poisson ")
3 > plot( cvfit )
4 > cvfit $ lambda .min
5 [1] 0.0002845703
6 > log( cvfit $ lambda .min)
7 [1] -8.16453
Cross validation curve + error bars
0.0 0.1 0.2 0.3 0.4
0.
20
0.15
0.
10
0.05
0.00
0.05
0.10
L1 Norm
Coe
ffici
ents
0 2 3 4 4
1
3
4
5
10 9 8 7 6 5
0.24
60.
248
0.25
00.
252
0.25
40.
256
0.25
8
log(Lambda)
Poi
sson
Dev
ianc
e
4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 2 1
@freakonometrics 84
http://www.ub.edu/riskcenter
1 > freq = merge (contrat , nombre _RC)
2 > freq = merge (freq , nombre _DO)
3 > freq [ ,10]= as. factor (freq [ ,10])
4 > mx= cbind (freq[,c(4 ,5 ,6)],freq [ ,9]=="D",
freq [ ,3]% in%c("A","B","C"))
5 > colnames (mx)=c( names (freq)[c(4 ,5 ,6)],"
diesel ","zone")
6 > for(i in 1: ncol(mx)) mx[,i]=( mx[,i]-mean(
mx[,i]))/sd(mx[,i])
7 > names (mx)
8 [1] puissance agevehicule ageconducteur
diesel zone
9 > library ( glmnet )
10 > fit = glmnet (x=as. matrix (mx), y=freq [ ,12] ,
offset =log(freq [ ,2]) , family = " poisson
")
11 > plot(fit , xvar=" lambda ", label =TRUE)
LASSO, Frquence DO
9 8 7 6 5 4
0.
8
0.6
0.
4
0.2
0.0
Log Lambda
Coe
ffici
ents
4 4 3 2 1 1
1
2
4
5
@freakonometrics 85
http://www.ub.edu/riskcenter
LASSO, material
1 > plot(fit , label =TRUE)
2 > cvfit = cv. glmnet (x=as. matrix (mx), y=freq
[ ,12] , offset =log(freq [ ,2]) ,family = "
poisson ")
3 > plot( cvfit )
4 > cvfit $ lambda .min
5 [1] 0.0004744917
6 > log( cvfit $ lambda .min)
7 [1] -7.653266
Cross validation curve + error bars
0.0 0.2 0.4 0.6 0.8 1.0
0.
8
0.6
0.
4
0.2
0.0
L1 Norm
Coe
ffici
ents
0 1 1 1 2 4
1
2
4
5
9 8 7 6 5 4
0.21
50.
220
0.22
50.
230
0.23
5
log(Lambda)
Poi
sson
Dev
ianc
e
4 4 4 4 3 3 3 3 3 2 2 1 1 1 1 1 1 1 1
@freakonometrics 86
http://www.ub.edu/riskcenter
Some thoughts about Tuning parameters
Regularization is a key issue in machine learning, to avoid overfitting.
In (traditional) econometrics are based on plug-in methods: see Silvermanbandwith rule in Kernel density estimation,
h? =(
45
3n
) 1.06n1/5.
In machine learning literature, use on out-of-sample cross-validation methods forchoosing amount of regularization.
@freakonometrics 87
http://www.ub.edu/riskcenter
Optimal LASSO Penalty
Use cross validation, e.g. K-fold,
(k)() = argmin
{i6Ik
[yi xTi ]2 + k
|k|
then compute the sum or the squared errors,
Qk() =i 6Ik
[yi xTi (k)()]2
and finally solve
? = argmin{Q() = 1
K
k
Qk()}
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest thelargest such that
Q() Q(?) + se[?] with se[]2 = 1K2
Kk=1
[Qk()Q()]2
@freakonometrics 88
http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.ub.edu/riskcenter
Big Data, Oracle and Sparcity
Assume that k is large, and that Rk can be partitioned as = (imp,non-imp), as well as covariates x = (ximp,xnon-imp), with importantand non-important variables, i.e. non-imp 0.
Goal : achieve variable selection and make inference of impOracle property of high dimensional model selection and estimation, see Fan andLi (2001). Only the oracle knows which variables are important...
If sample size is large enough (n >> kimp(
1 + log kkimp
)) we can do inference as
if we knew which covariates were important: we can ignore the selection ofcovariates part, that is not relevant for the confidence intervals. This providescover for ignoring the shrinkage and using regularstandard errors, see Athey &Imbens (2015).
@freakonometrics 89
http://orfe.princeton.edu/~jqfan/papers/01/penlike.pdfhttp://orfe.princeton.edu/~jqfan/papers/01/penlike.pdfhttp://www.nber.org/econometrics_minicourse_2015/http://www.nber.org/econometrics_minicourse_2015/
http://www.ub.edu/riskcenter
Why Shrinkage Regression Estimates ?
Interesting for model selection (alternative to peanlized criterions) and to get agood balance between bias and variance.
In decision theory, an admissible decision rule is a rule for making a decisionsuchthat there is not any other rule that is always better than it.
When k 3, ordinary least squares are not admissible, see the improvement byJamesStein estimator.
@freakonometrics 90
http://www.ub.edu/riskcenter
Regularization and Scalability
What if k is (extremely) large? never trust ols with more than five regressors(attributed to Zvi Griliches in Athey & Imbens (2015))
Use regularization techniques, see Ridge, Lasso, or subset selection
= argmin
{ni=1
[yi 0 xTi ]2 + `0 where `0 =k
1(k 6= 0).}
@freakonometrics 91
http://www.nber.org/econometrics_minicourse_2015/nber_slides11.pdf
http://www.ub.edu/riskcenter
Penalization and Splines
In order to get a sufficiently smooth model, why not penalyse the sum of squaresof errors,
ni=1
[yi m(xi)]2 +
[m(t)]2dt
for some tuning parameter . Consider some cubic spline basis, so that
m(x) =Jj=1
jNj(x)
then the optimal expression for m is obtained using
= [NTN + ]1NTy
where N i,j is the matrix of Nj(Xi)s and i,j =N i (t)N j (t)dt
@freakonometrics 92
http://www.ub.edu/riskcenter
Smoothing with Multiple Regressors
Actuallyni=1
[yi m(xi)]2 +
[m(t)]2dt
is based on some multivariate penalty functional, e.g.
[m(t)]2dt =
i
(2m(t)t2i
)2+ 2
i,j
(2m(t)titj
)2 dt
@freakonometrics 93
http://www.ub.edu/riskcenter
Regression Trees
The partitioning is sequential, one covariate at a time (see adaptative neighborestimation).
Start with Q =ni=1
[yi y]2
For covariate k and threshold t, split the data according to {xi,k t} (L) or{xi,k > t} (R). Compute
yL =i,xi,kt yii,xi,kt 1
and yR =i,xi,k>t
yii,xi,k>t
1
and let
m(k,t)i =
yL if xi,k tyR if xi,k > t@freakonometrics 94
http://www.ub.edu/riskcenter
Regression Trees
Then compute (k?, t?) = argmin{
ni=1
[yi m(k,t)i ]2
}, and partition the space
intro two subspace, whether xk? t?, or not.
Then repeat this procedure, and minimize
ni=1
[yi mi]2 + #{leaves},
(cf LASSO).
One can also consider random forests with regression trees.
@freakonometrics 95
http://www.ub.edu/riskcenter
Local Regression
1 > W
http://www.ub.edu/riskcenter
Local Regression
1 > W
http://www.ub.edu/riskcenter
Local Regression : Nearest Neighbor
1 > W
http://www.ub.edu/riskcenter
Local Regression : Kernel Based Smoothing
1 > library ( KernSmooth )
2 > W library ( KernSmooth )
6 > library (sp)
@freakonometrics 99
http://www.ub.edu/riskcenter
Local Regression : Kernel Based Smoothing
1 > library (np)
2 > fit predict (fit , newdata = data. frame (X=x))
@freakonometrics 100
http://www.ub.edu/riskcenter
From Linear to Generalized Linear Models
The (Gaussian) Linear Model and the logistic regression have been extended tothe wide class of the exponential family,
f(y|, ) = exp(y b()a() + c(y, )
),
where a(), b() and c() are functions, is the natural - canonical - parameterand is a nuisance parameter.
The Gaussian distribution N (, 2) belongs to this family
= E(Y )
, = 2 Var(Y )
, a() = , b() = 2/2
@freakonometrics 101
http://www.ub.edu/riskcenter
From Linear to Generalized Linear Models
The Bernoulli distribution B(p) belongs to this family
= log p1 p =g?(E(Y ))
, a() = 1, b() = log(1 + exp()), and = 1
where the g?() is some link function (here the logistic transformation): thecanonical link.
Canonical links are1 binomial (link = " logit ")
2 gaussian (link = " identity ")
3 Gamma (link = " inverse ")
4 inverse . gaussian (link = "1/mu ^2")
5 poisson (link = "log")
6 quasi (link = " identity ", variance = " constant ")
7 quasibinomial (link = " logit ")
8 quasipoisson (link = "log")
@freakonometrics 102
http://www.ub.edu/riskcenter
From Linear to Generalized Linear Models
Observe that
= E(Y ) = b() and Var(Y ) = b() = b([b]1()) variance function V ()
distributions are characterized by this variance function, e.g. V () = 1 forthe Gaussian family (homoscedastic models), V () = for the Poisson andV () = 2 for the Gamma distribution, V () = 3 for the inverse-Gaussianfamily.
Note that g?() = [b]1() is the canonical link.
Tweedie (1984) suggested a power-type variance function V () = . When [1, 2], then Y has a compound Poisson distribution with Gamma jumps.
1 > library ( tweedie )
@freakonometrics 103
http://www.ub.edu/riskcenter
From the Exponential Family to GLMs
So far, there no regression model. Assume that
f(yi|i, ) = exp(yii b(i)
a() + c(yi, ))
where i = g1? (g(xTi ))
so that the log-likelihood is
L(, |y) =ni=1
f(yi|i, ) = exp(n
i=1 yii ni=1 b(i)
a() +ni=1
c(yi, )).
To derive the first order condition, observe that we can write
logL(, |yi)j
= i,jxi,j [yi i]
for some i,j (see e.g. Mller (2004)) which are simple when g? = g.
@freakonometrics 104
http://www.marlenemueller.de/publications/HandbookCS.pdf
http://www.ub.edu/riskcenter
From the Exponential Family to GLMs
The first order conditions can be writen
XTW1[y ] = 0
which are first order conditions for a weighted linear regression model.
As for the logistic regression, W depends on unkown s : use an iterativealgorithm
1. Set 0 = y, 0 = g(0) and
z0 = 0 + (y 0)g(0).
Define W 0 = diag[g(0)2Var(y)] and fit a (weighted) lineare regression of Z0 onX, i.e.
1 = [XTW10 X]1XTW10 z0
2. Set k = Xk, k = g(k) and
zk = k + (y k)g(k).
@freakonometrics 105
http://www.ub.edu/riskcenter
From the Exponential Family to GLMs
Define W k = diag[g(k)2Var(y)] and fit a (weighted) lineare regression of Zk onX, i.e.
k+1 = [XTW1k X]1XTW1k Zk
and loop... until changes in k+1 are (sufficiently) small. Then set =
Under some technical conditions, we can prove that P and
n( ) L N (0, I()1).
where numerically I() = [XTW1X]).
@freakonometrics 106
http://www.ub.edu/riskcenter
From the Exponential Family to GLMs
We estimate (see linear regression estimation) by
= 1n dim(X)
ni=1
i,i[yi i]2
Var(i)
This asymptotic expression can be used to derive confidence intervals, or tests.But is might be a poor approximation when n is small. See use of boostrap inclaims reserving.
Those are theorerical results: in practice, the algorithm may fail to converge
@freakonometrics 107
http://www.ub.edu/riskcenter
GLMs outside the Exponential Family?
Actually, it is possible to consider more general distributions, see Yee (2014))
1 > library (VGAM)
2 > vglm(y ~ x, family = Makeham )
3 > vglm(y ~ x, family = Gompertz )
4 > vglm(y ~ x, family = Erlang )
5 > vglm(y ~ x, family = Frechet )
6 > vglm(y ~ x, family = pareto1 ( location =100) )
Those functions can also be used for a multivariate response y
@freakonometrics 108
https://www.stat.auckland.ac.nz/~yee/VGAM/doc/VGAMrefcard.pdf
http://www.ub.edu/riskcenter
GLM: Link and Distribution
@freakonometrics 109
http://www.ub.edu/riskcenter
GLM: Distribution?
From a computational point of view, the Poisson regression is not (really) relatedto the Poisson distribution.
Here we solve the first order conditions (or normal equations)i
[Yi exp(XTi )]Xi,j = 0 j
with unconstraint , using Fishers scoring technique k+1 = k H1k k
where Hk = i
exp(XTi k)XiXTi and k =i
XTi [Yi exp(XTi k)]
There is no assumption here about Y N: it is possible to run a Poissonregression on non-integers.
@freakonometrics 110
http://www.ub.edu/riskcenter
The Exposure and (Annual) Claim Frequency
In General Insurance, we should predict blueyearly claims frequency. Let Nidenote the number of claims over one year for contrat i.
We did observe only the contract for a period of time EiLet Yi denote the observed number of claims, over period [0, Ei].
@freakonometrics 111
http://www.ub.edu/riskcenter
The Exposure and (Annual) Claim Frequency
Assuming that claims occurence is driven by a Poisson process of intensity , ifNi P(), then Yi P( Ei), where N is the annual frequency.
L(,Y ,E) =ni=1
eEi [Ei]YiYi!
the first order condition is
logL(,Y ,E) =
ni=1
Ei +1
ni=1
Yi = 0
for
=ni=1 Yini=1 Ei
=ni=1
iYiEi
where i =Eini=1 Ei
@freakonometrics 112
http://www.ub.edu/riskcenter
The Exposure and (Annual) Claim Frequency
Assume thatYi P(i Ei) where i = exp[XTi ].
Here E(Yi|Xi) = Var(Yi|Xi) = i = exp[XTi + logEi].
logL(;Y ) =ni=1
Yi [XTi + logEi] (exp[Xi] + logEi) log(Yi!)
1 > model model
http://www.ub.edu/riskcenter
Boosting
Boosting is a machine learning ensemble meta-algorithm for reducing biasprimarily and also variance in supervised learning, and a family of machinelearning algorithms which convert weak learners to strong ones. (source:Wikipedia)
The heuristics is simple: we consider an iterative process where we keep modelingthe errors.
Fit model for y, m1() from y and X, and compute the error, 1 = y m1(X).
Fit model for 1, m2() from 1 and X, and compute the error,2 = 1 m2(X), etc. Then set
m() = m1() y
+m2() 1
+m3() 2
+ +mk() k1
@freakonometrics 114
https://en.wikipedia.org/wiki/Boosting_(machine_learning)
http://www.ub.edu/riskcenter
Boosting
With (very) general notations, we want to solve
m? = argmin{E[`(Y,m(X))]}
for some loss function `.
It is an iterative procedure: assume that at some step k we have an estimatormk(X). Why not constructing a new model that might improve our model,
mk+1(X) = mk(X) + h(X).
What h() could be?
@freakonometrics 115
http://www.ub.edu/riskcenter
Boosting
In a perfect world, h(X) = y mk(X), which can be interpreted as a residual.
Note that this residual is the gradient of 12 [y m(x)]2
A gradient descent is based on Taylor expansion
f(xk) f,xk
f(xk1) f,xk1
+ (xk xk1)
f(xk1) f,xk1
But here, it is different. We claim we can write
fk(x) fk,x
fk1(x) fk1,x
+ (fk fk1)
?fk1,x
where ? is interpreted as a gradient.
@freakonometrics 116
http://www.ub.edu/riskcenter
Boosting
Here, fk is a Rd R function, so the gradient should be in such a (big)functional space want to approximate that function.
mk(x) = mk1(x) + argminfF
{ni=1
`(Yi,mk1(x) + f(x))}
where f F means that we seek in a class of weak learner functions.
If learner are two strong, the first loop leads to some fixed point, and there is nolearning procedure, see linear regression y = xT + . Since x we cannotlearn from the residuals.
@freakonometrics 117
http://www.ub.edu/riskcenter
Boosting with some Shrinkage
Consider here some quadratic loss function.
In order to make sure that we learn weakly, we can use some shrinkageparameter (or collection of parameters j) so that
E[Y |X = x] = m(x) mM (x) =Mj=1
jhj(x)
The problem is always the same. At stage j, we should solve
minh()
ni=1
[yi mj1(xi) i,j1
h(xi)]2
@freakonometrics 118
http://www.ub.edu/riskcenter
Boosting with some Shrinkage
The algorithm is then
start with some (simple) model y = h1(x)
compute the residuals (including ), 1 = y h1(x)
and at step j,
consider some (simple) model j = hj(x)
compute the residuals (including ), j+1 = j hj(x)
and loop. And set finally
y =Mj=1
hj(x)
@freakonometrics 119
http://www.ub.edu/riskcenter
Boosting with Piecewise Linear Spline Functions
@freakonometrics 120
http://www.ub.edu/riskcenter
Boosting with Trees (Stump Functions)
@freakonometrics 121
http://www.ub.edu/riskcenter
Boosting for Classification
Still seek m?() = argmin{E[`(Y,m(X))]}
Here y {1,+1}, and use `(y,m(x)) = eym(x) : AdaBoot algorithm.
Note thatP[Y = +1|X = x] = 11 + e2m?x
cf probit transform... Can be seen as iteration on weights. At step k solve
argminh()
ni=1
eyimk(xi) i,k
eyih(xi)
@freakonometrics 122
http://www.ub.edu/riskcenter
Boosting for Classification
1 > gbm.step(data=myocarde , gbm.x = 1:7 , gbm.y = 8,
2 + family = " bernoulli ", tree. complexity = 5,
3 + learning .rate = 0.01 , bag. fraction = 0.5)
200 400 600 800 1000
0.6
0.7
0.8
0.9
1.0
1.1
no. of trees
hold
out d
evia
nce
PRONO01, d 5, lr 0.01
@freakonometrics 123
http://www.ub.edu/riskcenter
Exponential distribution, deviance, loss function, residuals, etc
Gaussian distribution `2 loss function
Deviance isni=1
(yi m(xi))2, with gradient i = yi m(xi)
Laplace distribution `1 loss function
Deviance isni=1|yi m(xi))|, with gradient i = sign(yi m(xi))
@freakonometrics 124
http://www.ub.edu/riskcenter
Exponential distribution, deviance, loss function, residuals, etc
Bernoullli {1,+1} distribution `adaboost loss function
Deviance isni=1
eyim(xi), with gradient i = yie[yi]m(xi)
Bernoullli {0, 1} distribution
Deviance 2ni=1
[yi log(
yim(xi)
)(1 yi) log
(1 yi
1m(xi)
)with gradient
i = yi exp[m(xi)]
1 + exp[m(xi)]
Poisson distribution
Deviance 2ni=1
(yi log
(yi
m(xi)
) [yi m(xi)]
)with gradient i =
yi m(xi)m(xi)
@freakonometrics 125
http://www.ub.edu/riskcenter
Regularized GLM
In Regularized GLMs, we introduced a penalty in the loss function (thedeviance), see e.g. `1 regularized logistic regression
max
ni=1
(yi[0 + xTi log[1 + e0+x
Ti]])
kj=1|j |
1 > library ( glmnet )
2 > y x glm_ ridge plot(lm_ ridge )0 5 10 15
4
2
02
46
L1 Norm
Coe
ffici
ents
7 7 7 7
FRCARINCARINSYSPRDIAPAPULPVENTREPUL
@freakonometrics 126
http://www.ub.edu/riskcenter
Collective vs. Individual Model
Consider a Tweedie distribution, with variance function power p (0, 1), mean and scale parameter , then it is a compound Poisson model,
N P() with = 2p
2 p
Yi G(, ) with = p 2p 1 and =
1p
p 1
Consversely, consider a compound Poisson model N P() and Yi G(, ),
variance function power is p = + 2+ 1
mean is =
scale parameter is = []+2+112
+2+1
+ 1
seems to be equivalent... but its not.
@freakonometrics 127
http://www.ub.edu/riskcenter
Collective vs. Individual Model
In the context of regression
Ni P(i) with i = exp[XTi ]
Yj,i G(i, ) with i = exp[XTi ]
Then Si = Y1,i + + YN,i has a Tweedie distribution
variance function power is p = + 2+ 1
mean is ii
scale parameter is 1
+11i
+1i
(
1 +
)
There are 1 + 2dim(X) degrees of freedom.
@freakonometrics 128
http://www.ub.edu/riskcenter
Collective vs. Individual Model
Note that the scale parameter should not depend on i. A Tweedie regression is
variance function power is p = (0, 1)
mean is i = exp[XTi Tweedie]
scale parameter is
There are 2 + dim(X) degrees of freedom.
Note that oone can easily boost a Tweedie model1 > library ( TDboost )
@freakonometrics 129
http://www.ub.edu/riskcenter
Part 3.Model Choice, Feature Selection, etc.
@freakonometrics 130
http://www.ub.edu/riskcenter
AIC, BIC
AIC and BIC are both maximum likelihood estimate driven and penalize uselessparameters(to avoid overfitting)
AIC = 2 log[likelihood] + 2k and BIC = 2 log[likelihood] + log(n)k
AIC focus on overfit, while BIC depends on n so it might also avoid underfit
BIC penalize complexity more than AIC does.
Minimizing AIC minimizing cross-validation value, Stone (1977).
Minimizing BIC k-fold leave-out cross-validation, Shao (1997), withk = n[1 (logn 1)]
used in econometric stepwise procedures
@freakonometrics 131
http://www.jstor.org/stable/2984877http://www3.stat.sinica.edu.tw/statistica/oldpdf/A7n21.pdf
http://www.ub.edu/riskcenter
Cross-Validation
Formally, the leave-one-out cross validation is based on
CV = 1n
ni=1
`(yi, mi(xi))
where mi is obtained by fitting the model on the sample where observation ihas been dropped, e.g.
CV = 1n
ni=1
[yi, mi(xi)]2
The Generalized cross-validation, for a quadratic loss function, is defined as
GCV = 1n
ni=1
[yi mi(xi)
1 trace(S)/n
]2
@freakonometrics 132
http://www.ub.edu/riskcenter
Cross-Validation for kernel based local regression
Econometric approachDefine m(x) = [x]0 +
[x]1 x with
([x]0 , [x]1 ) = argmin
(0,1)
{ni=1
[x]h? [yi (0 + 1xi)]
2
}
where h? is given by some rule of thumb(see previous discussion).
0 2 4 6 8 10
2
1
01
2
@freakonometrics 133
http://www.ub.edu/riskcenter
Cross-Validation for kernel based local regression
Bootstrap based approach
Use bootstrap samples, compute h?b , and get mb(x)s.
0 2 4 6 8 10
2
1
01
2
0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20
02
46
810
12
@freakonometrics 134
http://www.ub.edu/riskcenter
Cross-Validation for kernel based local regression
Statistical learning approach (Cross Validation (leave-one-out))
Given j {1, , n}, given h, solve
([(i),h]0 , [(i),h]1 ) = argmin
(0,1)
j 6=i
(i)h [Yj (0 + 1xj)]
2
and compute m[h](i)(xi) =
[(i),h]0 +
[(i),h]1 xi. Define
mse(h) =ni=1
[yi m[h](i)(xi)]2
and set h? = argmin{mse(h)}.
Then compute m(x) = [x]0 + [x]1 x with
([x]0 , [x]1 ) = argmin
(0,1)
{ni=1
[x]h? [yi (0 + 1xi)]
2
}
@freakonometrics 135
http://www.ub.edu/riskcenter
Cross-Validation for kernel based local regression
@freakonometrics 136
http://www.ub.edu/riskcenter
Cross-Validation for kernel based local regression
Statistical learning approach (Cross Validation (k-fold))
Given I {1, , n}, given h, solve
([(I),h]0 , [xi,h]1 ) = argmin
(0,1)
j /I
(I)h [yj (0 + 1xj)]
2
and compute m[h](I)(xi) =
[(i),h]0 +
[(i),h]1 xi, i I. Define
mse(h) =I
iI
[yi m[h](I)(xi)]2
and set h? = argmin{mse(h)}.
Then compute m(x) = [x]0 + [x]1 x with
([x]0 , [x]1 ) = argmin
(0,1)
{ni=1
[x]h? [yi (0 + 1xi)]
2
}
@freakonometrics 137
http://www.ub.edu/riskcenter
Cross-Validation for kernel based local regression
@freakonometrics 138
http://www.ub.edu/riskcenter
Cross-Validation for Ridge & Lasso
1 > library ( glmnet )
2 > y x cvfit cvfit $ lambda .min
7 [1] 0.0408752
8 > plot( cvfit )
9 > cvfit cvfit $ lambda .min
12 [1] 0.03315514
13 > plot( cvfit )
2 0 2 4 6
0.6
0.8
1.0
1.2
1.4
log(Lambda)
Bin
omia
l Dev
ianc
e
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
10 8 6 4 2
12
34
log(Lambda)
Bin
omia
l Dev
ianc
e
7 7 7 6 6 6 6 5 5 6 5 4 4 3 3 2 1
@freakonometrics 139
http://www.ub.edu/riskcenter
Variable Importance for Trees
Given some random forest with M trees, set I(Xk) =1M
m
t
NtN
i(t)
where the first sum is over all trees, and the second one is over all nodes wherethe split is done based on variable Xk.
1 > RF= randomForest ( PRONO ~ .,data = myocarde )
2 > varImpPlot (RF ,main="")
3 > importance (RF)
4 MeanDecreaseGini
5 FRCAR 1.107222
6 INCAR 8.194572
7 INSYS 9.311138
8 PRDIA 2.614261
9 PAPUL 2.341335
10 PVENT 3.313113
11 REPUL 7.078838
FRCAR
PAPUL
PRDIA
PVENT
REPUL
INCAR
INSYS
0 2 4 6 8
MeanDecreaseGini
@freakonometrics 140
http://www.ub.edu/riskcenter
Partial Response Plots
One can also compute Partial Response Plots,
x 7 1n
ni=1
E[Y |Xk = x,Xi,(k) = xi,(k)]
1 > importanceOrder names for (name in names )
4 + partialPlot (RF , myocarde , eval(name), col="red",
main="", xlab=name)
@freakonometrics 141
http://www.ub.edu/riskcenter
Feature Selection
Use Mallows Cp, from Mallow (1974) on all subset of predictors, in a regression
Cp =1S2
ni=1
[Yi Yi]2 n+ 2p,
1 > library ( leaps )
2 > y x selec = leaps (x, y, method ="Cp")
5 > plot( selec $size -1, selec $Cp)
@freakonometrics 142
http://www.stat.washington.edu/courses/stat527/s13/readings/technometrics1973.pdf
http://www.ub.edu/riskcenter
Feature Selection
Use random forest algorithm, removing some features at each iterations (the lessrelevent ones).
The algorithm uses shadow attributes (obtained from existing features byshuffling the values).
1 > library ( Boruta )
2 > B plot(B)
PV
EN
T
PAP
UL
PR
DIA
RE
PU
L
INC
AR
INS
YS
5
10
15
20
Impo
rtan
ce@freakonometrics 143
http://www.ub.edu/riskcenter
Feature Selection
Use random forests, and variable importance plots
1 > library ( varSelRFBoot )
2 > X Y library ( randomForest )
5 > rf V VB plot(VB)
2 3 4 5 6 7 8
0.00
0.05
0.10
0.15
0.20
OOB Error rate vs. Number of variables in predictor
Number of variables
OO
B E
rror
rat
e
@freakonometrics 144
http://www.ub.edu/riskcenter
ROC (and beyond)Y = 0 Y = 1 prvalence
Y = 0true negative
N00
false negative(type II)N01
negativepredictive
value
NPV=N00N0
falseomission
rate
FOR=N01N0
Y = 1false positive
(type I)N10
true positive
N11
falsediscovery
rate
FDR=N10N1
positivepredictive
value
PPV=N11N1
(precision)
negativelikelihood
ratioLR-=FNR/TNR
true negativerate
TNR=N00N0
(specificity)
false negativerate
FNR=N01N1
positivelikelihood
ratioLR+=TPR/FPR
false positiverate
FPR=N10N0
(fall out)
true positive rate
TPR=N11N1
(sensitivity)
diagnostic oddsratio = LR+/LR-
@freakonometrics 145
http://www.ub.edu/riskcenter
Comparing Classifiers: ROC Curves1 > library ( randomForest )
2 > fit= randomForest ( PRONO ~.,data= train _ myocarde )
3 > train _Y=( train _ myocarde $ PRONO ==" Survival ")
4 > test_Y =( test_ myocarde $ PRONO ==" Survival ")
5 > train _S= predict (fit ,type="prob",newdata = train
_ myocarde )[ ,2]
6 > test_S= predict (fit ,type="prob",newdata =test_
myocarde )[ ,2]
7 > vp=seq (0,1, length =101)
8 > roc_ train =t( Vectorize ( function (u) roc. curve (
train _Y, train _S,s=u))(vp))
9 > roc_test=t( Vectorize ( function (u) roc. curve (
test_Y,test_S,s=u))(vp))
10 > plot(roc_train ,type="b",col="blue",xlim =0:1 ,
ylim =0:1)
@freakonometrics 146
http://www.ub.edu/riskcenter
Comparing Classifiers: ROC Curves
The Area Under the Curve, AUC, can be interpreted as the probability that aclassifier will rank a randomly chosen positive instance higher than a randomlychosen negative one, see Swets, Dawes & Monahan (2000)
Many other quantities can be computed, see1 > library ( hmeasures )
2 > HMeasure (Y,S)$ metrics [ ,1:5]
3 Class labels have been switched from (Death , Survival ) to (0 ,1)
4 H Gini AUC AUCH KS
5 scores 0.7323154 0.8834154 0.9417077 0.9568966 0.8144499
with the H-measure (see hmeasure), Gini and AUC, as well as the area under theconvex hull (AUCH).
@freakonometrics 147
http://ist-socrates.berkeley.edu/~maccoun/LP_SwetsDawesMonahan2000.pdfhttp://www.hmeasure.net/
http://www.ub.edu/riskcenter
Comparing Classifiers: ROC CurvesConsider our previous logistic regression (on heart at-tacks)
1 > logistic Y S library (ROCR)
2 > pred perf plot(perf)
False positive rate
True
pos
itive
rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
@freakonometrics 148
http://www.ub.edu/riskcenter
Comparing Classifiers: ROC Curves
On can get econfidence bands (obtained using bootstrapprocedures)
1 > library (pROC)
2 > roc roc.se plot(roc.se , type=" shape ", col=" light blue")
Specificity (%)
Sen
sitiv
ity (
%)
020
4060
8010
0
100 80 60 40 20 0
see also for Gains and Lift curves1 > library ( gains )
@freakonometrics 149
http://www.ub.edu/riskcenter
Comparing Classifiers: Accuracy and Kappa
Kappa statistic compares an Observed Accuracy with an Expected Accuracy(random chance), see Landis & Koch (1977).
Y = 0 Y = 1
Y = 0 TN FN TN+FN
Y = 1 FP TP FP+TP
TN+FP FN+TP n
See also Obsersed and Random Confusion TablesY = 0 Y = 1
Y = 0 25 3 28
Y = 1 4 39 43
29 42 71
Y = 0 Y = 1
Y = 0 11.44 16.56 28
Y = 1 17.56 25.44 43
29 42 71
total accuracy = TP + TNn
90.14%
random accuracy = [TN + FP ] [TP + FN ] + [TP + FP ] [TN + FN ]n2
51.93%
= total accuracy random accuracy1 random accuracy 79.48%
@freakonometrics 150
http://www.jstor.org/pss/2529310
http://www.ub.edu/riskcenter
Comparing Models on the myocarde Dataset
@freakonometrics 151
http://www.ub.edu/riskcenter
Comparing Models on the myocarde Dataset
If we average over all training samples
loess
glm
aic
knn
svm
tree
bag
nn
boost
gbm
rf
0.
5
0.0
0.5
1.0
@freakonometrics 152
http://www.ub.edu/riskcenter
Gini and Lorenz Type Curves
Consider an ordered sample {y1, , yn} of incomes, with y1 y2 yn,then Lorenz curve is
{Fi, Li} with Fi =i
nand Li =
ij=1 yjnj=1 yj
1 > L
http://www.ub.edu/riskcenter
Gini and Lorenz Type Curves
The theoretical curve, given a distribution F , is
u 7 L(u) = F1(u) tdF (t) + tdF (t)
see Gastwirth (1972).
One can also sort them from high to low incomes, y1 y2 yn
1 > L
http://www.ub.edu/riskcenter
Gini and Lorenz Type Curves
We want to compare two regression models, m1() and m2(), in the context ofinsurance pricing, see Frees, Meyers & Cummins (2014). We have observed lossesyi and premiums m(xi). Consider an ordered sample by the model,
m(x1) m(x2) m(xn)
then plot {Fi, Li} with Fi =i
nand Li =
ij=1 yjnj=1 yj
1 > L
http://www.ub.edu/riskcenter
Gini and Lorenz Type Curves
See Frees et al. (2010) or Tevet (2013).
@freakonometrics 156
https://www.casact.org/education/spring/2010/handouts/C17-Frees2.pdfhttp://www.casact.org/newsletter/index.cfm?fa=viewart&id=6540
http://www.ub.edu/riskcenter
Model Selection, or Aggregation?
We have k models, m1(x), , mk(x) for the same y-variable, that can be trees,vsm, regression, etc.
Instead of selecting the best model, why not consider
m?(x) =k=1
m(x)
for some weights .
New problem: solve min1, ,k
{ni=1
`
(yi
k=1
m(xi))}
Note that it might be interesting to regularize, by adding a Lasso-type penalty
term, based on k=1||
@freakonometrics 157
http://www.ub.edu/riskcenter
Part 4.Visualization and Maps
@freakonometrics 158
http://www.ub.edu/riskcenter
Projection(s)
"When people thought the earth was flat, they were wrong. Whenpeople thought the earth was spherical, they were wrong. But ifyou think that thinking the earth is spherical is just as wrong asthinking the earth is flat, then your view is wronger than both ofthem put together." Isaac Asimov, cited in kjordahl.github.io
Need some Spatial Reference Systems
E.g. EPSG:4326 latitude, longitude in WGS-84 coordinate system
E.g. EPSG:900913 and EPSG:3857: Google spherical Mercator
E.g. ESRI:102718: and NAD 1983 State Plane New York Long Island
@freakonometrics 159
http://kjordahl.github.io/SciPy2013/
http://www.ub.edu/riskcenter
1 library ( rgdal )
2 library ( maptools )
3 Proj
http://www.ub.edu/riskcenter
1 library (sp)
2 library ( rworldmap )
3 data( countriesLow )
4
5 crs. WGS84
http://www.ub.edu/riskcenter
Projection(s)
Source: explainxkcd.com
@freakonometrics 162
http://www.explainxkcd.com/wiki/index.php/977:_Map_Projections
http://www.ub.edu/riskcenter
Maps and Polygons
1 > library ( rgdal )
2 > ita1 plot(ita1 ,col=" light green ")
@freakonometrics 163
http://www.ub.edu/riskcenter
Maps and Polygons
1 > ita1@data [,"NAME_1" ][1:6]
2 [1] Abruzzo Apulia Basilicata Calabria
3 [5] Campania Emilia - Romagna
4 > pos Poly_ Sicily plot(ita1 ,col=" light green ")
7 > plot(Poly_Sicily ,col=" yellow ",add=TRUE)
@freakonometrics 164
http://www.ub.edu/riskcenter
Maps and Polygons
1 > plot(ita1 ,col=" light green ")
2 > plot(Poly_Sicily ,col=" yellow ",add=TRUE)
3 > abline (v=5:20 , col=" light blue")
4 > abline (h=35:50 , col=" light blue")
5 > axis (1)
6 > axis (2)
5 10 15 20
3638
4042
4446
@freakonometrics 165
http://www.ub.edu/riskcenter
Maps and Polygons
1 > pos ita_ south ita_ north plot(ita1)
5 > plot(ita_south ,col="red",add=TRUE)
6 > plot(ita_north ,col="blue",add=TRUE)
@freakonometrics 166
http://www.ub.edu/riskcenter
Maps and Polygons
1 > library (xlsx)
2 > data_ codes names (data_ codes )[1]="NAME_1"
4 > ita2 pos
http://www.ub.edu/riskcenter
Maps and Polygons
Here we have only used colors, but it is also possible to merge the polygonstogether,
1 > library ( rgeos )
2 > ita_s ita_n plot(ita1)
5 > plot(ita_s,col="red",add=TRUE)
6 > plot(ita_n,col="blue",add=TRUE)
@freakonometrics 168
http://www.ub.edu/riskcenter
Maps and Polygons
On polygons, it is also possible to visualize centroids,
1 > plot(ita1 ,col=" light green ")
2 > plot(Poly_Sicily ,col=" yellow ",add=TRUE)
3 > gCentroid (Poly_Sicily ,byid=TRUE)
4 SpatialPoints :
5 x y
6 14 14.14668 37.58842
7 Coordinate Reference System (CRS) arguments :
8 +proj= longlat + ellps = WGS84
9 + datum = WGS84 +no_defs + towgs84 =0 ,0 ,0
10 > points ( gCentroid (Poly_Sicily ,byid=TRUE),
pch =19 , col="red")
@freakonometrics 169
http://www.ub.edu/riskcenter
Maps and Polygons
or
1 > G plot(ita1 ,col=" light green ")
3 > text(G$x,G$y ,1:20)
1
23
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 20
@freakonometrics 170
http://www.ub.edu/riskcenter
Maps and Polygons
Consider two trajectories, caracterized by a series of knots (from the centrodslist)
1 > c1 c2
http://www.ub.edu/riskcenter
3 L1@proj4string plot(ita1 ,col=" light green ")
15 > plot(L1 ,col="red",lwd =2, add=TRUE)
16 > plot(L2 ,col="blue",lwd =2, add=TRUE)
17 > plot( cross _road ,pch =19 , cex =1.5 , add=TRUE)
@freakonometrics 172
http://www.ub.edu/riskcenter
Maps and Polygons
To add elements on maps, consider, e.g.
1 plot(ita1 ,col=" light green ")
2 grat
http://www.ub.edu/riskcenter
Maps and Polygons
To add elements on maps, consider, e.g.
1 sea
http://www.ub.edu/riskcenter
Maps, with R
There are 36, 552 communes in France. Why not illustrate with some smallerdatasets? Paris is splited in 850 IRIS areas (lot Regroup pour des IndicateursStatistiques see insee.fr).
1 library ( RColorBrewer )
2 plotclr
http://www.ub.edu/riskcenter
Maps, with R
To visualize all those polygons, use1 > plot( paris )
2.20 2.25 2.30 2.35 2.40 2.45 2.50
48.8
248
.86
48.9
0
@freakonometrics 176
http://www.ub.edu/riskcenter
Maps, with RLet us focus on one very specific polygon (the 100th)
1 > length ( paris )
2 [1] 850
3 > poly_ paris sub_poly sub_poly
6 PID SID POS X Y
7 1 100 1 1 2.332904 48.85836
8 2 100 1 2 2.333240 48.85826
9 3 100 1 3 2.331740 48.85644
10 18 100 1 18 2.332460 48.85850
11 19 100 1 19 2.332904 48.85836
12 > plot(sub_poly[,c("X","Y")])
13 > polygon (sub_poly[,c("X","Y")])
@freakonometrics 177
http://www.ub.edu/riskcenter
Maps, with R
Given some GPS coordinates, we want to know in which IRIS it is located
use function (for all possible polygon...it can be long)
1 > library (sp)
2 > point point .in. polygon ( point [1] , point [2] , sub_poly[,"X"],sub_poly[,"Y"])
4 [1] 1
5 > point _in_i= function (i, point )
6 + point .in. polygon ( point [1] , point [2] , poly_ paris [poly_ paris $PID ==i,"X"
],poly_ paris [poly_ paris $PID ==i,"Y"])
7 > where _is_ point = function ( point )
8 + which ( Vectorize ( function (i) point _in_i(i, point ))(1: length ( paris ))
>0)
@freakonometrics 178
http://www.ub.edu/riskcenter
Maps, with R1 > where _is_ point ( point )
2 [1] 100
1 > library ( RColorBrewer )
2 > plotclr vizualize _ point
http://www.ub.edu/riskcenter
Maps, with R
simplify using some kind of grid, then use the previous function only on afew possible candidates
1 > minmax = function (i){
2 + vect_xy=poly_ paris [poly_ paris $PID ==i,c("X"
,"Y")]
3 + box_xy=c(min(vect_xy[,"X"]) ,min(vect_xy[,"
Y"]) ,
4 + max(vect_xy[,"X"]) ,max(vect_xy[,"Y"]))
5 + return (box_xy)}
6 > rect( minmax (100) )
2.326 2.328 2.330 2.332
48.8
5748
.858
48.8
5948
.860
48.8
61
X
Y
@freakonometrics 180
http://www.ub.edu/riskcenter
Maps, with RThen simply use
1 > is.box= function (i,loc){
2 + box_i= minmax (i)
3 + x = box_i[1])&(loc [1]
http://www.ub.edu/riskcenter
Maps, with RThen to get all possible candidates use
1 > which .box which .box( point )
4 [1] 1 100 101
see1 > plot(sub_poly[,c("X","Y")],col=" white ")
2 > polygon (poly_ paris [poly_ paris $PID ==1 ,c("X"
,"Y")],col= plotclr [2])
3 > polygon (poly_ paris [poly_ paris $PID ==100 ,c("
X","Y")],col= plotclr [1])
4 > polygon (poly_ paris [poly_ paris $PID ==101 ,c("
X","Y")],col= plotclr [2])
2.320 2.325 2.330 2.335 2.340 2.345
48.8
5448
.856
48.8
5848
.860
48.8
6248
.864
X
Y
@freakonometrics 182
http://www.ub.edu/riskcenter
Maps, with R
Finally use1 > which .poly 0) idx_ valid =c(idx_valid ,i)}
9 + return (idx_ valid )}
to identify the IRIS polygon1 > which .poly( point )
2 [1] 100
@freakonometrics 183
http://www.ub.edu/riskcenter
Maps, with R1 > vizualize _ point
http://www.ub.edu/riskcenter
Google Maps and Open Street Map1 > library ( ggmap )
2 > library ( geosphere )
3 > (MAIF ( Rennes distHaversine (MAIF , Rennes , r =6378.137)
2 [1] 217.9371
@freakonometrics 185
http://en.wikipedia.org/wiki/Haversine_formula
http://www.ub.edu/riskcenter
(in km.) while the driving distance is
1 > mapdist (as. numeric (MAIF),as. numeric ( Rennes ), mode = driving )
2 by using this function you are agreeing to the terms at :
3 http://code. google .com/apis/maps/ documentation / distancematrix /
4
5 from
to m km miles seconds
6 1 200 Avenue Salvador Allende , 79000 Niort , France 10 -11 Place Hoche ,
35000 Rennes , France 257002 257.002 159.701 9591
7 minutes hours
8 1 159.85 2.664167
@freakonometrics 186
http://www.ub.edu/riskcenter
Visualizing Maps, via Google Maps
1 Loc=data. frame ( rbind (MAIF , Rennes ))
2 > library ( ggmap )
3 > library ( RgoogleMaps )
4 > CenterOfMap W WMap WMap
or1 > W ggmap (W)
46.5
47.0
47.5
48.0
2 1 0lon
lat
46.5
47.0
47.5
48.0
2 1 0lon
lat
@freakonometrics 187
http://www.ub.edu/riskcenter
Visualizing Maps, via Google Maps
from a technical point of view, those are ggplot2maps, see ggmapCheatsheet
1 > W ggmap (W)
or1 > W ggmap (W)
1.2
1.3
1.4
1.5
103.7 103.8 103.9 104.0 104.1lon
lat
1.2
1.3
1.4
1.5
103.7 103.8 103.9 104.0 104.1lon
lat
@freakonometrics 188
https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf
http://www.ub.edu/riskcenter
Visualizing Maps, via Google Maps1 > Paris ParisMap library ( maptools )
4 > library ( rgdal )
5 > paris = readShapeSpatial ("paris - cartelec .shp
")
6 > proj4string ( paris ) paris ParisMapPlus ParisMapPlus
48.81
48.84
48.87
48.90
2.25 2.30 2.35 2.40 2.45lon
lat
48.84
48.86
48.88
2.300 2.325 2.350 2.375 2.400lon
lat
@freakonometrics 189
http://www.ub.edu/riskcenter
OpenStreet Map
1 > library ( OpenStreetMap )
2 > map map plot(map)
@freakonometrics 190
http://www.ub.edu/riskcenter
OpenStreet Map
1 > library ( OpenStreetMap )
2 > map plot(map)
It is a standard R plot, we can add points on thatgraph.
@freakonometrics 191
http://www.ub.edu/riskcenter
OpenStreet Map
1 > library ( maptools )
2 > deaths head( deaths@coords )
4 coords .x1 coords .x2
5 0 529308.7 181031.4
6 1 529312.2 181025.2
7 2 529314.4 181020.3
8 3 529317.4 181014.3
9 4 529320.7 181007.9
10 > points ( deaths@coords ,col="red", pch =19 ,
cex =.7 )
Cholera outbreak, in London, 1854,dataset collected by John (not Jon) Snow
@freakonometrics 192
http://www.ub.edu/riskcenter
OpenStreet Map
1 >