Download pdf - Semiparametric and Nonparametric Additive Regression Modelsartax.karlin.mff.cuni.cz/~macim1am/pub/obhajoba.pdf · Introduction Additive Regression Generalized Additive Regression

IntroductionAdditive Regression

Generalized Additive RegressionRate of Convergence

Model Selection CriteriaAdaptive Methods

Semiparametric and NonparametricAdditive Regression Models

Matúš Maciak

Department of Probability and Mathematical Statistics

March 30, 2007Matúš Maciak MFF UK - [email protected] Semiparametric and Nonparametric Additive Regression Models




Contents

1 IntroductionMotivationCurse of DimensionalityAdditive Decomposition

2 Additive RegressionSpline EstimatesKernel Estimates

3 Generalized Additive Regression

4 Rate of Convergence

5 Model Selection Criteria

6 Adaptive MethodsWell-known algorithmsReal data - Example

Matúš Maciak MFF UK - [email protected] Semiparametric and Nonparametric Additive Regression Models




MotivationCurse of DimensionalityAdditive Decomposition

The main objectives...

1 The “Curse of Dimensionality” problem - the main reason whyone tries to apply additive semiparametric and nonparametricregression approaches.

2 The most frequently used methods to obtain additive estimates.3 Generalized additive regression models - in a special case of

binary data samples.4 Expectaction - to achieve the same rates of convergence for

additive estimates as in a case of univariate regression problem.5 Model selection criteria - the optimal choice of the final model

from a set of all proposed models.6 Adaptive strategies (CIM) - RPR, MARS and PPR, etc.





























































Contents












Multivariate Regression

Let X ∈ χ ⊆ RJ be a J−dimensional random variable andconsider a random variable Y with a mean µ ∈ R and the finitesecond moment EY 2 < ∞.

Let f : χ ∈ RJ → R be a J−dimensional function such thatE[Y |X = x] = f(x) - regression function of Y on X.

Regression function f(x) is supposed to be smooth up to thespecific order.

There are no other assumptions taken on the functional form ofthe function f(·) but smoothness.






Multivariate Kernel regression

Multidimensional Smoothing ⇒ Multidimensional Regression

f(x) = E[Y |X = x] =

∫yg(y |x)dy =

∫yp1(y , x)dy

p2(x)

Estimates of functions p1, p2 W Kernel Density Estimation

fh(x) =

∑Ni=1 κh(Xi − x)Yi∑N

i=1 κh(Xi − x),

where κh is the multivariate, multiplicative kernel and h = (h1, . . . , hJ)is a vector of appropriate bandwidths.V Problems:“Curse of Dimensionality” and low asymptotic convergence...






Multivariate Kernel regression

Multidimensional Smoothing ⇒ Multidimensional Regression

f(x) = E[Y |X = x] =

∫yg(y |x)dy =

∫yp1(y , x)dy

p2(x)

Estimates of functions p1, p2 W Kernel Density Estimation

fh(x) =

∑Ni=1 κh(Xi − x)Yi∑N

i=1 κh(Xi − x),

where κh is the multivariate, multiplicative kernel and h = (h1, . . . , hJ)is a vector of appropriate bandwidths.V Problems:“Curse of Dimensionality” and low asymptotic convergence...






Additive approaches...

Let (X, Y ) ∈ RJ+1 be a pair of random variables such thatX = (X1, . . . , XJ) and Y is a real valued variable with a meanEY = µ and the finite second moment 0 < EY 2 ≤ K < ∞.

Consider an unknown regression function f : RJ → R of Yon X ∈ RJ so that f (x) = E[Y |X = x] (f : [0, 1]J → R).

We impose one more condition:

f (x1, . . . , xJ) = µ +J∑

j=1

fj(xj)

The functional components fj areuniquely determined and Efj(Xj) = 0.

Smoothness assumption remains...(smoothness of functional components)






Additive Estimates...

Let (X1, Y1), (X2, Y2), . . . , (XN , YN) denote an independentrandom sample, where each pair (Xi , Yi) has the samedistribution as (X, Y ).Estimates of the true underlying regression function are given bydifferent approaches (splines techniques, B-splines and kernelestimates)Semi-parametric (Nonparametric) estimate is based on therandom sample of size N - it can be written in the additive form:

fN(x1, . . . , xJ) = Y N +J∑

j=1

fNj(xj)

Regarding to the assumption on the functional components fj onehas to consider that

∑i=1,...,N fNj(Xij) = 0 for all j ∈ {1, . . . , J}.






Splines vs. Kernels

Spline estimates:

1 Semi-parametric approaches2 High-dimensional data3 Extra-large sample sizes4 No asymptotic distribution5 No uniform convergence over

the whole interval6 No measure of uniform

accuracy (except L2 norm)7 So called Sledge-hammer

technique

Kernel estimates:

1 Nonparametric techniques2 Too costly for large dimension3 Too costly for large sample

sizes N ∈ N4 Asymptotic (normal)

distribution (conf. intervals)5 Uniform convergence over the

whole interval6 So called Sharp-knife

technique






Splines vs. Kernels

Spline estimates:

1 Semi-parametric approaches2 High-dimensional data3 Extra-large sample sizes4 No asymptotic distribution5 No uniform convergence over

the whole interval6 No measure of uniform

accuracy (except L2 norm)7 So called Sledge-hammer

technique

Kernel estimates:

1 Nonparametric techniques2 Too costly for large dimension3 Too costly for large sample

sizes N ∈ N4 Asymptotic (normal)

distribution (conf. intervals)5 Uniform convergence over the

whole interval6 So called Sharp-knife

technique






Contents












What is “Curse of Dimensionality” problem?

1 The size N ∈ N of data sample required to fit J-dimensionalregression surface increases exponentially with the increasingnumber of dimensions.

2 Some limitations given by an estimation ability of the most ofmultivariate regression approaches (splines, kernels).

3 The asymptotic convergence decreases with the increasingnumber of dimensions J (according to the expression r = p

2p+J )

4 Too costly algorithms dealing with high-dimensional data withoutdimensionality reduction principle (straightforward methods)

5 Special case - the components of X are not independent...






Curse of Dimensionality - examples:

Consider random variables Y = (Y1, . . . , YJ) and the randomsample{Xi = (Xi1, . . . , XiJ); i = 1, . . . N} such that

Xi ∼ R([0, 1]J

), Y ∼ R

([0, 1]J

).






Maximum Distance vs. Euclidean Distance

Maximum distance:‖ x ‖max = maxj=1,...,J |xj |

Euclidean distance:‖ x ‖2

euc =∑

j=1,...,J x2j

Maximum distance

J = 1 J = 2 J = 3 J = 5 J = 10 d = 20N = 100 0.003838 0.054951 0.094651 0.232593 0.366015 0.571504N = 1000 0.000506 0.015051 0.053464 0.129761 0.273968 0.440982N = 10000 0.000044 0.004691 0.021613 0.044339 0.213186 0.402223N = 100000 0.000006 0.001178 0.009108 0.030709 0.159703 0.353620

Euclidean distance

J = 1 J = 2 J = 3 J = 5 J = 10 d = 20N = 100 0.003838 0.060434 0.118966 0.328987 0.660090 1.264582N = 1000 0.000506 0.017274 0.063800 0.191530 0.498007 1.000749N = 10000 0.000041 0.005546 0.027003 0.060081 0.376753 0.909363N = 100000 0.000005 0.001376 0.011672 0.052891 0.289131 0.795231

The empirical average minimum distance between two uniformlydistributed random variables in a hypercube [0, 1]J .






The Lower Bounds – bandwidth selection...

Lemma (Packing density in a hypercube – maximum distance)

Let Y ∼ R([0, 1]J

)and let Xi , i = 1, . . . , N is a random sample where

each Xi ∼ R([0, 1]J

). Then Bmax(N, J) is the lower bound for the

average minimum distance for the maximum distance, where

Bmax(N, J) =12· 1

N1/J· J

J + 1(1)

Lemma (Packing density in a hypercube – Euclidean distance)

Let Y ∼ R([0, 1]J

)and let Xi , i = 1, . . . , N is a random sample where

each Xi ∼ R([0, 1]J

). The Beuc(N, J) is the lower bound for the

average minimum distance for the Euclidean distance, where

Beuc(N, J) =12·√

JN1/J

· JJ + 1

(2)






Contents












The additive form of regression function

Is the true underlying regression function genuinely additive?

1 V YES: → the straightforward estimation of the functionalcomponents (just an occasional case)

2 V No: → then one has to find some additive approximation,which will be consequentially estimated

How to define a measure of accuracy between the underlyingregression function and its approximation???




































Additive Decomposition - approximation

Consider a regression function f which is not genuinely additive.In such a case the regression function f can be successfullydecomposed into main effects (additive decomposition).

Condition 1

Let the distribution of X ∈ [0, 1]J is absolutely continuous and let itsdensity g is bounded away from zero and infinity(∃ b > 0 ∃B > b ∀x ∈ C = [0, 1]Jb ≤ g ≤ B.

The additive approximation to f can be obtained as a sumof J univariate functions f ∗j (xj) where

f ∗j (xj) = E[f (x)|Xj = xj

]− E

[f (x)

], x = (x1, . . . , xJ) ∈ [0, 1]J .

If there are interactions between some variables required onecan obtain them in a similar way...






Additive Decomposition – definiteness...

Lemma 1

Let the random variable∑

j hj(Xj) has a finite second moment where

hj are functions on [0, 1]. Set δ =√

(1− b/B) and let SD(·) denotesthe standard deviation. Then each hj(Xj) has a finite second momentand the next statement holds:

SD(∑

j

hj(Xj)) ≥ ((1−δ)/2)((J−1)/2) · (SD(h1(X1))+ · · ·+SD(hJ(XJ))).

Under the Condition 1 it follows from the lemma that the functionalcomponents are uniquely determined up to set of measure zero.





Spline EstimatesKernel Estimates

Resumption...

1 Regression function f (x) = E[Y |X = x]

2 Estimates based on a random sample {(Xi , Yi), i = 1, . . . , N}3 Random variable Y has a mean µ ∈ R and a finete second

moment EY 2 < ∞4 With any loss of generality - we assume that function f has as

additive form - otherwise we use additive decomposition5 Functional components ff with zero mean Efj(xj) = 0

(to avoid constant functional components)






Contents












Regression splines

the first method - polynomial estimates (over-fitting, etc.)

polynomial regression with penalties (not used anymore)

To avoid some problems related to polynomial regression ⇒implementation of spline approaches (piecewise polynomial)

Definition 1 - Spline function

Spline is a piecewise polynomial function of nth degree, where singlepolynomial pieces joint together in the knot points, obeying continuityconditions for function itself and its n − 1 derivatives.

Problem: How to chose the number and the positions of knots?






Regression splines

the first method - polynomial estimates (over-fitting, etc.)

polynomial regression with penalties (not used anymore)

To avoid some problems related to polynomial regression ⇒implementation of spline approaches (piecewise polynomial)

Definition 1 - Spline function

Spline is a piecewise polynomial function of nth degree, where singlepolynomial pieces joint together in the knot points, obeying continuityconditions for function itself and its n − 1 derivatives.

Problem: How to chose the number and the positions of knots?






Regression splines - power basis

Spline estimation approaches are based on a set of basis functions:

1 Spline Power basis takes a following form:{1, x , x2, . . . xn, (x − ξ1)

n+, . . . (x − ξK )n

+} (n - spline order)2 The estimate of each functional component fj is defined as:

fj(xj) =∑n

l=0 β0jx lj +

∑Kk=1 βkn(xj − ξk )n

+

3 The estimate of the underlying additive regression function isdefined as a minimizing problem

∑Ni=1(Yi − Y N −

∑Jj=1 fj(xij))

2

subject to basis coefficients β01, . . . , β0n, β1n, . . . , βKn.

0 20 40 60 80 100

0.0

0.4

0.8

0 20 40 60 80 100

0.0

0.4

0.8






Regression splines with penalties

In a case we redefine the former minimization problem such that weprovide minimization with respect to knots positions ⇒ there is aproblem of over-smoothing (interpolation).

Regression Splines ⇒ Regression Splines with penalties

to ensure a better flexibility of the final estimate

to get an ability to control the amount of smoothness

The estimate of the true underlying regression function f = (fj , . . . , fJ)is given by the minimization problem:

MinimizeN∑

i=1

(Yi − Y N −J∑

j=1

fj(xij))2 + λ

∫ 1

0(f ′′(x))2dx ,

where λ is so called smoothing parameter.






Regression splines with penalties

In a case we redefine the former minimization problem such that weprovide minimization with respect to knots positions ⇒ there is aproblem of over-smoothing (interpolation).

Regression Splines ⇒ Regression Splines with penalties

to ensure a better flexibility of the final estimate

to get an ability to control the amount of smoothness

The estimate of the true underlying regression function f = (fj , . . . , fJ)is given by the minimization problem:

MinimizeN∑

i=1

(Yi − Y N −J∑

j=1

fj(xij))2 + λ

∫ 1

0(f ′′(x))2dx ,

where λ is so called smoothing parameter.






B-splines basis

Consider B-splines basis of the nth order. Then it holds:1 Each B-spline function consist of n + 1 polynomial pieces2 Single pieces joint in n inner knots3 At the knot points - continuity condition up to the order n − 14 Each B-spline basis is positive over a domain spanned by n + 2

knots - everywhere else it is zero by definition5 B-spline function is overlapped by 2n another basis functions6 At any points x ∈ [0, 1] there are n + 1 nonzero basis functions

−2 0 2 4 6 8 10 12

0.00

0.10

0.20

x values

y va

lues

−5 0 5 10 15

0.00

0.04

0.08

x values

y va

lues






B-splines - estimation

The estimate of each functional component is written as a linearcombination of spline basis functions (piecewise polynomials ofthe degree n ∈ N).

fj∆(xj) =K+n+1∑

k=1

ϑjk ·Bkn(xj)

The estimate of the whole unknown regression function f isdefined as a following minimization problem

minϑjk∈R

N∑i=1

[Yi − Y N −

∑Jj=1 ϑjk · Bkn(xij)

]2






B-splines with penalties (P-splines)

In regard to ensure better control over smoothness and a betterflexibility of the final estimate, there were proposed B-splines withpenalties:

The estimate is given as a minimization problem

Minimize :N∑

i=1

(fN(Xi)− Yi)2 +

J∑j=1

λj ·∫ ξK+1

ξ0

(f ′′j∆(xj))2dxj

subject to basis coefficients ϑjk and parameter λ with the sameB-splines basis as in a case of simple B-splines estimates.

The optimal choice of the smoothing parameter λ⇒ Model Selection Criteria






B-splines with penalties (P-splines)

In regard to ensure better control over smoothness and a betterflexibility of the final estimate, there were proposed B-splines withpenalties:

The estimate is given as a minimization problem

Minimize :N∑

i=1

(fN(Xi)− Yi)2 +

J∑j=1

λj ·∫ ξK+1

ξ0

(f ′′j∆(xj))2dxj

subject to basis coefficients ϑjk and parameter λ with the sameB-splines basis as in a case of simple B-splines estimates.

The optimal choice of the smoothing parameter λ⇒ Model Selection Criteria






Power basis vs. B-spline basis

Power basis:

1 Relation between a knot andcorresponding basis function

2 Greater correlation betweenbasis functions

B-spline basis:

1 Numerically much morestable set of basis functions

2 Smaller correlation betweenbasis functions






Power basis vs. B-spline basis

Power basis:

1 Relation between a knot andcorresponding basis function

2 Greater correlation betweenbasis functions

B-spline basis:

1 Numerically much morestable set of basis functions

2 Smaller correlation betweenbasis functions






Contents












Additive Kernel Estimates - progress

Multidimensional estimate of the unknown regression function f⇒ consequentially we estimate single components f1, . . . , fJ .

1 Motivated by additive linear regression2 First iterative procedures (backfitting algorithm)3 Some other iterative procedures (RPR, PPR, MARS)4 Proposed so called Direct Integration Method - 1994

↪→ the statistical properties of such an estimate are straightforward toderive (bias, variance, asymptotical properties, confidence int., etc.)

↪→ the asymptotical normality of DIM estimates






Additive Kernel Estimates - progress

Multidimensional estimate of the unknown regression function f⇒ consequentially we estimate single components f1, . . . , fJ .

1 Motivated by additive linear regression2 First iterative procedures (backfitting algorithm)3 Some other iterative procedures (RPR, PPR, MARS)4 Proposed so called Direct Integration Method - 1994

↪→ the statistical properties of such an estimate are straightforward toderive (bias, variance, asymptotical properties, confidence int., etc.)

↪→ the asymptotical normality of DIM estimates






Direct Integration Method

Consider a multivariate unknown regression function f (x) which is inadditive form. Let X = (X1, X) ∈ R× RJ−1 and define the functionalϕ1(x1) as follows:

ϕ1(x1) =

∫ 1

0f (x1, x)p2(x)d x

Under the assumption about the additive form of functionf = (f1, . . . , fJ), it holds that ϕ1 = f1 up to the additive constant µ.

V Multivariate Nadaraya-Watson kernel estimateV Kernel estimates of functions f () and p2






Direct Integration Method

the estimate of f1(x1) is given as a sample version of thefunctional ϕ1(x1):

f1(x1) =1N

N∑i=1

f (x1, Xi)

the estimate f1(x1) can be written in the form:

f1(x1) =N∑

i=1

wi(x1)Yi ,

where wi(x1) = n−1 ∑Ni=1 wi(x1, Xi). Weights wi(x1, Xi) are given

by the equation f (x1, X) =∑N

i=1 wi(x1, Xi)Yi






Asymptotical normality of Kernel Additive Estimate

the functional components f2, . . . , fJ can be by obtained by thesimilar process, considering the functional ϕk (xk ) and a partition(Xk , X) ∈ R× RJ−1, and X = (X1, . . . , Xk−1, Xk+1, . . . XJ).

Theorem (Asymptotical normality)

Under some assumptions on N ∈ N and smoothness bandwidth hand g for Kernel estimates, it holds

N25[ϕj(xj)− ϕj(xj)

]→ N(bj(xj), vj(xj))





Generalization into the GAM

In a case of binary data (survival time data) it is more convenient toimplement Generalized Additive Models - GAM.

1 Full model specification - the conditional distribution function ofY given X belongs to an exponential family - known link function

G[f (x)

]= µ +

J∑j=1

fj(xj)

2 Partial model specification - no restrictions on exponentialfamily - variance stays function unrestricted

If one takes a link function G to be an identity ⇒ Classical AdditiveRegression model (another choices: logit , probit , logarithm).





GAM - estimation

The estimation procedure is similar to that in Additive Kernelestimates.Let X = (X1, X) such that X = (X2, . . . , XJ). Let’s define ϕ1(x1):

ϕ1(x1) =

∫G[f (x1, x)] · p2(x)d x

Multidimensional Nadaraya - Watson kernel estimator⇒ nonparametric multivariate kernel estimates of p2 and f .

the estimate of f1 is unified with the estimate of ϕ1

the estimate of ϕ1 is given by the equation:

ϕ1(x1) =1N

N∑i=1

G[f (x1, Xi)], Xi = (Xi2, . . . , XiJ).





ADVANTAGES or DISADVANTAGES?

What is the main advantage of an additive approach?





The Optimal Global Rate of Convergence

The sequence {bN} is the optimal rate of convergence if:

limc→0

lim infN→∞

supf∈κ

P[‖ TN − f ‖q> c · bN

]= 1

limc→∞

lim supN→∞

supf∈κ

P[‖ TN − f ‖q> c · bN

]= 0

The optimal global rate of convergence given by Stone:

Theorem (Rate of Convergence for Nonparametric Estimates)

Let β ∈ (0, 1] and set p = k + β. Let 0 < q ≤ ∞ and set r = p−m2p+J .

Then the optimal global rate of convergence is

{N−r}, q ∈ (0,∞) {(N−1 · ln N)r}, q = ∞.





Additive Reduction Principle

The effectiveness of the additive reduction principle onsimpleness and interpretability of the model.

“Curse of dimensionality” prevention.

The improvement of the optimal global rate of convergence.(r = p−m

2p+J −→ r = p−m2p+1 )

Predic

tor XPredictor Z

Response Y

0 5000 15000 25000

−20

−10

010

20

income

s(in

com

e,3.

12)

6 8 10 12 14 16

−20

−10

010

20education

s(ed

ucat

ion,

3.18

)





Additive Expansion in L2 Norm

Consider and additive estimate fN of the regression function f .Set γ = 1/(2p + 1) and r = (p −m)/(2p + 1).

Theorem (Rate of Convergence for Additive Estimates)

Suppose that all necessary conditions hold. Let NN ∼ Nγ . Then:

‖ f (m)Nj − (f ∗j )(m) ‖2

j = Opr (N−2r ) ‖ fNj − (f ∗j ) ‖2j = Opr (N−2r )

‖ fN − (f ∗) ‖2= Opr (N−2r ) (Y N − µ)2 = Opr (N−2r )

The only reasonable derivatives are partial derivatives for the

same variable ( ∂2bfN

∂xj1∂xj2

= 0 for j1 6= j2).

Theorem holds for the redefinition of the mth derivative of theadditive function (linear combination of partial derivatives).









‖ f (m)Nj − (f ∗j )(m) ‖2





∂xj1∂xj2

= 0 for j1 6= j2).










‖ f (m)Nj − (f ∗j )(m) ‖2





∂xj1∂xj2

= 0 for j1 6= j2).






Additive Expansion in L∞ Norm

The effect of the additive decomposition on the rate ofconvergence in supremum norm (r = p

2p+J −→ r = p2p+1 ).

To decompose not even the unknown regression function but thewhole regression problem (⇒ J univariate regression problems).

Theorem

Let all necessary conditions hold and let Nγ ∼ NN . Suppose thatEY = µ = 0 and let r = p

2p+1 . Then:

‖fN − f ∗‖∞ = supx∈[0,1]J

|fN(x)− f ∗(x)| = Opr (N−r · logr N) (3)

‖fNj − f ∗j ‖∞,j = supxj∈[0,1]

|fNj(xj)− f ∗j (xj)| = Opr (N−r · logr N) (4)





Additive Expansion in L∞ Norm

The effect of the additive decomposition on the rate ofconvergence in supremum norm (r = p

2p+J −→ r = p2p+1 ).

To decompose not even the unknown regression function but thewhole regression problem (⇒ J univariate regression problems).

Theorem

Let all necessary conditions hold and let Nγ ∼ NN . Suppose thatEY = µ = 0 and let r = p

2p+1 . Then:

‖fN − f ∗‖∞ = supx∈[0,1]J

|fN(x)− f ∗(x)| = Opr (N−r · logr N) (3)

‖fNj − f ∗j ‖∞,j = supxj∈[0,1]

|fNj(xj)− f ∗j (xj)| = Opr (N−r · logr N) (4)





The Effectiveness of the Additive Expansion

0 2000 4000 6000 8000 10000

0.00

20.

006

0.01

00.

014

N observations

Rat

e of

Con

verg

ence

●

●

●

●

●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

●

●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●Supremum norm

Euclidean norm

J=2

J=1

J=2J=1

Figure: The optimal global rate of convergence for the additive models in the case of twodimensional regression surface for the Supremum norm and the Euclidean norm.





Optimal model selection

1 Spline estimates: - in a case of implementation of smoothingparameter λ one gets a set of “good” admissible models⇒ there come up a requirement to take only one

2 Penalized splines: - a set of all admissible models even moreincreases once we consider a minimization problem oversmoothing parameter λ and knots positions ∆ too.

3 Kernel regression: - a problem of a right selection of thesmoothing parameter h - a measure of localness(or a multiple bandwidth parameter h)





























Model Selection Criteria

●

●

●

●●●

●

●●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●●●●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●

●●

●

●

●

●●

●

●●●●●●●●

●●

●●●●●●

●

●

●

●●●

●

●

●

●

●

●●●

●

0.0 0.2 0.4 0.6 0.8 1.0

1.0

1.5

2.0

2.5

3.0

x values

y va

lues

Smoothing parameters:

lambda = 0.000 008

Cross-Validation

Generalized Cross-Validation

Akaike information Criterion

Bayesian InformationCriterion






●

●

●

●●●

●

●●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●●●●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●

●●

●

●

●

●●

●

●●●●●●●●

●●

●●●●●●

●

●

●

●●●

●

●

●

●

●

●●●

●

0.0 0.2 0.4 0.6 0.8 1.0

1.0

1.5

2.0

2.5

3.0

x values

y va

lues


lambda = 0.001 357

Cross-Validation









●

●

●

●●●

●

●●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●●●●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●

●●

●

●

●

●●

●

●●●●●●●●

●●

●●●●●●

●

●

●

●●●

●

●

●

●

●

●●●

●

0.0 0.2 0.4 0.6 0.8 1.0

1.0

1.5

2.0

2.5

3.0

x values

y va

lues


lambda = 0.037 821

Cross-Validation









●

●

●

●●●

●

●●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●●●●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●

●●

●

●

●

●●

●

●●●●●●●●

●●

●●●●●●

●

●

●

●●●

●

●

●

●

●

●●●

●

0.0 0.2 0.4 0.6 0.8 1.0

1.0

1.5

2.0

2.5

3.0

x values

y va

lues


lambda = 0.199 624

Cross-Validation









●

●

●

●●●

●

●●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●●●●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●

●●

●

●

●

●●

●

●●●●●●●●

●●

●●●●●●

●

●

●

●●●

●

●

●

●

●

●●●

●

0.0 0.2 0.4 0.6 0.8 1.0

1.0

1.5

2.0

2.5

3.0

x values

y va

lues


lambda = 1.053 625

Cross-Validation








Well-known algorithmsReal data - Example

Contents












Iterative methods - Backfittilng Algorithm

The first proposals → iterative methods(based on the additive decomposition method)

fj(xj) = E[Y − µ−

∑Jt=1, t 6=j ft(xt)|Xj

]

1 Initialization: µ0 = 1N

∑Ni=1 Yi , fj = f 0

j , j = 1, . . . , J

2 fj = Sj

[Y − µ0 −

∑k 6=j fk (Xk )|Xj

]µ0 = µ0 + 1

N

∑Ni=1 fj(Xij)

fj = fj − 1N

∑Ni=1 fj(Xij)

3 Repeat step 2 until sufficient convergence...










]


∑Ni=1 Yi , fj = f 0

j , j = 1, . . . , J

2 fj = Sj

[Y − µ0 −


]µ0 = µ0 + 1

N

∑Ni=1 fj(Xij)

fj = fj − 1N

∑Ni=1 fj(Xij)











]


∑Ni=1 Yi , fj = f 0

j , j = 1, . . . , J

2 fj = Sj

[Y − µ0 −


]µ0 = µ0 + 1

N

∑Ni=1 fj(Xij)

fj = fj − 1N

∑Ni=1 fj(Xij)











]


∑Ni=1 Yi , fj = f 0

j , j = 1, . . . , J

2 fj = Sj

[Y − µ0 −


]µ0 = µ0 + 1

N

∑Ni=1 fj(Xij)

fj = fj − 1N

∑Ni=1 fj(Xij)











]


∑Ni=1 Yi , fj = f 0

j , j = 1, . . . , J

2 fj = Sj

[Y − µ0 −


]µ0 = µ0 + 1

N

∑Ni=1 fj(Xij)

fj = fj − 1N

∑Ni=1 fj(Xij)











]


∑Ni=1 Yi , fj = f 0

j , j = 1, . . . , J

2 fj = Sj

[Y − µ0 −


]µ0 = µ0 + 1

N

∑Ni=1 fj(Xij)

fj = fj − 1N

∑Ni=1 fj(Xij)







Iterative techniques - Computer intensive methods

Recursive Partitioning Regression - RPR- spline estimate of zero degree- locally constant estimate with a great interpretability

MARS Algorithm - MARS- Multivariate adaptive spline estimates- modification of the RPR algorithm (continuity condition)

Projection Pursuit Regression - PPR- projection into the lower dimensions- additivity in a different sense






Contents












Example: Polynomial Regression

Polynomial Regression Estimate- spline estimate of 3th degree.Life exp. v S12[Log(People/TV ), Log(People/physician)]

Log(people per physician)Log(people per TV)

Average Life E

xp.

Residual Sum of Squares:

[1] 10.56021






Example: Additive Regression Model

Additive Regression Estimate - generalization of PPR- additive spline estimate of the 3rd degreeLife expectancy v S1[Log(People/TV )] + S2[Log(People/phys)]


Average Life E

xp.


[1] 11.89261






Example: Recursive Partitioning Regression

Recursive Partitioning Regression Estimate- locally constant estimate - spline of the 0rd degreeLife expectancy v

∑Vv=1 πv

∏{x∈Bv}


Average Life E

xp.


[1] NaN






Example: MARS Algorithm

Multivariate Adaptive Regression Splines - MARS- modification of the RPR algorithmLife expectancy v s0 +

∑Vv=1 svBv (x)


Average Life E

xp.


[1] 11.32777






Example: Projection Pursuit Regression

Projection Pursuit Regression - PPR- projection into the lower dimensionsLife expectancy v

∑Vv=1 gv (bT

v x)


Average Life E

xp.


[1] 6.129001






Additive Regression Models with Regression Splines

Thank you for your attention...

Matúš Maciak: [email protected]