Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Stochastic ModelsMachine Learning

Walt Pohl

Universitat ZurichDepartment of Business Administration

March 19, 2015

What is Machine Learning?

Machine learning is aimed at prediction, nothypothesis testing.

Use a high-dimensional approximation to extract themaximum predictability.

No effort is made to interpret individual parameters.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 2 / 29

The Prediction Problem

The basic framework is:

Predict Y , given some vector of predictors, X .

Find a function, f (X ), to predict Y ,

Y = f (X ) + ε.

where ε is random.

The space of all possible f ’s is infinite-dimensional, butwe choose a finite-dimensional approximation.


Example: Polynomial Regression

Use regression to fit a a high-degree polynomial – degree10 or 20.

f (X ) = b0 + b1X + · · ·+ bNXN

Coefficients are hard to interpret. What does thecoefficient of x8 mean?


In Versus Out of Sample

Use a high-enough degree, you can fit perfectly – insample.

Out of sample, the fit will be terrible, much worse than alinear regression.

This is the problem of overfitting.

The solution? Penalize complexity.


Loss Function

Choose a loss function, L(Y , f (X )).

Choose f to minimize

E (L(Y , f (X ))).

L can be modified to penalize complexity.


Penalizing Complexity via Loss Function

One natural choice for L is squared-error loss:

(Y − f (X ))2

This leads to regression.

But now, let’s introduce a term to penalize complexity.

Example:

(Y − f (X ))2 + λ∑i

β2i


Ridge Regression

Minimizing this penalty gives you ridge regression.

Note that λ – known as the ridge parameter – cannot beestimated from the data. It must be given.

λ = 0 is ordinary regression. λ→∞ will force thecoefficients towards zero.


Variable Selection Methods

Ridge regression works by pushing all variable estimatestowards zero.

A natural alternative is to only set some variables to zero.


Subset Selection

Subset selection works by choosing a subset of thevariables and only regressing those.

Several standard techniques:

Best subset

Forward stepwise

Backward stepwise


Best subset

For a fixed k , choose the k variables that maximize theR2.

Downside: can be computationally expensive.

Unspecified: k .


Forward selection

Start with only the intercept, and add one variable at atime. Choose the variable that increases the R2 the most.

Downside: not optimal fit.

Unspecified: when to stop.


Backward selection

Start with all variables, and remove one variable at atime. Choose the variable that decreases the R2 theleast.

Downside: not optimal fit.

Unspecified: when to stop.


The Lasso

The lasso superficially resembles ridge regression, buthas some of the aspects of subset selection.

It’s regression with a penalty term,

∑i

yi − α−∑j

βjxij

2

+ λ∑j

|βj |,

but the penalty is minimized by setting some of thevariables to zero.


Other Penalties

Other penalties appear in the literature:

p-norm (for 1 ≤ 2):|βj |p

elastic net:αβ2

j + (1− α)|βj |

Both somewhere between ridge and lasso in behavior.


Nonlinear Models

The previous techniques are for linear models of largenumbers of variables. What about nonlinear models of afew variables?

In theory we can treat this problem as a special case ofthe previous problem: choose a big family of basisfunctions, a la polynomial regression.

We now consider some more intrinsically nonlinearmethods:

Splines

Local regression

Generalized additive models


Piecewise Basis Functions

Polynomial regression has the downside that everyobservation affects every coefficient.

Alternative: divide x-axis into intervals. (The endpointsare known as knots.) For each interval, choose a set ofbasis functions that are zero outside that interval.

Regression coefficients will be unaffected by observationsoutside an interval.


Choosing Piecewise Basis Functions

General recipe:

Choose knots: ξ1, . . . , ξn.

Choose arbitrary set of basis functions: constants,linear, polynomials, etc.

Let I (ξi ≤ x < xi+1) be the function that is 1 in theinterval[ξi , ξi+1], 0 otherwise. Then functions of theform

fi(x − ξi)I (ξi ≤ ξi+1)

do the job.


Continuity

Downside: A regression fit will usually be discontinuousat the boundary.

Fix: Choose coefficients so that the basis functionsmatch up on either side of a knot.

This imposes a linear constraint on coefficients at eachknot.


Derivatives

We can go further, and choose coefficient so thatderivatives agree on either side of a knot. This is againlinear.

A spline is piecewise polynomials of degree d such thatderivatives up to degree d − 1 agree on each side of theknots.

Usual case is cubic splines.


Smoothing Splines

Splines can be fit by constrained linear regression –ordinary linear regression with linear constraints on thecoefficients.

These fits can be pretty wiggly, especially as the numberof knots increases.

An alternative techique is smoothing splines – add apenalty term for wiggliness.


Smoothness Penalty

A fairly general objective with penalty is of the form∑i

(y − f (x))2 + λ

∫f ′′(t)2dt.

f is chosen from some family, such as cubic splines, tominimize this.

The f ′′ term is easy to compute for polynomials.


Local Regression

Piecewise polynomials are an extreme solution to theproblem of making coefficients depend only nearby data.

Local regression is a less-extreme solution, where foreach point we blend together nearby points.


Nearest-neighbor average

Simplest technique:For a point x , let f (x) be the average of yi for the k xi ’snearest to x .

f is a discontinuous step function, because anobserveration is used or not used.

Alternative: take a weighted sum, where the weights dieoff smoothly.


Kernels

Let K (x , x ′) be a function with a maximum at x = x ′

such that it goes to zero as x − x ′ → ±∞. K is calledthe kernel.

Let f (x) be

f (x) =

∑i K (x , xi)yiK (x , xi)

.

The K (x , xi) are the weights.


Nonlinear multivariate models

Once we combine nonlinearity and many variables, thingsget considerably harder. Some techniques:

Multidimensional splines.

Multidimensional local regression.

The curse of dimensionality: methods do not scale tomany variables.


High-tech approaches

There are high-tech approaches such as

Neural nets

Genetic algorithms

Support vector machines


Generalized Additive Models

Generalized additive models are a low-tech approach:

Assume that the influence of each variable is wellexplained by splines.

Add them together to get the total influence.

y = f1(x1) + · · ·+ fn(xn).


Fitting Generalized Additive Models

These models can be fit in several ways. One naturalway generalizes smoothing splines.

Minimize a penalty of the form∑i

(y −∑j

fj(x))2 + λj

∫f ′′j (t)2dt.


Documents

Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration