Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Stochastic ModelsMachine Learning
Walt Pohl
Universitat ZurichDepartment of Business Administration
March 19, 2015
What is Machine Learning?
Machine learning is aimed at prediction, nothypothesis testing.
Use a high-dimensional approximation to extract themaximum predictability.
No effort is made to interpret individual parameters.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 2 / 29
The Prediction Problem
The basic framework is:
Predict Y , given some vector of predictors, X .
Find a function, f (X ), to predict Y ,
Y = f (X ) + ε.
where ε is random.
The space of all possible f ’s is infinite-dimensional, butwe choose a finite-dimensional approximation.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 3 / 29
Example: Polynomial Regression
Use regression to fit a a high-degree polynomial – degree10 or 20.
f (X ) = b0 + b1X + · · ·+ bNXN
Coefficients are hard to interpret. What does thecoefficient of x8 mean?
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 4 / 29
In Versus Out of Sample
Use a high-enough degree, you can fit perfectly – insample.
Out of sample, the fit will be terrible, much worse than alinear regression.
This is the problem of overfitting.
The solution? Penalize complexity.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 5 / 29
Loss Function
Choose a loss function, L(Y , f (X )).
Choose f to minimize
E (L(Y , f (X ))).
L can be modified to penalize complexity.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 6 / 29
Penalizing Complexity via Loss Function
One natural choice for L is squared-error loss:
(Y − f (X ))2
This leads to regression.
But now, let’s introduce a term to penalize complexity.
Example:
(Y − f (X ))2 + λ∑i
β2i
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 7 / 29
Ridge Regression
Minimizing this penalty gives you ridge regression.
Note that λ – known as the ridge parameter – cannot beestimated from the data. It must be given.
λ = 0 is ordinary regression. λ→∞ will force thecoefficients towards zero.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 8 / 29
Variable Selection Methods
Ridge regression works by pushing all variable estimatestowards zero.
A natural alternative is to only set some variables to zero.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 9 / 29
Subset Selection
Subset selection works by choosing a subset of thevariables and only regressing those.
Several standard techniques:
Best subset
Forward stepwise
Backward stepwise
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 10 / 29
Best subset
For a fixed k , choose the k variables that maximize theR2.
Downside: can be computationally expensive.
Unspecified: k .
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 11 / 29
Forward selection
Start with only the intercept, and add one variable at atime. Choose the variable that increases the R2 the most.
Downside: not optimal fit.
Unspecified: when to stop.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 12 / 29
Backward selection
Start with all variables, and remove one variable at atime. Choose the variable that decreases the R2 theleast.
Downside: not optimal fit.
Unspecified: when to stop.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 13 / 29
The Lasso
The lasso superficially resembles ridge regression, buthas some of the aspects of subset selection.
It’s regression with a penalty term,
∑i
yi − α−∑j
βjxij
2
+ λ∑j
|βj |,
but the penalty is minimized by setting some of thevariables to zero.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 14 / 29
Other Penalties
Other penalties appear in the literature:
p-norm (for 1 ≤ 2):|βj |p
elastic net:αβ2
j + (1− α)|βj |
Both somewhere between ridge and lasso in behavior.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 15 / 29
Nonlinear Models
The previous techniques are for linear models of largenumbers of variables. What about nonlinear models of afew variables?
In theory we can treat this problem as a special case ofthe previous problem: choose a big family of basisfunctions, a la polynomial regression.
We now consider some more intrinsically nonlinearmethods:
Splines
Local regression
Generalized additive models
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 16 / 29
Piecewise Basis Functions
Polynomial regression has the downside that everyobservation affects every coefficient.
Alternative: divide x-axis into intervals. (The endpointsare known as knots.) For each interval, choose a set ofbasis functions that are zero outside that interval.
Regression coefficients will be unaffected by observationsoutside an interval.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 17 / 29
Choosing Piecewise Basis Functions
General recipe:
Choose knots: ξ1, . . . , ξn.
Choose arbitrary set of basis functions: constants,linear, polynomials, etc.
Let I (ξi ≤ x < xi+1) be the function that is 1 in theinterval[ξi , ξi+1], 0 otherwise. Then functions of theform
fi(x − ξi)I (ξi ≤ ξi+1)
do the job.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 18 / 29
Continuity
Downside: A regression fit will usually be discontinuousat the boundary.
Fix: Choose coefficients so that the basis functionsmatch up on either side of a knot.
This imposes a linear constraint on coefficients at eachknot.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 19 / 29
Derivatives
We can go further, and choose coefficient so thatderivatives agree on either side of a knot. This is againlinear.
A spline is piecewise polynomials of degree d such thatderivatives up to degree d − 1 agree on each side of theknots.
Usual case is cubic splines.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 20 / 29
Smoothing Splines
Splines can be fit by constrained linear regression –ordinary linear regression with linear constraints on thecoefficients.
These fits can be pretty wiggly, especially as the numberof knots increases.
An alternative techique is smoothing splines – add apenalty term for wiggliness.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 21 / 29
Smoothness Penalty
A fairly general objective with penalty is of the form∑i
(y − f (x))2 + λ
∫f ′′(t)2dt.
f is chosen from some family, such as cubic splines, tominimize this.
The f ′′ term is easy to compute for polynomials.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 22 / 29
Local Regression
Piecewise polynomials are an extreme solution to theproblem of making coefficients depend only nearby data.
Local regression is a less-extreme solution, where foreach point we blend together nearby points.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 23 / 29
Nearest-neighbor average
Simplest technique:For a point x , let f (x) be the average of yi for the k xi ’snearest to x .
f is a discontinuous step function, because anobserveration is used or not used.
Alternative: take a weighted sum, where the weights dieoff smoothly.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 24 / 29
Kernels
Let K (x , x ′) be a function with a maximum at x = x ′
such that it goes to zero as x − x ′ → ±∞. K is calledthe kernel.
Let f (x) be
f (x) =
∑i K (x , xi)yiK (x , xi)
.
The K (x , xi) are the weights.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 25 / 29
Nonlinear multivariate models
Once we combine nonlinearity and many variables, thingsget considerably harder. Some techniques:
Multidimensional splines.
Multidimensional local regression.
The curse of dimensionality: methods do not scale tomany variables.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 26 / 29
High-tech approaches
There are high-tech approaches such as
Neural nets
Genetic algorithms
Support vector machines
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 27 / 29
Generalized Additive Models
Generalized additive models are a low-tech approach:
Assume that the influence of each variable is wellexplained by splines.
Add them together to get the total influence.
y = f1(x1) + · · ·+ fn(xn).
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 28 / 29
Fitting Generalized Additive Models
These models can be fit in several ways. One naturalway generalizes smoothing splines.
Minimize a penalty of the form∑i
(y −∑j
fj(x))2 + λj
∫f ′′j (t)2dt.
Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 29 / 29