Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Chapter1: IntroductionChapter2: Overview of

Supervised Learning2006.01.20

Supervised learning

Training data set: several features and outcome

Build a learner based on training data sets Predict the future unseen outcome from seen

features of data

An example of supervised learningEmail spam

NormalEmails

………

Spam…………

Learner …New emails

Normal emails

Spam

Known

Unknown

Input & Output

Input = predictor = independent variable Output = response = dependent variable

Output Types

Quantitative >> regression Ex) stock price, temperature, age

Qualitative >> classification Ex) Yes/No,

Input Types

Quantitative Qualitative Ordered categorical

Ex) small, medium, big

Terminology

X : input Xj : j th component X : matrix xj : j th observed value

Y : quantitative output Y : prediction

G : qualitative output

^

General model

Given input X, output Y

Want to estimate the function f based on known data set (training data)

unknown

Two simple methods

Linear model, linear regression

Nearest neighbor method

Linear model

Give a vector of input features X = (X1…Xp) Assume the linear relationship:

Least squares standard:

min -2

Classification example in two dimensions -1

Nearest neighbor method

Majority vote within the k nearest neighbors

K= 1: brownK= 3: green

new

Classification example in two dimensions -2

Linear model vs. K-nearest neighbor Linear model• #parameters: p• Stable, smooth• Low variance, high bias

K-nearest neighbor• #parameters: N/k • Unstable, wiggly• High variance, low bias

Each method has its own situations

for which it works best.

Misclassification curves

Enhanced Methods

Kernel methods using weights Modifying the distance kernels Locally weighted least squares Expansion of inputs for arbitrarily complex

models Projection & neural network

Statistical decision theory (1)

Given input X in Rp, output Y in R Joint distribution: Pr(X,Y) Looking for predicting function: f(X) Squared error loss:

Nearest-neighbor methods :

min EPE

^


k-Nearest neighbor

If N,k , k/N 0

Insufficient samples!

Curse of dimensionality!

Linear model

But, the true function might not be linear!


If

Robust But, discontinuous in their derivatives

^


G : categorical output variable L : Loss Function EPE = E[L(G, G(X))]

Bayesian Classifier

^

References

Reading group on "elements of statistical learning” – overview.ppt http://sifaka.cs.uiuc.edu/taotao/stat.html

Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf

http://www.stat.ohio-state.edu/~goel/STATLEARN/

The Matrix Cookbook http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/

3274/pdf/imm3274.pdf

A First Course in Probability

http://sifaka.cs.uiuc.edu/taotao/stat.html

http://www.stat.ohio-state.edu/~goel/STATLEARN

http://www.stat.ohio-state.edu/~goel/STATLEARN

http://www2.imm.dtu.dk/pubdb/views/%20edoc_download.php/3274/pdf/imm3274.pdf



2.5 Local Methods in High Dimensions With a reasonably large set of training data, we could always

approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging.

The curse of dimensionality To capture 1% of data to form a local average, we must cover

63% of the range of each input variable. The expected edge length =

All sample points are close to an edge of the sample. Median distance from the origin to the closest data point:

p r

1/1/1

( , ) (1 )2

Npd p N

2.5 Local Methods in High Dimensions Example 1-NN vs. Linear

1-NN

As p increases, MSE & bias tends to 1.0. Linear model

Expecting on x0, the expected EPE increases linearly as a function of p.

20 0 0

2 20 0 0 0

ˆ( ) [ ( ) ]

ˆ ˆ ˆ[ ( )] [ ( ) ( )]

T

T T T

MSE x E f x y

E y E y E y f x

Variance Sq. Bias

0 0

20 | 0 0

2 20 0 0 0 0 0

20 0 0 0

ˆ( ) ( )

ˆ ˆ ˆ( | ) [ ] [ ]

ˆ ˆ( | ) ( ) ( )

y x T

T T T T

T

EPE x E E y y

Var y x E y E y E y E y

Var y x Var y Bias y

= 0.

By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger.

2.6 Statistical Models, Supervised Learning and Function Approximation Finding a useful approximation to

function that underlies the predictive relationship between the inputs and outputs. Supervised learning: machine learning point of

view Function approximation: mathematics and

statistics point of view

ˆ ( )f x( )f x

2.7 Structured Regression Models Nearest-neighbor and other local methods

face problems in high dimensions. They may be inappropriate even in low

dimensions. Need for structured approaches.

Difficulty of the problem Infinitely many solutions to minimizing RSS. Unique solution comes from restrictions on f.

2

1

( ) ( ( ))N

i ii

RSS f y f x

2.8 Classes of Restricted Estimators Methods categorized by the nature of the

restrictions. Roughness penalty and Bayesian methods

Penalizing functions that too rapidly vary over small regions of input space.

Kernel methods and local regression Explicitly specifying the nature of local neighborhood (kernel

function). Need adaptation in high dimensions.

Basis functions and dictionary methods Linear expansion of basis functions.

2.9 Model Selection and the Bias-Variance Tradeoff All models have a smoothing or complexity

parameter to be determined Multiplier of the penalty term Width of the kernel Number of basis functions

Bias-Variance tradeoff

Essential with ε, no way to reduce

To reduce one might increase the other. Tradeoff!

Bias-Variance tradeoff in kNN

Model complexity

Training error

Test error

Model ComplexityLow High

Pre

dict

ion

Err

or

High BiasLow Variance

Low BiasHigh Variance

Documents

Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20