Upload
felicia-nicholson
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Chapter1: IntroductionChapter2: Overview of
Supervised Learning2006.01.20
Supervised learning
Training data set: several features and outcome
Build a learner based on training data sets Predict the future unseen outcome from seen
features of data
An example of supervised learningEmail spam
NormalEmails
………
Spam…………
Learner …New emails
Normal emails
Spam
Known
Unknown
Input & Output
Input = predictor = independent variable Output = response = dependent variable
Output Types
Quantitative >> regression Ex) stock price, temperature, age
Qualitative >> classification Ex) Yes/No,
Input Types
Quantitative Qualitative Ordered categorical
Ex) small, medium, big
Terminology
X : input Xj : j th component X : matrix xj : j th observed value
Y : quantitative output Y : prediction
G : qualitative output
^
General model
Given input X, output Y
Want to estimate the function f based on known data set (training data)
unknown
Two simple methods
Linear model, linear regression
Nearest neighbor method
Linear model
Give a vector of input features X = (X1…Xp) Assume the linear relationship:
Least squares standard:
min -2
Classification example in two dimensions -1
Nearest neighbor method
Majority vote within the k nearest neighbors
K= 1: brownK= 3: green
new
Classification example in two dimensions -2
Linear model vs. K-nearest neighbor Linear model• #parameters: p• Stable, smooth• Low variance, high bias
K-nearest neighbor• #parameters: N/k • Unstable, wiggly• High variance, low bias
Each method has its own situations
for which it works best.
Misclassification curves
Enhanced Methods
Kernel methods using weights Modifying the distance kernels Locally weighted least squares Expansion of inputs for arbitrarily complex
models Projection & neural network
Statistical decision theory (1)
Given input X in Rp, output Y in R Joint distribution: Pr(X,Y) Looking for predicting function: f(X) Squared error loss:
Nearest-neighbor methods :
min EPE
^
Statistical decision theory (2)
k-Nearest neighbor
If N,k , k/N 0
Insufficient samples!
Curse of dimensionality!
Linear model
But, the true function might not be linear!
Statistical decision theory (3)
If
Robust But, discontinuous in their derivatives
^
Statistical decision theory (4)
G : categorical output variable L : Loss Function EPE = E[L(G, G(X))]
Bayesian Classifier
^
References
Reading group on "elements of statistical learning” – overview.ppt http://sifaka.cs.uiuc.edu/taotao/stat.html
Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf
http://www.stat.ohio-state.edu/~goel/STATLEARN/
The Matrix Cookbook http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/
3274/pdf/imm3274.pdf
A First Course in Probability
2.5 Local Methods in High Dimensions With a reasonably large set of training data, we could always
approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging.
The curse of dimensionality To capture 1% of data to form a local average, we must cover
63% of the range of each input variable. The expected edge length =
All sample points are close to an edge of the sample. Median distance from the origin to the closest data point:
p r
1/1/1
( , ) (1 )2
Npd p N
2.5 Local Methods in High Dimensions Example 1-NN vs. Linear
1-NN
As p increases, MSE & bias tends to 1.0. Linear model
Expecting on x0, the expected EPE increases linearly as a function of p.
20 0 0
2 20 0 0 0
ˆ( ) [ ( ) ]
ˆ ˆ ˆ[ ( )] [ ( ) ( )]
T
T T T
MSE x E f x y
E y E y E y f x
Variance Sq. Bias
0 0
20 | 0 0
2 20 0 0 0 0 0
20 0 0 0
ˆ( ) ( )
ˆ ˆ ˆ( | ) [ ] [ ]
ˆ ˆ( | ) ( ) ( )
y x T
T T T T
T
EPE x E E y y
Var y x E y E y E y E y
Var y x Var y Bias y
= 0.
By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger.
2.6 Statistical Models, Supervised Learning and Function Approximation Finding a useful approximation to
function that underlies the predictive relationship between the inputs and outputs. Supervised learning: machine learning point of
view Function approximation: mathematics and
statistics point of view
ˆ ( )f x( )f x
2.7 Structured Regression Models Nearest-neighbor and other local methods
face problems in high dimensions. They may be inappropriate even in low
dimensions. Need for structured approaches.
Difficulty of the problem Infinitely many solutions to minimizing RSS. Unique solution comes from restrictions on f.
2
1
( ) ( ( ))N
i ii
RSS f y f x
2.8 Classes of Restricted Estimators Methods categorized by the nature of the
restrictions. Roughness penalty and Bayesian methods
Penalizing functions that too rapidly vary over small regions of input space.
Kernel methods and local regression Explicitly specifying the nature of local neighborhood (kernel
function). Need adaptation in high dimensions.
Basis functions and dictionary methods Linear expansion of basis functions.
2.9 Model Selection and the Bias-Variance Tradeoff All models have a smoothing or complexity
parameter to be determined Multiplier of the penalty term Width of the kernel Number of basis functions
Bias-Variance tradeoff
Essential with ε, no way to reduce
To reduce one might increase the other. Tradeoff!
Bias-Variance tradeoff in kNN
Model complexity
Training error
Test error
Model ComplexityLow High
Pre
dict
ion
Err
or
High BiasLow Variance
Low BiasHigh Variance