29
Introduction to Statistical Learning Theory Petra Philips Friedrich Miescher Laboratory, Tübingen Vorlesung WS 2006/2007 Eberhard Karls Universität Tübingen 24 January 2007

Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Introduction to Statistical LearningTheory

Petra Philips

Friedrich Miescher Laboratory, Tübingen

Vorlesung WS 2006/2007Eberhard Karls Universität Tübingen

24 January 2007

http://www.fml.mpg.de/raetsch/lectures/amsa

Page 2: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Retrospection

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2

Page 3: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Supervised Learning in a Nutshell

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 3

GivenTraining data : A finite set of examples xi ∈ X and theirassociated labels yi ∈ Y.

WantedThe ’best’ estimator modelling the relationship betweenthe xi and the associated labels yi, i.e. the ’best’ function

f : X → Y.

Approach

Restrict possible functions (e.g. hyperplanes).Quantify ’best’ as the optimum of some computableobjective function (usually error on training data).Evaluate prediction performance on new test data .

Page 4: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Challenge

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 4

Is there an a priori way to guarantee good performance?

Page 5: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Recall - Loss, Risk

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 5

Loss The error for a particular example. `(f (xi), yi).Examples: 0-1 loss, hinge loss, squared loss.

Risk The expected loss for all data, including unseen.

R(f ) =

∫`(f (x), y)dρ.

Empirical Risk The average loss on training data only.

Remp(f ) =1

n

n∑i=1

`(f (xi), yi).

’Best’ Function The one that minimizes the risk.Empirical Risk Minimization Instead of minimizing the

risk we minimize the empirical risk!

Page 6: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Questions

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 6

How can we know we are doing ’the right thing’?Why should a small error on the training data ensure asmall error on unseen test data?

Assumption Training and test data are ’similar’ becausethey represent the same phenomenon.

No Free Lunch Without assumptions and restrictions noinference and generalization possible!

Why should the minimizer of the empirical risk be thesame as the minimizer of the risk?

Page 7: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Questions

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 7

How can we know we are doing ’the right thing’?Why should a small error on the training data ensure asmall error on unseen test data?Assumption Training and test data are ’similar’ because

they represent the same phenomenon.No Free Lunch Without assumptions and restrictions no

inference and generalization possible!Why should the minimizer of the empirical risk be thesame as the minimizer of the risk?

Page 8: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Questions

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 8

How can we know we are doing ’the right thing’?Why should a small error on the training data ensure asmall error on unseen test data?Assumption Training and test data are ’similar’ be-

cause they represent the same phenomenon.No Free Lunch Without assumptions and restrictions no

inference and generalization possible!

Why should the minimizer of the empirical risk be thesame as the minimizer of the risk?

Page 9: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Questions

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 9

How can we know we are doing ’the right thing’?Why should a small error on the training data ensure asmall error on unseen test data?Assumption Training and test data are ’similar’ be-

cause they represent the same phenomenon.No Free Lunch Without assumptions and restrictions no

inference and generalization possible!Why should the minimizer of the empirical risk be thesame as the minimizer of the risk?

Page 10: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

More Precise Questions

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 10

How to restrict the possible set of functions?Occam’s Razor Of two equivalent models choose the

simplest one.Can we quantify the ’complexity’ of a learning problem?Is more data always better data?How much data do we need?

Page 11: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

More Precise Questions

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 11

How to restrict the possible set of functions?Occam’s Razor Of two equivalent models choose the

simplest one. ?

Can we quantify the ’complexity’ of a learning problem?Is more data always better data?How much data do we need?

Page 12: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

More Precise Questions

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 12

How to restrict the possible set of functions?Occam’s Razor Of two equivalent models choose the

simplest one. ?Can we quantify the ’complexity’ of a learning problem?Is more data always better data?How much data do we need?

Page 13: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

More Precise Questions

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 13

How to restrict the possible set of functions?Occam’s Razor Of two equivalent models choose the

simplest one. ?Can we quantify the ’complexity’ of a learning problem?Is more data always better data?How much data do we need?

Page 14: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Statistical Learning Theory

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 14

Provides a theoretical framework to study thesequestions.Started with Vapnik and Chervonenkis [1971]which led to VC-Theory and SVM.Models the machine learning setting as astatistical phenomenon .

Answers are probabilistic in nature.Tools: statistics, functional analysis, empiricalprocesses, combinatorics, high-dimensional ge-ometry, complexity theory.Newer view: Bousquet et al. [2004].

Page 15: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Probabilistic Learning Model

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 15

AssumptionAll data is generated by the same hidden probabilisticsource!

Formallyρ is an unknown joint probability distribution over X×Y

Training data ((x1, y1), . . . , (xn, yn)) is iid ∼ ρ

Aim: find best f ∗∗ that minimizes risk

R(f ) =

∫`(f (x), y)dρ.

Page 16: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Probabilistic Learning Model

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 16

AssumptionAll data is generated by the same hidden probabilisticsource!

Formallyρ is an unknown joint probability distribution over X×Y

Training data ((x1, y1), . . . , (xn, yn)) is iid ∼ ρ

Aim: find best f ∗ ∈ F that minimizes risk

R(f ) =

∫`(f (x), y)dρ.

Page 17: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Probabilistic Learning Model

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 17

AssumptionAll data is generated by the same hidden probabilisticsource!

Formallyρ is an unknown joint probability distribution over X×Y

Training data ((x1, y1), . . . , (xn, yn)) is iid ∼ ρ

Aim: find best f ∗ ∈ F that minimizes risk

R(f ) =

∫`(f (x), y)dρ.

ERM: find best fn ∈ F that minimizes empirical risk

Remp(f ) =1

n

n∑i=1

`(f (xi), yi).

Page 18: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Challenge Question

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 18

Is R(fn) small, i.e. R(fn) ≈ R(f ∗∗)?

Magics?

Page 19: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Approximation & Estimation Error

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 19

R(fn)−R(f ∗∗) = R(fn)−R(f ∗) + R(f ∗)−R(f ∗∗)

F large

small approximation erroroverfitting

F small

large approximation errorbetter generalization but poor performance

Model selectionChoose F to get an optimal tradeoff between approxima-tion and estimation error.

Page 20: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Estimation Error

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 20

R(f ∗)−R(fn) ?

depends on training datadepends on F

depends on how algorithm chooses fn

depends on unknown ρ through f ∗ and risk

For ERM use uniform differences trick!

Uniform differences

|R(f ∗)−R(fn)| ≤ 2 supf∈F

|Remp(f )−R(f )|

Page 21: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Empirical and Actual Risk

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 21

Remp(f ) ≈ R(f ) ?

Asymptotics: Law of Large NumbersFor any fixed f , |Remp(f )−R(f )| −→ 0 as n −→∞.

Finite Sample Result [Chernoff-Hoeffding]For any fixed f , with high probability

|Remp(f )−R(f )| ≈ 1√n.

Does this mean that ERM finds optimal estimator f ∗ whentraining sample is getting large?

Page 22: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Empirical and Actual Risk

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 22

Remp(f ) ≈ R(f ) ?

Asymptotics: Law of Large NumbersFor any fixed f , |Remp(f )−R(f )| −→ 0 as n −→∞.

Finite Sample Result [Chernoff-Hoeffding]For any fixed f , with high probability

|Remp(f )−R(f )| ≈ 1√n.

Does this mean that ERM finds optimal estimator f ∗ whentraining sample is getting large?

NO! fn is a random variable and not fixed. A uniform LLNis needed, which holds simultaneously for all f ∈ F. This istrue only for classes F which are ’not too complex’ .

Page 23: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Estimation Error

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 23

R(f ∗)−R(fn) ?

Uniform differences

|R(f ∗)−R(fn)| ≤ 2 supf∈F

|Remp(f )−R(f )|

Finite Sample Results

One fixed function: |R(f ∗)−R(fn)| ≈ 1/√

n

F finite: |R(f ∗)−R(fn)| ≈√

log(|F|)/√

n

F infinite: ?

Page 24: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

VC Dimension

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 24

A model class shatters a set of data points if it can correctlyclassify any possible labeling.

Lines shatter any 3 points in R2, but not 4 points.

VC dimension [Vapnik, 1995]The VC dimension of a model class is the maximum hsuch that some data point set of size h can be shatteredby the model. (e.g. VC dimension of R2 is 3.)

A small VC dimension implies small complexity.

Page 25: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Optimal and Empirical Estimator

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 25

R(f ∗) ≈ R(fn) ?

Uniform differences

|R(f ∗)−R(fn)‖ ≤ 2 supf∈F

|Remp(f )−R(f )|

Finite Sample Results

One fixed function: |R(f ∗)−R(fn)| ≈ 1/√

n

F finite: |R(f ∗)−R(fn)| ≈√

log(|F|)/√

n

F infinite: |R(f ∗)−R(fn)| ≈√

V Cdim(F)/√

n

All results hold with high probability overthe random draw of training samples!

Page 26: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Implications

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 26

VC dimension is a meaningful complexity measure.Do model selection by minimizing VC dimension.More data gives more likely a good predictor.

Page 27: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Larger Margin Classifiers

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 27

Large Margin ⇒ Small VC dimensionHyperplane classifiers with large margin have small VCdimension [Vapnik, 1995].

Maximum Margin ⇒ Minimum ComplexityMinimize complexity by maximizing margin (irrespectiveof the dimension of the space).

Page 28: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

Summary - SLT

G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 28

Provides a statistical framework to study learning algorithms.

Quantifies the generalization ability in terms of

complexity of estimator functions

number of training examples.

Results are probabilistic in nature (confidences).

Results teach us

When and why our intuitive solutions were right (SVM, crossvali-dation).

Why and how to restrict class of estimators and to regularize.

That more data is best because it increases confidence in result.

But: Limited model, many questions not yet understood!

Page 29: Introduction to Statistical Learning Theoryraetschlab.org/lectures/amsa/11-SLT.pdf · Statistical Learning Theory G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence

References

O. Bousquet, S. Boucheron, and G. Lugosi. Machine Learning Summer School 2003, volume 3176 of LNAI, chapter Introduction tostatistical learning theory, pages 208–240. Springer-Verlag, 2004.

V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab.Appl., 16:264–280, 1971.

V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995.