101
CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 1 / 34

CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

CSC 411: Lecture 17: Ensemble Methods I

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 1 / 34

Page 2: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Today

Ensemble Methods

Bagging

Boosting

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 2 / 34

Page 3: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods

Typical application: classification

Ensemble of classifiers is a set of classifiers whose individual decisions arecombined in some way to classify new examples

Simplest approach:

1. Generate multiple classifiers2. Each votes on test instance3. Take majority as classification

Classifiers are di↵erent due to di↵erent sampling of training data, orrandomized parameters within the classification algorithm

Aim: take simple mediocre algorithm and transform it into a super classifierwithout requiring any fancy new algorithm

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 3 / 34

Page 4: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods

Typical application: classification

Ensemble of classifiers is a set of classifiers whose individual decisions arecombined in some way to classify new examples

Simplest approach:

1. Generate multiple classifiers2. Each votes on test instance3. Take majority as classification

Classifiers are di↵erent due to di↵erent sampling of training data, orrandomized parameters within the classification algorithm

Aim: take simple mediocre algorithm and transform it into a super classifierwithout requiring any fancy new algorithm

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 3 / 34

Page 5: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods

Typical application: classification

Ensemble of classifiers is a set of classifiers whose individual decisions arecombined in some way to classify new examples

Simplest approach:

1. Generate multiple classifiers2. Each votes on test instance3. Take majority as classification

Classifiers are di↵erent due to di↵erent sampling of training data, orrandomized parameters within the classification algorithm

Aim: take simple mediocre algorithm and transform it into a super classifierwithout requiring any fancy new algorithm

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 3 / 34

Page 6: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods

Typical application: classification

Ensemble of classifiers is a set of classifiers whose individual decisions arecombined in some way to classify new examples

Simplest approach:

1. Generate multiple classifiers2. Each votes on test instance3. Take majority as classification

Classifiers are di↵erent due to di↵erent sampling of training data, orrandomized parameters within the classification algorithm

Aim: take simple mediocre algorithm and transform it into a super classifierwithout requiring any fancy new algorithm

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 3 / 34

Page 7: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods

Typical application: classification

Ensemble of classifiers is a set of classifiers whose individual decisions arecombined in some way to classify new examples

Simplest approach:

1. Generate multiple classifiers2. Each votes on test instance3. Take majority as classification

Classifiers are di↵erent due to di↵erent sampling of training data, orrandomized parameters within the classification algorithm

Aim: take simple mediocre algorithm and transform it into a super classifierwithout requiring any fancy new algorithm

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 3 / 34

Page 8: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods: Summary

Di↵er in training strategy, and combination method

I Parallel training with di↵erent training sets1. Bagging (bootstrap aggregation) – train separate models on

overlapping training sets, average their predictions

I Sequential training, iteratively re-weighting training examples socurrent classifier focuses on hard examples: boosting

I Parallel training with objective encouraging division of labor: mixtureof experts

Notes:

I Also known as meta-learningI Typically applied to weak models, such as decision stumps (single-node

decision trees), or linear classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 4 / 34

Page 9: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods: Summary

Di↵er in training strategy, and combination method

I Parallel training with di↵erent training sets

1. Bagging (bootstrap aggregation) – train separate models onoverlapping training sets, average their predictions

I Sequential training, iteratively re-weighting training examples socurrent classifier focuses on hard examples: boosting

I Parallel training with objective encouraging division of labor: mixtureof experts

Notes:

I Also known as meta-learningI Typically applied to weak models, such as decision stumps (single-node

decision trees), or linear classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 4 / 34

Page 10: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods: Summary

Di↵er in training strategy, and combination method

I Parallel training with di↵erent training sets1. Bagging (bootstrap aggregation) – train separate models on

overlapping training sets, average their predictions

I Sequential training, iteratively re-weighting training examples socurrent classifier focuses on hard examples: boosting

I Parallel training with objective encouraging division of labor: mixtureof experts

Notes:

I Also known as meta-learningI Typically applied to weak models, such as decision stumps (single-node

decision trees), or linear classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 4 / 34

Page 11: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods: Summary

Di↵er in training strategy, and combination method

I Parallel training with di↵erent training sets1. Bagging (bootstrap aggregation) – train separate models on

overlapping training sets, average their predictions

I Sequential training, iteratively re-weighting training examples socurrent classifier focuses on hard examples: boosting

I Parallel training with objective encouraging division of labor: mixtureof experts

Notes:

I Also known as meta-learningI Typically applied to weak models, such as decision stumps (single-node

decision trees), or linear classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 4 / 34

Page 12: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods: Summary

Di↵er in training strategy, and combination method

I Parallel training with di↵erent training sets1. Bagging (bootstrap aggregation) – train separate models on

overlapping training sets, average their predictions

I Sequential training, iteratively re-weighting training examples socurrent classifier focuses on hard examples: boosting

I Parallel training with objective encouraging division of labor: mixtureof experts

Notes:

I Also known as meta-learningI Typically applied to weak models, such as decision stumps (single-node

decision trees), or linear classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 4 / 34

Page 13: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods: Summary

Di↵er in training strategy, and combination method

I Parallel training with di↵erent training sets1. Bagging (bootstrap aggregation) – train separate models on

overlapping training sets, average their predictions

I Sequential training, iteratively re-weighting training examples socurrent classifier focuses on hard examples: boosting

I Parallel training with objective encouraging division of labor: mixtureof experts

Notes:

I Also known as meta-learning

I Typically applied to weak models, such as decision stumps (single-nodedecision trees), or linear classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 4 / 34

Page 14: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble methods: Summary

Di↵er in training strategy, and combination method

I Parallel training with di↵erent training sets1. Bagging (bootstrap aggregation) – train separate models on

overlapping training sets, average their predictions

I Sequential training, iteratively re-weighting training examples socurrent classifier focuses on hard examples: boosting

I Parallel training with objective encouraging division of labor: mixtureof experts

Notes:

I Also known as meta-learningI Typically applied to weak models, such as decision stumps (single-node

decision trees), or linear classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 4 / 34

Page 15: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bias-Variance Decomposition

Intro ML (UofT) CSC311-Lec4 2 / 70

Page 16: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Variance-bias Tradeo↵

Minimize two sets of errors:

1. Variance: error from sensitivity to small fluctuations in the training set

2. Bias: erroneous assumptions in the model

Variance-bias decomposition is a way of analyzing the generalization error asa sum of 3 terms: variance, bias and irreducible error (resulting from theproblem itself)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 5 / 34

Page 17: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Variance-bias Tradeo↵

Minimize two sets of errors:

1. Variance: error from sensitivity to small fluctuations in the training set2. Bias: erroneous assumptions in the model

Variance-bias decomposition is a way of analyzing the generalization error asa sum of 3 terms: variance, bias and irreducible error (resulting from theproblem itself)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 5 / 34

Page 18: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Variance-bias Tradeo↵

Minimize two sets of errors:

1. Variance: error from sensitivity to small fluctuations in the training set2. Bias: erroneous assumptions in the model

Variance-bias decomposition is a way of analyzing the generalization error asa sum of 3 terms: variance, bias and irreducible error (resulting from theproblem itself)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 5 / 34

Page 19: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Why do Ensemble Methods Work?

Based on one of two basic observations:

1. Variance reduction: if the training sets are completely independent, itwill always help to average an ensemble because this will reducevariance without a↵ecting bias (e.g., bagging)

I reduce sensitivity to individual data points

2. Bias reduction: for simple models, average of models has much greatercapacity than single model (e.g., hyperplane classifiers, Gaussiandensities).

I Averaging models can reduce bias substantially by increasing capacity,and control variance by fitting one component at a time (e.g., boosting)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 6 / 34

Page 20: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Why do Ensemble Methods Work?

Based on one of two basic observations:

1. Variance reduction: if the training sets are completely independent, itwill always help to average an ensemble because this will reducevariance without a↵ecting bias (e.g., bagging)

I reduce sensitivity to individual data points

2. Bias reduction: for simple models, average of models has much greatercapacity than single model (e.g., hyperplane classifiers, Gaussiandensities).

I Averaging models can reduce bias substantially by increasing capacity,and control variance by fitting one component at a time (e.g., boosting)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 6 / 34

Page 21: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Why do Ensemble Methods Work?

Based on one of two basic observations:

1. Variance reduction: if the training sets are completely independent, itwill always help to average an ensemble because this will reducevariance without a↵ecting bias (e.g., bagging)

I reduce sensitivity to individual data points

2. Bias reduction: for simple models, average of models has much greatercapacity than single model (e.g., hyperplane classifiers, Gaussiandensities).

I Averaging models can reduce bias substantially by increasing capacity,and control variance by fitting one component at a time (e.g., boosting)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 6 / 34

Page 22: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Why do Ensemble Methods Work?

Based on one of two basic observations:

1. Variance reduction: if the training sets are completely independent, itwill always help to average an ensemble because this will reducevariance without a↵ecting bias (e.g., bagging)

I reduce sensitivity to individual data points

2. Bias reduction: for simple models, average of models has much greatercapacity than single model (e.g., hyperplane classifiers, Gaussiandensities).

I Averaging models can reduce bias substantially by increasing capacity,and control variance by fitting one component at a time (e.g., boosting)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 6 / 34

Page 23: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Justification

Ensemble methods more accurate than any individual members if:

I Accurate (better than guessing)I Diverse (di↵erent errors on new examples)

Why?

Independent errors: prob k of N classifiers (independent error rate ✏) wrong:

P(num errors = k) =

✓N

k

◆✏k(1� ✏)N�k

Probability that majority vote wrong: error under distribution where morethan N/2 wrong

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 7 / 34

Page 24: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Justification

Ensemble methods more accurate than any individual members if:

I Accurate (better than guessing)I Diverse (di↵erent errors on new examples)

Why?

Independent errors: prob k of N classifiers (independent error rate ✏) wrong:

P(num errors = k) =

✓N

k

◆✏k(1� ✏)N�k

Probability that majority vote wrong: error under distribution where morethan N/2 wrong

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 7 / 34

Page 25: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Justification

Ensemble methods more accurate than any individual members if:

I Accurate (better than guessing)I Diverse (di↵erent errors on new examples)

Why?

Independent errors: prob k of N classifiers (independent error rate ✏) wrong:

P(num errors = k) =

✓N

k

◆✏k(1� ✏)N�k

Probability that majority vote wrong: error under distribution where morethan N/2 wrong

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 7 / 34

Page 26: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Justification

Ensemble methods more accurate than any individual members if:

I Accurate (better than guessing)I Diverse (di↵erent errors on new examples)

Why?

Independent errors: prob k of N classifiers (independent error rate ✏) wrong:

P(num errors = k) =

✓N

k

◆✏k(1� ✏)N�k

Probability that majority vote wrong: error under distribution where morethan N/2 wrong

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 7 / 34

Page 27: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Justification

Figure : Example: The probability that exactly K (out of 21) classifiers will makean error assuming each classifier has an error rate of ✏ = 0.3 and makes its errorsindependently of the other classifier. The area under the curve for 11 or moreclassifiers being simultaneously wrong is 0.026 (much less than ✏).

[Credit: T. G Dietterich, Ensemble Methods in Machine Learning]

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 8 / 34

Page 28: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Justification

Figure : ✏ = 0.3: (left) N = 11 classifiers, (middle) N = 21, (right) N = 121.

Figure : ✏ = 0.49: (left) N = 11, (middle) N = 121, (right) N = 10001.

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 9 / 34

Page 29: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Justification

Figure : ✏ = 0.3: (left) N = 11 classifiers, (middle) N = 21, (right) N = 121.

Figure : ✏ = 0.49: (left) N = 11, (middle) N = 121, (right) N = 10001.Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 9 / 34

Page 30: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Netflix

Clear demonstration of the power of ensemble methods

Original progress prize winner (BellKor) was ensemble of 107 models!

I ”Our experience is that most e↵orts should be concentrated in deriving

substantially di↵erent approaches, rather than refining a simple

technique.”

I ”We strongly believe that the success of an ensemble approach

depends on the ability of its various predictors to expose di↵erent

complementing aspects of the data. Experience shows that this is very

di↵erent than optimizing the accuracy of each individual predictor.”

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 10 / 34

Page 31: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Netflix

Clear demonstration of the power of ensemble methods

Original progress prize winner (BellKor) was ensemble of 107 models!

I ”Our experience is that most e↵orts should be concentrated in deriving

substantially di↵erent approaches, rather than refining a simple

technique.”

I ”We strongly believe that the success of an ensemble approach

depends on the ability of its various predictors to expose di↵erent

complementing aspects of the data. Experience shows that this is very

di↵erent than optimizing the accuracy of each individual predictor.”

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 10 / 34

Page 32: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Netflix

Clear demonstration of the power of ensemble methods

Original progress prize winner (BellKor) was ensemble of 107 models!

I ”Our experience is that most e↵orts should be concentrated in deriving

substantially di↵erent approaches, rather than refining a simple

technique.”

I ”We strongly believe that the success of an ensemble approach

depends on the ability of its various predictors to expose di↵erent

complementing aspects of the data. Experience shows that this is very

di↵erent than optimizing the accuracy of each individual predictor.”

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 10 / 34

Page 33: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Ensemble Methods: Netflix

Clear demonstration of the power of ensemble methods

Original progress prize winner (BellKor) was ensemble of 107 models!

I ”Our experience is that most e↵orts should be concentrated in deriving

substantially di↵erent approaches, rather than refining a simple

technique.”

I ”We strongly believe that the success of an ensemble approach

depends on the ability of its various predictors to expose di↵erent

complementing aspects of the data. Experience shows that this is very

di↵erent than optimizing the accuracy of each individual predictor.”

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 10 / 34

Page 34: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

SHARESHARE2

TWEET

COMMENT

EMAIL

09.21.09 10:31AM

BELLKOR’S PRAGMATICCHAOS WINS $1 MILLIONNETFLIX PRIZE BY MEREMINUTES

AFTER NEARLY THREE years of late nights and heavy

collaboration, a team led by AT&T Research

engineers has won the $1 million Netflix Prize for

devising the best way to improve the company's

movie recommendation algorithm, which

generates an average of 30 billion predictions per

day, by 10 percent or more.

Amazingly, the decision came down to a matter of

minutes, according to Netflix Prize chief Neil Hunt.

BellKor's Pragmatic Chaos submitted their solution

;

PUSHING INNOVATIONTO THE EDGE.

$1,899.99XPS 15

8th Generation Intel®Core™ i7 Processor,

16GB Memory, 512GBsolid state hard drive

Shop Now

MOST POPULARTRANSPORTATIONAirports Cracked Uber and

Lyft—Time for Cities to

Take Note

SCIENCEThe Genius Neuroscientist

Who Might Hold the Key to

True AI

S P O N S O R C O N T E N THow Partnering with

Startups is Helping

Incumbents Grow

AARIAN MARSHALL

SHAUN RAVIV

SAMSUNG

ELIOT VAN BUSKIRK BUSINESS

BellKor’s Pragmatic Chaos Wins $1 Million Netflix Prize by Mere Minutes SIGN IN SUBSCRIBE

Page 35: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bootstrap Estimation

Repeatedly draw n samples from D

For each set of samples, estimate a statistic

The bootstrap estimate is the mean of the individual estimates

Used to estimate a statistic (parameter) and its variance

Bagging: bootstrap aggregation (Breiman 1994)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 11 / 34

Page 36: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bootstrap Estimation

Repeatedly draw n samples from D

For each set of samples, estimate a statistic

The bootstrap estimate is the mean of the individual estimates

Used to estimate a statistic (parameter) and its variance

Bagging: bootstrap aggregation (Breiman 1994)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 11 / 34

Page 37: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bootstrap Estimation

Repeatedly draw n samples from D

For each set of samples, estimate a statistic

The bootstrap estimate is the mean of the individual estimates

Used to estimate a statistic (parameter) and its variance

Bagging: bootstrap aggregation (Breiman 1994)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 11 / 34

Page 38: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bootstrap Estimation

Repeatedly draw n samples from D

For each set of samples, estimate a statistic

The bootstrap estimate is the mean of the individual estimates

Used to estimate a statistic (parameter) and its variance

Bagging: bootstrap aggregation (Breiman 1994)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 11 / 34

Page 39: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bootstrap Estimation

Repeatedly draw n samples from D

For each set of samples, estimate a statistic

The bootstrap estimate is the mean of the individual estimates

Used to estimate a statistic (parameter) and its variance

Bagging: bootstrap aggregation (Breiman 1994)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 11 / 34

Page 40: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Elements of Statistical Learning (2nd Ed.) c⃝Hastie, Tibshirani & Friedman 2009 Chap 8

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-10

12

34

5

•••• •

•• ••••••

••

•••••

•••

•• ••••••

•••

•••• •••••

••

••

x

y

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-10

12

34

5

xy

•••• •

•• ••••••

••

•••••

•••

•• ••••••

•••

•••• •••••

••

••

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-10

12

34

5

x

y

•••• •

•• ••••••

••

•••••

•••

•• ••••••

•••

•••• •••••

••

••

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-10

12

34

5

x

y

•••• •

•• ••••••

••

•••••

•••

•• ••••••

•••

•••• •••••

••

••

FIGURE 8.2. (Top left:) B-spline smooth of data.(Top right:) B-spline smooth plus and minus 1.96×standard error bands. (Bottom left:) Ten bootstrapreplicates of the B-spline smooth. (Bottom right:)B-spline smooth with 95% standard error bands com-puted from the bootstrap distribution.

Page 41: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bagging

Simple idea: generate M bootstrap samples from your original training set.Train on each one to get ym, and average them

yMbag (x) =

1

M

MX

m=1

ym(x)

For regression: average predictions

For classification: average class probabilities (or take the majority vote ifonly hard outputs available)

Bagging approximates the Bayesian posterior mean. The more bootstrapsthe better, so use as many as you have time for

Each bootstrap sample is drawn with replacement, so each one containssome duplicates of certain training points and leaves out other trainingpoints completely

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 12 / 34

Page 42: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bagging

Simple idea: generate M bootstrap samples from your original training set.Train on each one to get ym, and average them

yMbag (x) =

1

M

MX

m=1

ym(x)

For regression: average predictions

For classification: average class probabilities (or take the majority vote ifonly hard outputs available)

Bagging approximates the Bayesian posterior mean. The more bootstrapsthe better, so use as many as you have time for

Each bootstrap sample is drawn with replacement, so each one containssome duplicates of certain training points and leaves out other trainingpoints completely

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 12 / 34

Page 43: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bagging

Simple idea: generate M bootstrap samples from your original training set.Train on each one to get ym, and average them

yMbag (x) =

1

M

MX

m=1

ym(x)

For regression: average predictions

For classification: average class probabilities (or take the majority vote ifonly hard outputs available)

Bagging approximates the Bayesian posterior mean. The more bootstrapsthe better, so use as many as you have time for

Each bootstrap sample is drawn with replacement, so each one containssome duplicates of certain training points and leaves out other trainingpoints completely

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 12 / 34

Page 44: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bagging

Simple idea: generate M bootstrap samples from your original training set.Train on each one to get ym, and average them

yMbag (x) =

1

M

MX

m=1

ym(x)

For regression: average predictions

For classification: average class probabilities (or take the majority vote ifonly hard outputs available)

Bagging approximates the Bayesian posterior mean. The more bootstrapsthe better, so use as many as you have time for

Each bootstrap sample is drawn with replacement, so each one containssome duplicates of certain training points and leaves out other trainingpoints completely

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 12 / 34

Page 45: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Bagging

Simple idea: generate M bootstrap samples from your original training set.Train on each one to get ym, and average them

yMbag (x) =

1

M

MX

m=1

ym(x)

For regression: average predictions

For classification: average class probabilities (or take the majority vote ifonly hard outputs available)

Bagging approximates the Bayesian posterior mean. The more bootstrapsthe better, so use as many as you have time for

Each bootstrap sample is drawn with replacement, so each one containssome duplicates of certain training points and leaves out other trainingpoints completely

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 12 / 34

Page 46: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Boosting (AdaBoost): Summary

Also works by manipulating training set, but classifiers trained sequentially

Each classifier trained given knowledge of the performance of previouslytrained classifiers: focus on hard examples

Final classifier: weighted sum of component classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 13 / 34

Page 47: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Boosting (AdaBoost): Summary

Also works by manipulating training set, but classifiers trained sequentially

Each classifier trained given knowledge of the performance of previouslytrained classifiers: focus on hard examples

Final classifier: weighted sum of component classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 13 / 34

Page 48: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Boosting (AdaBoost): Summary

Also works by manipulating training set, but classifiers trained sequentially

Each classifier trained given knowledge of the performance of previouslytrained classifiers: focus on hard examples

Final classifier: weighted sum of component classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 13 / 34

Page 49: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Making Weak Learners Stronger

Suppose you have a weak learning module (a base classifier) that can alwaysget (0.5 + ✏) correct when given a two-way classification task

I That seems like a weak assumption but beware!

Can you apply this learning module many times to get a strong learner thatcan get close to zero error rate on the training data?

I Theorists showed how to do this and it actually led to an e↵ective newlearning procedure (Freund & Shapire, 1996)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 14 / 34

Page 50: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Making Weak Learners Stronger

Suppose you have a weak learning module (a base classifier) that can alwaysget (0.5 + ✏) correct when given a two-way classification task

I That seems like a weak assumption but beware!

Can you apply this learning module many times to get a strong learner thatcan get close to zero error rate on the training data?

I Theorists showed how to do this and it actually led to an e↵ective newlearning procedure (Freund & Shapire, 1996)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 14 / 34

Page 51: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Making Weak Learners Stronger

Suppose you have a weak learning module (a base classifier) that can alwaysget (0.5 + ✏) correct when given a two-way classification task

I That seems like a weak assumption but beware!

Can you apply this learning module many times to get a strong learner thatcan get close to zero error rate on the training data?

I Theorists showed how to do this and it actually led to an e↵ective newlearning procedure (Freund & Shapire, 1996)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 14 / 34

Page 52: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Making Weak Learners Stronger

Suppose you have a weak learning module (a base classifier) that can alwaysget (0.5 + ✏) correct when given a two-way classification task

I That seems like a weak assumption but beware!

Can you apply this learning module many times to get a strong learner thatcan get close to zero error rate on the training data?

I Theorists showed how to do this and it actually led to an e↵ective newlearning procedure (Freund & Shapire, 1996)

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 14 / 34

Page 53: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Boosting (ADAboost)

First train the base classifier on all the training data with equal importanceweights on each case.

Then re-weight the training data to emphasize the hard cases and train asecond model.

I How do we re-weight the data?

Keep training new models on re-weighted data

Finally, use a weighted committee of all the models for the test data.

I How do we weight the models in the committee?

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 15 / 34

Page 54: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Boosting (ADAboost)

First train the base classifier on all the training data with equal importanceweights on each case.

Then re-weight the training data to emphasize the hard cases and train asecond model.

I How do we re-weight the data?

Keep training new models on re-weighted data

Finally, use a weighted committee of all the models for the test data.

I How do we weight the models in the committee?

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 15 / 34

Page 55: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Boosting (ADAboost)

First train the base classifier on all the training data with equal importanceweights on each case.

Then re-weight the training data to emphasize the hard cases and train asecond model.

I How do we re-weight the data?

Keep training new models on re-weighted data

Finally, use a weighted committee of all the models for the test data.

I How do we weight the models in the committee?

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 15 / 34

Page 56: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Boosting (ADAboost)

First train the base classifier on all the training data with equal importanceweights on each case.

Then re-weight the training data to emphasize the hard cases and train asecond model.

I How do we re-weight the data?

Keep training new models on re-weighted data

Finally, use a weighted committee of all the models for the test data.

I How do we weight the models in the committee?

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 15 / 34

Page 57: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Boosting (ADAboost)

First train the base classifier on all the training data with equal importanceweights on each case.

Then re-weight the training data to emphasize the hard cases and train asecond model.

I How do we re-weight the data?

Keep training new models on re-weighted data

Finally, use a weighted committee of all the models for the test data.

I How do we weight the models in the committee?

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 15 / 34

Page 58: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Boosting (ADAboost)

First train the base classifier on all the training data with equal importanceweights on each case.

Then re-weight the training data to emphasize the hard cases and train asecond model.

I How do we re-weight the data?

Keep training new models on re-weighted data

Finally, use a weighted committee of all the models for the test data.

I How do we weight the models in the committee?

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 15 / 34

Page 59: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

How to Train Each Classifier

Input: x, Output: y(x) 2 {1,�1}

Target t 2 {�1, 1}

Weight on example n for classifier m: wmn

Cost function for classifier m

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]| {z }

1 if error, 0 o.w.

=X

weighted errors

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 16 / 34

Page 60: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

How to Train Each Classifier

Input: x, Output: y(x) 2 {1,�1}

Target t 2 {�1, 1}

Weight on example n for classifier m: wmn

Cost function for classifier m

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]| {z }

1 if error, 0 o.w.

=X

weighted errors

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 16 / 34

Page 61: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

How to Train Each Classifier

Input: x, Output: y(x) 2 {1,�1}

Target t 2 {�1, 1}

Weight on example n for classifier m: wmn

Cost function for classifier m

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]| {z }

1 if error, 0 o.w.

=X

weighted errors

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 16 / 34

Page 62: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

How to Train Each Classifier

Input: x, Output: y(x) 2 {1,�1}

Target t 2 {�1, 1}

Weight on example n for classifier m: wmn

Cost function for classifier m

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]| {z }

1 if error, 0 o.w.

=X

weighted errors

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 16 / 34

Page 63: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

How to weight each training case for classifier m

Recall cost function is

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]| {z }

1 if error, 0 o.w.

=X

weighted errors

Weighted error rate of a classifier

✏m =JmPwmn

The quality of the classifier is

↵m = ln

✓1� ✏m✏m

It is zero if the classifier has weighted error rate of 0.5 and infinity if theclassifier is perfect

The weights for the next round are then

wm+1n = exp

�1

2t(n)

mX

i=1

↵iyi (x(n))

!= w

mn exp

✓�1

2t(n)↵mym(x

(n))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 17 / 34

Page 64: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

How to weight each training case for classifier m

Recall cost function is

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]| {z }

1 if error, 0 o.w.

=X

weighted errors

Weighted error rate of a classifier

✏m =JmPwmn

The quality of the classifier is

↵m = ln

✓1� ✏m✏m

It is zero if the classifier has weighted error rate of 0.5 and infinity if theclassifier is perfect

The weights for the next round are then

wm+1n = exp

�1

2t(n)

mX

i=1

↵iyi (x(n))

!= w

mn exp

✓�1

2t(n)↵mym(x

(n))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 17 / 34

Page 65: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

How to weight each training case for classifier m

Recall cost function is

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]| {z }

1 if error, 0 o.w.

=X

weighted errors

Weighted error rate of a classifier

✏m =JmPwmn

The quality of the classifier is

↵m = ln

✓1� ✏m✏m

It is zero if the classifier has weighted error rate of 0.5 and infinity if theclassifier is perfect

The weights for the next round are then

wm+1n = exp

�1

2t(n)

mX

i=1

↵iyi (x(n))

!= w

mn exp

✓�1

2t(n)↵mym(x

(n))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 17 / 34

Page 66: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

How to weight each training case for classifier m

Recall cost function is

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]| {z }

1 if error, 0 o.w.

=X

weighted errors

Weighted error rate of a classifier

✏m =JmPwmn

The quality of the classifier is

↵m = ln

✓1� ✏m✏m

It is zero if the classifier has weighted error rate of 0.5 and infinity if theclassifier is perfect

The weights for the next round are then

wm+1n = exp

�1

2t(n)

mX

i=1

↵iyi (x(n))

!= w

mn exp

✓�1

2t(n)↵mym(x

(n))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 17 / 34

Page 67: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

How to make predictions using a committee of classifiers

Weight the binary prediction of each classifier by the quality of that classifier:

yM(x) = sign

MX

m=1

1

2↵mym(x)

!

This is how to do inference, i.e., how to compute the prediction for eachnew example.

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 18 / 34

Page 68: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost Algorithm

Input: {x(n), t(n)}Nn=1, and WeakLearn: learning procedure, produces classifier y(x)

Initialize example weights: wmn (x) = 1/N

For m=1:MI ym(x) = WeakLearn({x}, t,w), fit classifier by minimizing

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]

I Compute unnormalized error rate

✏m =JmPwm

n

I Compute classifier coe�cient ↵m = log 1�✏m✏m

I Update data weights

wm+1n = wm

n exp

✓�12t(n)↵mym(x

(n))

Final model

Y (x) = sign(yM(x)) = sign(MX

m=1

↵mym(x))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 19 / 34

Page 69: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost Example

Training data

[Slide credit: Verma & Thrun]

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 20 / 34

Page 70: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost Example

Round 1

[Slide credit: Verma & Thrun]

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 21 / 34

Page 71: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost Example

Round 2

[Slide credit: Verma & Thrun]

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 22 / 34

Page 72: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost Example

Round 3

[Slide credit: Verma & Thrun]

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 23 / 34

Page 73: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost Example

Final classifier

[Slide credit: Verma & Thrun]Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 24 / 34

Page 74: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost example

Each figure shows the number m of base learners trained so far, the decisionof the most recent learner (dashed black), and the boundary of the ensemble(green)

AdaBoost Applet: http://cseweb.ucsd.edu/~yfreund/adaboost/Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 25 / 34

Page 75: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An alternative derivation of ADAboost

Just write down the right cost function and optimize each parameter tominimize it

I stagewise additive modeling (Friedman et. al. 2000)

At each step employ the exponential loss function for classifier m

E =NX

n=1

exp{�t(n)

fm(x(n))}

Real-valued prediction by committee of models up to m

fm(x) =1

2

mX

i=1

↵iyi (x)

We want to minimize E w.r.t. ↵m and the parameters of the classifier ym(x)

We do this in a sequential manner, one classifier at a time

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 26 / 34

Page 76: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An alternative derivation of ADAboost

Just write down the right cost function and optimize each parameter tominimize it

I stagewise additive modeling (Friedman et. al. 2000)

At each step employ the exponential loss function for classifier m

E =NX

n=1

exp{�t(n)

fm(x(n))}

Real-valued prediction by committee of models up to m

fm(x) =1

2

mX

i=1

↵iyi (x)

We want to minimize E w.r.t. ↵m and the parameters of the classifier ym(x)

We do this in a sequential manner, one classifier at a time

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 26 / 34

Page 77: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An alternative derivation of ADAboost

Just write down the right cost function and optimize each parameter tominimize it

I stagewise additive modeling (Friedman et. al. 2000)

At each step employ the exponential loss function for classifier m

E =NX

n=1

exp{�t(n)

fm(x(n))}

Real-valued prediction by committee of models up to m

fm(x) =1

2

mX

i=1

↵iyi (x)

We want to minimize E w.r.t. ↵m and the parameters of the classifier ym(x)

We do this in a sequential manner, one classifier at a time

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 26 / 34

Page 78: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An alternative derivation of ADAboost

Just write down the right cost function and optimize each parameter tominimize it

I stagewise additive modeling (Friedman et. al. 2000)

At each step employ the exponential loss function for classifier m

E =NX

n=1

exp{�t(n)

fm(x(n))}

Real-valued prediction by committee of models up to m

fm(x) =1

2

mX

i=1

↵iyi (x)

We want to minimize E w.r.t. ↵m and the parameters of the classifier ym(x)

We do this in a sequential manner, one classifier at a time

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 26 / 34

Page 79: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An alternative derivation of ADAboost

Just write down the right cost function and optimize each parameter tominimize it

I stagewise additive modeling (Friedman et. al. 2000)

At each step employ the exponential loss function for classifier m

E =NX

n=1

exp{�t(n)

fm(x(n))}

Real-valued prediction by committee of models up to m

fm(x) =1

2

mX

i=1

↵iyi (x)

We want to minimize E w.r.t. ↵m and the parameters of the classifier ym(x)

We do this in a sequential manner, one classifier at a time

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 26 / 34

Page 80: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Loss Functions

Misclassification: 0/1 loss

Exponential loss: exp(�t · f (x)) (AdaBoost)Squared error: (t � f (x))2

Soft-margin support vector (hinge loss): max(0, 1� t · y)Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 27 / 34

Page 81: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Learning classifier m using exponential loss

At iteration m, the energy is computed as

E =NX

n=1

exp{�t(n)

fm(x(n))}

with

fm(x) =1

2

mX

i=1

↵iyi (x) =1

2↵mym(x) +

1

2

m�1X

i=1

↵iyi (x)

We can compute the part that is relevant for the m-th classifier

Erelevant =NX

n=1

exp

✓�t

(n)fm�1(x

(n))� 1

2t(n)↵mym(x

(n))

=NX

n=1

wmn exp

✓�1

2t(n)↵mym(x

(n))

with wmn = exp

��t

(n)fm�1(x(n))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 28 / 34

Page 82: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Learning classifier m using exponential loss

At iteration m, the energy is computed as

E =NX

n=1

exp{�t(n)

fm(x(n))}

with

fm(x) =1

2

mX

i=1

↵iyi (x) =1

2↵mym(x) +

1

2

m�1X

i=1

↵iyi (x)

We can compute the part that is relevant for the m-th classifier

Erelevant =NX

n=1

exp

✓�t

(n)fm�1(x

(n))� 1

2t(n)↵mym(x

(n))

=NX

n=1

wmn exp

✓�1

2t(n)↵mym(x

(n))

with wmn = exp

��t

(n)fm�1(x(n))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 28 / 34

Page 83: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Learning classifier m using exponential loss

At iteration m, the energy is computed as

E =NX

n=1

exp{�t(n)

fm(x(n))}

with

fm(x) =1

2

mX

i=1

↵iyi (x) =1

2↵mym(x) +

1

2

m�1X

i=1

↵iyi (x)

We can compute the part that is relevant for the m-th classifier

Erelevant =NX

n=1

exp

✓�t

(n)fm�1(x

(n))� 1

2t(n)↵mym(x

(n))

=NX

n=1

wmn exp

✓�1

2t(n)↵mym(x

(n))

with wmn = exp

��t

(n)fm�1(x(n))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 28 / 34

Page 84: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Learning classifier m using exponential loss

At iteration m, the energy is computed as

E =NX

n=1

exp{�t(n)

fm(x(n))}

with

fm(x) =1

2

mX

i=1

↵iyi (x) =1

2↵mym(x) +

1

2

m�1X

i=1

↵iyi (x)

We can compute the part that is relevant for the m-th classifier

Erelevant =NX

n=1

exp

✓�t

(n)fm�1(x

(n))� 1

2t(n)↵mym(x

(n))

=NX

n=1

wmn exp

✓�1

2t(n)↵mym(x

(n))

with wmn = exp

��t

(n)fm�1(x(n))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 28 / 34

Page 85: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Learning classifier m using exponential loss

At iteration m, the energy is computed as

E =NX

n=1

exp{�t(n)

fm(x(n))}

with

fm(x) =1

2

mX

i=1

↵iyi (x) =1

2↵mym(x) +

1

2

m�1X

i=1

↵iyi (x)

We can compute the part that is relevant for the m-th classifier

Erelevant =NX

n=1

exp

✓�t

(n)fm�1(x

(n))� 1

2t(n)↵mym(x

(n))

=NX

n=1

wmn exp

✓�1

2t(n)↵mym(x

(n))

with wmn = exp

��t

(n)fm�1(x(n))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 28 / 34

Page 86: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Continuing the derivation

Erelevant =NX

n=1

wmn exp

⇣�t

(n)↵m

2ym(x

(n))⌘

= e�↵m

2

X

right

wmn + e

↵m2

X

wrong

wmn

=⇣e

↵m2 � e

�↵m2

| {z }multiplicative constant

X

n

wmn [t(n) 6= ym(x

(n))]

| {z }wrong cases

+ e�↵m

2

X

n

wmn

| {z }unmodifiable

The second term is constant w.r.t. ym(x)

Thus we minimize the weighted number of wrong examples

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 29 / 34

Page 87: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Continuing the derivation

Erelevant =NX

n=1

wmn exp

⇣�t

(n)↵m

2ym(x

(n))⌘

= e�↵m

2

X

right

wmn + e

↵m2

X

wrong

wmn

=⇣e

↵m2 � e

�↵m2

| {z }multiplicative constant

X

n

wmn [t(n) 6= ym(x

(n))]

| {z }wrong cases

+ e�↵m

2

X

n

wmn

| {z }unmodifiable

The second term is constant w.r.t. ym(x)

Thus we minimize the weighted number of wrong examples

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 29 / 34

Page 88: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Continuing the derivation

Erelevant =NX

n=1

wmn exp

⇣�t

(n)↵m

2ym(x

(n))⌘

= e�↵m

2

X

right

wmn + e

↵m2

X

wrong

wmn

=⇣e

↵m2 � e

�↵m2

| {z }multiplicative constant

X

n

wmn [t(n) 6= ym(x

(n))]

| {z }wrong cases

+ e�↵m

2

X

n

wmn

| {z }unmodifiable

The second term is constant w.r.t. ym(x)

Thus we minimize the weighted number of wrong examples

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 29 / 34

Page 89: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Continuing the derivation

Erelevant =NX

n=1

wmn exp

⇣�t

(n)↵m

2ym(x

(n))⌘

= e�↵m

2

X

right

wmn + e

↵m2

X

wrong

wmn

=⇣e

↵m2 � e

�↵m2

| {z }multiplicative constant

X

n

wmn [t(n) 6= ym(x

(n))]

| {z }wrong cases

+ e�↵m

2

X

n

wmn

| {z }unmodifiable

The second term is constant w.r.t. ym(x)

Thus we minimize the weighted number of wrong examples

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 29 / 34

Page 90: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

Continuing the derivation

Erelevant =NX

n=1

wmn exp

⇣�t

(n)↵m

2ym(x

(n))⌘

= e�↵m

2

X

right

wmn + e

↵m2

X

wrong

wmn

=⇣e

↵m2 � e

�↵m2

| {z }multiplicative constant

X

n

wmn [t(n) 6= ym(x

(n))]

| {z }wrong cases

+ e�↵m

2

X

n

wmn

| {z }unmodifiable

The second term is constant w.r.t. ym(x)

Thus we minimize the weighted number of wrong examples

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 29 / 34

Page 91: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost Algorithm

Input: {x(n), t(n)}Nn=1, and WeakLearn: learning procedure, produces classifier y(x)

Initialize example weights: wmn (x) = 1/N

For m=1:MI ym(x) = WeakLearn({x}, t,w), fit classifier by minimizing

Jm =NX

n=1

wmn [ym(x

n) 6= t(n)]

I Compute unnormalized error rate

✏m =JmPwm

n

I Compute classifier coe�cient ↵m = log 1�✏m✏m

I Update data weights

wm+1n = wm

n exp

✓�12t(n)↵mym(x

(n))

Final model

Y (x) = sign(yM(x)) = sign(MX

m=1

↵mym(x))

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 30 / 34

Page 92: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost Example

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 31 / 34

Page 93: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An impressive example of boosting

Viola and Jones created a very fast face detector that can be scanned acrossa large image to find the faces.

The base classifier/weak learner just compares the total intensity in tworectangular pieces of the image.

I There is a neat trick for computing the total intensity in a rectangle ina few operations.

I So its easy to evaluate a huge number of base classifiers and they arevery fast at runtime.

I The algorithm adds classifiers greedily based on their quality on theweighted training cases.

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 32 / 34

Page 94: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An impressive example of boosting

Viola and Jones created a very fast face detector that can be scanned acrossa large image to find the faces.

The base classifier/weak learner just compares the total intensity in tworectangular pieces of the image.

I There is a neat trick for computing the total intensity in a rectangle ina few operations.

I So its easy to evaluate a huge number of base classifiers and they arevery fast at runtime.

I The algorithm adds classifiers greedily based on their quality on theweighted training cases.

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 32 / 34

Page 95: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An impressive example of boosting

Viola and Jones created a very fast face detector that can be scanned acrossa large image to find the faces.

The base classifier/weak learner just compares the total intensity in tworectangular pieces of the image.

I There is a neat trick for computing the total intensity in a rectangle ina few operations.

I So its easy to evaluate a huge number of base classifiers and they arevery fast at runtime.

I The algorithm adds classifiers greedily based on their quality on theweighted training cases.

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 32 / 34

Page 96: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An impressive example of boosting

Viola and Jones created a very fast face detector that can be scanned acrossa large image to find the faces.

The base classifier/weak learner just compares the total intensity in tworectangular pieces of the image.

I There is a neat trick for computing the total intensity in a rectangle ina few operations.

I So its easy to evaluate a huge number of base classifiers and they arevery fast at runtime.

I The algorithm adds classifiers greedily based on their quality on theweighted training cases.

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 32 / 34

Page 97: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

An impressive example of boosting

Viola and Jones created a very fast face detector that can be scanned acrossa large image to find the faces.

The base classifier/weak learner just compares the total intensity in tworectangular pieces of the image.

I There is a neat trick for computing the total intensity in a rectangle ina few operations.

I So its easy to evaluate a huge number of base classifiers and they arevery fast at runtime.

I The algorithm adds classifiers greedily based on their quality on theweighted training cases.

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 32 / 34

Page 98: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost in Face Detection

Famous application of boosting: detecting faces in images

Two twists on standard algorithm

I Pre-define weak classifiers, so optimization=selectionI Change loss function for weak learners: false positives less costly than

misses

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 33 / 34

Page 99: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost in Face Detection

Famous application of boosting: detecting faces in images

Two twists on standard algorithm

I Pre-define weak classifiers, so optimization=selection

I Change loss function for weak learners: false positives less costly thanmisses

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 33 / 34

Page 100: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost in Face Detection

Famous application of boosting: detecting faces in images

Two twists on standard algorithm

I Pre-define weak classifiers, so optimization=selectionI Change loss function for weak learners: false positives less costly than

misses

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 33 / 34

Page 101: CSC 411: Lecture 17: Ensemble Methods Ibonner/courses/2020f/csc311/lectures/... · 2020. 11. 18. · CSC 411: Lecture 17: Ensemble Methods I Richard Zemel, Raquel Urtasun and Sanja

AdaBoost Face Detection Results

Zemel, Urtasun, Fidler (UofT) CSC 411: 17-Ensemble Methods I 34 / 34