The Boosting Approach to Machine Learning

Preview:

Citation preview

The Boosting Approach to

Machine Learning

Maria-Florina Balcan

04/05/2019

Boosting

• Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor.

• Backed up by solid foundations.

• Works well in practice --- Adaboost and its variations one of the top ML algorithms.

• General method for improving the accuracy of any given learning algorithm.

Readings:

• The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001

• Theory and Applications of Boosting. NIPS tutorial.

http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf

Plan for today:

• Motivation.

• A bit of history.

• Adaboost: algo, guarantees, discussion.

• Focus on supervised classification.

An Example: Spam Detection

Key observation/motivation:

• Easy to find rules of thumb that are often correct.

• Harder to find single rule that is very highly accurate.

• E.g., “If buy now in the message, then predict spam.”

Not spam spam

• E.g., “If say good-bye to debt in the message, then predict spam.”

• E.g., classify which emails are spam and which are important.

• Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets.

An Example: Spam Detection

• apply weak learner to a subset of emails, obtain rule of thumb

• apply to 2nd subset of emails, obtain 2nd rule of thumb

• apply to 3rd subset of emails, obtain 3rd rule of thumb

• repeat T times; combine weak rules into a single highly accurate rule.

𝒉𝟏

𝒉𝟐

𝒉𝟑

𝒉𝑻

…E

mai

ls

Boosting: Important Aspects

How to choose examples on each round?

How to combine rules of thumb into single prediction rule?

• take (weighted) majority vote of rules of thumb

• Typically, concentrate on “hardest” examples (those most often misclassified by previous rules of thumb)

Historically….

Weak Learning vs Strong Learning

• [Kearns & Valiant ’88]: defined weak learning:being able to predict better than random guessing

(error ≤1

2− 𝛾) , consistently.

• Posed an open pb: “Does there exist a boosting algo that turns a weak learner into a strong learner (that can produce

arbitrarily accurate hypotheses)?”

• Informally, given “weak” learning algo that can consistently

find classifiers of error ≤1

2− 𝛾, a boosting algo would

provably construct a single classifier with error ≤ 𝜖.

Surprisingly….

Weak Learning =Strong Learning

Original Construction [Schapire ’89]:

• poly-time boosting algo, exploits that we can learn a little on every distribution.

• A modest booster obtained via calling the weak learning algorithm on 3 distributions.

• Cool conceptually and technically, not very practical.

• Then amplifies the modest boost of accuracy by running this somehow recursively.

Error = 𝛽 <1

2− 𝛾 → error 3𝛽2 − 2𝛽3

An explosion of subsequent work

Adaboost (Adaptive Boosting)

[Freund-Schapire, JCSS’97]

Godel Prize winner 2003

“A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”

Informal Description Adaboost

• For t=1,2, … ,T

• Construct Dt on {x1, …, xm}

• Run A on Dt producing ht: 𝑋 → {−1,1} (weak classifier)

xi ∈ 𝑋, 𝑦𝑖 ∈ 𝑌 = {−1,1}

+++

+

+++

+

- -

--

- -

--

ht

• Boosting: turns a weak algo into a strong learner.

• Output Hfinal 𝑥 = sign σ𝑡=1𝛼𝑡ℎ𝑡 𝑥

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

Roughly speaking Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct.

weak learning algo A (e.g., Naïve Bayes, decision stumps)

ϵt = Pxi ~Dt(ht xi ≠ yi) error of ht over Dt

Adaboost (Adaptive Boosting)

• For t=1,2, … ,T

• Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

• Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

• Weak learning algorithm A.

𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡e −𝛼𝑡 if 𝑦𝑖 = ℎ𝑡 𝑥𝑖

𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡e 𝛼𝑡 if 𝑦𝑖 ≠ ℎ𝑡 𝑥𝑖

Constructing 𝐷𝑡

[i.e., D1 𝑖 =1

𝑚]

• Given Dt and ht set

𝛼𝑡 =1

2ln

1 − 𝜖𝑡𝜖𝑡

> 0

Final hyp: Hfinal 𝑥 = sign σ𝑡 𝛼𝑡ℎ𝑡 𝑥

• D1 uniform on {x1, …, xm}

𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡e −𝛼𝑡𝑦𝑖 ℎ𝑡 𝑥𝑖

Adaboost: A toy example

Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

Adaboost: A toy example

Adaboost: A toy example

Adaboost (Adaptive Boosting)

• For t=1,2, … ,T

• Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

• Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

• Weak learning algorithm A.

𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡e −𝛼𝑡 if 𝑦𝑖 = ℎ𝑡 𝑥𝑖

𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡e 𝛼𝑡 if 𝑦𝑖 ≠ ℎ𝑡 𝑥𝑖

Constructing 𝐷𝑡

[i.e., D1 𝑖 =1

𝑚]

• Given Dt and ht set

𝛼𝑡 =1

2ln

1 − 𝜖𝑡𝜖𝑡

> 0

Final hyp: Hfinal 𝑥 = sign σ𝑡 𝛼𝑡ℎ𝑡 𝑥

• D1 uniform on {x1, …, xm}

𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡e −𝛼𝑡𝑦𝑖 ℎ𝑡 𝑥𝑖

Nice Features of Adaboost

• Very general: a meta-procedure, it can use any weak learning algorithm!!!

• Very fast (single pass through data each round) & simple to code, no parameters to tune.

• Grounded in rich theory.

• Shift in mindset: goal is now just to find classifiers a bit better than random guessing.

• Relevant for big data age: quickly focuses on “core difficulties”, well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLT’12].

(e.g., Naïve Bayes, decision stumps)

Analyzing Training Error

Theorem 𝜖𝑡 = 1/2 − 𝛾𝑡 (error of ℎ𝑡 over 𝐷𝑡)

𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2

𝑡

𝛾𝑡2

So, if ∀𝑡, 𝛾𝑡 ≥ 𝛾 > 0, then 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝛾2𝑇

Adaboost is adaptive

• Does not need to know 𝛾 or T a priori

• Can exploit 𝛾𝑡 ≫ 𝛾

The training error drops exponentially in T!!!

To get 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ 𝜖, need only 𝑇 = 𝑂1

𝛾2log

1

𝜖rounds

Understanding the Updates & Normalization

Pr𝐷𝑡+1

𝑦𝑖 = ℎ𝑡 𝑥𝑖 =

𝑖:𝑦𝑖=ℎ𝑡 𝑥𝑖

𝐷𝑡 𝑖

𝑍𝑡𝑒−𝛼𝑡 =

1 − 𝜖𝑡𝑍𝑡

𝑒−𝛼𝑡 =1 − 𝜖𝑡𝑍𝑡

𝜖𝑡1 − 𝜖𝑡

=1 − 𝜖𝑡 𝜖𝑡𝑍𝑡

Pr𝐷𝑡+1

𝑦𝑖 ≠ ℎ𝑡 𝑥𝑖 =

𝑖:𝑦𝑖≠ℎ𝑡 𝑥𝑖

𝐷𝑡 𝑖

𝑍𝑡𝑒𝛼𝑡 =

Probabilities are equal!

𝑍𝑡

Claim: Dt+1 puts half of the weight on xi where ht was incorrect and half of the weight on xi where ht was correct.

Recall 𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡e −𝛼𝑡𝑦𝑖 ℎ𝑡 𝑥𝑖

= 1 − 𝜖𝑡 𝑒−𝛼𝑡 + 𝜖𝑡𝑒

𝛼𝑡 = 2 𝜖𝑡 1 − 𝜖𝑡

𝜖𝑡1

𝑍𝑡𝑒𝛼𝑡 = 𝜖𝑡

𝑍𝑡

1 − 𝜖𝑡𝜖𝑡

=𝜖𝑡 1 − 𝜖𝑡

𝑍𝑡

Analyzing Training Error

Theorem 𝜖𝑡 = 1/2 − 𝛾𝑡 (error of ℎ𝑡 over 𝐷𝑡)

𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2

𝑡

𝛾𝑡2

So, if ∀𝑡, 𝛾𝑡 ≥ 𝛾 > 0, then 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝛾2𝑇

The training error drops exponentially in T!!!

To get 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ 𝜖, need only 𝑇 = 𝑂1

𝛾2log

1

𝜖rounds

Think about generalization aspects!!!!

What you should know

• The difference between weak and strong learners.

• AdaBoost algorithm and intuition for the distribution update step.

• The training error bound for Adaboost

Advanced (Not required) Material

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

𝑚

exp −𝑦𝑖𝑓 𝑥𝑖

ς𝑡 𝑍𝑡

where 𝑓 𝑥𝑖 = σ𝑡 𝛼𝑡ℎ𝑡 𝑥𝑖 .

Step 2: errS 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ ς𝑡 𝑍𝑡 .

Step 3: ς𝑡 𝑍𝑡 = ς𝑡 2 𝜖𝑡 1 − 𝜖𝑡 = ς𝑡 1 − 4𝛾𝑡2 ≤ 𝑒−2 σ𝑡 𝛾𝑡

2

[Unthresholded weighted vote of ℎ𝑖 on 𝑥𝑖 ]

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

𝑚

exp −𝑦𝑖𝑓 𝑥𝑖

ς𝑡 𝑍𝑡

where 𝑓 𝑥𝑖 = σ𝑡 𝛼𝑡ℎ𝑡 𝑥𝑖 .

Recall 𝐷1 𝑖 =1

𝑚and 𝐷𝑡+1 𝑖 = 𝐷𝑡 𝑖

exp −𝑦𝑖𝛼𝑡ℎ𝑡 𝑥𝑖

𝑍𝑡

𝐷𝑇+1 𝑖 =exp −𝑦𝑖𝛼𝑇ℎ𝑇 𝑥𝑖

𝑍𝑇× 𝐷𝑇 𝑖

=exp −𝑦𝑖𝛼𝑇ℎ𝑇 𝑥𝑖

𝑍𝑇×

exp −𝑦𝑖𝛼𝑇−1ℎ𝑇−1 𝑥𝑖

𝑍𝑇−1× 𝐷𝑇−1 𝑖

…… .

=exp −𝑦𝑖𝛼𝑇ℎ𝑇 𝑥𝑖

𝑍𝑇×⋯×

exp −𝑦𝑖𝛼1ℎ1 𝑥𝑖

𝑍1

1

𝑚

=1

𝑚

exp −𝑦𝑖(𝛼1ℎ1 𝑥𝑖 +⋯+𝛼𝑇ℎ𝑇 𝑥𝑇 )

𝑍1⋯𝑍𝑇=

1

𝑚

exp −𝑦𝑖𝑓 𝑥𝑖

ς𝑡 𝑍𝑡

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

𝑚

exp −𝑦𝑖𝑓 𝑥𝑖

ς𝑡 𝑍𝑡

where 𝑓 𝑥𝑖 = σ𝑡 𝛼𝑡ℎ𝑡 𝑥𝑖 .

errS 𝐻𝑓𝑖𝑛𝑎𝑙 =1

𝑚

𝑖

1𝑦𝑖≠𝐻𝑓𝑖𝑛𝑎𝑙 𝑥𝑖

1

0

0/1 loss

exp loss

=1

𝑚

𝑖

1𝑦𝑖𝑓 𝑥𝑖 ≤0

≤1

𝑚

𝑖

exp −𝑦𝑖𝑓 𝑥𝑖

= ς𝑡 𝑍𝑡 .=

𝑖

𝐷𝑇+1 𝑖 ෑ

𝑡

𝑍𝑡

Step 2: errS 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ ς𝑡 𝑍𝑡 .

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

𝑚

exp −𝑦𝑖𝑓 𝑥𝑖

ς𝑡 𝑍𝑡

where 𝑓 𝑥𝑖 = σ𝑡 𝛼𝑡ℎ𝑡 𝑥𝑖 .

Step 2: errS 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ ς𝑡 𝑍𝑡 .

Step 3: ς𝑡 𝑍𝑡 = ς𝑡 2 𝜖𝑡 1 − 𝜖𝑡 = ς𝑡 1 − 4𝛾𝑡2 ≤ 𝑒−2 σ𝑡 𝛾𝑡

2

Note: recall 𝑍𝑡 = 1 − 𝜖𝑡 𝑒−𝛼𝑡 + 𝜖𝑡𝑒

𝛼𝑡 = 2 𝜖𝑡 1 − 𝜖𝑡

𝛼𝑡 minimizer of 𝛼 → 1 − 𝜖𝑡 𝑒−𝛼 + 𝜖𝑡𝑒

𝛼

Generalization Guarantees (informal)

• Let H be the set of rules that the weak learner can use• Let G be the set of weighted majority rules over T elements

of H (i.e., the things that AdaBoost might output)

Theorem [Freund&Schapire’97]

∀ 𝑔 ∈ 𝐺, 𝑒𝑟𝑟 𝑔 ≤ 𝑒𝑟𝑟𝑆 𝑔 + ෨𝑂𝑇𝑑

𝑚

T= # of rounds

Theorem where 𝜖𝑡 = 1/2 − 𝛾𝑡𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2

𝑡

𝛾𝑡2

How about generalization guarantees?

Original analysis [Freund&Schapire’97]

d= VC dimension of H

Generalization Guarantees

error

complexity

train error

generalizationerror

T= # of rounds

Theorem [Freund&Schapire’97]

∀ 𝑔 ∈ 𝐺, 𝑒𝑟𝑟 𝑔 ≤ 𝑒𝑟𝑟𝑆 𝑔 + ෨𝑂𝑇𝑑

𝑚

T= # of rounds

d= VC dimension of H

Generalization Guarantees

• Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

• Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

Generalization Guarantees

• Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

• Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

• These results seem to contradict FS’87 bound and Occam’s razor (in order achieve good test error the classifier should be as

simple as possible)!

How can we explain the experiments?

Key Idea:

R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in “Boosting the margin: A new explanation for the effectiveness of voting methods” a nice theoretical explanation.

Training error does not tell the whole story.

We need also to consider the classification confidence!!

Recommended