The Boosting Approach to Machine Learning

The Boosting Approach to

Machine Learning

Maria-Florina Balcan

04/05/2019

Boosting

• Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor.

• Backed up by solid foundations.

• Works well in practice --- Adaboost and its variations one of the top ML algorithms.

• General method for improving the accuracy of any given learning algorithm.

Readings:

• The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001

• Theory and Applications of Boosting. NIPS tutorial.

http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf

Plan for today:

• Motivation.

• A bit of history.

• Adaboost: algo, guarantees, discussion.

• Focus on supervised classification.

An Example: Spam Detection

Key observation/motivation:

• Easy to find rules of thumb that are often correct.

• Harder to find single rule that is very highly accurate.

• E.g., “If buy now in the message, then predict spam.”

Not spam spam

• E.g., “If say good-bye to debt in the message, then predict spam.”

• E.g., classify which emails are spam and which are important.

• Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets.

An Example: Spam Detection

• apply weak learner to a subset of emails, obtain rule of thumb

• apply to 2nd subset of emails, obtain 2nd rule of thumb

• apply to 3rd subset of emails, obtain 3rd rule of thumb

• repeat T times; combine weak rules into a single highly accurate rule.

𝒉𝟏

𝒉𝟐

𝒉𝟑

𝒉𝑻

Boosting: Important Aspects

How to choose examples on each round?

How to combine rules of thumb into single prediction rule?

• take (weighted) majority vote of rules of thumb

• Typically, concentrate on “hardest” examples (those most often misclassified by previous rules of thumb)

Historically….

Weak Learning vs Strong Learning

• [Kearns & Valiant ’88]: defined weak learning:being able to predict better than random guessing

(error ≤1

2− 𝛾) , consistently.

• Posed an open pb: “Does there exist a boosting algo that turns a weak learner into a strong learner (that can produce

arbitrarily accurate hypotheses)?”

• Informally, given “weak” learning algo that can consistently

find classifiers of error ≤1

2− 𝛾, a boosting algo would

provably construct a single classifier with error ≤ 𝜖.

Surprisingly….

Weak Learning =Strong Learning

Original Construction [Schapire ’89]:

• poly-time boosting algo, exploits that we can learn a little on every distribution.

• A modest booster obtained via calling the weak learning algorithm on 3 distributions.

• Cool conceptually and technically, not very practical.

• Then amplifies the modest boost of accuracy by running this somehow recursively.

Error = 𝛽 <1

2− 𝛾 → error 3𝛽2 − 2𝛽3

An explosion of subsequent work

Adaboost (Adaptive Boosting)

[Freund-Schapire, JCSS’97]

Godel Prize winner 2003

“A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”

Informal Description Adaboost

• For t=1,2, … ,T

• Construct Dt on {x1, …, xm}

• Run A on Dt producing ht: 𝑋 → {−1,1} (weak classifier)

xi ∈ 𝑋, 𝑦𝑖 ∈ 𝑌 = {−1,1}

• Boosting: turns a weak algo into a strong learner.

• Output Hfinal 𝑥 = sign σ𝑡=1𝛼𝑡ℎ𝑡 𝑥

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

Roughly speaking Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct.

weak learning algo A (e.g., Naïve Bayes, decision stumps)

ϵt = Pxi ~Dt(ht xi ≠ yi) error of ht over Dt

• For t=1,2, … ,T

• Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

• Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

• Weak learning algorithm A.

𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡e −𝛼𝑡 if 𝑦𝑖 = ℎ𝑡 𝑥𝑖

𝑍𝑡e 𝛼𝑡 if 𝑦𝑖 ≠ ℎ𝑡 𝑥𝑖

Constructing 𝐷𝑡

[i.e., D1 𝑖 =1

• Given Dt and ht set

𝛼𝑡 =1

1 − 𝜖𝑡𝜖𝑡

Final hyp: Hfinal 𝑥 = sign σ𝑡 𝛼𝑡ℎ𝑡 𝑥

• D1 uniform on {x1, …, xm}

𝑍𝑡e −𝛼𝑡𝑦𝑖 ℎ𝑡 𝑥𝑖

Adaboost: A toy example

Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

Adaboost: A toy example

• For t=1,2, … ,T

• Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

• Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

• Weak learning algorithm A.

𝑍𝑡e −𝛼𝑡 if 𝑦𝑖 = ℎ𝑡 𝑥𝑖

𝑍𝑡e 𝛼𝑡 if 𝑦𝑖 ≠ ℎ𝑡 𝑥𝑖

Constructing 𝐷𝑡

[i.e., D1 𝑖 =1

• Given Dt and ht set

𝛼𝑡 =1

Final hyp: Hfinal 𝑥 = sign σ𝑡 𝛼𝑡ℎ𝑡 𝑥

• D1 uniform on {x1, …, xm}

Nice Features of Adaboost

• Very general: a meta-procedure, it can use any weak learning algorithm!!!

• Very fast (single pass through data each round) & simple to code, no parameters to tune.

• Grounded in rich theory.

• Shift in mindset: goal is now just to find classifiers a bit better than random guessing.

• Relevant for big data age: quickly focuses on “core difficulties”, well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLT’12].

(e.g., Naïve Bayes, decision stumps)

Analyzing Training Error

Theorem 𝜖𝑡 = 1/2 − 𝛾𝑡 (error of ℎ𝑡 over 𝐷𝑡)

𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2

𝛾𝑡2

So, if ∀𝑡, 𝛾𝑡 ≥ 𝛾 > 0, then 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝛾2𝑇

Adaboost is adaptive

• Does not need to know 𝛾 or T a priori

• Can exploit 𝛾𝑡 ≫ 𝛾

The training error drops exponentially in T!!!

To get 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ 𝜖, need only 𝑇 = 𝑂1

𝛾2log

𝜖rounds

Understanding the Updates & Normalization

Pr𝐷𝑡+1

𝑦𝑖 = ℎ𝑡 𝑥𝑖 =

𝑖:𝑦𝑖=ℎ𝑡 𝑥𝑖

𝐷𝑡 𝑖

𝑍𝑡𝑒−𝛼𝑡 =

1 − 𝜖𝑡𝑍𝑡

𝑒−𝛼𝑡 =1 − 𝜖𝑡𝑍𝑡

𝜖𝑡1 − 𝜖𝑡

=1 − 𝜖𝑡 𝜖𝑡𝑍𝑡

Pr𝐷𝑡+1

𝑦𝑖 ≠ ℎ𝑡 𝑥𝑖 =

𝑖:𝑦𝑖≠ℎ𝑡 𝑥𝑖

𝐷𝑡 𝑖

𝑍𝑡𝑒𝛼𝑡 =

Probabilities are equal!

𝑍𝑡

Claim: Dt+1 puts half of the weight on xi where ht was incorrect and half of the weight on xi where ht was correct.

Recall 𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

= 1 − 𝜖𝑡 𝑒−𝛼𝑡 + 𝜖𝑡𝑒

𝛼𝑡 = 2 𝜖𝑡 1 − 𝜖𝑡

𝜖𝑡1

𝑍𝑡𝑒𝛼𝑡 = 𝜖𝑡

𝑍𝑡

=𝜖𝑡 1 − 𝜖𝑡

𝑍𝑡

Analyzing Training Error

Theorem 𝜖𝑡 = 1/2 − 𝛾𝑡 (error of ℎ𝑡 over 𝐷𝑡)

𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2

𝛾𝑡2

So, if ∀𝑡, 𝛾𝑡 ≥ 𝛾 > 0, then 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝛾2𝑇

The training error drops exponentially in T!!!

To get 𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ 𝜖, need only 𝑇 = 𝑂1

𝛾2log

𝜖rounds

Think about generalization aspects!!!!

What you should know

• The difference between weak and strong learners.

• AdaBoost algorithm and intuition for the distribution update step.

• The training error bound for Adaboost

Advanced (Not required) Material

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

exp −𝑦𝑖𝑓 𝑥𝑖

ς𝑡 𝑍𝑡

where 𝑓 𝑥𝑖 = σ𝑡 𝛼𝑡ℎ𝑡 𝑥𝑖 .

Step 2: errS 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ ς𝑡 𝑍𝑡 .

Step 3: ς𝑡 𝑍𝑡 = ς𝑡 2 𝜖𝑡 1 − 𝜖𝑡 = ς𝑡 1 − 4𝛾𝑡2 ≤ 𝑒−2 σ𝑡 𝛾𝑡

[Unthresholded weighted vote of ℎ𝑖 on 𝑥𝑖 ]

ς𝑡 𝑍𝑡

Recall 𝐷1 𝑖 =1

𝑚and 𝐷𝑡+1 𝑖 = 𝐷𝑡 𝑖

exp −𝑦𝑖𝛼𝑡ℎ𝑡 𝑥𝑖

𝑍𝑡

𝐷𝑇+1 𝑖 =exp −𝑦𝑖𝛼𝑇ℎ𝑇 𝑥𝑖

𝑍𝑇× 𝐷𝑇 𝑖

=exp −𝑦𝑖𝛼𝑇ℎ𝑇 𝑥𝑖

𝑍𝑇×

exp −𝑦𝑖𝛼𝑇−1ℎ𝑇−1 𝑥𝑖

𝑍𝑇−1× 𝐷𝑇−1 𝑖

…… .

=exp −𝑦𝑖𝛼𝑇ℎ𝑇 𝑥𝑖

𝑍𝑇×⋯×

exp −𝑦𝑖𝛼1ℎ1 𝑥𝑖

exp −𝑦𝑖(𝛼1ℎ1 𝑥𝑖 +⋯+𝛼𝑇ℎ𝑇 𝑥𝑇 )

𝑍1⋯𝑍𝑇=

ς𝑡 𝑍𝑡

errS 𝐻𝑓𝑖𝑛𝑎𝑙 =1

1𝑦𝑖≠𝐻𝑓𝑖𝑛𝑎𝑙 𝑥𝑖

0/1 loss

exp loss

1𝑦𝑖𝑓 𝑥𝑖 ≤0

= ς𝑡 𝑍𝑡 .=

𝐷𝑇+1 𝑖 ෑ

𝑍𝑡

ς𝑡 𝑍𝑡

Step 3: ς𝑡 𝑍𝑡 = ς𝑡 2 𝜖𝑡 1 − 𝜖𝑡 = ς𝑡 1 − 4𝛾𝑡2 ≤ 𝑒−2 σ𝑡 𝛾𝑡

Note: recall 𝑍𝑡 = 1 − 𝜖𝑡 𝑒−𝛼𝑡 + 𝜖𝑡𝑒

𝛼𝑡 = 2 𝜖𝑡 1 − 𝜖𝑡

𝛼𝑡 minimizer of 𝛼 → 1 − 𝜖𝑡 𝑒−𝛼 + 𝜖𝑡𝑒

Generalization Guarantees (informal)

• Let H be the set of rules that the weak learner can use• Let G be the set of weighted majority rules over T elements

of H (i.e., the things that AdaBoost might output)

Theorem [Freund&Schapire’97]

∀ 𝑔 ∈ 𝐺, 𝑒𝑟𝑟 𝑔 ≤ 𝑒𝑟𝑟𝑆 𝑔 + ෨𝑂𝑇𝑑

T= # of rounds

Theorem where 𝜖𝑡 = 1/2 − 𝛾𝑡𝑒𝑟𝑟𝑆 𝐻𝑓𝑖𝑛𝑎𝑙 ≤ exp −2

𝛾𝑡2

How about generalization guarantees?

Original analysis [Freund&Schapire’97]

d= VC dimension of H

Generalization Guarantees

complexity

train error

generalizationerror

T= # of rounds

Theorem [Freund&Schapire’97]

∀ 𝑔 ∈ 𝐺, 𝑒𝑟𝑟 𝑔 ≤ 𝑒𝑟𝑟𝑆 𝑔 + ෨𝑂𝑇𝑑

T= # of rounds

d= VC dimension of H

• Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

• Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

• Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

• Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

• These results seem to contradict FS’87 bound and Occam’s razor (in order achieve good test error the classifier should be as

simple as possible)!

How can we explain the experiments?

Key Idea:

R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in “Boosting the margin: A new explanation for the effectiveness of voting methods” a nice theoretical explanation.

Training error does not tell the whole story.

We need also to consider the classification confidence!!

The Boosting Approach to Machine Learning

Documents

The Boosting Approach to Machine Learning An Overvie · machine-learning approach to this problem would be the following: Start by gath-ering as many examples as posible of both spam

Banking failure prediction: a boosting classification tree approach · 2017. 10. 20. · Banking failure prediction: a boosting classiﬁcation tree approach La Predicción Del Fracaso

IMI: A Public Private Partnership Approach to the Boosting ... · IMI: A Public Private Partnership Approach to the Boosting of Pharmaceutical R&D Elisabetta Vaudano Innovative Medicines

The Boosting Approach to Machine Learning An …courses.csail.mit.edu/6.034f/ai3/msri.pdfMachine learning studies automatic techniques for learning to make accurate pre-dictions based

Machine Learning – Classifiers and Boosting · Machine Learning – Classifiers and Boosting Reading . Ch 18.6-18.12, 20.1-20.3.2 . Outline • Different types of learning problems

The Boosting Approach to Machine Learning An Overview · 1 Introduction Machine learning studies automatic techniques for learning to make accurate pre-dictions based on past observations

The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) • For t=1,2, … ,T • Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦• Run

A 5-Step Approach to Boosting Your Buying Power

Boosting Research with Machine Learning - …...Boosting Research with Machine Learning Franziska Oschmann Scientiﬁc IT Services, ETH 10th of July, 2019 Examples for ML in research

Delta Boosting Machine and its Application in Actuarial ... · (MARS), regression trees [22] and boosting. 1.1. The Boosting Algorithms. Boosting methods are based on an idea of com-bining

A gradient boosting approach to the Kaggle load ...souhaib-bentaieb.com/pdf/2014_ijf_gbkaggle.pdf · A gradient boosting approach to the Kaggle load forecasting competition Souhaib

Boosting Machine Learning with Redis Modules and Spark

The Boosting Approach to Machine Learning An Overvieaarti/Class/10701/readings/boosting-schapire.pdf · The Boosting Approach to Machine Learning An Overview Robert E. Schapire AT&T

Gradient Boosting Machine with H2Odocs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GBMBooklet.pdf · Gradient Boosting Machine with H2O by Michal Malohlava & Arno Candel with assitance

Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance

Machine Learning – Classifiers and Boosting

Boosting quantum machine learning models with multi-level ... · Boosting quantum machine learning models with multi-level combination technique: Pople diagrams revisited Peter Zaspel,1

Subsea Boosting Systems - · PDF file1 Content Introduction Subsea Boosting and Compression –Why subsea boosting? Subsea Boosting Systems for Subsea Tiebacks –Total system approach

Boosting Economic Growth Through Advanced Machine Vision

Boosting Methods - Babeș-Bolyai Universitycsatol/mach_learn/bemutato/BenkKelemen_Boosti… · Boosting Methods Benk Erika Kelemen Zsolt. Summary Overview Boosting – approach, definition,