53
A DECISION-THEORETIC GENERALIZATION OF ON-LINE LEARNING AND AN APPLICATION TO BOOSTING By Yoav Freund and Robert E. Schapire Presented by David Leach Original Slides by Glenn Rachlin 1

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

  • Upload
    kylar

  • View
    61

  • Download
    1

Embed Size (px)

DESCRIPTION

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. By Yoav Freund and Robert E. Schapire Presented by David Leach Original Slides by Glenn Rachlin. Outline:. Background On-line allocation of resources Introduction The Problem - PowerPoint PPT Presentation

Citation preview

Page 1: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

A DECISION-THEORETIC GENERALIZATION OF ON-LINE LEARNING AND AN APPLICATION TO BOOSTINGBy Yoav Freund and Robert E. SchapirePresented by David Leach

Original Slides by Glenn Rachlin

1

Page 2: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

2

OUTLINE: Background On-line allocation of resources

Introduction The Problem The Hedge Algorithm Analysis

Boosting Introduction The Problem The AdaBoost Algorithm Analysis Applications Extensions

Conclusions Questions for Final exam

Page 3: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

3

USEFUL DEFINITIONS: On-Line Learning – Information comes in one

step at a time, learner must apply model, make prediction, observe true value, then adjust model accordingly.

Weak Learner – Algorithm has higher accuracy than random guessing, but is impractical by itself for most real-world applications.

PAC – Probably Approximately Correct; most of the time the prediction returned will be close to the actual result.

Page 4: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

4

ENSEMBLE LEARNING:A machine learning paradigm where multiple

learners are used to solve the problem

Problem

… ...… ...

Problem

Learner Learner Learner Learner

Previously: Ensemble:

• The generalization ability of the ensemble is usually significantly better than that of an individual learner• Boosting is one of the most important families of ensemble methods

Page 5: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

5

BOOSTING: A BACKGROUND Significant advantages:

Solid theoretical foundation High level of accuracy Simple to implement Wide range of applications R. Schapire and Y. Freund won the 2003 Godel

Prize (one of the most prestigious awards in theoretical computer science)

Prize winning paper (which introduced AdaBoost): "A decision theoretic generalization of on-line learning and an application to Boosting,“ Journal of Computer and System Sciences, 1997, 55: 119-139.

Page 6: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

6

HOW WAS ADABOOST BORN? In 1988, M. Kearns and L.G. Valiant

posed an interesting question: Can a “weak” learning algorithm that performs

just slightly better than random guess can be “boosted” into an arbitrarily accurate “strong” learning algorithm?

More simply, can we transform one or more weak learners into a single strong learner?

Page 7: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

7

HOW WAS ADABOOST BORN? In R. Schapire’s MLJ90 paper, Rob said “yes” and

gave a proof to the question. The proof is a construction, which is the first Boosting algorithm (“Recursive Majority Gate Formulation”)

Then, in Y. Freund’s Phd thesis (1993), Yoav gave a scheme of combining weak learners by “Majority Vote”

Though theoretically strong, both algorithms relied on knowledge of each weak learner’s accuracy

Later, at AT&T Bell Labs, they published the 1997 paper (in fact the work was done in 1995), which proposed the AdaBoost algorithm, a practical, “adaptable” algorithm

Page 8: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

8

BOOSTING TIMELINE 1990 – Boost-by-majority algorithm (Freund) 1995 – AdaBoost (Freund &Schapire) 1997 – Generalized version of AdaBoost

(Schapire& Singer) 2001 – AdaBoost in Face Detection (Viola &

Jones)

Page 9: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

9

ON-LINE ALLOCATION OF RESOURCES: INTRODUCTION Problem: “... dynamically apportioning

resources among a set of options...” In other words, “Given a set of individual

predictions, how much should we value each one?”

The Gambler Example (A recurring theme)

Page 10: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

10

THE GAMBLER:A Gambler wants to make money on horse-

racing by consulting a group of experts. He discovers that experts tend to use certain

“rules of thumb” for races that dictate results to some degree (“Horse with the best odds”, etc.).

Hard to find one particular rule that works for multiple circumstances.

How can he use the network of various predictions (each of which tends to use a given rule of thumb) to win money?

More specifically, how should he split his money among the experts?

Page 11: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

11

ON-LINE ALLOCATION OF RESOURCES: PROBLEM FORMULATIONThe on-line allocation model:

Allocation agent: A - the gambler A certain strategy: i – one expert’s behavior # of options/strategies: {1,2,3, ... ,N} - the #

of experts to choose from # of time steps: {1,2,3, ... ,T} - the # of races distribution over strategies: pt -how much

money he spends on each expert loss: l –money lost (or not gained)

Page 12: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

12

ON-LINE ALLOCATION OF RESOURCES: HEDGE(Β) Basis: “The algorithm and its analysis are

direct generalizations of Littlestone and Warmuth’s weighted majority algorithm”

Assumptions: The loss suffered by any strategy be bounded All weights be nonnegative Initial weights sum up to 1 (optional)

Page 13: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

13

Algorithm Hedge (β) Parameters: β∈[0,1] initial weight vector: ω1∈ [0,1]N with number of trials T

Do for t = 1,2, ..., T 1. Choose allocation from environment

2. Receive loss vector

3. Suffer loss

4. Set new weights vector to be

Goal: minimize difference between expected total loss and minimal total loss of repeating one action

Page 14: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

14

THE GAMBLER REVISITEDo The gambler uses his fancy new algorithm as

follows:o 1. The gambler splits his money evenly

between 3 experts, giving $5 to eacho p1 = <.33,.33,.33>

o 2. The gambler records the loss to each experto Expert 1 loses $2o Expert 2 loses $1o Expert 3 loses $4o loss vector lt = <2,1,4>o total loss = .33x2 + .33x1 + .33x4 = 2.33

Page 15: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

15

THE GAMBLER REVISITED 3. The gambler sets new weights using this

data and a beta of .5 Expert 1 is weighted .33 x .52 = .083 Expert 2 is now weighted .33 x .51 = .167 Expert 3 is now weighted .33 x .54 = .063 Total weight = .083 + .167 + .063 = .313

4. The gambler repeats the process, now “hedging” his bets as follows: p2 = <.083/.313, .167/.313, .063/.313>

= <.265, .533, .202>

Page 16: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

16

ON-LINE ALLOCATION OF RESOURCES: BOUNDS OF HEDGE(Β)

Page 17: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

17

BOUNDS, CON’T β = given parameter T = total number of time steps or trials N = number of options

Page 18: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

18

CHOOSING BETASet

Where L~ is the bound on the best strategy, and

Then:

And if we know T:

Page 19: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

19

ON-LINE ALLOCATION OF RESOURCES: EVALUATION The authors show that the Hedge(β)

algorithm “yield[s] bounds that are slightly weaker in some cases, [than those produced by the algorithm proposed by Littlestone and Warmuth, 1994] but applicable to a considerably more general class of learning problems.” Not only binary decisions Not only discrete loss

Page 20: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

20

MORE GAMBLING The gambler now wants to avoid the experts,

and opts to write a program that predicts the winner.

He must take input data Odds Previous results Track conditions

And predict the outcome Win or loss

He notices that “rules of thumb” once again emerge, where simple heuristics can provide some predictive accuracy, but not enough

How can he use this information to make money?

Page 21: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

21

BOOSTING: INTRODUCTION Aim: “.. converting a weak learning algorithm

that performs just slightly better than random guessing into one with arbitrarily high accuracy.”

Example: Constructing an expert computer program

Two problems: Choosing data Combining rules

“Boosting refers to this general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumb.”

Page 22: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

22

TRADITIONAL BOOSTING1. Split a training data set into

multiple overlapping subsets.2. Train a weak learner on one equally

weighted example set, until accuracy is > 50%.

3. Train a weak learner on a new example set, now weighted to focus on errors.

4. Repeat until all example sets are exhausted.

5. Apply all learners to test set to determine final hypothesis.

Page 23: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

23

THE PROBLEM Previous algorithms by the same authors

“work by calling a given weak learning algorithm WeakLearn multiple times, each time presenting it with a different distribution [of examples], and finally combining all the generated hypotheses into a single hypothesis.”

Problems Too much has to be known in advance Improvement of the overall performance

depends on the weakest rules

Page 24: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

24

ADABOOST: ADAPTIVE BOOSTING

Instead of sampling, re-weight

Can be used to train weak classifiers

Final classification based on weighted vote of weak classifiers

Page 25: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

25

ADABOOST:If the underlying classifiers are linear networks, then

AdaBoost builds multilayer perceptrons one node at a time.

However, the underlying classifier can be anything, decision trees, neural networks, etc…

Page 26: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

26

THE ADABOOST ALGORITHM:

Page 27: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

27

INITIAL ANALYSIS:

The weight of each example is adjusted so that the multiplier will be beta if correct (<1), or 1 if incorrect (beta^0). Remember that the weight will be normalized, so no decrease is effectively an increase.

Each learner gets a vote inversely proportional to the logarithm of its beta, which in turn was proportional to its error.

Page 28: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

28

BETA

What implications does this have? 1. If the error is .5, or equivalent to random

guessing, no information is gained and the time step isn’t used.

2. For a common error <.5, we can weight examples proportionally to error, and weight votes inversely proportional to error.

3. In the final hypothesis generation, if error was > .5, the time step will actually have an inverse vote.

Page 29: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

29

STILL MORE GAMBLING The gambler now has a pretty good scheme to make

money, and downloads the entire race history from the track’s database

1. He finds that odds are a PAC predictor, and comes up with hypotheses accordingly.

2. He calculates the error of using this predictor. 3. He looks at the data in a different way, focusing on

examples that odds could not easily predict, and comes up with a new heuristic (when the track is muddy, the horse with the most experienced jockey wins)

4. He repeats this process until no more viable heuristics can be determined.

5. When enough of the heuristics indicate a win for a given horse, he places a bet.

Page 30: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

30

THEORETICAL PROPERTIES: Y. Freund and R. Schapire[JCSS97] have proved

that the training error of AdaBoost is bounded by:

where

Thus, if each base classifier is slightly better than random so that for some , then the training error drops exponentially fast in T since the above bound is at most

Page 31: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

31

THEORETICAL PROPERTIES CON’T Y. Freund and R. Schapire[JCSS97] have tried to

bound the generalization error as:where denotes empirical probability on training sample, s is the sample

size,d is the VC-dim of base learner

The above bounds suggest that Boosting will overfit if T is large. However, empirical studies show that Boosting often does not overfit R. Schapire et al. [AnnStat98] gave a margin-based

bound: for any > 0 with high

probability

where

Page 32: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

32

TOY EXAMPLE – TAKEN FROM ANTONIO TORRALBA @MIT

Weak learners from the family of lines

h => p(error) = 0.5 it is at chance

Each data point hasa class label:

wt =1and a weight:

+1 ( )-1 ( )

yt =

Page 33: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

33

TOY EXAMPLE

This one seems to be the best

Each data point hasa class label:

wt =1and a weight:

+1 ( )-1 ( )

yt =

This is a ‘weak classifier’: It performs slightly better than chance.

Page 34: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

34

TOY EXAMPLE

We set a new problem for which the previous weakclassifier performs better than chance again

Each data point hasa class label:

wt wt exp{-yt Ht}We update the weights:

+1 ( )-1 ( )

yt =

Page 35: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

35

TOY EXAMPLE

We set a new problem for which the previous weak classifier performs better than chance again

Each data point hasa class label:

wt wt exp{-yt Ht}We update the weights:

+1 ( )-1 ( )yt

=

Page 36: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

36

TOY EXAMPLE

We set a new problem for which the previous weak classifier performs at chance again

Each data point hasa class label:

wt wt exp{-yt Ht}We update the weights:

+1 ( )-1 ( )

yt =

Page 37: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

37

TOY EXAMPLE

We set a new problem for which the previous weak classifier performs better than chance again

Each data point hasa class label:

wt wt exp{-yt Ht}We update the weights:

+1 ( )-1 ( )

yt =

Page 38: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

38

TOY EXAMPLE

The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.

f1 f2

f3

f4

Page 39: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

39

FORMAL PROCEDURE OF ADABOOST

Page 40: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

40

PROCEDURE OF ADABOOST:

Page 41: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

41

ERROR ON TRAINING SET:

Page 42: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

42

OVERFITTING Will Adaboost screw up with a fat complex

classifier finally?

Occam’s razor – simple is the best

Over fitting

Shall we stop before over fitting? If only over fitting happens.

Page 43: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

43

ACTUAL TYPICAL RUN

Page 44: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

44

AN EXPLANATION BY MARGIN This margin is not the margin in SVM

Page 45: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

45

MARGIN DISTRIBUTION

Although final classifier is getting larger, margins are still increasing Final classifier is actually getting to simpler classifer

Page 46: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

46

PRACTICAL ADVANTAGES OF ADABOOST: Simple and easy to program. No parameters to tune (except T). Effective, provided it can consistently find

rough rules of thumb. Goal is to find hypothesis barely better than

guessing. Can combine with any (or many) classifiers

to find weak hypotheses: neural networks, decision trees, simple rules of thumb, nearest-neighbor classifiers, etc.

Page 47: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

47

EXTENSIONS AdaBoost.M1: First Multiclass

AdaBoost.M2: Second Multiclass

AdaBoost.R: Weak Regression

Page 48: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

48

ADABOOST.M1 We modify error calculation as follows:

With the caveat:

And come up with a final hypothesis by:

Page 49: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

49

CONCLUDING REMARKS This paper was the introduction of AdaBoost,

an award-winning, widely used algorithm featured in the top 10 algorithms.

The paper also included the Hedge algorithm, a much less widely known algorithm.

It should be noted that Hedge was a solution to a problem that prevented adaptable boosting for a long time, and has therefore had a significant impact on data mining since.

Page 50: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

50

EXAM QUESTIONS:1. What are we seeking to minimize in

resource allocation?

2. What is the goal of boosting?

3. What makes Adaboost adaptable?

Page 51: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

51

EXAM QUESTION I: WHAT ARE WE SEEKING TO MINIMIZE IN RESOURCE ALLOCATION?

More simply, we seek to minimize the total loss of the allocator with respect to the loss of the best learner.

This gives us a consistent “worst case scenario”, effectively hedging our bets.

Page 52: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

52

EXAM QUESTION II: WHAT IS THE GOAL OF BOOSTING?

The goal is to use one or more weak learners as an arbitrarily accurate strong learner

In other words, to use better-than-chance heuristics in ensemble for high predictive accuracy

Page 53: A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

53

EXAM QUESTION III: WHAT MAKES ADABOOST ADAPTABLE?

The classifiers used in the final decision function have all been modified to account for the weaknesses in the preceding classifiers.

As long as we at least one initial learner that tells us something about the data, the algorithm will infer everything else it needs to.