FilterBoost: Regression and Classification on Large...

Preview:

Citation preview

FilterBoost: Regression and Classification on Large Datasets

NIPS’07 paper by Joseph K. Bradley and Robert E. Schapire

* some slides reused from the NIPS ’07 presentation

1

Typical Boosting Framework

• Batch Framework

2

Dataset

Sampled i.i.d. from target

distribution D

2

Typical Boosting Framework

• Batch Framework

2

Dataset

Booster

Sampled i.i.d. from target

distribution D

2

Typical Boosting Framework

• Batch Framework

2

Dataset

Booster Final Hypothesis

Sampled i.i.d. from target

distribution D

2

Typical Boosting Framework

• Batch Framework

2

Dataset

Booster Final Hypothesis

The booster must always have

access to the entire dataset.

Sampled i.i.d. from target

distribution D

2

Motivation for a New Framework

• In the typical framework, the boosting algorithm must have access to the entire data set.

• This limits the application to scenarios with very large data sets.

• Computationally infeasible because in each round, the distribution information on each point is updated.

• New ideas:

• use a data stream instead of the entire (fixed) data set.

• train on new subsets of data in each round.

3

3

Filtering Framework

• Boost for 1000 rounds: only store ~1/1000 of data at a time.

4

Data Oracle

Booster

Stores tiny fraction of

data!

Final Hypothesis~D

4

Paper’s Results

• A new boosting-by-filtering algorithm.

• Provable guarantees.

• Applicable to both classification and conditional probability estimation.

• Good empirical performance.

5

5

AdaBoost (batch boosting)

• Given: Fixed data set S.

• In each round t,

• choose distribution Dt over S.

• choose hypothesis ht.

• estimate error of ht with Dt.

• give a weight of

• Output: hypothesis:

6

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Higher weight to misclassified examples.

Dt forces the algorithm to correctly classify “harder” examples in later rounds.

6

forces the algorithm to correctly classify “harder” examples in later rounds.

AdaBoost (batch boosting)

• Given: Fixed data set S.

• In each round t,

• choose distribution Dt over S.

• choose hypothesis ht.

• estimate error of ht with Dt.

• give a weight of

• Output: hypothesis:

• In filtering, no “fixed” data set S. What about ?

7

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Higher weight to misclassified examples.

Dt

Dt

7

• Key idea: Simulate using the filter (accept only “hard examples”).

accept

Filtering Idea

8

Data Oracle

DFilter

reject

Use in the Boosting algorithm.

Dt

8

FilterBoost: Main Algorithm

9

• Given: Oracle to distribution D.

• In each round t,

• use Filtert to obtain (sample really)

• choose hypothesis ht that does well on

• estimate error of ht by using the oracle again.

• give a weight of

• Output: hypothesis:

6

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Dt St

St

9

• Simulate using rejection sampling.

• Higher weight to misclassified examples.

• If , then the label and hypothesis agree, low probability of being accepted.

• Otherwise, if , then misclassified example, high probability of being accepted.

• So, try

• Difficulty: too much weight on too few examples.

Filtering: Details

10

reject

acceptData Oracle

DFilter Use in the

Boosting algorithm.

Dt

yH(x) = 1

yH(x) = !1

D(x, y) = exp(!yH(x)).

10

Filtering: Details

• Truncated exponential weights work for filtering.[M-AdaBoost, Domingo & Watanabe, 00]

11

reject

acceptData Oracle

DFilter Use in the

Boosting algorithm.

-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8

-0.8

-0.4

0.4

0.8

1.2

1.6

2

2.4

2.8

11

Filtering: Details

• FilterBoost: based on AdaBoost for Logistic Regression. Minimize logistic loss leads to logistic weights.

12

reject

acceptData Oracle

DFilter Use in the

Boosting algorithm.

-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8

-0.5

0.5

1

1.5

2

2.5

Pr [accept(x, y)] =1

1 + eyH(x).

Filte

rBoo

st

Pr [accept(x, y)] = min{1, e!yH(x)}.

M-Ada

Boos

t

12

FilterBoost: Main Algorithm

13

• Given: Oracle to distribution D.

• In each round t,

• use Filtert to obtain (sample really)

• choose hypothesis ht that does well on

• estimate error of ht by using the oracle again.

• give a weight of

• Output: hypothesis:

6

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Dt St

St

13

FilterBoost: Main Algorithm

14

• Given: Oracle to distribution D.

• In each round t,

• use Filtert to obtain (sample really)

• choose hypothesis ht that does well on

• estimate error of ht by using the oracle again.

• give a weight of

• Output: hypothesis:

6

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Dt St

St

Q1. How much time does Filtert take?

Q2. How many boosting rounds are needed?

Q3. How can we estimate hypothesis errors?

A1. If filter takes too long, then hypothesis is accurate enough.

A2. If weak hypothesis’ error bounded away from 1/2, we

make good progress.

A3. Adaptive Sampling [Watanabe, 00]

14

FilterBoost: Analysis

• Interpreted as an additive logistic regression model. Suppose

• Which implies .

• In the case of FilterBoost, .

• Expected Negative log-likelihood of an example is:

• FilterBoost minimizes this function. Like AdaBoost, gradient descent is used to determine weak learner and step size.

15

logPr[y = 1|x]

Pr[y = !1|x]=

!

t

ft(x) = F (x)

Pr[y = 1|x] =1

1 + e!F (x)

ft(x) = !tht(x)

!(F ) = E!! ln

11 + e!yF (x)

"

!(F + "tht).

step-size

weak learner

15

FilterBoost: Details

• Second order expansion of :

• For positive , this expression is minimized when we maximize:

• This is maximized for

• Once is fixed, is determined by to minimize the upper-bound

16

!(F + "h)!!h=0

!(F + "h) = !(F ) + "h!!(F ) +"2h2

2!!!(F )

= E!ln(1 + e"yF (x))! y"h

1 + eyF (x)+

12

y2"2h2eyF (x)

(1 + eyF (x))2

"both +1

!

E!

yh(x)1 + eyF (x)

"! Eq[yh(x)]

f(x) = sign(Eq[y|x]).

h(x) !!(F + "h) ! E[e!y(F (x)+!h(x))]

! =12

log!

1/2 + "

1/2! "

".

16

FilterBoost: Details (1)

17

17

FilterBoost: Details (2)

18

18

FilterBoost: Details (3)

19

19

FilterBoost: Theory

• Theorem:Assume that the weak hypotheses have edge at least . Let be the target error rate. FilterBoost produces a final hypothesis with error less than in T rounds, where

• The real bound is given by:

• Proof elements:

• Step 1:

• Step 2:

• Assume that for all Contradiction.

20

!!

!

T = !O"

1!"2

#

T >2 ln(2)

!(1! 2!

1/4! "2)

errt ! 2pt.

!t ! !t+1 " pt

!1! 2

"1/4! "2

t

#.

t ! {1, . . . , T}, errt " !.

hypothesis error in round t

probability of accepting an example in round t

20

FilterBoost: Theory

• Step 1:

• recall that Call it .

21

Pr [accept(x, y)] =1

1 + eyH(x). qt(x, y)

-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8

-0.5

0.5

1

1.5

2

2.5

yH(x)

21

FilterBoost: Theory

• Step II:

• recall that .

• expanding the expectation,

22

!(F ) = E!! ln

11 + e!yF (x)

"

!t =!

(x,y)

D(x, y) ln(1! qt(x, y))

!t ! !t+1 =!

(x,y)

D(x, y) ln"

1! qt(x, y)1! qt+1(x, y)

#

22

FilterBoost: Theory

• Let

• Recall that

23

&

23

FilterBoost vs. Rest

• Comparison (as reported in the paper):

24

Need edges decreasing?

Need bound on min edge?

Inf. weak learner space

# rounds

M-AdaBoost[Domingo &

Watanabe,00]Y N Y 1/ε

AdaFlat[Gavinsky,02]

N N Y 1/ε2

GiniBoost[Hatano,06]

N N N 1/ε

FilterBoost N N Y 1/ε

24

FilterBoost: Details

• In the previous analysis, overlooked the probability of failure introduced by the three steps:

• training the weak learner

• deciding when to stop boosting

• estimating the edges

25

25

Experiments

• Tested FilterBoost against M-AdaBoost, AdaBoost, Logistic AdaBoost.

• Synthetic and real data sets.

• Tested: classification and conditional probability estimation.

26

26

Experiments: Classification

• Noisy Majority Vote (synthetic).

• Decision stumps are the weak learners.

• 500,000 examples.

27

FilterBoost achieves high accuracy fast.

27

Experiments: CPE

• Conditional Probability Estimation

28

28

Conclusion

• FilterBoost good for boosting over large data sets.

• Fewer assumptions, better guarantees.

• Validated empirically, in classification and conditional probability estimation.

29

29