FilterBoost: Regression and Classiﬁcation on Large...

FilterBoost: Regression and Classification on Large Datasets

NIPS’07 paper by Joseph K. Bradley and Robert E. Schapire

* some slides reused from the NIPS ’07 presentation

Typical Boosting Framework

• Batch Framework

Dataset

Sampled i.i.d. from target

distribution D

• Batch Framework

Dataset

Booster

distribution D

• Batch Framework

Dataset

Booster Final Hypothesis

distribution D

• Batch Framework

Dataset

Booster Final Hypothesis

The booster must always have

access to the entire dataset.

distribution D

Motivation for a New Framework

• In the typical framework, the boosting algorithm must have access to the entire data set.

• This limits the application to scenarios with very large data sets.

• Computationally infeasible because in each round, the distribution information on each point is updated.

• New ideas:

• use a data stream instead of the entire (fixed) data set.

• train on new subsets of data in each round.

Filtering Framework

• Boost for 1000 rounds: only store ~1/1000 of data at a time.

Data Oracle

Booster

Stores tiny fraction of

Final Hypothesis~D

Paper’s Results

• A new boosting-by-filtering algorithm.

• Provable guarantees.

• Applicable to both classification and conditional probability estimation.

• Good empirical performance.

AdaBoost (batch boosting)

• Given: Fixed data set S.

• In each round t,

• choose distribution Dt over S.

• choose hypothesis ht.

• estimate error of ht with Dt.

• give a weight of

• Output: hypothesis:

ht !t =12

log1! "t

h(x) :=T!

!tht(x)

Higher weight to misclassified examples.

Dt forces the algorithm to correctly classify “harder” examples in later rounds.

forces the algorithm to correctly classify “harder” examples in later rounds.

AdaBoost (batch boosting)

• Given: Fixed data set S.

• choose distribution Dt over S.

• choose hypothesis ht.

• estimate error of ht with Dt.

• In filtering, no “fixed” data set S. What about ?

ht !t =12

log1! "t

h(x) :=T!

!tht(x)

Higher weight to misclassified examples.

• Key idea: Simulate using the filter (accept only “hard examples”).

accept

Filtering Idea

Data Oracle

DFilter

reject

Use in the Boosting algorithm.

FilterBoost: Main Algorithm

• Given: Oracle to distribution D.

• use Filtert to obtain (sample really)

• choose hypothesis ht that does well on

• estimate error of ht by using the oracle again.

ht !t =12

log1! "t

h(x) :=T!

!tht(x)

• Simulate using rejection sampling.

• Higher weight to misclassified examples.

• If , then the label and hypothesis agree, low probability of being accepted.

• Otherwise, if , then misclassified example, high probability of being accepted.

• So, try

• Difficulty: too much weight on too few examples.

Filtering: Details

reject

acceptData Oracle

DFilter Use in the

Boosting algorithm.

yH(x) = 1

yH(x) = !1

D(x, y) = exp(!yH(x)).

Filtering: Details

• Truncated exponential weights work for filtering.[M-AdaBoost, Domingo & Watanabe, 00]

reject

acceptData Oracle

DFilter Use in the

Boosting algorithm.

-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8

Filtering: Details

• FilterBoost: based on AdaBoost for Logistic Regression. Minimize logistic loss leads to logistic weights.

reject

acceptData Oracle

DFilter Use in the

Boosting algorithm.

-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8

Pr [accept(x, y)] =1

1 + eyH(x).

Pr [accept(x, y)] = min{1, e!yH(x)}.

ht !t =12

log1! "t

h(x) :=T!

!tht(x)

ht !t =12

log1! "t

h(x) :=T!

!tht(x)

Q1. How much time does Filtert take?

Q2. How many boosting rounds are needed?

Q3. How can we estimate hypothesis errors?

A1. If filter takes too long, then hypothesis is accurate enough.

A2. If weak hypothesis’ error bounded away from 1/2, we

make good progress.

A3. Adaptive Sampling [Watanabe, 00]

FilterBoost: Analysis

• Interpreted as an additive logistic regression model. Suppose

• Which implies .

• In the case of FilterBoost, .

• Expected Negative log-likelihood of an example is:

• FilterBoost minimizes this function. Like AdaBoost, gradient descent is used to determine weak learner and step size.

logPr[y = 1|x]

Pr[y = !1|x]=

ft(x) = F (x)

Pr[y = 1|x] =1

1 + e!F (x)

ft(x) = !tht(x)

!(F ) = E!! ln

11 + e!yF (x)

!(F + "tht).

step-size

weak learner

FilterBoost: Details

• Second order expansion of :

• For positive , this expression is minimized when we maximize:

• This is maximized for

• Once is fixed, is determined by to minimize the upper-bound

!(F + "h)!!h=0

!(F + "h) = !(F ) + "h!!(F ) +"2h2

2!!!(F )

= E!ln(1 + e"yF (x))! y"h

1 + eyF (x)+

y2"2h2eyF (x)

(1 + eyF (x))2

"both +1

yh(x)1 + eyF (x)

"! Eq[yh(x)]

f(x) = sign(Eq[y|x]).

h(x) !!(F + "h) ! E[e!y(F (x)+!h(x))]

1/2 + "

1/2! "

FilterBoost: Details (1)

FilterBoost: Theory

• Theorem:Assume that the weak hypotheses have edge at least . Let be the target error rate. FilterBoost produces a final hypothesis with error less than in T rounds, where

• The real bound is given by:

• Proof elements:

• Step 1:

• Step 2:

• Assume that for all Contradiction.

T = !O"

T >2 ln(2)

!(1! 2!

1/4! "2)

errt ! 2pt.

!t ! !t+1 " pt

"1/4! "2

t ! {1, . . . , T}, errt " !.

hypothesis error in round t

probability of accepting an example in round t

FilterBoost: Theory

• Step 1:

• recall that Call it .

Pr [accept(x, y)] =1

1 + eyH(x). qt(x, y)

-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8

FilterBoost: Theory

• Step II:

• recall that .

• expanding the expectation,

!(F ) = E!! ln

11 + e!yF (x)

D(x, y) ln(1! qt(x, y))

!t ! !t+1 =!

D(x, y) ln"

1! qt(x, y)1! qt+1(x, y)

FilterBoost: Theory

• Let

• Recall that

FilterBoost vs. Rest

• Comparison (as reported in the paper):

Need edges decreasing?

Need bound on min edge?

Inf. weak learner space

# rounds

M-AdaBoost[Domingo &

Watanabe,00]Y N Y 1/ε

AdaFlat[Gavinsky,02]

N N Y 1/ε2

GiniBoost[Hatano,06]

N N N 1/ε

FilterBoost N N Y 1/ε

FilterBoost: Details

• In the previous analysis, overlooked the probability of failure introduced by the three steps:

• training the weak learner

• deciding when to stop boosting

• estimating the edges

Experiments

• Tested FilterBoost against M-AdaBoost, AdaBoost, Logistic AdaBoost.

• Synthetic and real data sets.

• Tested: classification and conditional probability estimation.

Experiments: Classification

• Noisy Majority Vote (synthetic).

• Decision stumps are the weak learners.

• 500,000 examples.

FilterBoost achieves high accuracy fast.

Experiments: CPE

• Conditional Probability Estimation

Conclusion

• FilterBoost good for boosting over large data sets.

• Fewer assumptions, better guarantees.

• Validated empirically, in classification and conditional probability estimation.

FilterBoost: Regression and Classiﬁcation on Large...

Documents

Ashish amritsar

Ashish Karamchandani

A Training-free Classiﬁcation Framework for Textures

ashish paints

An Optimal Control Framework for Efficient Training … · An Optimal Control Framework for Efficient Training of Deep Neural ... classification problem generated from peaks in

Ashish Project

Ashish Rice

ashish sontakke_lean_mfg

Chapter 11 A Seismic Performance Classiﬁcation Framework

A Classiﬁcation Framework for Practice Exercises in

A Classiﬁcation and Comparison Framework for Software ...sunset.usc.edu/~neno/papers/TSE-ADL.pdf · A Classiﬁcation and Comparison Framework for Software Architecture Description

Ashish Dsp

Ashish Ppt

Ashish Ccna

Ashish Singh ∗∗

A Classiﬁcation Framework for Anomaly Detection - jmlr.orgjmlr.org/papers/volume6/steinwart05a/steinwart05a.pdf · A CLASSIFICATION FRAMEWORK FOR ANOMALY DETECTION 2. Detecting

Ashish Seminar

Presentation ashish

1 Runtime Network Routing for Efficient Image Classification · Abstract—In this paper, we propose a generic Runtime Network Routing (RNR) framework for efficient image classification,

Active Learning for Sparse Bayesian Multilabel Classiﬁcation€¦ · andreas.damianou@shefﬁeld.ac.uk Manik Varma Microsoft Research manik@microsoft.com Ashish Kapoor Microsoft