Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Machine Learning for Image Classification----Part II: Ensemble Approaches
Jianping Fan Dept of Computer Science
UNC-Charlotte
Course Website: http://webpages.uncc.edu/jfan/itcs5152.html
Ensemble Learning
A machine learning paradigm where multiple learners are used to solve the problem
Problem
… ...… ...
Problem
LearnerLearner Learner Learner
Previously: single classifier
Ensemble: multiple classifiers
1
( )( )T
t
tT th xH x sign
Ensemble Classifierf1
f2
fT
ML
ML
MLf
Dataset
Subset 1
Subset 2
Subset T
1
( )( )T
t
tT th xH x sign
It is not a good idea to randomly combine multiple classifiers together!
More helps most time !
More is less sometimes!
Wish List:
Each weak classifier may be different
from others! Different focuses,
capabilities, ……they even can
compensate each other!
Each of them plays different roles!
1
( )( )T
t
tT th xH x sign
Ensemble Classifier
Majority voting: winner take all!
Weighted voting: combine with weights
Averaging: combine with equal weights
1
( )( )T
t
tT th xH x sign
Why we learn from data subsets?
Ensemble Classifier
What may affect ensemble classifier?
Diversity of weak classifiers: we will not
hair two ``almost same” persons
Weights for weak classifier combination:
we know they play different not equal
roles in final decision
1
( )( )T
t
tT th xH x sign
Ensemble Classifier
How to train a set of classifiers with
diverse capabilities?
1
( )( )T
t
tT th xH x sign
1. Using different datasets for the same data-driven learning algorithm
2. Using different learning algorithms to train different classifiers
from the same dataset
Ensemble Classifier
We may prefer weighted voting for ensemble
1
( )( )T
t
tT th xH x sign
How to determine the weights automatically?
Wish Lists: NBA Championship Rule
Wish Lists: NBA Championship Rule
Without Shaq, Kobe even cannot got playoffs for several years!
Wish Lists: NBA Championship Rule
Kobe finally finds the solution!
Wish Lists: NBA Championship Rule
How about this man with enough helpers?
Wish Lists: NBA Championship Rule
Yes, you can after I retire or I move to Lakers!
Weak classifiers are not ``weak’ at all!
They all are very strong on some places &
They know balance and compensations!
Our observations from NSA examples
Diversity of weak classifiers is not
sufficient, they should compensate each
other!
Weak classifiers are not weak! They are
very strong at certain places!
Weights should depend on the
importance or potential contributions or
capabilities!
Diversity of Weak Classifier
Training different weak classifiers from
various data subsets!
Data-driven learning algorithms may make these
weak classifiers to be different!
Sampling various subsets from the same big data set!
16
• Bootstrapping
• Bagging
• Boosting (Schapire 1989)
• Adaboost (Schapire 1995)
A Brief History Resampling for
estimating statistic
Resampling for
classifier design
How to make weak classifier diverse?
Bootstrap Estimation
• Repeatedly draw n samples from D
• For each set of samples, estimate a statistic
• The bootstrap estimate is the mean of the individual estimates
• Used to estimate a statistic (parameter) and its variance
Bagging - Aggregate Bootstrapping
• For i = 1 .. M• Draw n*<n samples from D with
replacement• Learn classifier Ci
• Final classifier is a vote of C1 .. CM
• Increases classifier stability/reduces variance
Bagging
f1
f2
fT
ML
ML
ML
f
Boosting
Training Sample
Weighted Sample
Weighted Sample
fT
f1
…
f2
f
ML
ML
ML
Revisit Bagging
Boosting Classifier
Differences: Bagging vs. Boosting
Boosting brings connections or compensations
between data subsets, e.g., they know each
other!
Boosting has special combination rules for
weak classifier integration
Bagging vs Boosting
• Bagging: the construction of complementary base-learners is left to chance and to the unstability of the learning methods.
• Boosting: actively seek to generate complementary base-learner--- training the next base-learner based on the mistakes of the previous learners.
Boosting (Schapire 1989)
• Randomly select n1 < n samples from D without replacement to obtain D1
• Train weak learner C1
• Select n2 < n samples from D with half of the samples misclassified by C1 to obtain D2
• Train weak learner C2
• Select all samples from D that C1 and C2 disagree on• Train weak learner C3
• Final classifier is vote of weak learners
AdaBoost (Schapire 1995)
• Instead of sampling, re-weight• Previous weak learner has only 50%
accuracy over new distribution
• Can be used to learn weak classifiers
• Final classification based on weighted vote of weak classifiers
Adaboost Terms
• Learner = Hypothesis = Classifier
• Weak Learner: < 50% error over any distribution
• Strong Classifier: thresholded linear combination of weak learner outputs
AdaBoostAdaptive
A learning algorithm
Building a strong classifier a lot of weaker ones
Boosting
AdaBoost Concept
1 { 1 }) , 1(h x
.
.
.
weak classifiers
slightly better than random
1
( )( )T
t
tT th xH x sign
2 { 1 }) , 1(h x
{ 1( }) , 1Th x
strong classifier
AdaBoost1
( )( )T
t
tT th xH x sign
How to train weak classifiers and
make them compensate each other?
How to determine the weights
automatically? We do expect such
weights depending on their
performances and capabilities.
Weaker Classifiers
1 { 1 }) , 1(h x
.
.
.
weak classifiers
slightly better than random
1
( )( )T
t
tT th xH x sign
2 { 1 }) , 1(h x
{ 1( }) , 1Th x
strong classifier
Each weak classifier learns by considering one simple feature
T most beneficial features for classification should be selected
How to– define features?– select beneficial features?– train weak classifiers?– manage (weight) training samples?– associate weight to each weak
classifier?
The Strong Classifiers
1 { 1 }) , 1(h x
.
.
.
weak classifiers
slightly better than random
1
( )( )T
t
tT th xH x sign
2 { 1 }) , 1(h x
{ 1( }) , 1Th x
strong classifier
The AdaBoost Algorithm
Given: 1 1 where ( , ), ,( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 11( ) , 1, ,
mD i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j i
ih
h D i y h x
:probability distribution of 's at time ( )t iD i x t
• Weight classifier:11
ln2
tt
t
• Update distribution:1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t t
t
D i y h xD i Z
Z
minimize weighted error
for minimize exponential loss
Give error classified patterns more chance for learning.
The AdaBoost Algorithm
Given: 1 1 where ( , ), ,( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 11( ) , 1, ,
mD i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j i
ih
h D i y h x
• Weight classifier:11
ln2
tt
t
• Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t t
t
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t t
t
sign H x h x
Observations for AdaBoost
Diversity of weak classifiers is enhances by compensation: the current weak classifier focuses on the samples which the previous ones make wrong predictions!
Weights for weak classifier combination largely depends on their performance or capabilities!
Compare these with our wish lists!
1
where arg min ( )[ ( )]j
m
t j j t i j i
ih
h D i y h x
1
( ) ( )T
t t
t
sign H x h x
11ln
2
tt
t
Boosting illustration
Weak
Classifier 1
Some samples are misclassified!
Boosting illustration
Weights increased
for misclassified samples!
& new weak classifier will
pay more attention on them!
The AdaBoost Algorithm
typicallywhere
the weights of incorrectly classified examples are increased so that the base learner is forced to focus on the hard examples in the training set
where
Boosting illustration
Weak
Classifier 2
Weak classifier 2 will not pay attention
on these samples which have good predictions
by weak classifier 1!
Weak classifier 1
Boosting illustration
Weights increased
for misclassified samples!
& new weak classifier will
pay more attention on them!
Boosting illustration
Weak
Classifier 3
Weak classifier 3 will not pay attention
on these samples which have good predictions
by weak classifiers 1 & 2!
Boosting illustration
Final classifier is
a combination of 3
weak classifiers
Observations from this intuitive example
Current weak classifier pay more
attentions on the samples which are
misclassified by the previous weak
classifiers, thus they can compensate each
other on final decision!
It has provided an easy-to-hard solution
for weak classifier training!
Weights for weak classifier combination
largely depends on their performance or
capabilities!
The AdaBoost Algorithm
Given: 1 1 where ( , ), ,( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 1
1( ) , 1, ,m
D i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j i
ih
h D i y h x
• Weight classifier: 11ln
2
tt
t
• Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t t
t
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t t
t
sign H x h x
What goal the AdaBoost wants to reach?
The AdaBoost Algorithm
Given: 1 1 where ( , ), ,( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 1
1( ) , 1, ,m
D i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j i
ih
h D i y h x
• Weight classifier:
• Update distribution:1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t t
t
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t t
t
sign H x h x
What goal the AdaBoost wants to reach?
11ln
2
tt
t
They are goal dependent.
Goal
Minimize exponential loss
Final classifier:1
( ) ( )T
t t
t
sign H x h x
( )
exp ,( ) yH x
x yloss H x E e
Goal
Minimize exponential loss
Final classifier:1
( ) ( )T
t t
t
sign H x h x
( )
exp ,( ) yH x
x yloss H x E e ( )yH x
Maximize the margin yH(x)
GoalFinal classifier:
1
( ) ( )T
t t
t
sign H x h x
( )
exp ,( ) yH x
x yloss H x E e Minimize
( ) ( )
, |t tyH x yH x
x y x yE e E E e x
Define 1( ) ( ) ( )t t t tH x H x h x with0( ) 0H x
Then, ( ) ( )TH x H x
1[ ( ) ( )]|t t ty H x h x
x yE E e x
1 ( ) ( )|t t tyH x y h x
x yE E e e x
1 ( )( ( )) ( ( ))t t tyH x
x t tE e e P y h x e P y h x
( ) ( )
, |t tyH x yH x
x y x yE e E E e x
Final classifier:1
( ) ( )T
t t
t
sign H x h x
( )
exp ,( ) yH x
x yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x with0( ) 0H x
Then, ( ) ( )TH x H x
1 ( )( ( )) ( ( ))t t tyH x
x t tE e e P y h x e P y h x
( )
, 0tyH x
x y
t
E e
Set
1 ( )( ( )) ( ( )) 0t t tyH x
x t tE e e P y h x e P y h x
0
?t
Final classifier:1
( ) ( )T
t t
t
sign H x h x
Minimize
Define 1( ) ( ) ( )t t t tH x H x h x with0( ) 0H x
Then, ( ) ( )TH x H x
1 ( )( ( )) ( ( )) 0t t tyH x
x t tE e e P y h x e P y h x
0
( ( ))1ln
2 ( ( ))
tt
t
P y h x
P y h x
11ln
2
tt
t
(error)t P
?t ( )
exp ,( ) yH x
x yloss H x E e
( , ) ( )i i tP x y D i
1
( )[ ( )]m
t i j i
i
D i y h x
( )
exp ,( ) yH x
x yloss H x E e
with
Final classifier:1
( ) ( )T
t t
t
sign H x h x
Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0( ) 0H x
Then, ( ) ( )TH x H x
1 ( )( ( )) ( ( )) 0t t tyH x
x t tE e e P y h x e P y h x
0
( ( ))1ln
2 ( ( ))
tt
t
P y h x
P y h x
11ln
2
tt
t
?t
Given:1 1 where ( , ), ,( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 11( ) , 1, ,
mD i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j i
ih
h D i y h x
• Weight classifier:11
ln2
tt
t
• Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t t
t
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t t
t
sign H x h x
(error)t P
( , ) ( )i i tP x y D i
1
( )[ ( )]m
t i j i
i
D i y h x
with
Final classifier:1
( ) ( )T
t t
t
sign H x h x
( )
exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0( ) 0H x
Then, ( ) ( )TH x H x
1 ( )( ( )) ( ( )) 0t t tyH x
x t tE e e P y h x e P y h x
0
( ( ))1ln
2 ( ( ))
tt
t
P y h x
P y h x
11ln
2
tt
t
1 ?tD
Given:1 1 where ( , ), ,( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 11( ) , 1, ,
mD i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j i
ih
h D i y h x
• Weight classifier:11
ln2
tt
t
• Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t t
t
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t t
t
sign H x h x
(error)t P
( , ) ( )i i tP x y D i
1
( )[ ( )]m
t i j i
i
D i y h x
with
Final classifier:1
( ) ( )T
t t
t
sign H x h x
( )
exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
1
, ,t t t tyH yH y h
x y x yE e E e e
1 2 2 2
,
11
2tyH
x y t t t tE e y h y h
1 2 2 2
,
1arg min 1
2tyH
t x y t th
h E e y h y h
2 2 1y h
1 2
,
1arg min 1
2tyH
t x y t th
h E e y h
1 21arg min 1 |
2tyH
t x y t th
h E E e y h x
with
Final classifier:1
( ) ( )T
t t
t
sign H x h x
( )
exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
1 21arg min 1 |
2tyH
t x y t th
h E E e y h x
1argmin |tyH
t x y th
h E E e y h x
1arg max |tyH
t x yh
h E E e yh x
1 1( ) ( )arg max 1 ( ) ( 1| ) ( 1) ( ) ( 1| )t tH x H x
t xh
h E h x e P y x h x e P y x
with
Final classifier:1
( ) ( )T
t t
t
sign H x h x
( )
exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
( )1, ~ ( | )arg max ( )yH xtt x y e P y xh
h E yh x maximized when ( ) y h x x
( ) ( )1 1, ~ ( | ) , ~ ( | )( ) ( 1| ) ( 1| )yH x yH xt tt x y e P y x x y e P y x
h x sign P y x P y x
( )1, ~ ( | )( ) |yH xtt x y e P y x
h x sign E y x
1 1( ) ( )arg max 1 ( ) ( 1| ) ( 1) ( ) ( 1| )t tH x H x
t xh
h E h x e P y x h x e P y x
with
Final classifier:1
( ) ( )T
t t
t
sign H x h x
( )
exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
( ) ( )1 1, ~ ( | ) , ~ ( | )( ) ( 1| ) ( 1| )yH x yH xt tt x y e P y x x y e P y x
h x sign P y x P y x
1( ), ~ ( | )tyH x
x y e P y xAt time t
with
Final classifier:1
( ) ( )T
t t
t
sign H x h x
( )
exp ~ ,( ) yH x
x D yloss H x E e Minimize
Define 1( ) ( ) ( )t t t tH x H x h x 0( ) 0H x
Then, ( ) ( )TH x H x
1 ?tD
1( ), ~ ( | )tyH x
x y e P y xAt time t
At time 1
Given:1 1 where ( , ), ,( , ) , { 1, 1}m m i ix y x y x X y
Initialization: 11( ) , 1, ,
mD i i m
For :1, ,t T
• Find classifier which minimizes error wrt Dt ,i.e.,: { 1, 1}th X
1
where arg min ( )[ ( )]j
m
t j j t i j i
ih
h D i y h x
• Weight classifier:11
ln2
tt
t
• Update distribution: 1
( ) exp[ ( )], is for normalizati( ) ont t i t i
t t
t
D i y h xD i Z
Z
Output final classifier:1
( ) ( )T
t t
t
sign H x h x
, ~ ( | )x y P y x ( | ) 1i iP y x 1
1
1 1(1)D
Z m
At time t+1( )
, ~ ( | )tyH xx y e P y x
( )t tyh x
tD e
1
( ) exp[ ( ), is for normaliza i
]( ) t ont t i t i
t t
t
D i y h xD i Z
Z
58
Pros and cons of AdaBoost
Advantages• Very simple to implement• Does feature selection resulting in
relatively simple classifier• Fairly good generalization
Disadvantages• Suboptimal solution• Sensitive to noisy data and outliers
Intuition
• Train a set of weak hypotheses: h1, …., hT.
• The combined hypothesis H is a weighted majority vote of the T weak hypotheses.
Each hypothesis ht has a weight αt.
• During the training, focus on the examples that are misclassified.
At round t, example xi has the weight Dt(i).
Basic Setting
• Binary classification problem
• Training data:
• Dt(i): the weight of xi at round t. D1(i)=1/m.
• A learner L that finds a weak hypothesis ht: X Y given the training set and Dt
• The error of a weak hypothesis ht:
}1,1{,),,(),....,,( 11 YyXxwhereyxyx iimm
The basic AdaBoost algorithm
For t=1, …, T
• Train weak learner using training data and Dt
• Get ht: X {-1,1} with error
• Choose
• Update
iit yxhi
tt iD)(:
)(
t
t
t
1ln
2
1
t
xhy
t
iit
iit
t
tt
Z
eiD
yxhife
yxhife
Z
iDiD
itit
t
t
)(
1
)(
)(
)(*
)()(
The general AdaBoost algorithm
63
Pros and cons of AdaBoost
Advantages• Very simple to implement• Does feature selection resulting in
relatively simple classifier• Fairly good generalization
Disadvantages• Suboptimal solution• Sensitive to noisy data and outliers