DdGeneralized Optimal Kernel-basedEnsemble Learning forHS Classification Problems
Prudhvi Gurram, Heesung Kwon Image Processing BranchU.S. Army Research Laboratory
Outline
Current Issues
Sparse Kernel-Based Ensemble Learning (SKEL)
Generalized Kernel-Based Ensemble Learning (GKEL)
Simulation Results
Conclusions
Sample Hyper spectral Data(Visible + near IR, 210 bands)
Grass
Military vehicle
High dimensionality of hyperspectral data vs. Curse of dimensionality Small set of training samples (small targets)
The decision function of a classifier is over fitted to the small number of training samples Idea is to find the underlying discriminant structure NOT the noisy nature of the data Goal is to regularize the learning to make decision surface robust to noisy samples and outliers Use Ensemble Learning
Current Issues
Training Data
SVM 1Decision
Surface f1
Kernel–based Ensemble Learning(Suboptimal technique)
Random Subsets of spectral bands
Ensemble Decision
SVM 2Decision
Surface f2
SVM 3Decision
Surface f3
SVM NDecision
Surface fN
Majority Voting
Sub-classifiers Used:Support Vector Machine
(SVM)
1d 2d Nd3d
1 2 Nd d d= = =L
Random subsets of spectral bands
Idea is not all the subsets are useful for the given task So select a small number of subsets useful for the task
Training Data
Random Subsets of Features (random
bands)
CombinedKernel Matrix
SVM 2 SVM N
2fSVM 1 SVM 2
1f 3f Nf
2'
'2
( , ) exp2
kσ
− =
x xx x
1 0d = 2 0.2d =3 0d = 0.1Nd = MKL (sparsity)
(L1 norm constraint)
1, 0m mm
d d= ≥∑
Sparse Kernel-based Ensemble Learning (SKEL)
To find useful subsets, developed SKEL built on the idea of multiple kernel learning (MKL) Jointly optimizes the SVM-based sub-classifiers in conjunction with the weights In the joint optimization, the L1 constraint is imposed on the weights to make them sparse
Optimal subsets useful for the given task
2
{ }, ,
1 1min
2
s.t. y ( ( ) ) 1 ,
=1, 0
mm Hf b d
m m
i m im
m mm
fd
f b i
d d m
+ ≥ ∀
≥ ∀
∑∑
∑
x
Optimization Problem
: kernel-based decision function
: weighting coefficientm
m
f
d
Optimization Problem (Multiple Kernel Learning, Rakotomamonjy at al) :
L1 normSparsity
SKEL SKEL is a useful classifier with improved performance However, some constraints in using SKEL SKEL has to use a large number of initial SVMs to maximize the
ensemble performance causing a memory error due to the limited memory size
The numbers of features selected for all the SVMs have to be the same also causing sub-optimality in choosing feature subspaces
GKEL Relaxes the constraints of SKEL Uses a bottom-up approach, starting from a single classifier, sub-
classifiers are added one by one until the ensemble converges, while a subset of features is optimized for each sub-classifier.
Generalized Sparse Kernel-based Ensemble (GKEL)
,
21
2
subject to
where ,
i.e. : binary vector, {1, 0, 0,1, ,
min min
( , ) 1 for al
{ {0,1}, 1, , }
l
0}
ii
d w
i i i
j
w C
x x d
d d
y w x b
d
j m
i
ξ
ξ
ξ∈
+
∈
−
=
= …
+ ≥
=
∑%
%
%
%
L
%
D
D ∣
Sparse SVM Problem
GKEL is built on the sparse SVM problem* that finds optimal sparse features maximizing the margin of the hyperplane,
Goal is to find an optimal resulting in optimal that maximizes the margin of the hyperplane
d w~
dww ~ =Elementwise product :
Primal optimizationproblem:
* Tan et al, “Learning sparse SVM for feature selection on very HD datasets,” ICML 2010
1max min ( )
2subject to 0,0 ,
where : a vector of Lagrange multipliers
: a vector of all ones
: diag( ),
n
T TdR
T
i
e YK d Y
Y C
e
Y y
αα α α
α α
α
∈∈−
= ≤ ≤D
Dual Problem of Sparse SVM
: Kernel matrix based on sparse feature
vectors
(
)
i ix
K d
x d=%
Using Lagrange multipliers and the KKT conditions, the primal problem can be converted to the dual problem
The mixed integer programming problem is NP hard
Since there are a large number of different combinations of sparse features, the number of possible kernel matrices is huge Combinatorial Problem !!!
( )K d
Relaxation into QCLP
,max
subject to 0,0
( , ),
1where ( , ) ( )
2
nR t R
T
l l
T T
t
Y C
t S d d
S d e YK d Y
α
α αα
α α α α
∈ ∈
= ≤ ≤≤ ∀ ∈
= −
D
To make the mixed integer problem tractable, relax it into Quadratically Constrained Linear Programming (QCLP) The objective function is converted into inequality constraints lower bounded by a real value
),( dS αt
Since the number of possible is huge, so is the number of the constraints , therefore it’s still hard to solve the QCLP problem But, among many constraints, most of the constraints are not actively used to solve the optimization problem Goal is to find a small number of constraints that are actively used
)(dK
Illustrative Example
(Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)
Use a technique called the restricted master problem that finds the active constraints by identifying the most violated constraints one by one iteratively Find the first most violated constraint
Suppose an optimization problem with a large number of inequality constraints (SVM) Among many constraints, most of the constraints in the problem are not used to find the feasible region and an optimal solution Only a small number of active constraints are used to fine the feasible region
0aX b+ ≥
0aX b+ <
Use the restricted master problem that finds the most violated constraints (features) one by one iteratively Find the first most violated constraint Based on previously found constraints, find the next most violated constraint
(Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)
Use the restricted master problem that finds the most violated constraints (features) one by one iteratively Find the first most violated constraint Based on previously found constraints, find the next one Continue the iterative search until no violated constraints are found
(Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)
Use the restricted master problem that finds the most violated constraints (features) one by one iteratively Find the first most violated constraint Then the next one Continue until no violated constraints are found
(Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)
Flow Chart
0 0
0
Initialize ( , ),
1,
t
IN
α
α = =∅
$1 1
ˆFind given ( , )
I=I
i i i
i
d t
d
α− −
∪
$
Update ( , )
given I = I
i i
i
t
d
α
∪
,max
subject to 0,0
ˆ ˆ ( , ),
1( , ) ( )
2
nR t R
T
l l
T T
t
Y C
t S d d
S d e YK d Y
α
α αα
α α α α
∈ ∈
= ≤ ≤
≤ ∀ ∈
= −
I
$
$
: Restricted set of spase features
: subset of features that maximally violates
ˆ ( , ), i.e., min ( , )
1Find ma x ) ( ) (
2
d
T
d
I
d a
t S d S d
M d YK d Y
α α
α α
≤
• =
$1 1( , )ii iS d tα − −≥Yes
No
Terminate
Flow chart of the QCLP problem based on the restricted master problem
Most Violated Features
$
$ˆ
1min ( , ) max ( ) ( )
2( , )
T
ddS d M d YK d Y
t S d
α α αα
⇒ =≤
Linear Kernel - Calculate for each feature separately and select features with top values - Does not work for non-linear kernels
( )iM d
Non-linear Kernel - Individual feature ranking no longer works because it exploits non-linear correlations among all the features (e.g. Gaussian RBF kernel) - Calculate where being all the features except feature, - Eliminate the least contributing feature - Repeat elimination until threshold condition is met (e.g. if change in exceeds 30% then stop the iteration) - Variable length features for different SVMs
( ) for iM d i∀id
thi
( )M d
1 2 3 nf f f fK
1 2 3 nf f f fK
How GKEL Works
∑
0 1
1 2
ˆ ˆ ˆ{ , , , }: selected features (variable lengths)
{ , , , }: weights
A bottom-up approach is used
N
N
I d d d
W w w w
==
L
L
ˆ ˆi jd d≠
SVM 1
1( , )i iwα
SVM 3
3( , )i iwα
SVM 2
2( , )i iwα
SVM N
( , )i iNwα
00
1ˆ ,dN
α← = $0{ }I d= $ $0 1{ , }I d d= $ $ $0 1 2{ , , }I d d d= $ $ $ $0 1 2 1{ , , , , }NI d d d d −= K
Images for Performance Evaluation
Forest Radiance I
Desert Radiance II
Hyperspectral Images (HYDICE) (210 bands, 0.4 – 2.5 microns)
: Training samples
Performance Comparison (FR I)
Single SVM
SKEL (10 to 2 SVMs)
GKEL (3 SVMs)
(Gaussian kernel)
(Gaussian kernel)
(Gaussian kernel)
ROC Curves (FR I)
Since each SKEL run uses different random subsets of spectral bands, 10 SKEL runs were used to generate 10 ROC curves
Performance Comparison (DR II)
Single SVM
GKEL (3 SVMs)
SKEL (10 to 2 SVMs)
(Gaussian kernel)
(Gaussian kernel)
(Gaussian kernel)
Performance Comparison (DR II)
10 ROC curves from 10 SKEL runs, each run with different random subsets of spectral bands
Spambase Data
Performance Comparison
SKEL: Initial SVMs: 25 After optimization: 12GKEL: SVMs with nonzero weights: 14
Data downloaded from the UCI machine learning database called Spambase data used to predict whether an email is spam or not
Conclusions
SKEL and a generalized version of SKEL have been introduced
SKEL starts from a large number of initial SVMS and then is optimized to a small number of SVMs useful for the given task
GKEL starts from a single SVM and Individual classifiers are added one by one optimally to the ensemble until the ensemble converges
GKEL and SKEL performs generally better than regular SVM
GKEL performs as good as SKEL while using less resources (memory) than SKEL
Q&A
?
2'
'2
( , ) exp2
: Gaussian kernel (Sphere Kernel)
kσ
− =
x xx x1σ
2σ
3σ
( ) 1
1
( , ') exp ( ' ( '))
0 0
= 0 0
0 0L
Tk
σ
σ
−= − Σ −
Σ
x x x x x x
O
Prior to the L1 optimization, kernel parameters of each SVM are optimally tuned. Gaussian kernel with single bandwidth has been used treating all the bands equally - suboptimal
Estimate the upper bound to Leave-one-out (LOO) error (the Radius-Margin bound)
Goal is to minimize the RM bound using the gradient descent technique
2
2
1, R:
:
RM
RL f
l γγ
≤ =
: the gradient of RMRM
p
ff
σ∂∂
: Full-band diagonal Gaussian kernel
the radius of the minimum enclosing hypersphere
The margin of the hyperplane
Optimally Tuning KernelParameters
Ensemble Learning
Sub-classifier 1 Sub-classifier 2 Sub-classifier N
∑
Regularized Decision Function(Robust to noise and outliers)
Ensemble decision
-1 -11
The performance of each classifier is better than random guess and independent each other By increasing the number of classifiers performance is improved.
Training Data
Random Subsets of Features (random
bands)
Combination of decision results
SVM 2 SVM N
2fSVM 1 SVM 2
1f 3f Nf
2'
'2
( , ) exp2
kσ
− =
x xx x
1 0d = 2 0.2d =3 0d = 0.1Nd = MKL (sparsity)
1, 0
(L1 norm constraint)
m mm
d d= ≥∑
SKEL : Comparison(Top-Down Approach)
Iteratively update constraints.
( , )t α based on a limited number of active
Iterative Approach to Solve QCLP
Due to a very large number of quadratic constraints, the subject QCLP problem is hard to solve.
So, take iterative approach
Each Iteration of QCLP
$ $
,max
subject to 0,0
( , ),
1where ( , ) ( )
2
n t
T
l l
T T
t
Y C
t S d d I
S d e YK d Y
α
α α
α
α α α α
∈ ∈
= ≤ ≤
≤ ∀ ∈
= −
¡ ¡
$( , ) ( ( , ), 0
From KKT condit
Lagrangia
ion 0 1
n :l
l ll
ll
L t u t u S d t u
Lu
t
α= + − ≥
∂ = → =∂
∑
∑
The intermediate solution pair is therefore obtained from ( , )t α
Iterative QCLP vs. MKL
$( , ) ( ( , )
Lagrang
0
From KKT condition
0
ia :
1
nl
ll
l
ll
L t u t u S d t
u
Lu
t
α= + −
≥∂ = → =∂
∑
∑
$
$
max min ( ( , )
1max min ( )
2
subject to 1, 0
l
lu
l
lT T
l lu
l
l ll
u S d
e Y u K d Y
u u
α
α
α
α α α= −
= ≥
∑
∑∑
± °
°
21( ) ( )
2
1,1,1, ,1
TM d YK d Y w
d
α α= =
= L
Variable Length Features
Applying threshold to variable length features
Stop iterations when the portion of the 2-norm of w from the least contributing features exceeds the predefined TH
± ( )M d (e.g. 30%) leads to
GKEL Preliminary Performance
Chemical Plume DataSKEL: Initial SVMs: 50 After optimization: 8GKEL: SVMs with nonzero weights: 7 (22)
*d
*α
*( , )S dα
( , )S dα
* *( , )S dα
*
*
* *
*
1. Fix and optimze ,
then S( , ) ( , )
2. Increse up to ( , )
3. For a fixed increase t to find
maximing
d
d S d
t t S d
d
αα α
α
α
≥=
( , )max min
subject to 0,0 ,
1( , ) ( )
2
n dR
T
T T
Y C
S d e
S
YK d Y
dα
α α
α
α
α α α
∈∈
= ≤ ≤
= −
D
Relaxation into QCLP
QCLP
,max
subject to 0,0
( , ),
1where ( , ) ( )
2 : Prohibitively large
Nearly impossible to solve
nR t R
T
l l
T T
t
Y C
t S d d
S d e YK d Y
α
α αα
α α α α
∈ ∈
= ≤ ≤≤ ∀ ∈
= −
D
D
L1 and Sparsity
Linear inequalityconstraints
L2 Optimization L1 Optimization