Download pdf - GKEL_IGARSS_2011.ppt

DdGeneralized Optimal Kernel-basedEnsemble Learning forHS Classification Problems

Prudhvi Gurram, Heesung Kwon Image Processing BranchU.S. Army Research Laboratory

Outline

Current Issues

Sparse Kernel-Based Ensemble Learning (SKEL)

Generalized Kernel-Based Ensemble Learning (GKEL)

Simulation Results

Conclusions

Sample Hyper spectral Data(Visible + near IR, 210 bands)

Grass

Military vehicle

High dimensionality of hyperspectral data vs. Curse of dimensionality Small set of training samples (small targets)

The decision function of a classifier is over fitted to the small number of training samples Idea is to find the underlying discriminant structure NOT the noisy nature of the data Goal is to regularize the learning to make decision surface robust to noisy samples and outliers Use Ensemble Learning

Current Issues

Training Data

SVM 1Decision

Surface f1

Kernel–based Ensemble Learning(Suboptimal technique)

Random Subsets of spectral bands

Ensemble Decision

SVM 2Decision

Surface f2

SVM 3Decision

Surface f3

SVM NDecision

Surface fN

Majority Voting

Sub-classifiers Used:Support Vector Machine

(SVM)

1d 2d Nd3d

1 2 Nd d d= = =L

Random subsets of spectral bands

Idea is not all the subsets are useful for the given task So select a small number of subsets useful for the task

Training Data

Random Subsets of Features (random

bands)

CombinedKernel Matrix

SVM 2 SVM N

2fSVM 1 SVM 2

1f 3f Nf

2'

'2

( , ) exp2

kσ

− =

x xx x

1 0d = 2 0.2d =3 0d = 0.1Nd = MKL (sparsity)

(L1 norm constraint)

1, 0m mm

d d= ≥∑

Sparse Kernel-based Ensemble Learning (SKEL)

To find useful subsets, developed SKEL built on the idea of multiple kernel learning (MKL) Jointly optimizes the SVM-based sub-classifiers in conjunction with the weights In the joint optimization, the L1 constraint is imposed on the weights to make them sparse

Optimal subsets useful for the given task

2

{ }, ,

1 1min

2

s.t. y ( ( ) ) 1 ,

=1, 0

mm Hf b d

m m

i m im

m mm

fd

f b i

d d m

+ ≥ ∀

≥ ∀

∑∑

∑

x

Optimization Problem

: kernel-based decision function

: weighting coefficientm

m

f

d

Optimization Problem (Multiple Kernel Learning, Rakotomamonjy at al) :

L1 normSparsity

SKEL SKEL is a useful classifier with improved performance However, some constraints in using SKEL SKEL has to use a large number of initial SVMs to maximize the

ensemble performance causing a memory error due to the limited memory size

The numbers of features selected for all the SVMs have to be the same also causing sub-optimality in choosing feature subspaces

GKEL Relaxes the constraints of SKEL Uses a bottom-up approach, starting from a single classifier, sub-

classifiers are added one by one until the ensemble converges, while a subset of features is optimized for each sub-classifier.

Generalized Sparse Kernel-based Ensemble (GKEL)

,

21

2

subject to

where ,

i.e. : binary vector, {1, 0, 0,1, ,

min min

( , ) 1 for al

{ {0,1}, 1, , }

l

0}

ii

d w

i i i

j

w C

x x d

d d

y w x b

d

j m

i

ξ

ξ

ξ∈

+

∈

−

=

= …

+ ≥

=

∑%

%

%

%

L

%

D

D ∣

Sparse SVM Problem

GKEL is built on the sparse SVM problem* that finds optimal sparse features maximizing the margin of the hyperplane,

Goal is to find an optimal resulting in optimal that maximizes the margin of the hyperplane

d w~

dww ~ =Elementwise product :

Primal optimizationproblem:

* Tan et al, “Learning sparse SVM for feature selection on very HD datasets,” ICML 2010

1max min ( )

2subject to 0,0 ,

where : a vector of Lagrange multipliers

: a vector of all ones

: diag( ),

n

T TdR

T

i

e YK d Y

Y C

e

Y y

αα α α

α α

α

∈∈−

= ≤ ≤D

Dual Problem of Sparse SVM

: Kernel matrix based on sparse feature

vectors

(

)

i ix

K d

x d=%

Using Lagrange multipliers and the KKT conditions, the primal problem can be converted to the dual problem

The mixed integer programming problem is NP hard

Since there are a large number of different combinations of sparse features, the number of possible kernel matrices is huge Combinatorial Problem !!!

( )K d

Relaxation into QCLP

,max

subject to 0,0

( , ),

1where ( , ) ( )

2

nR t R

T

l l

T T

t

Y C

t S d d

S d e YK d Y

α

α αα

α α α α

∈ ∈

= ≤ ≤≤ ∀ ∈

= −

D

To make the mixed integer problem tractable, relax it into Quadratically Constrained Linear Programming (QCLP) The objective function is converted into inequality constraints lower bounded by a real value

),( dS αt

Since the number of possible is huge, so is the number of the constraints , therefore it’s still hard to solve the QCLP problem But, among many constraints, most of the constraints are not actively used to solve the optimization problem Goal is to find a small number of constraints that are actively used

)(dK

Illustrative Example

(Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)

Use a technique called the restricted master problem that finds the active constraints by identifying the most violated constraints one by one iteratively Find the first most violated constraint

Suppose an optimization problem with a large number of inequality constraints (SVM) Among many constraints, most of the constraints in the problem are not used to find the feasible region and an optimal solution Only a small number of active constraints are used to fine the feasible region

0aX b+ ≥

0aX b+ <

Use the restricted master problem that finds the most violated constraints (features) one by one iteratively Find the first most violated constraint Based on previously found constraints, find the next most violated constraint


Use the restricted master problem that finds the most violated constraints (features) one by one iteratively Find the first most violated constraint Based on previously found constraints, find the next one Continue the iterative search until no violated constraints are found


Use the restricted master problem that finds the most violated constraints (features) one by one iteratively Find the first most violated constraint Then the next one Continue until no violated constraints are found


Flow Chart

0 0

0

Initialize ( , ),

1,

t

IN

α

α = =∅

$1 1

ˆFind given ( , )

I=I

i i i

i

d t

d

α− −

∪

$

Update ( , )

given I = I

i i

i

t

d

α

∪

,max

subject to 0,0

ˆ ˆ ( , ),

1( , ) ( )

2

nR t R

T

l l

T T

t

Y C

t S d d

S d e YK d Y

α

α αα

α α α α

∈ ∈

= ≤ ≤

≤ ∀ ∈

= −

I

$

$

: Restricted set of spase features

: subset of features that maximally violates

ˆ ( , ), i.e., min ( , )

1Find ma x ) ( ) (

2

d

T

d

I

d a

t S d S d

M d YK d Y

α α

α α

≤

• =

$1 1( , )ii iS d tα − −≥Yes

No

Terminate

Flow chart of the QCLP problem based on the restricted master problem

Most Violated Features

$

$ˆ

1min ( , ) max ( ) ( )

2( , )

T

ddS d M d YK d Y

t S d

α α αα

⇒ =≤

Linear Kernel - Calculate for each feature separately and select features with top values - Does not work for non-linear kernels

( )iM d

Non-linear Kernel - Individual feature ranking no longer works because it exploits non-linear correlations among all the features (e.g. Gaussian RBF kernel) - Calculate where being all the features except feature, - Eliminate the least contributing feature - Repeat elimination until threshold condition is met (e.g. if change in exceeds 30% then stop the iteration) - Variable length features for different SVMs

( ) for iM d i∀id

thi

( )M d

1 2 3 nf f f fK

1 2 3 nf f f fK

How GKEL Works

∑

0 1

1 2

ˆ ˆ ˆ{ , , , }: selected features (variable lengths)

{ , , , }: weights

A bottom-up approach is used

N

N

I d d d

W w w w

==

L

L

ˆ ˆi jd d≠

SVM 1

1( , )i iwα

SVM 3

3( , )i iwα

SVM 2

2( , )i iwα

SVM N

( , )i iNwα

00

1ˆ ,dN

α← = $0{ }I d= $ $0 1{ , }I d d= $ $ $0 1 2{ , , }I d d d= $ $ $ $0 1 2 1{ , , , , }NI d d d d −= K

Images for Performance Evaluation

Forest Radiance I

Desert Radiance II

Hyperspectral Images (HYDICE) (210 bands, 0.4 – 2.5 microns)

: Training samples

Performance Comparison (FR I)

Single SVM

SKEL (10 to 2 SVMs)

GKEL (3 SVMs)

(Gaussian kernel)

(Gaussian kernel)

(Gaussian kernel)

ROC Curves (FR I)

Since each SKEL run uses different random subsets of spectral bands, 10 SKEL runs were used to generate 10 ROC curves

Performance Comparison (DR II)

Single SVM

GKEL (3 SVMs)

SKEL (10 to 2 SVMs)

(Gaussian kernel)

(Gaussian kernel)

(Gaussian kernel)

Performance Comparison (DR II)

10 ROC curves from 10 SKEL runs, each run with different random subsets of spectral bands

Spambase Data

Performance Comparison

SKEL: Initial SVMs: 25 After optimization: 12GKEL: SVMs with nonzero weights: 14

Data downloaded from the UCI machine learning database called Spambase data used to predict whether an email is spam or not

Conclusions

SKEL and a generalized version of SKEL have been introduced

SKEL starts from a large number of initial SVMS and then is optimized to a small number of SVMs useful for the given task

GKEL starts from a single SVM and Individual classifiers are added one by one optimally to the ensemble until the ensemble converges

GKEL and SKEL performs generally better than regular SVM

GKEL performs as good as SKEL while using less resources (memory) than SKEL

Q&A

?

2'

'2

( , ) exp2

: Gaussian kernel (Sphere Kernel)

kσ

− =

x xx x1σ

2σ

3σ

( ) 1

1

( , ') exp ( ' ( '))

0 0

= 0 0

0 0L

Tk

σ

σ

−= − Σ −

Σ

x x x x x x

O

Prior to the L1 optimization, kernel parameters of each SVM are optimally tuned. Gaussian kernel with single bandwidth has been used treating all the bands equally - suboptimal

Estimate the upper bound to Leave-one-out (LOO) error (the Radius-Margin bound)

Goal is to minimize the RM bound using the gradient descent technique

2

2

1, R:

:

RM

RL f

l γγ

≤ =

: the gradient of RMRM

p

ff

σ∂∂

: Full-band diagonal Gaussian kernel

the radius of the minimum enclosing hypersphere

The margin of the hyperplane

Optimally Tuning KernelParameters

Ensemble Learning

Sub-classifier 1 Sub-classifier 2 Sub-classifier N

∑

Regularized Decision Function(Robust to noise and outliers)

Ensemble decision

-1 -11

The performance of each classifier is better than random guess and independent each other By increasing the number of classifiers performance is improved.

Training Data

Random Subsets of Features (random

bands)

Combination of decision results

SVM 2 SVM N

2fSVM 1 SVM 2

1f 3f Nf

2'

'2

( , ) exp2

kσ

− =

x xx x

1 0d = 2 0.2d =3 0d = 0.1Nd = MKL (sparsity)

1, 0

(L1 norm constraint)

m mm

d d= ≥∑

SKEL : Comparison(Top-Down Approach)

Iteratively update constraints.

( , )t α based on a limited number of active

Iterative Approach to Solve QCLP

Due to a very large number of quadratic constraints, the subject QCLP problem is hard to solve.

So, take iterative approach

Each Iteration of QCLP

$ $

,max

subject to 0,0

( , ),

1where ( , ) ( )

2

n t

T

l l

T T

t

Y C

t S d d I

S d e YK d Y

α

α α

α

α α α α

∈ ∈

= ≤ ≤

≤ ∀ ∈

= −

¡ ¡

$( , ) ( ( , ), 0

From KKT condit

Lagrangia

ion 0 1

n :l

l ll

ll

L t u t u S d t u

Lu

t

α= + − ≥

∂ = → =∂

∑

∑

The intermediate solution pair is therefore obtained from ( , )t α

Iterative QCLP vs. MKL

$( , ) ( ( , )

Lagrang

0

From KKT condition

0

ia :

1

nl

ll

l

ll

L t u t u S d t

u

Lu

t

α= + −

≥∂ = → =∂

∑

∑

$

$

max min ( ( , )

1max min ( )

2

subject to 1, 0

l

lu

l

lT T

l lu

l

l ll

u S d

e Y u K d Y

u u

α

α

α

α α α= −

= ≥

∑

∑∑

± °

°

21( ) ( )

2

1,1,1, ,1

TM d YK d Y w

d

α α= =

= L

Variable Length Features

Applying threshold to variable length features

Stop iterations when the portion of the 2-norm of w from the least contributing features exceeds the predefined TH

± ( )M d (e.g. 30%) leads to

GKEL Preliminary Performance

Chemical Plume DataSKEL: Initial SVMs: 50 After optimization: 8GKEL: SVMs with nonzero weights: 7 (22)

*d

*α

*( , )S dα

( , )S dα

* *( , )S dα

*

*

* *

*

1. Fix and optimze ,

then S( , ) ( , )

2. Increse up to ( , )

3. For a fixed increase t to find

maximing

d

d S d

t t S d

d

αα α

α

α

≥=

( , )max min

subject to 0,0 ,

1( , ) ( )

2

n dR

T

T T

Y C

S d e

S

YK d Y

dα

α α

α

α

α α α

∈∈

= ≤ ≤

= −

D

Relaxation into QCLP

QCLP

,max

subject to 0,0

( , ),

1where ( , ) ( )

2 : Prohibitively large

Nearly impossible to solve

nR t R

T

l l

T T

t

Y C

t S d d

S d e YK d Y

α

α αα

α α α α

∈ ∈

= ≤ ≤≤ ∀ ∈

= −

D

D

L1 and Sparsity

Linear inequalityconstraints

L2 Optimization L1 Optimization