Linear Programming Boosting by Column and Row Generation Kohei Hatano and Eiji Takimoto Kyushu...

Preview:

Citation preview

Linear Programming Boosting

by Column and Row Generation

Kohei Hatano and Eiji TakimotoKyushu University, Japan

DS 2009

1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary

ExampleGiven a web page, predict if the topic

is “DS 2009”.

hypothesis set= words

DS 2009?y n

-1+1

DS 2009?y n

-1+1

0.5* ALT?y n

-1+1

+ 0.3* Porto?y n

-1+1

+ 0.2*

Weighted majority vote

Modern Machine Learning ApproachFind a weighted majority of hypotheses (or hyperplane) which enlarges the margin

--≥

-

n

iii

ii

n

jijji

m

ii

ρ

αα

ymiξρhαy

mννξν

ρ

11

1

1,,

0 ,1

)1,1 ,,...,1( ,)(

sub.to.

)1 s.t. constant:(1

.max

α

x

ξα

1-norm soft margin optimization

(aka 1 norm SVM)Popular formulation as well as 2 norm soft margin opt.“find a hyperplane which separates positive and negative instances well”

margin ρ

ξ i

loss ξ i

Note・ Linear Program・ good generarization guarantee [Schapire et al. 98]

marginloss

normalized with 1 norm

1-norm soft margin optimization(2)

Advantage of 1 norm soft margin opt.Solution likely to be sparse   ⇒  useful for feature selection

sparse hyperplane0.5 *(DS 2009?) + 0.3 *(ALT?) + 0.2* (Porto?)

non-sparse hyperplane0.2 *(DS 2009?) + 0.1 *(ALT?) + 0.1 * (Porto?)* 0.1* (wine?)+0.05*(tasting?)+0.05* (discovery?)+ 0.03*(science?)+0.02*()+…

2 norm soft margin opt.

1 norm soft marign opt.

Recent Results

Our resultnew algorithm for 1 norm soft margin optimization

2-norm soft margin optimization・ Quadratic Programming・ SMO [Platt, 1999]

・ SVMlight [Joachims, 1999]

・ SVMPerf [Joachims, 2006]

・ Pegasos [Shai-Schwartz et al., 2007]

There are state-of-the-art solvers

1-norm soft margin optimization・ Linear Programming・ LPBoost [Dem iriz et al, 2003]

・ Entropy Regularized LPBoost [Warmuth et al., 2008]

・ others [Mangasarian 2006][Sra 2006]

not efficient enough for large dataroom for improvements

1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary

BoostingClassification: frog “+1”, others “-1”Hypotheses ・・・ color and size

color

size

-1

+1

1 . d1 : uniform distribution

2 . For t=1,…,T (i) Choose hypothesis ht

maximizing the edge w.r.t.dt (ii) Update distribution dt

to dt+1

3. Output weighting of chosen hypotheses

-1

-1

+1

Boosting (2)

color

size

-1

+1

1 . d1 : uniform distribution

2 . For t=1,…,T (i) Choose hypothesis ht

maximizing the edge w.r.t.dt (ii) Update distribution dt

to dt+1

3. Output weighting of chosen hypotheses

-1

-1

+1

h1

h1(xi)

)()(edge1

i

m

iii hydh xd

Edge of hypothesis h w.r.t. distribution d

yih(xi)>0 if correct

h of accuracy

frog

-1,or +1 ∈[-1,+1]

Boosting (2)

color

size

-1

+1

1 . d1 : uniform distribution

2 . For t=1,…,T (i) Choose hypothesis ht

maximizing the edge w.r.t.dt (ii) Update distribution dt

to dt+1

3. Output weighting of chosen hypotheses

-1

-1

+1

h1

h1(xi)

More weights onMisclassified instances

frog

Boosting (3)

color

size

-1

+1

1 . d1 : uniform distribution

2 . For t=1,…,T (i) Choose hypothesis ht

maximizing the edge w.r.t.dt (ii) Update distribution dt

to dt+1

3. Output weighting of chosen hypotheses

-1

-1

+1

h2

Note: more weights on “diffucult” instances

Boosting (4)

color

size

-1

+1

1 . d1 : uniform distribution

2 . For t=1,…,T (i) Choose hypothesis ht

maximizing the edge w.r.t.dt (ii) Update distribution dt

to dt+1

3. Output weighting of chosen hypotheses

-1

-1

+1

0.4h1 +0.6h2

Boosting & 1-norm soft margin optimizatoin

m

iii

jd

γ

d

njγh

γ

11

,

1 ,1

0

),...,1( ,)(edge

sub.to.

.min

Dual

d

d

n

jjj

ii

n

jijji

m

ii

ρ

αα

ymiξρhαy

νξν

ρ

11

1

1,,

0 ,1

)1,1 ,,...,1( ,)(

sub.to.

)1(1

.max

Primal

α

x

ξα

“find the large-margin hyperplane which separates pos. and neg. instances as much as possible”

find the distiribution which minimizesthe maximum edge of hypotheses(find the most “difficult” distribution w.r.t. hypotheses)

≈ solvable using Boosting

equivalent by duality of LP

LPBoost [Demiriz et al, 2003]Update: solve the dual problem w.r.t. the hypothesis set {h1,…,ht}

Output: the convex combination

where α is the solution of the primal problem

m

iii

jd

γtt

d

tjγh

γγ

11

,11

1 ,1

0

),...,1( ,)(edge

sub.to.

.minarg),(

d

dd

Theorem[Demiriz et al.]

Given ε>0, LPBoost outputs ε-approximation of the optimal solution.

T

ttt hαf

1

)()( xx

Properties of the optimal solution

margin ρ*ξ* i

loss ξ* i

KKT conditions imply:

Note: Optimal solution can be recovered using only instances with positive weigthts

0 margin with i instance

10 margin with instance

1 margin with i instance

**

**

**

i

i

i

dρν

νdρ

problem dual the of sol. :),(

problem primal the of sol.:),,(**

***

γ

ρ

d

ξα

Properties of the optimal solution ( 2 )

By KKT conditions

0 d w.r.t. edge with hypothesis

0 d w.r.t. edge with hypothesis***

***

j

j

αγ

αγ

Sparseness of the optimal solution1. Sparseness w.r.t. hypotheses weighting2. Sparseness w.r.t. instances

Note: Optimal solution can be recovered using only hypotheses with positive coefficients.

1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary

Our idea: Sparse LPBoostTake advantage of the sparseness w.r.t. hypotheses and instances

2.For t=1….(i)Pick up instances with margin <ρt

(iii) solve the dual problem w.r.t. the past chosen instances by Boosting (ρt+1 : the solution)

3 . Output the solution of the primal problem.

TheoremGiven ε>0, Sparse LPBoost outputs ε-approximation of the optimal solution.

Our idea (matrix form)

γ

γ

γ

d

d

d

hyhyhy

hyhyhy

hyhyhy

m

i

mmmmmm

iniijiii

nj

1

111

1

1111111

)(...)(...)(

......

)(... )(...)(

......

)(...)(...)(

xxx

xxx

xxx

Each row i corresponds toinstance i

Each column j corresponds to hypothesis j

LP LPBoost Sparse LPBoost

“effective” constraintsfor optimal sol.

whole matrix columns intersections of columns and rows

intersections of column and rows

Inequality Constraints of the dual problem

How to choose examples (hypotheses)?

1st attempt: add an instance one by one

11

11

Ω11

)time ncomputatio(

kkν

t

t

k mk

νtt

less efficient than LP solver!

Assumptions- # of hypotheses is constant・ time complexity of a LP solver :  mk (m: # of instances) 

Our method :Choose at most 2t instances with margin < ρ (t:# of iterations)If the algorithm terminate after it chooose cm instances ( 0<c < 1 )

kkkcm

t

kt mOmcν

)2.04(2time ncomputatio)log(

1

)(

Note: same argument holds for hypothesesunknown

1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary

Experiments(new experiments not in the proceedings)

Data set # of examples m

# of hypotheses n

Reuters-21578 10,170 30,389

RCV1 20,242 47,237

news20 19,996 1,335,193

parameters :ν=0.2m, ε=0.001

each algorithm implemented with C++ and LP solver CPLEX 11.0

Experimental Results (sec.)

Sparse LPBoost improves the computation time by 3 to 100 times.

1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary

Summary & Open problem

Our result• Sparse LPBoost: provable decompotion algorithm

which ε-approximates 1-norm soft margin optimization

• faster than 3 to 100 times than LP solver or LPBoost.

Open problem• Theoretical guarantee on # of iterations• Better method for choosing instances (hypotheses)

Recommended