Upload
dayna-parks
View
217
Download
0
Embed Size (px)
Citation preview
Linear Programming Boosting
by Column and Row Generation
Kohei Hatano and Eiji TakimotoKyushu University, Japan
DS 2009
1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary
ExampleGiven a web page, predict if the topic
is “DS 2009”.
hypothesis set= words
DS 2009?y n
-1+1
DS 2009?y n
-1+1
0.5* ALT?y n
-1+1
+ 0.3* Porto?y n
-1+1
+ 0.2*
Weighted majority vote
Modern Machine Learning ApproachFind a weighted majority of hypotheses (or hyperplane) which enlarges the margin
∑
∑
∑
≥
--≥
-
n
iii
ii
n
jijji
m
ii
ρ
αα
ymiξρhαy
mννξν
ρ
11
1
1,,
0 ,1
)1,1 ,,...,1( ,)(
sub.to.
)1 s.t. constant:(1
.max
α
x
ξα
1-norm soft margin optimization
(aka 1 norm SVM)Popular formulation as well as 2 norm soft margin opt.“find a hyperplane which separates positive and negative instances well”
margin ρ
ξ i
loss ξ i
Note・ Linear Program・ good generarization guarantee [Schapire et al. 98]
marginloss
normalized with 1 norm
1-norm soft margin optimization(2)
Advantage of 1 norm soft margin opt.Solution likely to be sparse ⇒ useful for feature selection
sparse hyperplane0.5 *(DS 2009?) + 0.3 *(ALT?) + 0.2* (Porto?)
non-sparse hyperplane0.2 *(DS 2009?) + 0.1 *(ALT?) + 0.1 * (Porto?)* 0.1* (wine?)+0.05*(tasting?)+0.05* (discovery?)+ 0.03*(science?)+0.02*()+…
2 norm soft margin opt.
1 norm soft marign opt.
Recent Results
Our resultnew algorithm for 1 norm soft margin optimization
2-norm soft margin optimization・ Quadratic Programming・ SMO [Platt, 1999]
・ SVMlight [Joachims, 1999]
・ SVMPerf [Joachims, 2006]
・ Pegasos [Shai-Schwartz et al., 2007]
There are state-of-the-art solvers
1-norm soft margin optimization・ Linear Programming・ LPBoost [Dem iriz et al, 2003]
・ Entropy Regularized LPBoost [Warmuth et al., 2008]
・ others [Mangasarian 2006][Sra 2006]
not efficient enough for large dataroom for improvements
1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary
BoostingClassification: frog “+1”, others “-1”Hypotheses ・・・ color and size
color
size
-1
+1
1 . d1 : uniform distribution
2 . For t=1,…,T (i) Choose hypothesis ht
maximizing the edge w.r.t.dt (ii) Update distribution dt
to dt+1
3. Output weighting of chosen hypotheses
-1
-1
+1
Boosting (2)
color
size
-1
+1
1 . d1 : uniform distribution
2 . For t=1,…,T (i) Choose hypothesis ht
maximizing the edge w.r.t.dt (ii) Update distribution dt
to dt+1
3. Output weighting of chosen hypotheses
-1
-1
+1
h1
h1(xi)
)()(edge1
i
m
iii hydh xd
Edge of hypothesis h w.r.t. distribution d
yih(xi)>0 if correct
h of accuracy
frog
-1,or +1 ∈[-1,+1]
Boosting (2)
color
size
-1
+1
1 . d1 : uniform distribution
2 . For t=1,…,T (i) Choose hypothesis ht
maximizing the edge w.r.t.dt (ii) Update distribution dt
to dt+1
3. Output weighting of chosen hypotheses
-1
-1
+1
h1
h1(xi)
More weights onMisclassified instances
frog
Boosting (3)
color
size
-1
+1
1 . d1 : uniform distribution
2 . For t=1,…,T (i) Choose hypothesis ht
maximizing the edge w.r.t.dt (ii) Update distribution dt
to dt+1
3. Output weighting of chosen hypotheses
-1
-1
+1
h2
Note: more weights on “diffucult” instances
Boosting (4)
color
size
-1
+1
1 . d1 : uniform distribution
2 . For t=1,…,T (i) Choose hypothesis ht
maximizing the edge w.r.t.dt (ii) Update distribution dt
to dt+1
3. Output weighting of chosen hypotheses
-1
-1
+1
0.4h1 +0.6h2
Boosting & 1-norm soft margin optimizatoin
m
iii
jd
γ
dν
d
njγh
γ
11
,
1 ,1
0
),...,1( ,)(edge
sub.to.
.min
Dual
d
d
n
jjj
ii
n
jijji
m
ii
ρ
αα
ymiξρhαy
νξν
ρ
11
1
1,,
0 ,1
)1,1 ,,...,1( ,)(
sub.to.
)1(1
.max
Primal
α
x
ξα
“find the large-margin hyperplane which separates pos. and neg. instances as much as possible”
find the distiribution which minimizesthe maximum edge of hypotheses(find the most “difficult” distribution w.r.t. hypotheses)
≈ solvable using Boosting
equivalent by duality of LP
LPBoost [Demiriz et al, 2003]Update: solve the dual problem w.r.t. the hypothesis set {h1,…,ht}
Output: the convex combination
where α is the solution of the primal problem
m
iii
jd
γtt
dν
d
tjγh
γγ
11
,11
1 ,1
0
),...,1( ,)(edge
sub.to.
.minarg),(
d
dd
Theorem[Demiriz et al.]
Given ε>0, LPBoost outputs ε-approximation of the optimal solution.
T
ttt hαf
1
)()( xx
Properties of the optimal solution
margin ρ*ξ* i
loss ξ* i
KKT conditions imply:
Note: Optimal solution can be recovered using only instances with positive weigthts
0 margin with i instance
10 margin with instance
1 margin with i instance
**
**
**
i
i
i
dρν
dρ
νdρ
problem dual the of sol. :),(
problem primal the of sol.:),,(**
***
γ
ρ
d
ξα
Properties of the optimal solution ( 2 )
By KKT conditions
0 d w.r.t. edge with hypothesis
0 d w.r.t. edge with hypothesis***
***
j
j
αγ
αγ
Sparseness of the optimal solution1. Sparseness w.r.t. hypotheses weighting2. Sparseness w.r.t. instances
Note: Optimal solution can be recovered using only hypotheses with positive coefficients.
1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary
Our idea: Sparse LPBoostTake advantage of the sparseness w.r.t. hypotheses and instances
2.For t=1….(i)Pick up instances with margin <ρt
(iii) solve the dual problem w.r.t. the past chosen instances by Boosting (ρt+1 : the solution)
3 . Output the solution of the primal problem.
TheoremGiven ε>0, Sparse LPBoost outputs ε-approximation of the optimal solution.
Our idea (matrix form)
γ
γ
γ
d
d
d
hyhyhy
hyhyhy
hyhyhy
m
i
mmmmmm
iniijiii
nj
1
111
1
1111111
)(...)(...)(
......
)(... )(...)(
......
)(...)(...)(
xxx
xxx
xxx
Each row i corresponds toinstance i
Each column j corresponds to hypothesis j
LP LPBoost Sparse LPBoost
“effective” constraintsfor optimal sol.
whole matrix columns intersections of columns and rows
intersections of column and rows
Inequality Constraints of the dual problem
How to choose examples (hypotheses)?
1st attempt: add an instance one by one
11
11
Ω11
)time ncomputatio(
kkν
t
kν
t
k mk
νtt
less efficient than LP solver!
Assumptions- # of hypotheses is constant・ time complexity of a LP solver : mk (m: # of instances)
Our method :Choose at most 2t instances with margin < ρ (t:# of iterations)If the algorithm terminate after it chooose cm instances ( 0<c < 1 )
kkkcm
t
kt mOmcν
)2.04(2time ncomputatio)log(
1
)(
Note: same argument holds for hypothesesunknown
1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary
Experiments(new experiments not in the proceedings)
Data set # of examples m
# of hypotheses n
Reuters-21578 10,170 30,389
RCV1 20,242 47,237
news20 19,996 1,335,193
parameters :ν=0.2m, ε=0.001
each algorithm implemented with C++ and LP solver CPLEX 11.0
Experimental Results (sec.)
Sparse LPBoost improves the computation time by 3 to 100 times.
1. Introduction2. Preliminaries3. Our algorithm4. Experiment5. Summary
Summary & Open problem
Our result• Sparse LPBoost: provable decompotion algorithm
which ε-approximates 1-norm soft margin optimization
• faster than 3 to 100 times than LP solver or LPBoost.
Open problem• Theoretical guarantee on # of iterations• Better method for choosing instances (hypotheses)