View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Unconstrained Optimization
Rong Jin
Recap Gradient ascent/descent
Simple algorithm, only requires the first order derivative
Problem: difficulty in determining the step size Small step size slow
convergence Large step size oscillation
or bubbling
Recap: Newton Method Univariate Newton method
Mulvariate Newton method
Guarantee to converge when the objective function is convex/concave
'( )
''( )
old
old
new old x x
x x
f xx x
f x
1 ( )new old f xx x
x
H
1 1, ,...,
T
m
f f f f
x x x x
21 2
,( , ,..., )m
i ji j
f x x x
x x
H
Hessian matrix
Recap Problem with standard Newton method
Computing inverse of Hessian matrix H is expensive (O(n^3)) The size of Hessian matrix H can be very large (O(n^2))
Quasi-Newton method (BFGS): Approximate the inverse of Hessian matrix H with another matrix B Avoid the difficulty in computing inverse of H However, still have problem when the size of B is large
Limited memory Quasi-Newton method (L-BFGS) Storing a set of vectors instead of matrix B Avoid the difficulty in computing the inverse of H Avoid the difficulty in storing the large-size B
Recap
Num
ber of Variable
Standard Newton method: O(n3)
Small
Medium Quasi Newton method (BFGS): O(n2)
Limited-memory Quasi Newton method (L-BFGS): O(n)
Large
Con
verg
ence
Rat
e
V-Fast
Fast
R-Fast
Empirical Study: Learning Conditional Exponential Model
Dataset Instances Features
Rule 29,602 246
Lex 42,509 135,182
Summary 24,044 198,467
Shallow 8,625,782 264,142
Dataset Iterations Time (s)
Rule 350 4.8
81 1.13
Lex 1545 114.21
176 20.02
Summary 3321 190.22
69 8.52
Shallow 14527 85962.53
421 2420.30
Limited-memory Quasi-Newton method
Gradient ascent
Free Software http://www.ece.northwestern.edu/~nocedal/so
ftware.html L-BFGS L-BFGSB
Conjugate Gradient Another Great Numerical Optimization
Method !
Linear Conjugate Gradient Method Consider optimizing the quadratic function
Conjugate vectors The set of vector {p1, p2, …, pl} is said to be conjugate with respect to
a matrix A if
Important property The quadratic function can be optimized by simply optimizing the
function along individual direction in the conjugate set. Optimal solution:
k is the minimizer along the kth conjugate direction
* arg min ( ) where ( )2
TT
x
x xx f x f x b x
A
0, for any Ti jp p i j A
1 1 2 2 ... l lx p p p
Example Minimize the following function
Matrix A
Conjugate direction
Optimization First direction, x1 = x2=x:
Second direction, x1 =- x2=x:
Solution: x1 = x2=1
-4
-2
0
2
4
-3
-2
-1
0
1
2
3-10
0
10
20
302 21 2 1 2 1 2 1 2( , )f x x x x x x x x
1 0.51
0.5 12A
1 21 1
,1 1
p p
21 2 1( , ) 2 Minimizer 1f x x x x
21 2 2( , ) 3 Minimizer 0f x x x
How to Efficiently Find a Set of Conjugate Directions Iterative procedure
Given conjugate directions {p1,p2,…, pk-1}
Set pk as follows:
Theorem: The direction generated in the above step is conjugate to all previous directions {p1,p2,…, pk-1}, i.e.,
Note: compute the k direction pk only requires the previous direction pk-1
11
1 1
( ), , where
k
Tk k
k k k k k kTx xk k
r Ap f xp r p r
xp Ap
, for any [1, 2,..., 1]Tk ip p i k A
Nonlinear Conjugate Gradient Even though conjugate gradient is derived for a quadratic
objective function, it can be applied directly to other nonlinear functions Guarantee convergence if the objective is convex/concave
Variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG)
More robust than FR-CG
Compared to Newton method The first order method Usually less efficient than Newton method However, it is simple to implement
Empirical Study: Learning Conditional Exponential Model
Dataset Instances Features
Rule 29,602 246
Lex 42,509 135,182
Summary 24,044 198,467
Shallow 8,625,782 264,142
Dataset Iterations Time (s)
Rule 142 1.93
81 1.13
Lex 281 21.72
176 20.02
Summary 537 31.66
69 8.52
Shallow 2813 16251.12
421 2420.30
Limited-memory Quasi-Newton method
Conjugate Gradient (PR)
Free Software http://www.ece.northwestern.edu/~nocedal/so
ftware.html CG+
When Should We Use Which Optimization Technique
Using Newton method if you can find a package
Using conjugate gradient if you have to implement it
Using gradient ascent/descent if you are lazy
Logarithm Bound Algorithms To maximize
Start with a guess Do it for t = 1, 2, …, T
Compute Find a decoupling function
Find optimal solution
0 0 0 01 2( , ,..., )mx x x x
1 1( , ) ( ) ( )t tQ x x f x f x x
1 2( ) ( , ,..., )nf x f x x x
1 1 2 2
1 1 1 1
( ) ( ) ( ) ... ( )
such that
( ) ( , ), and ( ) ( , ) 0
m m
t t t t
x g x g x g x
x Q x x x Q x x
' for each ( )i i ix g x
1 'ti ix x
Touch Point
Logarithm Bound Algorithm
( )f x
0x
• Start with initial guess x0
0( ) ( )x f x
• Come up with a lower bounded function (x) f(x) + f(x0)
• Touch point: (x0) =0
Touch Point
1x
• Optimal solution x1 for (x)
Logarithm Bound Algorithm
( )f x
0x
• Start with initial guess x0
1( ) ( )x f x
• Come up with a lower bounded function (x) f(x) + f(x0)
• Touch point: (x0) =0
1x
• Optimal solution x1 for (x)
• Repeat the above procedure
2x
Logarithm Bound Algorithm
( )f x
0x
• Start with initial guess x0
• Come up with a lower bounded function (x) f(x) + f(x0)
• Touch point: (x0) =0
1x
• Optimal solution x1 for (x)
• Repeat the above procedure
• Converge to the optimal point
2x
Optimal Point
Property of Concave Functions For any concave function
1 2 1 21 2 1 2
n
1
( ... ) ( ) ( ) ... ( )
for 0 and 1
n nn n
i jj
f p x p x p x p f x p f x p f x
p p
( )f x
( )f x
1x 2x
1( )f x
2( )f x
1 21 2
3 3x x
1 21 2( )3 3
f x x
1 21 2( ) ( )
3 3f x f x
Important Inequality log(x), -exp(x) are concave functions Therefore
1 1
1 1
log( ) log( )
exp( ) exp( )
n n
i i i ii i
n n
i i i ii i
p x p x
p x p x
Expectation-Maximization Algorithm Derive the EM algorithm for
Hierarchical Mixture Model
m1(x)
r(x)
m2(x)
X
y
1 1
2 2
( | ) ( , | )
( 1| ; ) ( | ; )
( 1| ; ) ( | ; )
g m
r
r
p y x p y m x
r x m y x
r x m y x
1 1 2 2
( ) ( | )
log ( 1| ; ) ( | ; ) ( 1| ; ) ( | ; )
train ii
i r i i r ii
l D p y x
r x m y x r x m y x
Log-likelihood of training data