Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Noise Robustness of Optimization Algorithms via Statistical Queries
Cristóbal Guzmá[email protected]
CWI Scientific Meeting27-11-2015
Motivation
2
Motivation
• In algorithms: Faster is better, right?
2
Motivation
• In algorithms: Faster is better, right?
• Sometimes, other objectives are as important
2
Motivation
• In algorithms: Faster is better, right?
• Sometimes, other objectives are as important
• This time, we care about robustness
2
Goal: Find
Example: Gradient Methods
3
where smooth convex function, compact convex setf : X ! R X ✓ Rd
f
⇤ = minx2X
f(x)
Goal: Find
Example: Gradient Methods
3
• Gradient Descent [Euler:1770]
x
t+1 = x
t � �trf(xt)
f
⇤ = minx2X
f(x)
Goal: Find
Example: Gradient Methods
3
• Gradient Descent [Euler:1770]
x
t+1 = x
t � �trf(xt) ) f(xT )� f⇤ = O(1/T )
f
⇤ = minx2X
f(x)
Goal: Find
Example: Gradient Methods
3
x
t = y
t � �trf(yt)
y
t+1 = x
t +↵t � 1
↵t+1(xt � x
t�1)
• Accelerated Gradient Method [Nesterov:1983]
• Gradient Descent [Euler:1770]
x
t+1 = x
t � �trf(xt) ) f(xT )� f⇤ = O(1/T )
f
⇤ = minx2X
f(x)
Goal: Find
Example: Gradient Methods
3
x
t = y
t � �trf(yt)
y
t+1 = x
t +↵t � 1
↵t+1(xt � x
t�1)
• Accelerated Gradient Method [Nesterov:1983]
• Gradient Descent [Euler:1770]
x
t+1 = x
t � �trf(xt) ) f(xT )� f⇤ = O(1/T )
) f(xT )� f⇤ = O(1/T 2)
f
⇤ = minx2X
f(x)
Example: Gradient Methods
4
Source: Moritz Hardt’s blog
g(x) = rf(x)
g(x) = rf(x) + ⇠
Example: Gradient Methods
4
Source: Moritz Hardt’s blog
⇠ ⇠ N (0,�2), � = 0.1
• There are other important reasons why acceleration could be harmful
Why Noise Robustness?
5
• There are other important reasons why acceleration could be harmful
Why Noise Robustness?
5
• Overfitting, privacy leakage, etc.
• There are other important reasons why acceleration could be harmful
Why Noise Robustness?
5
• Overfitting, privacy leakage, etc.
• Goal: A theory of optimization algorithms that incorporates noise sensitivity
"
TIterations
Accuracy
• There are other important reasons why acceleration could be harmful
Why Noise Robustness?
5
• Overfitting, privacy leakage, etc.
• Goal: A theory of optimization algorithms that incorporates noise sensitivity
"
TIterations
Accuracy
Data
Statistical Queries
6
• Unknown distribution supported on D W
W
D
1/⌧2
Statistical Queries
6
• Unknown distribution supported on D W
• Algorithm has oracle access to D
� : W ! [�1, 1]
Ew⇠D
[�(w)]± ⌧
W
D
1/⌧2
Statistical Queries
6
• Unknown distribution supported on D W
• Algorithm has oracle access to D
� : W ! [�1, 1]
Ew⇠D
[�(w)]± ⌧
W
D
1/⌧2
An oracle query can be emulated by samples1/⌧2
Statistical Queries
6
• Unknown distribution supported on D W
• Goals
• Algorithm has oracle access to D
• Efficiency ~ small number of queries
� : W ! [�1, 1]
Ew⇠D
[�(w)]± ⌧
W
D
1/⌧2
Statistical Queries
6
• Unknown distribution supported on D W
• Goals
• Algorithm has oracle access to D
• Efficiency ~ small number of queries• Noise tolerance ~ as large as possible ⌧
� : W ! [�1, 1]
Ew⇠D
[�(w)]± ⌧
W
D
1/⌧2
Why Statistical Queries?
7
SQ algorithms have additional properties, or provide the basis for designing algorithms with additional features:
Why Statistical Queries?
7
SQ algorithms have additional properties, or provide the basis for designing algorithms with additional features:
• Noise tolerance [Kearns:1994]
• Differential privacy [Blum, Dwork, McSherry, Nissim:2005], [Kasiviswanathan, Lee, Nissim, Raskhodnikova, Smith, 2011]
• Distributed computation [Balcan, Blum, Fine, Mansour:2012]
• Generalization in adaptive data analysis [Dwork, Feldman, Hardt, Pitassi, Reingold, Roth:2014]
Statistical Query Convex Opt.
8
• Stochastic convex optimizationminx2X
Ew⇠D
[f(x,w)]
Statistical Query Convex Opt.
8
• Stochastic convex optimizationminx2X
Ew⇠D
[f(x,w)]
• Statistical queries
• Function value: �(·) = f(x, ·)
Statistical Query Convex Opt.
8
• Stochastic convex optimizationminx2X
Ew⇠D
[f(x,w)]
• Statistical queries
• Function value: �(·) = f(x, ·)
• Gradient:�(·) = @f(x, ·)
@xii = 1, . . . , d
Statistical Query Convex Opt.
8
• Difficulty: estimation in 2-norm accumulates errors
• Stochastic convex optimizationminx2X
Ew⇠D
[f(x,w)]
• Statistical queries
• Function value: �(·) = f(x, ·)
• Gradient:�(·) = @f(x, ·)
@xii = 1, . . . , d
gi = Ew⇠D
@f(x,w)
@xi
�± ⌧ ) krEw[f(x,w)]� gk2 ⇡
pd⌧
Optimal Estimation Algorithm
9
[Feldman, G., Vempala:2015]
Q: Is it possible to avoid noise accumulation in gradient estimation?
W
D
Optimal Estimation Algorithm
9
[Feldman, G., Vempala:2015]
Q: Is it possible to avoid noise accumulation in gradient estimation?
W
D a
Optimal Estimation Algorithm
9
[Feldman, G., Vempala:2015]
Q: Is it possible to avoid noise accumulation in gradient estimation?
W
D
U
Optimal Estimation Algorithm
9
[Feldman, G., Vempala:2015]
Q: Is it possible to avoid noise accumulation in gradient estimation?
W
D
U
With high probability krEw[f(x,w)]� gk2 ⇡ O(
plog d)⌧
Further Consequences
10
New statistical query algorithms for optimization and machine learning
[Feldman, G., Vempala:2015]
Further Consequences
10
New statistical query algorithms for optimization and machine learning
[Feldman, G., Vempala:2015]
• Gradient methods: mirror-descent, accelerated method, strongly convex minimization• Polynomial-time algorithms: center of gravity, interior-point method• Improved algorithms for high-dimensional classification (Perceptron) and regression• Improved differentially-private convex optimization algorithms
Further Consequences
10
New statistical query algorithms for optimization and machine learning
[Feldman, G., Vempala:2015]
• Gradient methods: mirror-descent, accelerated method, strongly convex minimization• Polynomial-time algorithms: center of gravity, interior-point method• Improved algorithms for high-dimensional classification (Perceptron) and regression• Improved differentially-private convex optimization algorithms
We provide new structural lower bounds for convex optimization
Thank you
11