Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Computational and StatisticalLearning Theory
TTIC 31120
Prof. Nati Srebro
Lecture 17:Stochastic Optimization
Part II: Realizable vs Agnostic RatesPart III: Nearest Neighbor Classification
Stochastic (Sub)-Gradient Descent
If 𝛻𝑓 𝑤, 𝑧 2 ≤ 𝐺 then with appropriate step size:
𝔼 𝐹 𝑤𝑚 ≤ inf𝑤∈𝒲, 𝑤 2≤𝐵
𝐹 𝑤 + 𝑂𝐵2𝐺2
𝑚
Similarly, also Stochastic Mirror Descent
Optimize 𝑭 𝒘 = 𝔼𝒛∼𝓓 𝒇 𝒘, 𝒛 s.t. 𝒘 ∈ 𝓦1. Initialize 𝑤1 = 0 ∈ 𝒲2. At iteration 𝑡 = 1,2,3, …
1. Sample 𝑧𝑡 ∼ 𝒟 (Obtain 𝑔𝑡 s.t. 𝔼 𝑔𝑡 ∈ 𝜕𝐹(𝑤𝑡))
2. 𝑤𝑡+1 = Π𝒲 𝑤𝑡 − 𝜂𝑡𝛻𝑓 𝑤𝑡 , 𝑧𝑡
3. Return 𝑤𝑚 =1
𝑚σ𝑡=1𝑚 𝑤𝑡
Stochastic Gradient DescentOnline Gradient Descent[Cesa-Binachi et al 02]
online2stochastic
𝒈𝒕
[Zinkevich 03] [Nemirovski Yudin 78]
Online Mirror Descent Stochastic Mirror Descent[Shalev-Shwatz Singer 07] [Nemirovski Yudin 78]
Stochastic Optimization
min𝑤∈𝒲
𝐹(𝑤)
based only on stochastic information on 𝐹• Only unbiased estimates of 𝐹 𝑤 , 𝛻𝐹(𝑤)
• No direct access to 𝐹
E.g., fixed 𝑓(𝑤, 𝑧) but 𝒟 unknown• Optimize 𝐹(𝑤) based on iid sample 𝑧1, 𝑧2, … , 𝑧𝑚~𝒟
• 𝑔 = 𝛻𝑓(𝑤, 𝑧𝑡) is unbiased estimate of 𝛻𝐹 𝑤
• Traditional applications• Optimization under uncertainty
• Uncertainty about network performance
• Uncertainty about client demands
• Uncertainty about system behavior in control problems
• Complex systems where its easier to sample then integrate over 𝑧
= 𝔼𝑧~𝒟[𝑓 𝑤, 𝑧 ]
Machine Learning isStochastic Optimization
minℎ
𝐿 ℎ = 𝔼𝑧∼𝒟 ℓ ℎ, 𝑧 = 𝔼𝑥,𝑦∼𝒟 𝑙𝑜𝑠𝑠 ℎ 𝑥 , 𝑦
• Optimization variable: predictor ℎ
• Objective: generalization error 𝐿(ℎ)
• Stochasticity over 𝑧 = (𝑥, 𝑦)
“General Learning” ≡ Stochastic Optimization:
ValdimirVapnik
ArkadiNemirovskii
•Focus on computational efficiency
•Generally assumes unlimited sampling- as in monte-carlo methods for complicated objectives
•Optimization variable generally a vector in a normed space- complexity control through norm
•Mostly convex objectives
Stochastic Optimization
•Focus on sample size
•What can be done with a fixed number of samples?
•Abstract hypothesis classes- linear predictors, but also combinatorial hypothesis classes- generic measures of complexity such as VC-dim, fat shattering, Radamacher
• Also non-convex classes and loss functions
Statistical Learningvs
ValdimirVapnik
ArkadiNemirovskii
Two Approaches to Stochastic Optimization / Learning
min𝑤∈𝒲
𝐹(𝑤) = 𝔼𝑧~𝒟[𝑓 𝑤, 𝑧 ]
• Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA):• Collect sample z1,…,zm
• Minimize 𝐹𝑆 𝑤 =1
𝑚σ𝑖 𝑓 𝑤, 𝑧𝑖
• Analysis typically based on Uniform Convergence
• Stochastic Approximation (SA): [Robins Monro 1951]
• Update 𝑤𝑡 based on 𝑧𝑡• E.g., based on 𝑔𝑡 = 𝛻𝑓(𝑤, 𝑧𝑡)
• E.g.: stochastic gradient descent
• Online-to-batch conversion of online algorithm…
SA/SGD for Machine Learning
• In learning with ERM, need to optimize
ෝ𝑤 = arg min𝑤∈𝒲
𝐿𝑆(𝑤) 𝐿𝑆 𝑤 =1
𝑚σ𝑖 ℓ(𝑤, 𝑧𝑖)
• 𝐿𝑆 𝑤 is expensive to evaluate exactly— 𝑂(𝑚𝑑) time
• Cheap to get unbiased gradient estimate— 𝑂(𝑑) time
𝑖 ∼ 𝑈𝑛𝑖𝑓(1…𝑚) 𝑔 = 𝛻ℓ 𝑤, 𝑧𝑖
𝔼 𝑔 = σ𝑖1
𝑚𝛻ℓ 𝑤, 𝑧𝑖 = 𝛻𝐿𝑆(𝑤)
• SGD guarantee:
𝔼 𝐿𝑆 ഥ𝑤 𝑇 ≤ inf𝑤∈𝒲
𝐿𝑆 𝑤 +sup 𝛻ℓ 2
2 sup 𝑤 22
𝑇
SGD for SVM
min 𝐿𝑆 𝑤 𝑠. 𝑡. 𝑤 2 ≤ 𝐵
Use 𝑔𝑡 = 𝛻𝑤𝑙𝑜𝑠𝑠ℎ𝑖𝑛𝑔𝑒 𝑤𝑡, 𝜙𝑖𝑡 𝑥 ; 𝑦𝑖𝑡 for random 𝑖𝑡
Initialize 𝑤 0 = 0At iteration t:• Pick 𝑖 ∈ 1…𝑚 at random
• If 𝑦𝑖 𝑤𝑡 , 𝜙 𝑥𝑖 < 1,
𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑖𝜙 𝑥𝑖else: 𝑤 𝑡+1 ← 𝑤 𝑡
• If 𝑤 𝑡+12> 𝐵, then 𝑤 𝑡+1 ← 𝐵
𝑤 𝑡+1
𝑤 𝑡+12
Return ഥ𝑤 𝑇 =1
𝑇σ𝑡=1𝑇 𝑤 𝑡
𝜙 𝑥 2 ≤ 𝐺 𝑔𝑡 2 ≤ 𝐺 𝐿𝑆 ഥ𝑤 𝑇 ≤ 𝐿𝑆 ෝ𝑤 +𝐵2𝐺2
𝑇
(in expectation over randomness in algorithm)
Stochastic vs Batch
x1,y1
x2,y2
x3,y3
x4,y4
x5,y5
xm,ym
𝑔1 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥1, 𝑦1) )
𝑔2 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥2, 𝑦2) )
𝑔3 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥3, 𝑦3) )
𝑔4 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥4, 𝑦4) )
𝑔5 = ∇𝑙𝑜𝑠𝑠(𝑤, 𝑥5, 𝑦5 )
𝑔𝑚 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥𝑚, 𝑦𝑚) )
𝑔1 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥1, 𝑦1) )
𝑔2 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥2, 𝑦2) )
𝑔3 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥3, 𝑦3) )
𝑔4 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥4, 𝑦4) )
𝑔5 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥5, 𝑦5) )
𝑔𝑚 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥𝑚, 𝑦𝑚) )
𝑤 ← 𝑤 − 𝑔1
𝑤 ← 𝑤 − 𝑔2
𝑤 ← 𝑤 − 𝑔3
𝑤 ← 𝑤 − 𝑔4
𝑤 ← 𝑤 − 𝑔𝑚−1
𝑤 ← 𝑤 − 𝑔𝑖𝑤 ← 𝑤 − 𝑔𝑚
𝑤 ← 𝑤 − 𝑔5
min 𝐿𝑆 𝑤 𝑠. 𝑡. 𝑤 2 ≤ 𝐵
∇𝐿𝑆 𝑤 =1
𝑚σ𝑔𝑖
Stochastic vs Batch• Intuitive argument: if only taking simple gradient steps, better
to be stochastic
• To get 𝐿𝑆 𝑤 ≤ 𝐿𝑆 ෝ𝑤 + 𝜖𝑜𝑝𝑡:
#iter cost/iter runtime
Batch GD 𝐵2𝐺2/𝜖𝑜𝑝𝑡2 𝑚𝑑 𝒎𝑑
𝐵2𝐺2
𝜖𝑜𝑝𝑡2
SGD 𝐵2𝐺2/𝜖𝑜𝑝𝑡2 𝑑 𝑑
𝐵2𝐺2
𝜖𝑜𝑝𝑡2
• Comparison to methods with a log 1/𝜖𝑜𝑝𝑡 dependence that use the structure of 𝐿𝑆(𝑤) (not only local access)?
• How small should 𝜖𝑜𝑝𝑡 be?
• What about 𝐿 𝑤 , which is what we really care about?
Overall Analysis of 𝐿𝒟(𝑤)• Recall for ERM: 𝐿𝒟 ෝ𝑤 ≤ 𝐿𝒟 𝑤∗ + 2sup
𝑤𝐿𝒟 𝑤 + 𝐿𝑆 𝑤
• For 𝜖𝑜𝑝𝑡 suboptimal ERM ഥ𝑤:
𝐿𝒟 ഥ𝑤 ≤ 𝐿𝒟 𝑤∗ + 2sup𝑤
𝐿𝒟 𝑤 − 𝐿𝑆 𝑤 + 𝐿𝑆 ഥ𝑤 − 𝐿𝑆 ෝ𝑤
• Take 𝜖𝑜𝑝𝑡 ≈ 𝜖𝑒𝑠𝑡, i.e. #𝑖𝑡𝑒𝑟 𝑇 ≈ 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑚
• To ensure 𝐿𝒟 𝑤 ≤ 𝐿𝒟 𝑤∗ + 𝜖:
𝑇,𝑚 = 𝑂𝐵2𝐺2
𝜖2
ෝ𝑤 = arg min𝑤 ≤𝐵
𝐿𝑆(𝑤) 𝑤∗ = arg min𝑤 ≤𝐵
𝐿𝒟(𝑤)
𝜖𝑎𝑝𝑟𝑜𝑥
𝜖𝑒𝑠𝑡 ≤ 2𝐵2𝐺2
𝑚𝜖𝑜𝑝𝑡 ≤
𝐵2𝐺2
𝑇
Direct Online-to-Batch:SGD on 𝐿𝒟(𝑤)
min𝑤
𝐿𝒟(𝑤)
use 𝑔𝑡 = 𝛻𝑤ℎ𝑖𝑛𝑔𝑒 𝑦 𝑤, 𝜙 𝑥 for random 𝑦, 𝑥~𝒟
𝔼 𝑔𝑡 = 𝛻𝐿𝒟 𝑤
Initialize 𝑤 0 = 0At iteration t:• Draw 𝑥𝑡 , 𝑦𝑡~𝒟
• If 𝑦𝑡 𝑤𝑡 , 𝜙 𝑥𝑡 < 1,
𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑡𝜙 𝑥𝑡else: 𝑤 𝑡+1 ← 𝑤 𝑡
Return ഥ𝑤 𝑇 =1
𝑇σ𝑡=1𝑇 𝑤 𝑡
𝐿𝒟 ഥ𝑤 𝑇 ≤ inf𝑤 2≤𝐵
𝐿𝒟 𝑤 +𝐵2𝐺2
𝑇
𝑚 = 𝑇 = 𝑂𝐵2𝐺2
𝜖2
SGD for Machine Learning
Initialize 𝑤 0 = 0At iteration t:• Draw 𝑥𝑡 , 𝑦𝑡~𝒟
• If 𝑦𝑡 𝑤𝑡 , 𝜙 𝑥𝑡 < 1,
𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑡𝜙 𝑥𝑡else: 𝑤 𝑡+1 ← 𝑤 𝑡
Return ഥ𝑤 𝑇 =1
𝑇σ𝑡=1𝑇 𝑤 𝑡
Draw 𝑥1, 𝑦1 , … , 𝑥𝑚, 𝑦𝑚 ~𝒟
Initialize 𝑤 0 = 0At iteration t:• Pick 𝑖 ∈ 1…𝑚 at random
• If 𝑦𝑖 𝑤𝑡 , 𝜙 𝑥𝑖 < 1,
𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑖𝜙 𝑥𝑖else: 𝑤 𝑡+1 ← 𝑤 𝑡
• 𝑤 𝑡+1 ← 𝑝𝑟𝑜𝑗 𝑤 𝑡+1 𝑡𝑜 𝑤 ≤ 𝐵
Return ഥ𝑤 𝑇 =1
𝑇σ𝑡=1𝑇 𝑤 𝑡
Direct SA (online2batch) Approach:SGD on ERM:min𝑤 2≤𝐵
𝐿𝑆 𝑤
min𝑤
𝐿 𝑤
• Fresh sample at each iteration, 𝑚 = 𝑇• No need to project nor require 𝑤 ≤ 𝐵• Implicit regularization via early stopping
• Can have 𝑇 > 𝑚 iterations• Need to project to 𝑤 ≤ 𝐵• Explicit regularization via ‖𝑤‖
SGD for Machine Learning
Initialize 𝑤 0 = 0At iteration t:• Draw 𝑥𝑡 , 𝑦𝑡~𝒟
• If 𝑦𝑡 𝑤𝑡 , 𝜙 𝑥𝑡 < 1,
𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑡𝜙 𝑥𝑡else: 𝑤 𝑡+1 ← 𝑤 𝑡
Return ഥ𝑤 𝑇 =1
𝑇σ𝑡=1𝑇 𝑤 𝑡
Draw 𝑥1, 𝑦1 , … , 𝑥𝑚, 𝑦𝑚 ~𝒟
Initialize 𝑤 0 = 0At iteration t:• Pick 𝑖 ∈ 1…𝑚 at random
• If 𝑦𝑖 𝑤𝑡 , 𝜙 𝑥𝑖 < 1,
𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑖𝜙 𝑥𝑖else: 𝑤 𝑡+1 ← 𝑤 𝑡
• 𝑤 𝑡+1 ← 𝑝𝑟𝑜𝑗 𝑤 𝑡+1 𝑡𝑜 𝑤 ≤ 𝐵
Return ഥ𝑤 𝑇 =1
𝑇σ𝑡=1𝑇 𝑤 𝑡
Direct SA (online2batch) Approach:SGD on ERM:min𝑤 2≤𝐵
𝐿𝑆 𝑤
min𝑤
𝐿 𝑤
𝐿 ഥ𝑤 𝑇 ≤ 𝐿 𝑤∗ +𝐵2𝐺2
𝑇𝐿 ഥ𝑤 𝑇 ≤ 𝐿 𝑤∗ + 2
𝐵2𝐺2
𝑚+
𝐵2𝐺2
𝑇
𝑤∗ = arg min𝑤 ≤𝐵
𝐿(𝑤)
𝜂𝑡 = 𝐵2/𝐺2𝑡
SGD for Machine Learning
Initialize 𝑤 0 = 0At iteration t:• Draw 𝑥𝑡 , 𝑦𝑡~𝒟
• If 𝑦𝑡 𝑤𝑡 , 𝜙 𝑥𝑡 < 1,
𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑡𝜙 𝑥𝑡else: 𝑤 𝑡+1 ← 𝑤 𝑡
Return ഥ𝑤 𝑇 =1
𝑇σ𝑡=1𝑇 𝑤 𝑡
Draw 𝑥1, 𝑦1 , … , 𝑥𝑚, 𝑦𝑚 ~𝒟
Initialize 𝑤 0 = 0At iteration t:• Pick 𝑖 ∈ 1…𝑚 at random
• If 𝑦𝑖 𝑤𝑡 , 𝜙 𝑥𝑖 < 1,
𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑖𝜙 𝑥𝑖else: 𝑤 𝑡+1 ← 𝑤 𝑡
• 𝑤 𝑡+1 ← 𝑤 𝑡+1 − 𝜆𝑤(𝑡)
Return ഥ𝑤 𝑇 =1
𝑇σ𝑡=1𝑇 𝑤 𝑡
Direct SA (online2batch) Approach:
SGD on RERM:
min𝐿𝑆 𝑤 +𝜆
2‖𝑤‖
min𝑤
𝐿 𝑤
• Fresh sample at each iteration, 𝑚 = 𝑇• No need shrink 𝑤• Implicit regularization via early stopping
• Can have 𝑇 > 𝑚 iterations• Need to shrink 𝑤• Explicit regularization via ‖𝑤‖
SGD vs ERM
𝑤 0
ഥ𝑤 𝑚
ෝ𝑤𝑤∗
𝑂𝐵2𝐺2
𝑚
argmin𝑤
𝐿𝑆(𝑤)
(overfit)
𝑻 > 𝒎w/ proj
𝑤∗ = arg min𝑤 ≤𝐵
𝐿(𝑤)ෝ𝑤 = arg min𝑤 ≤𝐵
𝐿𝑆(𝑤)
Mixed Approach: SGD on ERM
0 1,000,000 2,000,000 3,000,000
0.052
0.054
0.056
0.058
# iterations k
Test
mis
clas
sifi
cati
on
err
or
m=300,000m=400,000m=500,000
Reuters RCV1 data, CCAT task
• The mixed approach (reusing examples) can make sense
Mixed Approach: SGD on ERM
0 1,000,000 2,000,000 3,000,000
0.052
0.054
0.056
0.058
# iterations k
Test
mis
clas
sifi
cati
on
err
or
m=300,000m=400,000m=500,000
Reuters RCV1 data, CCAT task
• The mixed approach (reusing examples) can make sense• Still: fresh samples are better
)With a larger training set, can reduce generalization error faster) Larger training set means less runtime to get target generalization error
Online Optimization vs Stochastic Approximation
• In both Online Setting and Stochastic Approximation • Receive samples sequentially• Update w after each sample
• But, in Online Setting:• Objective is empirical regret, i.e. behavior on observed instances• 𝑧𝑡 chosen adversarialy (no distribution involved)
• As opposed on Stochastic Approximation:• Objective is 𝔼 ℓ 𝑤, 𝑧 , i.e. behavior on “future” samples• i.i.d. samples 𝑧𝑡
• Stochastic Approximation is a computational approach,Online Learning is an analysis setup• E.g. “Follow the leader”
Part II:Realizable vs Agnostic
Rates
Realizable vs Agnostic Rates
• Recall for finite hypothesis classes:
𝐿𝒟 ℎ ≤ infℎ∈ℋ
𝐿𝒟(ℎ) + 2log |ℋ|+log ൗ2 𝛿
2𝑚 𝑚 = 𝑂
log ℋ
𝜖𝟐
• But in the realizable case, if infℎ∈ℋ
𝐿𝒟(ℎ) = 0:
𝐿𝒟 ℎ ≤log ℋ +log
1
𝛿
𝑚 𝑚 = 𝑂
log ℋ
𝜖
• Also for VC-classes, in general 𝑚 = 𝑂𝑉𝐶𝑑𝑖𝑚
𝜖𝟐while in the
realizable case 𝑚 = 𝑂𝑉𝐶𝑑𝑖𝑚⋅log 1/𝜖
𝜖
• What happens if 𝐿∗ = infℎ∈ℋ
𝐿𝒟(ℎ) is low, but not zero?
Estimating the Bias of a Coin
𝑝 − Ƹ𝑝 ≤log ൗ2 𝛿2𝑚
𝑝 − Ƹ𝑝 ≤2 𝑝 log ൗ2 𝛿
𝑚+2 log ൗ2 𝛿3𝑚
Optimistic VC bound(aka 𝐿∗-bound, multiplicative bound)
ℎ = argminℎ∈ℋ
𝐿𝑆(ℎ) 𝐿∗ = infℎ∈ℋ
𝐿(ℎ)
• For a hypothesis class with VC-dim 𝐷, w.p. 1-𝛿 over n samples:
𝐿 ℎ ≤ 𝐿∗ + 2 𝐿∗𝐷 log Τ2𝑒𝑚
𝐷+log ൗ2 𝛿
𝑚+ 4
𝐷 log Τ2𝑒𝑚𝐷+log ൗ2 𝛿
𝑚
= inf𝛼
1 + 𝛼 𝐿∗ + 1 +1
𝛼4𝐷 log Τ2𝑒𝑚
𝐷+log ൗ2 𝛿
𝑚
• Sample complexity to get 𝐿 ℎ ≤ 𝐿∗ + 𝜖:
𝑚 𝜖 = 𝑂𝐷
𝜖⋅𝐿∗ + 𝜖
𝜖log
1
𝜖
• Extends to bounded real-valued loss in terms of VC-subgraph dim
From Parametric to Scale Sensitive𝐿 ℎ = 𝔼 𝑙𝑜𝑠𝑠 ℎ 𝑥 , 𝑦 ℎ ∈ ℋ
• Instead of VC-dim or VC-subgraph-dim (≈ #params), rely on metric scale to control complexity, e.g.:
ℋ = 𝑤 ↦ 𝑤, 𝑥 𝑤 2 ≤ 𝐵 }
• Learning depends on:
• Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity
• Scale sensitivity of loss (bound on derivatives or “margin”)
• For ℋwith Rademacher Complexity ℛ𝑚, and 𝑙𝑜𝑠𝑠′ ≤ 𝐺:
𝐿 ℎ ≤ 𝐿∗ + 2𝐺ℛ𝑚 +log ൗ2 𝛿2𝑚
≤ 𝐿∗ + 𝑂𝐺2𝑅 + log ൗ2 𝛿
2𝑚
ℛ𝑚 ≤𝑅
𝑚
𝑅=
ℛ𝑚 ℋ =𝐵2 sup 𝑥 2
𝑚
Non-Parametric Optimistic Rate for Smooth Loss
• Theorem: for any ℋ with (worst case) RademacherComplexity ℛ𝑚(ℋ), and any smooth loss with 𝑙𝑜𝑠𝑠′′ ≤ 𝐻, 𝑙𝑜𝑠𝑠 ≤ 𝑏, w.p. 1 − 𝛿 over n samples:
• Sample complexity
Parametric vs Non-Parametric
Parametricdim(ℋ) ≤ 𝐃, 𝒉 ≤ 𝟏
Scale-Sensitive
ℛ𝒏 ℋ ≤ ൗ𝑹 𝒏
Lipschitz: 𝜙′ ≤ 𝐺(e.g. hinge, ℓ1) 𝐺 𝐷
𝑚 + 𝐿∗𝐺𝐷𝑚
𝐺2𝑅𝑚
Smooth: 𝜙′′ ≤ 𝐻(e.g. logistic, Huber, smoothed hinge)
𝐻 𝐷𝑚 + 𝐿∗
𝐻𝐷𝑚
𝐻 𝑅𝑚 + 𝐿∗
𝐻𝑅𝑚
Smooth & strongly convex: 𝜇 ≤ 𝜙′′ ≤ 𝐻(e.g. square loss)
𝐻
𝜇⋅𝐻 𝐷
𝑚𝐻 𝑅𝑚 + 𝐿∗
𝐻𝑅𝑚
Min-max tight up to poly-log factors
Optimistic Learning Guarantees
𝐿 ℎ ≤ 1 + 𝛼 𝐿∗ + 1 +1
𝛼෨𝑂
𝑅
𝑚
𝑚 𝜖 ≤ ෨𝑂𝑅
𝜖⋅𝐿∗ + 𝜖
𝜖
Parametric classes
Scale-sensitive classes with smooth loss
Perceptron gurantee
Margin Bounds
Stability-based guarantees with smooth loss
Online Learning/Optimization with smooth loss
× Non-param (scale sensitive) classes with non-smooth loss
× Online Learning/Optimization with non-smooth loss
Why Optimistic Guarantees?
𝐿 ℎ ≤ 1 + 𝛼 𝐿∗ + 1 +1
𝛼෨𝑂
𝑅
𝑚
𝑚 𝜖 ≤ ෨𝑂𝑅
𝜖⋅𝐿∗ + 𝜖
𝜖
• Optimistic regime typically relevant regime:
• Approximation error 𝐿∗ ≈ Estimation error 𝜖
• If 𝜖 ≪ 𝐿∗, better to spend energy on lowering approx. error(use more complex class)
• Often important in highlighting true phenomena
Part III:Nearest Neighbor
Classification
The Nearest Neighbor Classifier• Training sample S = 𝑥1, 𝑦1 , … , 𝑥𝑚, 𝑦𝑚
• Want to predict label of new point 𝑥
• The Nearest Neighbor Rule:
• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)
• Predict label of 𝑥 as 𝑦𝑖
?
The Nearest Neighbor Classifier• Training sample S = 𝑥1, 𝑦1 , … , 𝑥𝑚, 𝑦𝑚
• Want to predict label of new point 𝑥
• The Nearest Neighbor Rule:
• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)
• Predict label of 𝑥 as 𝑦𝑖
• As learning rule: 𝑁𝑁 𝑆 = ℎ where ℎ(𝑥) = 𝑦arg min𝑖
𝜌 𝑥,𝑥𝑖
෩𝝓 𝒙 − ෩𝝓(𝒙′)𝟐
෩𝝓 𝒙 = (𝟓𝒙 𝟏 , 𝒙 𝟐 )𝒙 − 𝒙′ 𝟏𝒙 − 𝒙′ 𝟐
Where is the Bias Hiding?
• What is the right “distance” between images? Between sound waves? Between sentences?
• Option 1: 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2
• What representation 𝜙(𝑥)?
• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)
• Predict label of 𝑥 as 𝑦𝑖
෩𝝓 𝒙 − ෩𝝓(𝒙′)𝟐
෩𝝓 𝒙 = (𝟓𝒙 𝟏 , 𝒙 𝟐 )𝒙 − 𝒙′ 𝟏 𝒙 − 𝒙′ ∞
Where is the Bias Hiding?
• What is the right “distance” between images? Between sound waves? Between sentences?
• Option 1: 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2
• What representation 𝜙(𝑥)?
• Maybe a different distance? 𝜙 𝑥 − 𝜙 𝑥′ 1? 𝜙 𝑥 − 𝜙 𝑥 ∞? sin(∠(𝜙 𝑥 , 𝜙 𝑥′ )? 𝐾𝐿(𝜙(𝑥)||𝜙 𝑥′ )?
• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)
• Predict label of 𝑥 as 𝑦𝑖
Where is the Bias Hiding?
• What is the right “distance” between images? Between sound waves? Between sentences?
• Option 1: 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2
• What representation 𝜙(𝑥)?
• Maybe a different distance? 𝜙 𝑥 − 𝜙 𝑥′ 1? 𝜙 𝑥 − 𝜙 𝑥 ∞? sin(∠(𝜙 𝑥 , 𝜙 𝑥′ )? 𝐾𝐿(𝜙(𝑥)||𝜙 𝑥′ )?
• Option 2: Special-purpose distance measure on 𝑥• E.g. edit distance, deformation measure, etc
• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)
• Predict label of 𝑥 as 𝑦𝑖
Nearest Neighbor Learning Guarantee
• Optimal predictor: ℎ∗ = argmin 𝐿𝒟(ℎ)
ℎ∗ 𝑥 = ቊ+1, 𝜂 𝑥 > 0.5
−1, 𝜂 𝑥 < 0.5𝜂 𝑥 = 𝑃𝒟(𝑦 = 1|𝑥)
• For the NN rule with 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2, and 𝜙:𝒳 → [0,1] 𝑑:
𝔼𝑆∼𝒟𝑚 𝐿 𝑁𝑁 𝑆 ≤ 2𝐿 ℎ∗ + 4𝑐𝒟𝑑
𝑑+1 𝑚
𝜂 𝑥 − 𝜂 𝑥′ ≤ 𝑐𝒟 ⋅ 𝜌(𝑥, 𝑥′)
Data Fit / Complexity Tradeoff
• k-Nearest Neighbor: predict according to majority among 𝑘 closest point from S.
𝔼𝑆∼𝒟𝑚 𝐿 𝑁𝑁 𝑆 ≤ 2𝐿(ℎ∗) + 4𝑐𝒟𝑑
𝑑+1 𝑚
k-Nearest Neighbor:Data Fit / Complexity Tradeoff
k=1 k=5 k=12
k=50 k=100 k=200
S= h*=
k-Nearest Neighbor Guarantee
• For k-NN with 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2, and 𝜙:𝒳 → [0,1]𝑑:
𝔼𝑆∼𝒟𝑚 𝐿 𝑁𝑁𝑘 𝑆 ≤ 1 + 8𝑘 𝐿 ℎ∗ +
6𝑐𝒟 𝑑 + 𝑘𝑑+1 𝑚
• Should increase 𝑘 with sample size 𝑚
• Above theory suggests 𝑘𝑚 ∝ 𝐿 ℎ∗ 2/3 ⋅ 𝑚2
3(𝑑+1)
• “Universal” Learning: for any “smooth” 𝒟 and representation 𝜙 ⋅ (with continuous 𝑃(𝑦|𝜙 𝑥 )), if we increase k slowly enough, we will eventually converge to optimal 𝐿(ℎ∗)
• Very non-uniform: sample complexity depends not only on ℎ∗, but also on 𝒟
𝜂 𝑥 − 𝜂 𝑥′ ≤ 𝑐𝒟 ⋅ 𝜌(𝑥, 𝑥′)
Uniform and Non-Uniform Learnability
• Definition: A hypothesis class ℋ is agnostically PAC-Learnable if there exists a
learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∃𝑚 𝜖, 𝛿 , ∀𝒟, ∀𝒉, ∀𝑆∼𝒟𝑚 𝜖,𝛿𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
• Definition: A hypothesis class ℋ is non-uniformly learnable if there exists a
learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ, ∃𝑚 𝜖, 𝛿, 𝒉 , ∀𝒟, ∀𝑆∼𝒟𝑚 𝜖,𝛿,𝒉𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
• Definition: A hypothesis class ℋ is “consistently learnable” if there exists a
learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ ∀𝓓, ∃𝑚 𝜖, 𝛿, ℎ, 𝓓 , ∀𝑆∼𝒟𝑚 𝜖,𝛿,ℎ,𝓓𝛿 ,
𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖
Realizable/Optimistic Guarantees: 𝒟 dependence through 𝐿𝒟(ℎ)