Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔼 ~𝒟[ , ] •Empirical

Computational and StatisticalLearning Theory

TTIC 31120

Prof. Nati Srebro

Lecture 17:Stochastic Optimization

Part II: Realizable vs Agnostic RatesPart III: Nearest Neighbor Classification

Stochastic (Sub)-Gradient Descent

If 𝛻𝑓 𝑤, 𝑧 2 ≤ 𝐺 then with appropriate step size:

𝔼 𝐹 𝑤𝑚 ≤ inf𝑤∈𝒲, 𝑤 2≤𝐵

𝐹 𝑤 + 𝑂𝐵2𝐺2

𝑚

Similarly, also Stochastic Mirror Descent

Optimize 𝑭 𝒘 = 𝔼𝒛∼𝓓 𝒇 𝒘, 𝒛 s.t. 𝒘 ∈ 𝓦1. Initialize 𝑤1 = 0 ∈ 𝒲2. At iteration 𝑡 = 1,2,3, …

1. Sample 𝑧𝑡 ∼ 𝒟 (Obtain 𝑔𝑡 s.t. 𝔼 𝑔𝑡 ∈ 𝜕𝐹(𝑤𝑡))

2. 𝑤𝑡+1 = Π𝒲 𝑤𝑡 − 𝜂𝑡𝛻𝑓 𝑤𝑡 , 𝑧𝑡

3. Return 𝑤𝑚 =1

𝑚σ𝑡=1𝑚 𝑤𝑡

Stochastic Gradient DescentOnline Gradient Descent[Cesa-Binachi et al 02]

online2stochastic

𝒈𝒕

[Zinkevich 03] [Nemirovski Yudin 78]

Online Mirror Descent Stochastic Mirror Descent[Shalev-Shwatz Singer 07] [Nemirovski Yudin 78]

Stochastic Optimization

min𝑤∈𝒲

𝐹(𝑤)

based only on stochastic information on 𝐹• Only unbiased estimates of 𝐹 𝑤 , 𝛻𝐹(𝑤)

• No direct access to 𝐹

E.g., fixed 𝑓(𝑤, 𝑧) but 𝒟 unknown• Optimize 𝐹(𝑤) based on iid sample 𝑧1, 𝑧2, … , 𝑧𝑚~𝒟

• 𝑔 = 𝛻𝑓(𝑤, 𝑧𝑡) is unbiased estimate of 𝛻𝐹 𝑤

• Traditional applications• Optimization under uncertainty

• Uncertainty about network performance

• Uncertainty about client demands

• Uncertainty about system behavior in control problems

• Complex systems where its easier to sample then integrate over 𝑧

= 𝔼𝑧~𝒟[𝑓 𝑤, 𝑧 ]

Machine Learning isStochastic Optimization

minℎ

𝐿 ℎ = 𝔼𝑧∼𝒟 ℓ ℎ, 𝑧 = 𝔼𝑥,𝑦∼𝒟 𝑙𝑜𝑠𝑠 ℎ 𝑥 , 𝑦

• Optimization variable: predictor ℎ

• Objective: generalization error 𝐿(ℎ)

• Stochasticity over 𝑧 = (𝑥, 𝑦)

“General Learning” ≡ Stochastic Optimization:

ValdimirVapnik

ArkadiNemirovskii

•Focus on computational efficiency

•Generally assumes unlimited sampling- as in monte-carlo methods for complicated objectives

•Optimization variable generally a vector in a normed space- complexity control through norm

•Mostly convex objectives

Stochastic Optimization

•Focus on sample size

•What can be done with a fixed number of samples?

•Abstract hypothesis classes- linear predictors, but also combinatorial hypothesis classes- generic measures of complexity such as VC-dim, fat shattering, Radamacher

• Also non-convex classes and loss functions

Statistical Learningvs

ValdimirVapnik

ArkadiNemirovskii

Two Approaches to Stochastic Optimization / Learning

min𝑤∈𝒲

𝐹(𝑤) = 𝔼𝑧~𝒟[𝑓 𝑤, 𝑧 ]

• Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA):• Collect sample z1,…,zm

• Minimize 𝐹𝑆 𝑤 =1

𝑚σ𝑖 𝑓 𝑤, 𝑧𝑖

• Analysis typically based on Uniform Convergence

• Stochastic Approximation (SA): [Robins Monro 1951]

• Update 𝑤𝑡 based on 𝑧𝑡• E.g., based on 𝑔𝑡 = 𝛻𝑓(𝑤, 𝑧𝑡)

• E.g.: stochastic gradient descent

• Online-to-batch conversion of online algorithm…

SA/SGD for Machine Learning

• In learning with ERM, need to optimize

ෝ𝑤 = arg min𝑤∈𝒲

𝐿𝑆(𝑤) 𝐿𝑆 𝑤 =1

𝑚σ𝑖 ℓ(𝑤, 𝑧𝑖)

• 𝐿𝑆 𝑤 is expensive to evaluate exactly— 𝑂(𝑚𝑑) time

• Cheap to get unbiased gradient estimate— 𝑂(𝑑) time

𝑖 ∼ 𝑈𝑛𝑖𝑓(1…𝑚) 𝑔 = 𝛻ℓ 𝑤, 𝑧𝑖

𝔼 𝑔 = σ𝑖1

𝑚𝛻ℓ 𝑤, 𝑧𝑖 = 𝛻𝐿𝑆(𝑤)

• SGD guarantee:

𝔼 𝐿𝑆 ഥ𝑤 𝑇 ≤ inf𝑤∈𝒲

𝐿𝑆 𝑤 +sup 𝛻ℓ 2

2 sup 𝑤 22

𝑇

SGD for SVM

min 𝐿𝑆 𝑤 𝑠. 𝑡. 𝑤 2 ≤ 𝐵

Use 𝑔𝑡 = 𝛻𝑤𝑙𝑜𝑠𝑠ℎ𝑖𝑛𝑔𝑒 𝑤𝑡, 𝜙𝑖𝑡 𝑥 ; 𝑦𝑖𝑡 for random 𝑖𝑡

Initialize 𝑤 0 = 0At iteration t:• Pick 𝑖 ∈ 1…𝑚 at random

• If 𝑦𝑖 𝑤𝑡 , 𝜙 𝑥𝑖 < 1,

𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑖𝜙 𝑥𝑖else: 𝑤 𝑡+1 ← 𝑤 𝑡

• If 𝑤 𝑡+12> 𝐵, then 𝑤 𝑡+1 ← 𝐵

𝑤 𝑡+1

𝑤 𝑡+12

Return ഥ𝑤 𝑇 =1

𝑇σ𝑡=1𝑇 𝑤 𝑡

𝜙 𝑥 2 ≤ 𝐺 𝑔𝑡 2 ≤ 𝐺 𝐿𝑆 ഥ𝑤 𝑇 ≤ 𝐿𝑆 ෝ𝑤 +𝐵2𝐺2

𝑇

(in expectation over randomness in algorithm)

Stochastic vs Batch

x1,y1

x2,y2

x3,y3

x4,y4

x5,y5

xm,ym

𝑔1 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥1, 𝑦1) )

𝑔2 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥2, 𝑦2) )

𝑔3 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥3, 𝑦3) )

𝑔4 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥4, 𝑦4) )

𝑔5 = ∇𝑙𝑜𝑠𝑠(𝑤, 𝑥5, 𝑦5 )

𝑔𝑚 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥𝑚, 𝑦𝑚) )

𝑔1 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥1, 𝑦1) )

𝑔2 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥2, 𝑦2) )

𝑔3 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥3, 𝑦3) )

𝑔4 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥4, 𝑦4) )

𝑔5 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥5, 𝑦5) )

𝑔𝑚 = ∇𝑙𝑜𝑠𝑠(𝑤, (𝑥𝑚, 𝑦𝑚) )

𝑤 ← 𝑤 − 𝑔1

𝑤 ← 𝑤 − 𝑔2

𝑤 ← 𝑤 − 𝑔3

𝑤 ← 𝑤 − 𝑔4

𝑤 ← 𝑤 − 𝑔𝑚−1

𝑤 ← 𝑤 − 𝑔𝑖𝑤 ← 𝑤 − 𝑔𝑚

𝑤 ← 𝑤 − 𝑔5

min 𝐿𝑆 𝑤 𝑠. 𝑡. 𝑤 2 ≤ 𝐵

∇𝐿𝑆 𝑤 =1

𝑚σ𝑔𝑖

Stochastic vs Batch• Intuitive argument: if only taking simple gradient steps, better

to be stochastic

• To get 𝐿𝑆 𝑤 ≤ 𝐿𝑆 ෝ𝑤 + 𝜖𝑜𝑝𝑡:

#iter cost/iter runtime

Batch GD 𝐵2𝐺2/𝜖𝑜𝑝𝑡2 𝑚𝑑 𝒎𝑑

𝐵2𝐺2

𝜖𝑜𝑝𝑡2

SGD 𝐵2𝐺2/𝜖𝑜𝑝𝑡2 𝑑 𝑑

𝐵2𝐺2

𝜖𝑜𝑝𝑡2

• Comparison to methods with a log 1/𝜖𝑜𝑝𝑡 dependence that use the structure of 𝐿𝑆(𝑤) (not only local access)?

• How small should 𝜖𝑜𝑝𝑡 be?

• What about 𝐿 𝑤 , which is what we really care about?

Overall Analysis of 𝐿𝒟(𝑤)• Recall for ERM: 𝐿𝒟 ෝ𝑤 ≤ 𝐿𝒟 𝑤∗ + 2sup

𝑤𝐿𝒟 𝑤 + 𝐿𝑆 𝑤

• For 𝜖𝑜𝑝𝑡 suboptimal ERM ഥ𝑤:

𝐿𝒟 ഥ𝑤 ≤ 𝐿𝒟 𝑤∗ + 2sup𝑤

𝐿𝒟 𝑤 − 𝐿𝑆 𝑤 + 𝐿𝑆 ഥ𝑤 − 𝐿𝑆 ෝ𝑤

• Take 𝜖𝑜𝑝𝑡 ≈ 𝜖𝑒𝑠𝑡, i.e. #𝑖𝑡𝑒𝑟 𝑇 ≈ 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑚

• To ensure 𝐿𝒟 𝑤 ≤ 𝐿𝒟 𝑤∗ + 𝜖:

𝑇,𝑚 = 𝑂𝐵2𝐺2

𝜖2

ෝ𝑤 = arg min𝑤 ≤𝐵

𝐿𝑆(𝑤) 𝑤∗ = arg min𝑤 ≤𝐵

𝐿𝒟(𝑤)

𝜖𝑎𝑝𝑟𝑜𝑥

𝜖𝑒𝑠𝑡 ≤ 2𝐵2𝐺2

𝑚𝜖𝑜𝑝𝑡 ≤

𝐵2𝐺2

𝑇

Direct Online-to-Batch:SGD on 𝐿𝒟(𝑤)

min𝑤

𝐿𝒟(𝑤)

use 𝑔𝑡 = 𝛻𝑤ℎ𝑖𝑛𝑔𝑒 𝑦 𝑤, 𝜙 𝑥 for random 𝑦, 𝑥~𝒟

𝔼 𝑔𝑡 = 𝛻𝐿𝒟 𝑤

Initialize 𝑤 0 = 0At iteration t:• Draw 𝑥𝑡 , 𝑦𝑡~𝒟

• If 𝑦𝑡 𝑤𝑡 , 𝜙 𝑥𝑡 < 1,

𝑤 𝑡+1 ← 𝑤 𝑡 + 𝜂𝑡𝑦𝑡𝜙 𝑥𝑡else: 𝑤 𝑡+1 ← 𝑤 𝑡



𝐿𝒟 ഥ𝑤 𝑇 ≤ inf𝑤 2≤𝐵

𝐿𝒟 𝑤 +𝐵2𝐺2

𝑇

𝑚 = 𝑇 = 𝑂𝐵2𝐺2

𝜖2

SGD for Machine Learning






Draw 𝑥1, 𝑦1 , … , 𝑥𝑚, 𝑦𝑚 ~𝒟




• 𝑤 𝑡+1 ← 𝑝𝑟𝑜𝑗 𝑤 𝑡+1 𝑡𝑜 𝑤 ≤ 𝐵



Direct SA (online2batch) Approach:SGD on ERM:min𝑤 2≤𝐵

𝐿𝑆 𝑤

min𝑤

𝐿 𝑤

• Fresh sample at each iteration, 𝑚 = 𝑇• No need to project nor require 𝑤 ≤ 𝐵• Implicit regularization via early stopping

• Can have 𝑇 > 𝑚 iterations• Need to project to 𝑤 ≤ 𝐵• Explicit regularization via ‖𝑤‖











• 𝑤 𝑡+1 ← 𝑝𝑟𝑜𝑗 𝑤 𝑡+1 𝑡𝑜 𝑤 ≤ 𝐵



Direct SA (online2batch) Approach:SGD on ERM:min𝑤 2≤𝐵

𝐿𝑆 𝑤

min𝑤

𝐿 𝑤

𝐿 ഥ𝑤 𝑇 ≤ 𝐿 𝑤∗ +𝐵2𝐺2

𝑇𝐿 ഥ𝑤 𝑇 ≤ 𝐿 𝑤∗ + 2

𝐵2𝐺2

𝑚+

𝐵2𝐺2

𝑇

𝑤∗ = arg min𝑤 ≤𝐵

𝐿(𝑤)

𝜂𝑡 = 𝐵2/𝐺2𝑡











• 𝑤 𝑡+1 ← 𝑤 𝑡+1 − 𝜆𝑤(𝑡)



Direct SA (online2batch) Approach:

SGD on RERM:

min𝐿𝑆 𝑤 +𝜆

2‖𝑤‖

min𝑤

𝐿 𝑤

• Fresh sample at each iteration, 𝑚 = 𝑇• No need shrink 𝑤• Implicit regularization via early stopping

• Can have 𝑇 > 𝑚 iterations• Need to shrink 𝑤• Explicit regularization via ‖𝑤‖

SGD vs ERM

𝑤 0

ഥ𝑤 𝑚

ෝ𝑤𝑤∗

𝑂𝐵2𝐺2

𝑚

argmin𝑤

𝐿𝑆(𝑤)

(overfit)

𝑻 > 𝒎w/ proj

𝑤∗ = arg min𝑤 ≤𝐵

𝐿(𝑤)ෝ𝑤 = arg min𝑤 ≤𝐵

𝐿𝑆(𝑤)

Mixed Approach: SGD on ERM

0 1,000,000 2,000,000 3,000,000

0.052

0.054

0.056

0.058

# iterations k

Test

mis

clas

sifi

cati

on

err

or

m=300,000m=400,000m=500,000

Reuters RCV1 data, CCAT task

• The mixed approach (reusing examples) can make sense

Mixed Approach: SGD on ERM

0 1,000,000 2,000,000 3,000,000

0.052

0.054

0.056

0.058

# iterations k

Test

mis

clas

sifi

cati

on

err

or

m=300,000m=400,000m=500,000

Reuters RCV1 data, CCAT task

• The mixed approach (reusing examples) can make sense• Still: fresh samples are better

)With a larger training set, can reduce generalization error faster) Larger training set means less runtime to get target generalization error

Online Optimization vs Stochastic Approximation

• In both Online Setting and Stochastic Approximation • Receive samples sequentially• Update w after each sample

• But, in Online Setting:• Objective is empirical regret, i.e. behavior on observed instances• 𝑧𝑡 chosen adversarialy (no distribution involved)

• As opposed on Stochastic Approximation:• Objective is 𝔼 ℓ 𝑤, 𝑧 , i.e. behavior on “future” samples• i.i.d. samples 𝑧𝑡

• Stochastic Approximation is a computational approach,Online Learning is an analysis setup• E.g. “Follow the leader”

Part II:Realizable vs Agnostic

Rates

Realizable vs Agnostic Rates

• Recall for finite hypothesis classes:

𝐿𝒟 ℎ ≤ infℎ∈ℋ

𝐿𝒟(ℎ) + 2log |ℋ|+log ൗ2 𝛿

2𝑚 𝑚 = 𝑂

log ℋ

𝜖𝟐

• But in the realizable case, if infℎ∈ℋ

𝐿𝒟(ℎ) = 0:

𝐿𝒟 ℎ ≤log ℋ +log

1

𝛿

𝑚 𝑚 = 𝑂

log ℋ

𝜖

• Also for VC-classes, in general 𝑚 = 𝑂𝑉𝐶𝑑𝑖𝑚

𝜖𝟐while in the

realizable case 𝑚 = 𝑂𝑉𝐶𝑑𝑖𝑚⋅log 1/𝜖

𝜖

• What happens if 𝐿∗ = infℎ∈ℋ

𝐿𝒟(ℎ) is low, but not zero?

Estimating the Bias of a Coin

𝑝 − Ƹ𝑝 ≤log ൗ2 𝛿2𝑚

𝑝 − Ƹ𝑝 ≤2 𝑝 log ൗ2 𝛿

𝑚+2 log ൗ2 𝛿3𝑚

Optimistic VC bound(aka 𝐿∗-bound, multiplicative bound)

ℎ = argminℎ∈ℋ

𝐿𝑆(ℎ) 𝐿∗ = infℎ∈ℋ

𝐿(ℎ)

• For a hypothesis class with VC-dim 𝐷, w.p. 1-𝛿 over n samples:

𝐿 ℎ ≤ 𝐿∗ + 2 𝐿∗𝐷 log Τ2𝑒𝑚

𝐷+log ൗ2 𝛿

𝑚+ 4

𝐷 log Τ2𝑒𝑚𝐷+log ൗ2 𝛿

𝑚

= inf𝛼

1 + 𝛼 𝐿∗ + 1 +1

𝛼4𝐷 log Τ2𝑒𝑚

𝐷+log ൗ2 𝛿

𝑚

• Sample complexity to get 𝐿 ℎ ≤ 𝐿∗ + 𝜖:

𝑚 𝜖 = 𝑂𝐷

𝜖⋅𝐿∗ + 𝜖

𝜖log

1

𝜖

• Extends to bounded real-valued loss in terms of VC-subgraph dim

From Parametric to Scale Sensitive𝐿 ℎ = 𝔼 𝑙𝑜𝑠𝑠 ℎ 𝑥 , 𝑦 ℎ ∈ ℋ

• Instead of VC-dim or VC-subgraph-dim (≈ #params), rely on metric scale to control complexity, e.g.:

ℋ = 𝑤 ↦ 𝑤, 𝑥 𝑤 2 ≤ 𝐵 }

• Learning depends on:

• Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity

• Scale sensitivity of loss (bound on derivatives or “margin”)

• For ℋwith Rademacher Complexity ℛ𝑚, and 𝑙𝑜𝑠𝑠′ ≤ 𝐺:

𝐿 ℎ ≤ 𝐿∗ + 2𝐺ℛ𝑚 +log ൗ2 𝛿2𝑚

≤ 𝐿∗ + 𝑂𝐺2𝑅 + log ൗ2 𝛿

2𝑚

ℛ𝑚 ≤𝑅

𝑚

𝑅=

ℛ𝑚 ℋ =𝐵2 sup 𝑥 2

𝑚

Non-Parametric Optimistic Rate for Smooth Loss

• Theorem: for any ℋ with (worst case) RademacherComplexity ℛ𝑚(ℋ), and any smooth loss with 𝑙𝑜𝑠𝑠′′ ≤ 𝐻, 𝑙𝑜𝑠𝑠 ≤ 𝑏, w.p. 1 − 𝛿 over n samples:

• Sample complexity

Parametric vs Non-Parametric

Parametricdim(ℋ) ≤ 𝐃, 𝒉 ≤ 𝟏

Scale-Sensitive

ℛ𝒏 ℋ ≤ ൗ𝑹 𝒏

Lipschitz: 𝜙′ ≤ 𝐺(e.g. hinge, ℓ1) 𝐺 𝐷

𝑚 + 𝐿∗𝐺𝐷𝑚

𝐺2𝑅𝑚

Smooth: 𝜙′′ ≤ 𝐻(e.g. logistic, Huber, smoothed hinge)

𝐻 𝐷𝑚 + 𝐿∗

𝐻𝐷𝑚

𝐻 𝑅𝑚 + 𝐿∗

𝐻𝑅𝑚

Smooth & strongly convex: 𝜇 ≤ 𝜙′′ ≤ 𝐻(e.g. square loss)

𝐻

𝜇⋅𝐻 𝐷

𝑚𝐻 𝑅𝑚 + 𝐿∗

𝐻𝑅𝑚

Min-max tight up to poly-log factors

Optimistic Learning Guarantees

𝐿 ℎ ≤ 1 + 𝛼 𝐿∗ + 1 +1

𝛼෨𝑂

𝑅

𝑚

𝑚 𝜖 ≤ ෨𝑂𝑅

𝜖⋅𝐿∗ + 𝜖

𝜖

Parametric classes

Scale-sensitive classes with smooth loss

Perceptron gurantee

Margin Bounds

Stability-based guarantees with smooth loss

Online Learning/Optimization with smooth loss

× Non-param (scale sensitive) classes with non-smooth loss

× Online Learning/Optimization with non-smooth loss

Why Optimistic Guarantees?

𝐿 ℎ ≤ 1 + 𝛼 𝐿∗ + 1 +1

𝛼෨𝑂

𝑅

𝑚

𝑚 𝜖 ≤ ෨𝑂𝑅

𝜖⋅𝐿∗ + 𝜖

𝜖

• Optimistic regime typically relevant regime:

• Approximation error 𝐿∗ ≈ Estimation error 𝜖

• If 𝜖 ≪ 𝐿∗, better to spend energy on lowering approx. error(use more complex class)

• Often important in highlighting true phenomena

Part III:Nearest Neighbor

Classification

The Nearest Neighbor Classifier• Training sample S = 𝑥1, 𝑦1 , … , 𝑥𝑚, 𝑦𝑚

• Want to predict label of new point 𝑥

• The Nearest Neighbor Rule:

• Find the closest training point: 𝑖 = argmin𝑖𝜌(𝑥, 𝑥𝑖)

• Predict label of 𝑥 as 𝑦𝑖

?

The Nearest Neighbor Classifier• Training sample S = 𝑥1, 𝑦1 , … , 𝑥𝑚, 𝑦𝑚

• Want to predict label of new point 𝑥

• The Nearest Neighbor Rule:



• As learning rule: 𝑁𝑁 𝑆 = ℎ where ℎ(𝑥) = 𝑦arg min𝑖

𝜌 𝑥,𝑥𝑖

෩𝝓 𝒙 − ෩𝝓(𝒙′)𝟐

෩𝝓 𝒙 = (𝟓𝒙 𝟏 , 𝒙 𝟐 )𝒙 − 𝒙′ 𝟏𝒙 − 𝒙′ 𝟐

Where is the Bias Hiding?

• What is the right “distance” between images? Between sound waves? Between sentences?

• Option 1: 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2

• What representation 𝜙(𝑥)?



෩𝝓 𝒙 − ෩𝝓(𝒙′)𝟐

෩𝝓 𝒙 = (𝟓𝒙 𝟏 , 𝒙 𝟐 )𝒙 − 𝒙′ 𝟏 𝒙 − 𝒙′ ∞



• Option 1: 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2


• Maybe a different distance? 𝜙 𝑥 − 𝜙 𝑥′ 1? 𝜙 𝑥 − 𝜙 𝑥 ∞? sin(∠(𝜙 𝑥 , 𝜙 𝑥′ )? 𝐾𝐿(𝜙(𝑥)||𝜙 𝑥′ )?





• Option 1: 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2


• Maybe a different distance? 𝜙 𝑥 − 𝜙 𝑥′ 1? 𝜙 𝑥 − 𝜙 𝑥 ∞? sin(∠(𝜙 𝑥 , 𝜙 𝑥′ )? 𝐾𝐿(𝜙(𝑥)||𝜙 𝑥′ )?

• Option 2: Special-purpose distance measure on 𝑥• E.g. edit distance, deformation measure, etc



Nearest Neighbor Learning Guarantee

• Optimal predictor: ℎ∗ = argmin 𝐿𝒟(ℎ)

ℎ∗ 𝑥 = ቊ+1, 𝜂 𝑥 > 0.5

−1, 𝜂 𝑥 < 0.5𝜂 𝑥 = 𝑃𝒟(𝑦 = 1|𝑥)

• For the NN rule with 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2, and 𝜙:𝒳 → [0,1] 𝑑:

𝔼𝑆∼𝒟𝑚 𝐿 𝑁𝑁 𝑆 ≤ 2𝐿 ℎ∗ + 4𝑐𝒟𝑑

𝑑+1 𝑚

𝜂 𝑥 − 𝜂 𝑥′ ≤ 𝑐𝒟 ⋅ 𝜌(𝑥, 𝑥′)

Data Fit / Complexity Tradeoff

• k-Nearest Neighbor: predict according to majority among 𝑘 closest point from S.

𝔼𝑆∼𝒟𝑚 𝐿 𝑁𝑁 𝑆 ≤ 2𝐿(ℎ∗) + 4𝑐𝒟𝑑

𝑑+1 𝑚

k-Nearest Neighbor:Data Fit / Complexity Tradeoff

k=1 k=5 k=12

k=50 k=100 k=200

S= h*=

k-Nearest Neighbor Guarantee

• For k-NN with 𝜌 𝑥, 𝑥′ = 𝜙 𝑥 − 𝜙 𝑥′ 2, and 𝜙:𝒳 → [0,1]𝑑:

𝔼𝑆∼𝒟𝑚 𝐿 𝑁𝑁𝑘 𝑆 ≤ 1 + 8𝑘 𝐿 ℎ∗ +

6𝑐𝒟 𝑑 + 𝑘𝑑+1 𝑚

• Should increase 𝑘 with sample size 𝑚

• Above theory suggests 𝑘𝑚 ∝ 𝐿 ℎ∗ 2/3 ⋅ 𝑚2

3(𝑑+1)

• “Universal” Learning: for any “smooth” 𝒟 and representation 𝜙 ⋅ (with continuous 𝑃(𝑦|𝜙 𝑥 )), if we increase k slowly enough, we will eventually converge to optimal 𝐿(ℎ∗)

• Very non-uniform: sample complexity depends not only on ℎ∗, but also on 𝒟

𝜂 𝑥 − 𝜂 𝑥′ ≤ 𝑐𝒟 ⋅ 𝜌(𝑥, 𝑥′)

Uniform and Non-Uniform Learnability

• Definition: A hypothesis class ℋ is agnostically PAC-Learnable if there exists a

learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∃𝑚 𝜖, 𝛿 , ∀𝒟, ∀𝒉, ∀𝑆∼𝒟𝑚 𝜖,𝛿𝛿 ,

𝐿𝒟 𝐴 𝑆 ≤ 𝐿𝒟 ℎ + 𝜖

• Definition: A hypothesis class ℋ is non-uniformly learnable if there exists a

learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ, ∃𝑚 𝜖, 𝛿, 𝒉 , ∀𝒟, ∀𝑆∼𝒟𝑚 𝜖,𝛿,𝒉𝛿 ,


• Definition: A hypothesis class ℋ is “consistently learnable” if there exists a

learning rule 𝐴 such that ∀𝜖, 𝛿 > 0, ∀ℎ ∀𝓓, ∃𝑚 𝜖, 𝛿, ℎ, 𝓓 , ∀𝑆∼𝒟𝑚 𝜖,𝛿,ℎ,𝓓𝛿 ,


Realizable/Optimistic Guarantees: 𝒟 dependence through 𝐿𝒟(ℎ)

Documents

Computational and Statistical Learning Theoryttic.uchicago.edu/~nati/Teaching/TTIC31120/2016/Lecture17.pdf · Stochastic Optimization / Learning min ∈ ( )=𝔼 ~𝒟[ , ] •Empirical