95
On Computational Thinking, Inferential Thinking and “Data Science” Michael I. Jordan University of California, Berkeley September 24, 2016

On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

On Computational Thinking, Inferential Thinking and “Data Science”

Michael I. Jordan University of California, Berkeley

September 24, 2016

Page 2: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

What Is the Big Data Phenomenon?

•  Science in confirmatory mode (e.g., particle physics) –  inferential issue: massive number of nuisance variables

•  Science in exploratory mode (e.g., astronomy, genomics) –  inferential issue: massive number of hypotheses

•  Measurement of human activity, particularly online activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets –  inferential issues: many, including heterogeneity, unknown

sampling frames, compound loss function

Page 3: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

A Job Description, circa 2015

•  Your Boss: “I need a Big Data system that will replace our classic service with a personalized service”

Page 4: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

A Job Description, circa 2015

•  Your Boss: “I need a Big Data system that will replace our classic service with a personalized service”

•  “It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us”

Page 5: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

A Job Description, circa 2015

•  Your Boss: “I need a Big Data system that will replace our classic service with a personalized service”

•  “It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us”

•  “It should run just as fast as our classic service”

Page 6: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

A Job Description, circa 2015

•  Your Boss: “I need a Big Data system that will replace our classic service with a personalized service”

•  “It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us”

•  “It should run just as fast as our classic service” •  “It should only improve as we collect more data; in

particular it shouldn’t slow down”

Page 7: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

A Job Description, circa 2015

•  Your Boss: “I need a Big Data system that will replace our classic service with a personalized service”

•  “It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us”

•  “It should run just as fast as our classic service” •  “It should only improve as we collect more data; in

particular it shouldn’t slow down” •  “There are serious privacy concerns of course, and

they vary across the clients”

Page 8: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Some Challenges Driven by Big Data

•  Big Data analysis requires a thorough blending of computational thinking and inferential thinking

Page 9: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Some Challenges Driven by Big Data

•  Big Data analysis requires a thorough blending of computational thinking and inferential thinking

•  What I mean by computational thinking –  abstraction, modularity, scalability, robustness, etc.

Page 10: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Some Challenges Driven by Big Data

•  Big Data analysis requires a thorough blending of computational thinking and inferential thinking

•  What I mean by computational thinking –  abstraction, modularity, scalability, robustness, etc.

•  Inferential thinking means (1) considering the real-world phenomenon behind the data, (2) considering the sampling pattern that gave rise to the data, and (3) developing procedures that will go “backwards” from the data to the underlying phenomenon

Page 11: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

The Challenges are Daunting

•  The core theories in computer science and statistics developed separately and there is an oil and water problem

•  Core statistical theory doesn’t have a place for runtime and other computational resources

•  Core computational theory doesn’t have a place for statistical risk

Page 12: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

•  Inference under privacy constraints •  Inference under communication constraints •  The variational perspective

Outline

Page 13: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Part I: Inference and Privacy

with John Duchi and Martin Wainwright

Page 14: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

•  Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur

•  “Privacy loss” can be quantified via differential privacy •  We want to trade privacy loss against the value we

obtain from “data analysis” •  The question becomes that of quantifying such value

and juxtaposing it with privacy loss

Privacy and Data Analysis

Page 15: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Privacy

query

database

Page 16: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Privacy

query

database

Page 17: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Privacy

query

database privatized database

Q

Page 18: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Privacy

query

database

query

privatized database

Q

Page 19: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Privacy

query

database

query

privatized database

Q

Page 20: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Privacy

query

database

query

privatized database

Q

✓ ✓Classical problem in differential privacy: show that and are close under constraints on

Q

Page 21: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Inference

query

database

Page 22: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Inference

query

database PS

Page 23: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Inference

query

database P

query

S

✓ ✓

Page 24: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Inference

query

database

Classical problem in statistical theory: show that and are close under constraints on

P

query

S

S

Page 25: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Privacy and Inference

query

database

query

privatized database

Q

The privacy-meets-inference problem: show that and are close under constraints on and on

Q

query

S

P

S✓

Page 26: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

•  In the 1930’s, Wald laid the foundations of statistical decision theory

•  Given a family of distributions , a parameter for each , an estimator , and a loss , define the risk:

Background on Inference

P ✓(P )P 2 P ✓ l(✓, ✓(P ))

RP (✓) := EP

hl(✓, ✓(P ))

i

Page 27: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

•  In the 1930’s, Wald laid the foundations of statistical decision theory

•  Given a family of distributions , a parameter for each , an estimator , and a loss , define the risk:

•  Minimax principle [Wald, ‘39, ‘43]: choose estimator

minimizing worst-case risk:

Background on Inference

supP2P

EP

hl(✓, ✓(P ))

i

P ✓(P )P 2 P ✓ l(✓, ✓(P ))

RP (✓) := EP

hl(✓, ✓(P ))

i

Page 28: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Local Privacy

Private

Page 29: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Local Privacy

PrivateChannel

Individuals with private data Estimator

Page 30: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Differential Privacy

Definition: channel is -differentially private if

[Dwork, McSherry, Nissim, Smith 06]

Given Z, cannot reliably discriminatebetween x and x’

Page 31: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

-private Minimax risk

Private Minimax Risk

Best -private channel

Minimax risk under privacy constraint

• Parameter of distribution• Family of distributions• Loss measuring error• Family of private channels

Page 32: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Proportions =

Vignette: Private Mean Estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines

= .45= .32= .16= .20= .00= .02

Page 33: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Proposition:

Vignette: Mean Estimation

Consider estimation of mean , witherrors measured in -norm, for

Minimax rate

(achieved by sample mean)

Page 34: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Proposition:

Private minimax rate for

Vignette: Mean Estimation

Consider estimation of mean , witherrors measured in -norm, for

Page 35: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Proposition:

Private minimax rate for

Vignette: Mean Estimation

Consider estimation of mean , witherrors measured in -norm, for

Effective sample sizeNote:

Page 36: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Optimal mechanism?

Non-privateobservation

Idea 1: add independent noise(e.g. Laplace mechanism)

Problem: magnitude much too large(this is unavoidable: provably sub-optimal)

[Dwork et al. 06]

Page 37: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Optimal mechanism

Non-privateobservation

Page 38: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Optimal mechanism

Non-privateobservation

View 1 View 2

• Draw uniformly in

(closer : 3 overlap) (farther : 2 overlap)

Page 39: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Optimal mechanism

Non-privateobservation

View 1 View 2

• Draw uniformly in

• With probability

choose closer of and to

(closer : 3 overlap) (farther : 2 overlap)

At end:Compute sample average

andde-bias

• otherwise, choose farther

Page 40: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Empirical evidence

Estimate proportion of emergency room visits involving different substances

Data source: Drug Abuse

Warning NetworkSample size n

Page 41: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

•  Fixed-design regression •  Convex risk minimization •  Multinomial estimation •  Nonparametric density estimation

•  Almost always, the effective sample size reduction is:

Additional Examples

Page 42: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Part III: Inference and Compression

with Yuchen Zhang, John Duchi and Martin Wainwright

Page 43: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Communication Constraints • Large data necessitates distributed storage•  Independent data collection (e.g., hospitals)• Privacy

Setting: each of agents has sample of size

Messages to fusion center

Question: tradeoffs between communication and statistical

utility?

[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Page 44: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Minimax Communication

Central object of study: •  Parameter of distribution•  Family of distributions •  Loss

Constrained to be bits

Page 45: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Minimax risk with -bounded communication

Minimax Communication

Central object of study: •  Parameter of distribution•  Family of distributions

Best protocol with smaller than bits

•  Loss

Constrained to be bits

Page 46: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Vignette: Mean Estimation

bits

Consider estimation in normallocation family,

Page 47: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Vignette: Mean Estimation

Minimax rateTheorem: when each agent has sample of size

bits

Consider estimation in normallocation family,

Page 48: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Minimax rate with B-bounded communication

Vignette: Mean Estimation

Theorem: when each agent has sample of size

bits

Consider estimation in normallocation family,

Page 49: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Minimax rate with B-bounded communication

Vignette: Mean Estimation

Theorem: when each agent has sample of size

bits

Consequence: each sends bits for optimal estimation

Consider estimation in normallocation family,

Page 50: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Computation and Inference

•  How does inferential quality trade off against classical computational resources such as time and space?

Page 51: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Computation and Inference

•  How does inferential quality trade off against classical computational resources such as time and space?

•  Hard!

Page 52: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Computation and Inference: Mechanisms and Bounds

•  Tradeoffs via convex relaxations –  linking runtime to convex geometry and risk to convex

geometry •  Tradeoffs via concurrency control

–  optimistic concurrency control •  Bounds via optimization oracles

–  number of accesses to a gradient as a surrogate for computation

•  Bounds via communication complexity •  Tradeoffs via subsampling

–  bag of little bootstraps, variational consensus Monte Carlo

Page 53: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

A Variational Framework forAccelerated Methods in Optimization

with Andre Wibisono and Ashia Wilson

July 12, 2016

Page 54: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated gradient descent

Setting: Unconstrained convex optimization

minx∈Rd

f (x)

I Classical gradient descent:

xk+1 = xk − β∇f (xk)

obtains a convergence rate of O(1/k)

I Accelerated gradient descent:

yk+1 = xk − β∇f (xk)

xk+1 = (1− λk)yk+1 + λkyk

obtains the (optimal) convergence rate of O(1/k2)

Page 55: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated gradient descent

Setting: Unconstrained convex optimization

minx∈Rd

f (x)

I Classical gradient descent:

xk+1 = xk − β∇f (xk)

obtains a convergence rate of O(1/k)

I Accelerated gradient descent:

yk+1 = xk − β∇f (xk)

xk+1 = (1− λk)yk+1 + λkyk

obtains the (optimal) convergence rate of O(1/k2)

Page 56: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated gradient descent

Setting: Unconstrained convex optimization

minx∈Rd

f (x)

I Classical gradient descent:

xk+1 = xk − β∇f (xk)

obtains a convergence rate of O(1/k)

I Accelerated gradient descent:

yk+1 = xk − β∇f (xk)

xk+1 = (1− λk)yk+1 + λkyk

obtains the (optimal) convergence rate of O(1/k2)

Page 57: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

The acceleration phenomenon

Two classes of algorithms:

I Gradient methods• Gradient descent, mirror descent, cubic-regularized Newton’s

method (Nesterov and Polyak ’06), etc.

• Greedy descent methods, relatively well-understood

I Accelerated methods• Nesterov’s accelerated gradient descent, accelerated mirror

descent, accelerated cubic-regularized Newton’s method(Nesterov ’08), etc.

• Important for both theory (optimal rate for first-ordermethods) and practice (many extensions: FISTA, stochasticsetting, etc.)

• Not descent methods, faster than gradient methods, stillmysterious

Page 58: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

The acceleration phenomenon

Two classes of algorithms:

I Gradient methods• Gradient descent, mirror descent, cubic-regularized Newton’s

method (Nesterov and Polyak ’06), etc.

• Greedy descent methods, relatively well-understood

I Accelerated methods• Nesterov’s accelerated gradient descent, accelerated mirror

descent, accelerated cubic-regularized Newton’s method(Nesterov ’08), etc.

• Important for both theory (optimal rate for first-ordermethods) and practice (many extensions: FISTA, stochasticsetting, etc.)

• Not descent methods, faster than gradient methods, stillmysterious

Page 59: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated methods

I Analysis using Nesterov’s estimate sequence technique

I Common interpretation as “momentum methods” (Euclideancase)

I Many proposed explanations:

• Chebyshev polynomial (Hardt ’13)• Linear coupling (Allen-Zhu, Orecchia ’14)• Optimized first-order method (Drori, Teboulle ’14; Kim,

Fessler ’15)• Geometric shrinking (Bubeck, Lee, Singh ’15)• Universal catalyst (Lin, Mairal, Harchaoui ’15)• . . .

But only for strongly convex functions, or first-order methods

Question: What is the underlying mechanism that generatesacceleration (including for higher-order methods)?

Page 60: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated methods

I Analysis using Nesterov’s estimate sequence technique

I Common interpretation as “momentum methods” (Euclideancase)

I Many proposed explanations:

• Chebyshev polynomial (Hardt ’13)• Linear coupling (Allen-Zhu, Orecchia ’14)• Optimized first-order method (Drori, Teboulle ’14; Kim,

Fessler ’15)• Geometric shrinking (Bubeck, Lee, Singh ’15)• Universal catalyst (Lin, Mairal, Harchaoui ’15)• . . .

But only for strongly convex functions, or first-order methods

Question: What is the underlying mechanism that generatesacceleration (including for higher-order methods)?

Page 61: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated methods

I Analysis using Nesterov’s estimate sequence technique

I Common interpretation as “momentum methods” (Euclideancase)

I Many proposed explanations:

• Chebyshev polynomial (Hardt ’13)• Linear coupling (Allen-Zhu, Orecchia ’14)• Optimized first-order method (Drori, Teboulle ’14; Kim,

Fessler ’15)• Geometric shrinking (Bubeck, Lee, Singh ’15)• Universal catalyst (Lin, Mairal, Harchaoui ’15)• . . .

But only for strongly convex functions, or first-order methods

Question: What is the underlying mechanism that generatesacceleration (including for higher-order methods)?

Page 62: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated methods: Continuous time perspective

I Gradient descent is discretization of gradient flow

Xt = −∇f (Xt)

(and mirror descent is discretization of natural gradient flow)

I Su, Boyd, Candes ’14: Continuous time limit of acceleratedgradient descent is a second-order ODE

Xt +3

tXt +∇f (Xt) = 0

I These ODEs are obtained by taking continuous time limits. Isthere a deeper generative mechanism?

Our work: A general variational approach to acceleration

A systematic discretization methodology

Page 63: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated methods: Continuous time perspective

I Gradient descent is discretization of gradient flow

Xt = −∇f (Xt)

(and mirror descent is discretization of natural gradient flow)

I Su, Boyd, Candes ’14: Continuous time limit of acceleratedgradient descent is a second-order ODE

Xt +3

tXt +∇f (Xt) = 0

I These ODEs are obtained by taking continuous time limits. Isthere a deeper generative mechanism?

Our work: A general variational approach to acceleration

A systematic discretization methodology

Page 64: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated methods: Continuous time perspective

I Gradient descent is discretization of gradient flow

Xt = −∇f (Xt)

(and mirror descent is discretization of natural gradient flow)

I Su, Boyd, Candes ’14: Continuous time limit of acceleratedgradient descent is a second-order ODE

Xt +3

tXt +∇f (Xt) = 0

I These ODEs are obtained by taking continuous time limits. Isthere a deeper generative mechanism?

Our work: A general variational approach to acceleration

A systematic discretization methodology

Page 65: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Bregman LagrangianDefine the Bregman Lagrangian:

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)

I Function of position x , velocity x , and time t

I Dh(y , x) = h(y)− h(x)− 〈∇h(x), y − x〉is the Bregman divergence

I h is the convex distance-generating function

I f is the convex objective function

I αt , βt , γt ∈ R are arbitrary smooth functions

I In Euclidean setting, simplifies to damped

Lagrangian

x y

h(x)

h(y)

Dh(y, x)

Ideal scaling conditions:

βt ≤ eαt

γt = eαt

Page 66: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Bregman LagrangianDefine the Bregman Lagrangian:

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)I Function of position x , velocity x , and time t

I Dh(y , x) = h(y)− h(x)− 〈∇h(x), y − x〉is the Bregman divergence

I h is the convex distance-generating function

I f is the convex objective function

I αt , βt , γt ∈ R are arbitrary smooth functions

I In Euclidean setting, simplifies to damped

Lagrangian

x y

h(x)

h(y)

Dh(y, x)

Ideal scaling conditions:

βt ≤ eαt

γt = eαt

Page 67: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Bregman LagrangianDefine the Bregman Lagrangian:

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)I Function of position x , velocity x , and time t

I Dh(y , x) = h(y)− h(x)− 〈∇h(x), y − x〉is the Bregman divergence

I h is the convex distance-generating function

I f is the convex objective function

I αt , βt , γt ∈ R are arbitrary smooth functions

I In Euclidean setting, simplifies to damped

Lagrangian

x y

h(x)

h(y)

Dh(y, x)

Ideal scaling conditions:

βt ≤ eαt

γt = eαt

Page 68: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Bregman LagrangianDefine the Bregman Lagrangian:

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)I Function of position x , velocity x , and time t

I Dh(y , x) = h(y)− h(x)− 〈∇h(x), y − x〉is the Bregman divergence

I h is the convex distance-generating function

I f is the convex objective function

I αt , βt , γt ∈ R are arbitrary smooth functions

I In Euclidean setting, simplifies to damped

Lagrangian

x y

h(x)

h(y)

Dh(y, x)

Ideal scaling conditions:

βt ≤ eαt

γt = eαt

Page 69: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Bregman LagrangianDefine the Bregman Lagrangian:

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)I Function of position x , velocity x , and time t

I Dh(y , x) = h(y)− h(x)− 〈∇h(x), y − x〉is the Bregman divergence

I h is the convex distance-generating function

I f is the convex objective function

I αt , βt , γt ∈ R are arbitrary smooth functions

I In Euclidean setting, simplifies to damped

Lagrangian

x y

h(x)

h(y)

Dh(y, x)

Ideal scaling conditions:

βt ≤ eαt

γt = eαt

Page 70: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Bregman LagrangianDefine the Bregman Lagrangian:

L(x , x , t) = eγt−αt

(1

2‖x‖2 − e2αt+βt f (x)

)I Function of position x , velocity x , and time t

I Dh(y , x) = h(y)− h(x)− 〈∇h(x), y − x〉is the Bregman divergence

I h is the convex distance-generating function

I f is the convex objective function

I αt , βt , γt ∈ R are arbitrary smooth functions

I In Euclidean setting, simplifies to damped

Lagrangian

x y

h(x)

h(y)

Dh(y, x)

Ideal scaling conditions:

βt ≤ eαt

γt = eαt

Page 71: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Bregman LagrangianDefine the Bregman Lagrangian:

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)I Function of position x , velocity x , and time t

I Dh(y , x) = h(y)− h(x)− 〈∇h(x), y − x〉is the Bregman divergence

I h is the convex distance-generating function

I f is the convex objective function

I αt , βt , γt ∈ R are arbitrary smooth functions

I In Euclidean setting, simplifies to damped

Lagrangian

x y

h(x)

h(y)

Dh(y, x)

Ideal scaling conditions:

βt ≤ eαt

γt = eαt

Page 72: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Bregman Lagrangian

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)

Variational problem over curves:

minX

∫L(Xt , Xt , t) dt

t

x

Optimal curve is characterized by Euler-Lagrange equation:

d

dt

{∂L∂x

(Xt , Xt , t)

}=∂L∂x

(Xt , Xt , t)

E-L equation for Bregman Lagrangian under ideal scaling:

Xt + (eαt − αt)Xt + e2αt+βt[∇2h(Xt + e−αt Xt)

]−1∇f (Xt) = 0

Page 73: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Bregman Lagrangian

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)

Variational problem over curves:

minX

∫L(Xt , Xt , t) dt

t

x

Optimal curve is characterized by Euler-Lagrange equation:

d

dt

{∂L∂x

(Xt , Xt , t)

}=∂L∂x

(Xt , Xt , t)

E-L equation for Bregman Lagrangian under ideal scaling:

Xt + (eαt − αt)Xt + e2αt+βt[∇2h(Xt + e−αt Xt)

]−1∇f (Xt) = 0

Page 74: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

General convergence rate

TheoremTheorem Under ideal scaling, the E-L equation has convergencerate

f (Xt)− f (x∗) ≤ O(e−βt )

Proof. Exhibit a Lyapunov function for the dynamics:

Et = Dh

(x∗, Xt + e−αt Xt

)+ eβt (f (Xt)− f (x∗))

Et = −eαt+βtDf (x∗,Xt) + (βt − eαt )eβt (f (Xt)− f (x∗)) ≤ 0

Note: Only requires convexity and differentiability of f , h

Page 75: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

General convergence rate

TheoremTheorem Under ideal scaling, the E-L equation has convergencerate

f (Xt)− f (x∗) ≤ O(e−βt )

Proof. Exhibit a Lyapunov function for the dynamics:

Et = Dh

(x∗, Xt + e−αt Xt

)+ eβt (f (Xt)− f (x∗))

Et = −eαt+βtDf (x∗,Xt) + (βt − eαt )eβt (f (Xt)− f (x∗)) ≤ 0

Note: Only requires convexity and differentiability of f , h

Page 76: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Polynomial convergence rate

For p > 0, choose parameters:

αt = log p − log t

βt = p log t + logC

γt = p log t

E-L equation has O(e−βt ) = O(1/tp) convergence rate:

Xt +p + 1

tXt + Cp2tp−2

[∇2h

(Xt +

t

pXt

)]−1∇f (Xt) = 0

For p = 2:

I Recover result of Krichene et al with O(1/t2) convergencerate

I In Euclidean case, recover ODE of Su et al:

Xt +3

tXt +∇f (Xt) = 0

Page 77: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Polynomial convergence rate

For p > 0, choose parameters:

αt = log p − log t

βt = p log t + logC

γt = p log t

E-L equation has O(e−βt ) = O(1/tp) convergence rate:

Xt +p + 1

tXt + Cp2tp−2

[∇2h

(Xt +

t

pXt

)]−1∇f (Xt) = 0

For p = 2:

I Recover result of Krichene et al with O(1/t2) convergencerate

I In Euclidean case, recover ODE of Su et al:

Xt +3

tXt +∇f (Xt) = 0

Page 78: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Time dilation property (reparameterizing time)

(p = 2: accelerated gradient descent)

O

(1

t2

): Xt +

3

tXt + 4C

[∇2h

(Xt +

t

2Xt

)]−1∇f (Xt) = 0y speed up time: Yt = Xt3/2

O

(1

t3

): Yt +

4

tYt + 9Ct

[∇2h

(Yt +

t

3Yt

)]−1∇f (Yt) = 0

(p = 3: accelerated cubic-regularized Newton’s method)

I All accelerated methods are traveling the same curve inspace-time at different speeds

I Gradient methods don’t have this property• From gradient flow to rescaled gradient flow: Replace 1

2‖ · ‖2

by 1p‖ · ‖

p

Page 79: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Time dilation property (reparameterizing time)

(p = 2: accelerated gradient descent)

O

(1

t2

): Xt +

3

tXt + 4C

[∇2h

(Xt +

t

2Xt

)]−1∇f (Xt) = 0y speed up time: Yt = Xt3/2

O

(1

t3

): Yt +

4

tYt + 9Ct

[∇2h

(Yt +

t

3Yt

)]−1∇f (Yt) = 0

(p = 3: accelerated cubic-regularized Newton’s method)

I All accelerated methods are traveling the same curve inspace-time at different speeds

I Gradient methods don’t have this property• From gradient flow to rescaled gradient flow: Replace 1

2‖ · ‖2

by 1p‖ · ‖

p

Page 80: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Time dilation for general Bregman Lagrangian

O(e−βt ) : E-L for Lagrangian Lα,β,γy speed up time: Yt = Xτ(t)

O(e−βτ(t)) : E-L for Lagrangian Lα,β,γwhere

αt = ατ(t) + log τ(t)

βt = βτ(t)

γt = γτ(t)

Question: How to discretize E-L while preserving the convergencerate?

Page 81: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Time dilation for general Bregman Lagrangian

O(e−βt ) : E-L for Lagrangian Lα,β,γy speed up time: Yt = Xτ(t)

O(e−βτ(t)) : E-L for Lagrangian Lα,β,γwhere

αt = ατ(t) + log τ(t)

βt = βτ(t)

γt = γτ(t)

Question: How to discretize E-L while preserving the convergencerate?

Page 82: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Discretizing the dynamics (naive approach)

Write E-L as a system of first-order equations:

Zt = Xt +t

pXt

d

dt∇h(Zt) = −Cptp−1∇f (Xt)

Euler discretization with time step δ > 0 (i.e., set xk = Xt ,xk+1 = Xt+δ):

xk+1 =p

k + pzk +

k

k + pxk

zk = arg minz

{Cpk(p−1)〈∇f (xk), z〉+

1

εDh(z , zk−1)

}with step size ε = δp, and k(p−1) = k(k + 1) · · · (k + p − 2) is therising factorial

Page 83: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Discretizing the dynamics (naive approach)

Write E-L as a system of first-order equations:

Zt = Xt +t

pXt

d

dt∇h(Zt) = −Cptp−1∇f (Xt)

Euler discretization with time step δ > 0 (i.e., set xk = Xt ,xk+1 = Xt+δ):

xk+1 =p

k + pzk +

k

k + pxk

zk = arg minz

{Cpk(p−1)〈∇f (xk), z〉+

1

εDh(z , zk−1)

}with step size ε = δp, and k(p−1) = k(k + 1) · · · (k + p − 2) is therising factorial

Page 84: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Naive discretization doesn’t work

xk+1 =p

k + pzk +

k

k + pxk

zk = arg minz

{Cpk(p−1)〈∇f (xk), z〉+

1

εDh(z , zk−1)

}

Cannot obtain a convergence guarantee, and empirically unstable

-2 -1 0 1 2-2

-1

0

1

2

k0 20 40 60 80 100 120 140 160

f(x

k)

0

1

2

3

4

5

6

7

8

Page 85: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Modified discretizationIntroduce an auxiliary sequence yk :

xk+1 =p

k + pzk +

k

k + pyk

zk = arg minz

{Cpk(p−1)〈∇f (yk), z〉+

1

εDh(z , zk−1)

}Sufficient condition: 〈∇f (yk), xk − yk〉 ≥ Mε

1p−1 ‖∇f (yk)‖

pp−1∗

Assume h is uniformly convex: Dh(y , x) ≥ 1p‖y − x‖p

TheoremTheorem

f (yk)− f (x∗) ≤ O

(1

εkp

)

Note: Matching convergence rates 1/(εkp) = 1/(δk)p = 1/tp

Proof using generalization of Nesterov’s estimate sequencetechnique

Page 86: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Modified discretizationIntroduce an auxiliary sequence yk :

xk+1 =p

k + pzk +

k

k + pyk

zk = arg minz

{Cpk(p−1)〈∇f (yk), z〉+

1

εDh(z , zk−1)

}Sufficient condition: 〈∇f (yk), xk − yk〉 ≥ Mε

1p−1 ‖∇f (yk)‖

pp−1∗

Assume h is uniformly convex: Dh(y , x) ≥ 1p‖y − x‖p

TheoremTheorem

f (yk)− f (x∗) ≤ O

(1

εkp

)

Note: Matching convergence rates 1/(εkp) = 1/(δk)p = 1/tp

Proof using generalization of Nesterov’s estimate sequencetechnique

Page 87: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Modified discretizationIntroduce an auxiliary sequence yk :

xk+1 =p

k + pzk +

k

k + pyk

zk = arg minz

{Cpk(p−1)〈∇f (yk), z〉+

1

εDh(z , zk−1)

}Sufficient condition: 〈∇f (yk), xk − yk〉 ≥ Mε

1p−1 ‖∇f (yk)‖

pp−1∗ ←−

How?

Assume h is uniformly convex: Dh(y , x) ≥ 1p‖y − x‖p

TheoremTheorem

f (yk)− f (x∗) ≤ O

(1

εkp

)

Note: Matching convergence rates 1/(εkp) = 1/(δk)p = 1/tp

Proof using generalization of Nesterov’s estimate sequencetechnique

Page 88: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Higher-order gradient updateHigher-order Taylor approximation of f :

fp−1(y ; x) = f (x) + 〈∇f (x), y − x〉+ · · ·+ 1

(p − 1)!∇p−1f (x)(y − x)p−1

Higher-order gradient update:

yk = arg miny

{fp−1(y ; xk) +

2

εp‖y − xk‖p

}

Assume f is smooth of order p − 1:

‖∇p−1f (y)−∇p−1f (x)‖∗ ≤1

ε‖y − x‖

TheoremLemma

〈∇f (yk), xk − yk〉 ≥1

1p−1 ‖∇f (yk)‖

pp−1∗

Can use this to complete the modified discretization process!

Page 89: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Higher-order gradient updateHigher-order Taylor approximation of f :

fp−1(y ; x) = f (x) + 〈∇f (x), y − x〉+ · · ·+ 1

(p − 1)!∇p−1f (x)(y − x)p−1

Higher-order gradient update:

yk = arg miny

{fp−1(y ; xk) +

2

εp‖y − xk‖p

}Assume f is smooth of order p − 1:

‖∇p−1f (y)−∇p−1f (x)‖∗ ≤1

ε‖y − x‖

TheoremLemma

〈∇f (yk), xk − yk〉 ≥1

1p−1 ‖∇f (yk)‖

pp−1∗

Can use this to complete the modified discretization process!

Page 90: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Accelerated higher-order gradient method

xk+1=p

k + pzk +

k

k + pyk

yk= arg miny

{fp−1(y ; xk) +

2

εp‖y − xk‖p

}← O

(1

εkp−1

)zk= arg min

z

{Cpk(p−1)〈∇f (yk), z〉+

1

εDh(z , zk−1)

}If ∇p−1f is (1/ε)-Lipschitz and h is uniformly convex of order p,then:

f (yk)− f (x∗) ≤ O

(1

εkp

)← accelerated rate

p = 2: Accelerated gradient/mirror descent

p = 3: Accelerated cubic-regularized Newton’s method (Nesterov’08)

p ≥ 2: Accelerated higher-order method

Page 91: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Recap: Gradient vs. accelerated methods

How to design dynamics for minimizing a convex function f ?

Rescaled gradient flow

Xt = −∇f (Xt) / ‖∇f (Xt)‖p−2p−1∗

O

(1

tp−1

)

Polynomial Euler-Lagrange equation

Xt+p + 1

tXt+tp−2

[∇2h

(Xt+

t

pXt

)]−1∇f (Xt) = 0

O

(1

tp

)

Higher-order gradient method

O

(1

εkp−1

)when ∇p−1f is

1

ε-Lipschitz

matching rate with ε = δp−1 ⇔ δ = ε1

p−1

Accelerated higher-order method

O

(1

εkp

)when ∇p−1f is

1

ε-Lipschitz

matching rate with ε = δp ⇔ δ = ε1p

Page 92: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Recap: Gradient vs. accelerated methods

How to design dynamics for minimizing a convex function f ?

Rescaled gradient flow

Xt = −∇f (Xt) / ‖∇f (Xt)‖p−2p−1∗

O

(1

tp−1

)

Polynomial Euler-Lagrange equation

Xt+p + 1

tXt+tp−2

[∇2h

(Xt+

t

pXt

)]−1∇f (Xt) = 0

O

(1

tp

)

Higher-order gradient method

O

(1

εkp−1

)when ∇p−1f is

1

ε-Lipschitz

matching rate with ε = δp−1 ⇔ δ = ε1

p−1

Accelerated higher-order method

O

(1

εkp

)when ∇p−1f is

1

ε-Lipschitz

matching rate with ε = δp ⇔ δ = ε1p

Page 93: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Summary: Bregman Lagrangian

I Bregman Lagrangian family with general convergenceguarantee

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)I Polynomial subfamily generates accelerated higher-order

methods: O(1/tp) convergence rate via higher-ordersmoothness

I Exponential subfamily: O(e−ct) rate via uniform convexity

I Understand structure and properties of Bregman Lagrangian:Gauge invariance, symmetry, gradient flows as limit points,etc.

I Bregman Hamiltonian:

H(x , p, t) = eαt+γt(Dh∗

(∇h(x) + e−γtp, ∇h(x)

)+ eβt f (x)

)

Page 94: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Summary: Bregman Lagrangian

I Bregman Lagrangian family with general convergenceguarantee

L(x , x , t) = eγt+αt

(Dh(x + e−αt x , x)− eβt f (x)

)I Polynomial subfamily generates accelerated higher-order

methods: O(1/tp) convergence rate via higher-ordersmoothness

I Exponential subfamily: O(e−ct) rate via uniform convexity

I Understand structure and properties of Bregman Lagrangian:Gauge invariance, symmetry, gradient flows as limit points,etc.

I Bregman Hamiltonian:

H(x , p, t) = eαt+γt(Dh∗

(∇h(x) + e−γtp, ∇h(x)

)+ eβt f (x)

)

Page 95: On Computational Thinking, Inferential Thinking and “Data ...publish.illinois.edu/frontiers-big-data-symposium/files/2016/10/Jordan-uiuc.pdfcomputational thinking and inferential

Discussion

•  Many conceptual and mathematical challenges arising in taking seriously the problem of “Big Data”

•  Facing these challenges will require a rapprochement between “computational thinking” and “inferential thinking” –  bringing computational and inferential fields together at the

level of their foundations