51
Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression Michael W. Mahoney ICSI and Department of Statistics, UC Berkeley Joint work with Michal Derezi´ nski, Feynman Liang, Manfred Warmuth, and Ken Clarkson September 2019 1 / 40

Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Minimax and Bayesian experimental design:Bridging the gap between statistical and

worst-case approaches to least squares regression

Michael W. MahoneyICSI and Department of Statistics, UC Berkeley

Joint work with Micha l Derezinski, Feynman Liang, Manfred Warmuth,and Ken Clarkson

September 2019

1 / 40

Page 2: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Outline

Correcting the bias in least squares regression

Volume-rescaled sampling

Minimax experimental design

Bayesian experimental design

Conclusions

2 / 40

Page 3: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Bias of the least-squares estimator

x

y

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

w∗(S) = argminw

∑i

(xi ·w − yi )2

3 / 40

Page 4: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Bias of the least-squares estimator

x

y

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

Statistical regression

y = x ·w∗ + ξ, E[ ξ ] = 0

w∗(S) = argminw

∑i

(xi ·w − yi )2

3 / 40

Page 5: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Bias of the least-squares estimator

x

y

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

Statistical regression

y = x ·w∗ + ξ, E[ ξ ] = 0

w∗(S) = argminw

∑i

(xi ·w − yi )2

3 / 40

Page 6: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Bias of the least-squares estimator

x

y

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

Statistical regression

y = x ·w∗ + ξ, E[ ξ ] = 0

w∗(S) = argminw

∑i

(xi ·w − yi )2

Unbiased! E[w∗(S)

]= w∗

3 / 40

Page 7: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Bias of the least-squares estimator

x

y

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

Worst-case regression

w∗= argminw

ED

[(x ·w − y)2

]

w∗(S) = argminw

∑i

(xi ·w − yi )2

3 / 40

Page 8: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Bias of the least-squares estimator

x

y

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

Worst-case regression

w∗= argminw

ED

[(x ·w − y)2

]

w∗(S) = argminw

∑i

(xi ·w − yi )2

Biased! E[w∗(S)

]6= w∗

3 / 40

Page 9: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Correcting the worst-case bias

x

y

xn+1

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

Worst-case regression

Sample xn+1 ∼ x2 ·DX

Query yn+1 ∼ DY|x=xn+1

S ′ ← S ∪ (xn+1, yn+1)

Unbiased! E[w∗(S ′)

]= w∗

4 / 40

Page 10: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Correcting the worst-case bias

x

y

xn+1

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

Worst-case regression

Sample xn+1 ∼ x2 ·DXQuery yn+1 ∼ DY|x=xn+1

S ′ ← S ∪ (xn+1, yn+1)

Unbiased! E[w∗(S ′)

]= w∗

4 / 40

Page 11: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Correcting the worst-case bias

x

y

xn+1

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

Worst-case regression

Sample xn+1 ∼ x2 ·DXQuery yn+1 ∼ DY|x=xn+1

S ′ ← S ∪ (xn+1, yn+1)

Unbiased! E[w∗(S ′)

]= w∗

4 / 40

Page 12: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Correcting the worst-case bias

x

y

xn+1

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D

Worst-case regression

Sample xn+1 ∼ x2 ·DXQuery yn+1 ∼ DY|x=xn+1

S ′ ← S ∪ (xn+1, yn+1)

Unbiased! E[w∗(S ′)

]= w∗

4 / 40

Page 13: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

In general: add dimension many pointsDerezinski and Warmuth

Worst-case regression in d dimensions

S = (x1, y1), . . . , (xn, yn)i.i.d.∼ D, (x, y) ∈ Rd×R

Estimate the optimum

w∗= argminw∈Rd

ED

[(x>w − y)2

]Volume rescaled sampling

Sampled points

xn+1, . . . , xn+d ∼ det

−x>n+1−. . .

−x>n+d−

2

· (DX )d

Query yn+i ∼ DY|x=xn+i∀i=1..d

Add S = (xn+1, yn+1), . . . , (xn+d , yn+d) to S

Theorem E[w∗(S ∪ S)

]= w∗ even though E

[w∗(S)

]6= w∗

5 / 40

Page 14: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Effect of correcting the bias

Let w = 1T

∑Tt=1 w∗(St), for independent samples S1, ...,ST

Question: Is the estimation error ‖w −w∗‖ converging to 0?

Example: x>= (x1, . . . , x5)i.i.d.∼ N (0, 1), y =

5∑i=1

xi +x3i

3︸ ︷︷ ︸nonlinearity

+ε,

100 101 102 103

number of estimators

10-4

10-2

100

estim

ation e

rror

i.i.d. samples k=10

i.i.d. + volume k=10

i.i.d. samples k=20

i.i.d. + volume k=20

i.i.d. samples k=40

i.i.d. + volume k=40

6 / 40

Page 15: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Discussion

I First-of-a-kind unbiased estimator for random designs,

different than RandNLA sampling theory

I Augmentation uses a determinantal point process (DPP) we

call volume-rescaled sampling

I There are many efficient DPP algorithms

I A new mathematical framework for computing expectations

Key application: Experimental design

I Bridge the gap between statistical and worst-case perspectives

7 / 40

Page 16: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Outline

Correcting the bias in least squares regression

Volume-rescaled sampling

Minimax experimental design

Bayesian experimental design

Conclusions

8 / 40

Page 17: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Volume-rescaled samplingDerezinski and Warmuth

x1, x2, . . . , xk − i.i.d. random vectors

sampled from x ∼ DX

DkX − distribution of X

d

x>i i

1

k

random X

Volume-rescaled sampling of size k from DX :

VSkDX (X) ∝ det(X>X)DkX (X)

Note: For k = d , we have det(X>X) = det(X)2

Question: What is the normalization factor of VSkDX ?

EDkX

[det(X>X)] = ??

Can find it through a new proof of the Cauchy-Binet formula!9 / 40

Page 18: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

The decomposition of volume-rescaled samplingDerezinski and Warmuth

Let X ∼ VSkDX and S ⊆ [k] be a random size d set such that

Pr(S | X) ∝ det(XS)2.

Then:

I XS ∼ VSdDX ,

I X[k]\S ∼ Dk−dX ,

I S is uniformly random,

and the three are independent.

XSd

random X

10 / 40

Page 19: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Consequences for least squaresDerezinski and Warmuth

Theorem ([DWH19])

Let S =(x1, y1), . . . , (xk , yk) i.i.d.∼ Dk , for any k ≥ 0.

Sample x1, . . . , xd ∼ VSdDX ,

Query yi ∼ DY|x=xi ∀i=1..d .

Then for S = (x1, y1), . . . , (xd , yd),

E[w∗(S ∪ S)

]= ES∼Dk

[ES∼VSdD

[ w∗(S ∪ S) ]]

(decomposition) = ES∼VSk+dD

[w∗(S)

](d-modularity) = w∗.

11 / 40

Page 20: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Outline

Correcting the bias in least squares regression

Volume-rescaled sampling

Minimax experimental design

Bayesian experimental design

Conclusions

12 / 40

Page 21: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Classical statistical regression

We consider n parameterized experiments: x1, . . . , xn ∈ Rd .Each experiment has a real random outcome Yi for i = 1..n.

Classical setup:Yi = x>i w∗+ ξi , E[ξi ] = 0, Var[ξi ] = σ2, cov[ξi , ξj ] = 0, i 6= j

The ordinary least squares estimator wLS = X+Y satisfies:

(unbiasedness) E[wLS] = w∗,

(mean squared error)

MSE(wLS)︷ ︸︸ ︷E ‖wLS −w∗‖2 = σ2tr

((X>X)−1

)letting b = tr

((X>X)−1

)=

b

n· E ‖ξ‖2

(mean squared prediction error)

MSPE(wLS)︷ ︸︸ ︷E ‖X(wLS −w∗)‖2 = σ2d

=d

n· E ‖ξ‖2

13 / 40

Page 22: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Experimental design in classical setting (summary)

Suppose we have a budget of k experiments out of the n choices.Goal: Select a subset of k experiments S ⊆ [n]Question: How large does k need to be so that:

MSE or MSPE︷ ︸︸ ︷Excess estimation error ≤ ε ·

E ‖ξ‖2︷ ︸︸ ︷Total noise ?

Denote L∗ = E ‖ξ‖2 = nσ2.

Prior result:There is a design (S , w) of size k s.t. E[wS ] = w∗ and:

MSE(wS)−MSE(wLS) ≤ ε · L∗, for k ≥ d + b/ε,

MSPE(wS)−MSPE(wLS) ≤ ε · L∗, for k ≥ d + d/ε,

where b = tr((X>X)−1).

14 / 40

Page 23: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Experimental design in general setting (summary)

No assumptions on Yi .

We define w∗def= E[wLS] = X+E[Y ].

Define “total noise” as L∗def= E ‖ξ‖2, where ξ

def= X>w∗−Y .

Theorem 1 (MSE).There is a random design (S , w) such that E[wS ] = w∗ and

MSE(wS)−MSE(wLS) ≤ ε · L∗, for k = O(d log n + b/ε),

where b = tr((X>X)−1).

Theorem 2 (MSPE).There is a random design (S , w) such that E[wS ] = w∗ and

MSPE(wS)−MSPE(wLS) ≤ ε · L∗, for k = O(d log n + d/ε).

15 / 40

Page 24: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Classical experimental design

Consider n parameterized experiments: x1, . . . , xn ∈ Rd .Each experiment has a real random response yi such that:

yi = x>i w∗ + ξi , ξi ∼ N (0, σ2)

Goal: Select k n experiments to best estimate w∗

Select S = 4, 6, 9

Receive y4, y6, y9

x>4

x>6

x>9

Xd

y

y4 .y6 .

y9

16 / 40

Page 25: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

A-optimal design

Find an unbiased estimator w with smallest mean squared error :

minw

maxw∗

Ew

[‖w −w∗‖2

]︸ ︷︷ ︸MSE[w]

subject to E[w]

= w∗ ∀w∗

Given every y1, . . . , yn , the optimum is least squares: w = X†y

MSE[X†y

]= tr

(Var[X†y]

)= σ2tr

((X>X)−1

)A-optimal design: min

S : |S |≤ktr((X>SXS)−1

)

Typical required assumption: yi = x>i w∗ + ξi , ξi ∼ N (0, σ2)

17 / 40

Page 26: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

A-optimal design

Find an unbiased estimator w with smallest mean squared error :

minw

maxw∗

Ew

[‖w −w∗‖2

]︸ ︷︷ ︸MSE[w]

subject to E[w]

= w∗ ∀w∗

Given set yi : i ∈ S , the optimum is least squares: w = X†SyS

MSE[X†SyS

]= tr

(Var[X†SyS ]

)= σ2tr

((X>SXS)−1

)A-optimal design: min

S : |S |≤ktr((X>SXS)−1

)

Typical required assumption: yi = x>i w∗ + ξi , ξi ∼ N (0, σ2)

17 / 40

Page 27: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

A-optimal design: a simple guarantee

Theorem (Avron and Boutsidis, 2013)For any X and k ≥ d there is S of size k such that:

tr((X>SXS)−1

)≤ n − d + 1

k − d + 1tr((X>X)−1

)︸ ︷︷ ︸(denoted φ)

Corollary If y = Xw∗+ ξ where Var[ξ] = σ2I and E[ξ] = 0 then

tr(Var[X†SyS ]

)︸ ︷︷ ︸σ2tr((X>S XS )−1)

≤ σ2 n−d+1

k−d+1φ ≤ φ

k−d+1︸ ︷︷ ︸ε

· tr(Var[ξ]

)︸ ︷︷ ︸nσ2

k = d + φ/ε and MSE[X†SyS ] ≤ ε · tr(Var[ξ])

18 / 40

Page 28: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

A-optimal design: a simple guarantee

Theorem (Avron and Boutsidis, 2013)For any X and k ≥ d there is S of size k such that:

tr((X>SXS)−1

)≤ n − d + 1

k − d + 1tr((X>X)−1

)︸ ︷︷ ︸(denoted φ)

Corollary If y = Xw∗+ ξ where

Is this necessary?︷ ︸︸ ︷Var[ξ] = σ2I and E[ξ] = 0 then

tr(Var[X†SyS ]

)︸ ︷︷ ︸σ2tr((X>S XS )−1)

≤ σ2 n−d+1

k−d+1φ ≤ φ

k−d+1︸ ︷︷ ︸ε

· tr(Var[ξ]

)︸ ︷︷ ︸nσ2

k = d + φ/ε and MSE[X†SyS ] ≤ ε · tr(Var[ξ])

18 / 40

Page 29: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

General response model (What if ξi is not N (0, σ2)?)

Fn - all random vectors in Rn with finite second moment

y ∈ Fn

w∗def= argmin

wEy

[‖Xw − y‖2

]= X†E[y],

ξy|Xdef= y − Xw∗ = y − XX†E[y] - deviation from best linear predictor

Two special cases:

1. Statistical regression: E[ξy|X

]= 0 (mean-zero noise)

2. Worst-case regression: Var[ξy|X

]= 0 (deterministic y)

19 / 40

Page 30: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Random experimental designs

Statistical: Fixed S is okWorst-case: Fixed S can be exploited by the adversary

Definition

A random experimental design (S , w) of size k is:

1. a random set variable S ⊆ 1..n such that |S | ≤ k

2. a (jointly with S) random function w : R|S| → Rd

Mean squared error of a random experimental design (S , w):

MSE[w(yS)

]= ES,w,y

[‖w(yS)−w∗‖2

]Wk(X) - family of unbiased random experimental designs (S , w):

ES ,w,y

[w(yS)

]= X†E[y]︸ ︷︷ ︸

w∗

for all y ∈ Fn

20 / 40

Page 31: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Main result

Theorem

For any ε > 0, there is a random experimental design (S , w) of size

k = O(d log n + φ/ε), where φ = tr((X>X)−1

),

such that (S , w) ∈ Wk(X) (unbiasedness) and for any y ∈ Fn

MSE[w(yS)

]−MSE

[X†y

]≤ ε · E

[‖ξy|X‖2

]

Toy example: Var[ξy|X] = σ2I, E[ξy|X] = 0

1. E[‖ξy|X‖2

]= tr

(Var[ξy|X]

)2. MSE

[X†y

]= φ

n · tr(Var[ξy|X]

)21 / 40

Page 32: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Main result

Theorem

For any ε > 0, there is a random experimental design (S , w) of size

k = O(d log n+φ/ε), where φ = tr((X>X)−1

),

such that (S , w) ∈ Wk(X) (unbiasedness) and for any y ∈ Fn

MSE[w(yS)

]−MSE

[X†y

]≤ ε · E

[‖ξy|X‖2

]

Toy example: Var[ξy|X] = σ2I, E[ξy|X] = 0

1. E[‖ξy|X‖2

]= tr

(Var[ξy|X]

)2. MSE

[X†y

]= φ

n · tr(Var[ξy|X]

)21 / 40

Page 33: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Important special instances

1. Statistical regression: y = Xw∗ + ξ, E[ξ] = 0

MSE[w(yS)

]−MSE

[X†y

]≤ ε · tr

(Var[ξ]

)I Weighted regression: Var[ξ] = diag

([σ2

1 , . . . , σ2n])

I Generalized regression: Var[ξ] is arbitrary

I Bayesian regression: w∗ ∼ N (0, I)

2. Worst-case regression: y is any fixed vector in Rn

ES,w

[‖w(yS)−w∗‖2

]≤ ε · ‖y − Xw∗‖2

where w∗ = X†y

22 / 40

Page 34: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Main result: proof outline

1. Volume sampling:

I to get unbiasedness and expected bounds

I control MSE in tail of distribution

1.1 well-conditioned matrices

1.2 unbiased estimators

2. Error bounds via i.i.d. sampling:

I to bound sample size k

I control MSE in bulk of the distribution

2.1 Leverage score sampling: Pr(i)def= 1

d x>i (X>X)−1xi

2.2 Inverse score sampling: Pr(i)def= 1

φx>i (X>X)−2xi (new)

3. Proving expected error bounds for least squares23 / 40

Page 35: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Volume sampling

Definition

Given a full rank matrix X ∈ Rn×d we define volume sampling

VS(X) as a distribution over sets S ⊆ [n] of size d :

Pr(S) =det(XS)2

det(X>X).

Pr(S) ∼squared volume

of the parallelepiped

spanned by xi : i ∈S

Computational cost:

O(nnz(X) log n + d4 log d)

xi

xj

24 / 40

Page 36: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Unbiased estimators via volume sampling

Under arbitrary response model, any i.i.d. sampling is biased

Theorem ([DWH19])

Volume sampling corrects the least squares bias of i.i.d. sampling.

Let q = (q1, . . . , qn) be some i.i.d. importance sampling.

volume + i.i.d.

∼VS(X)︷ ︸︸ ︷xi1 , ..., xid ,

∼qk−d︷ ︸︸ ︷xid+1

, xid+2, ..., xik

E[argmin

w

k∑t=1

1

qit(x>it w − yit )

2

]= w∗y|X

25 / 40

Page 37: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Key idea: volume-rescaled importance sampling

Simple volume-rescaled sampling:

I Let DX be a uniformly random xi

I (XS , yS) ∼ VSkD and w = X†SyS .

Then, E[w] = w∗y|X.

x>4

x>6

x>9

fixed Xd

y

y4 .y6 .

y9

Problem: Not robust to worst-case noiseSolution: Volume-rescaled importance sampling

I Let p = (p1, . . . , pn) be an importance sampling distribution,

I Define x ∼ DX as x = 1√pi

xi for i ∼ p.

Then, for (XS , yS) ∼ VSkD and w = X†S yS , we have E[w] = w∗y|X.

26 / 40

Page 38: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Importance sampling for experimental design

1. Leverage score sampling : Pr(i) = plevidef= 1

d x>i (X>X)−1xi

A standard sampling method for worst-case linear regression.

2. Inverse score sampling : Pr(i) = pinvidef= 1

φx>i (X>X)−2xi .

A novel sampling technique essential for achieving O(φ/ε) sample size.

27 / 40

Page 39: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Minimax A-optimality and Minimax experimental design

Definition

Minimax A-optimal value for experimental design:

R∗k (X)def= min

(S,w)∈Wk (X)max

y∈Fn\Sp(X)

MSE[w(yS)

]−MSE

[X†y

]E[‖ξy|X‖2

]Fact. X†y is the minimum variance unbiased estimator for Fn:

if Ey,w

[w(y)

]= X†E[y] ∀y∈Fn

then Var[w(y)

] Var

[X†y

]∀y∈Fn

I If d ≤ k ≤ n, then R∗k (X) ∈ [0,∞)

I If k ≥ C · d log n, then R∗k (X) ≤ C · φ/k for some C

I If k2 < εnd/3, then R∗k (X) ≥ (1−ε) · φ/k for some X28 / 40

Page 40: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Alternative: mean squared prediction error

Definition. MSPE[w]

= E[‖X(w −w∗)‖2

](V-optimality)

Theorem

There is (S , w) of size k = O(d log n + d/ε) s.t. for any y ∈ Fn,

MSPE[w(yS)

]−MSPE

[X†y

]≤ ε · E

[‖ξy|X‖2

]Follows from the MSE bound by reduction to X>X = I.

Then MSPE[w]

= MSE[w]

and φ = d .

Minimax V-optimal value:

min(S ,w)∈Wk (X)

maxy∈Fn\Sp(X)

MSPE[w(yS)

]−MSPE

[X†y

]E[‖ξy|X‖2

]29 / 40

Page 41: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Questions about minimax experimental design

1. Can R∗k (X) be found, exactly or approximately?

2. What happens in the regime of k ≤ C · d log n?

3. Can we restrict Wk(X) to only tractable experimental designs?

4. Does the minimax-value change when you restrict Fn?

4.1 Weighted regression

4.2 Generalized regression

4.3 Bayesian regression

4.4 Worst-case regression

30 / 40

Page 42: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Reduction to worst-case regression

Theorem

W.l.o.g. we can replace random y ∈ Fn with fixed y ∈ Rn:

R∗k (X) = min(S,w)∈Wk (X)

maxy∈Rn\Sp(X)

ES ,w

[‖w(yS)− X†y‖2

]‖y − XX†y‖2

Suppose (S , w) for all fixed response vectors y ∈ Rn satisfies

E[w(yS)

]= X†y and E

[‖w(yS)− X†y‖2

]≤ ε · ‖y − XX†y‖2.

Then, for all random response vectors y ∈ Fn and w∗ ∈ Rd ,

E[‖w(yS)−w∗‖2

]︸ ︷︷ ︸MSE[w(yS )]

≤ E[‖X†y −w∗‖2

]︸ ︷︷ ︸MSE[X†y]

+ ε · E[‖y − Xw∗‖2

].

31 / 40

Page 43: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Outline

Correcting the bias in least squares regression

Volume-rescaled sampling

Minimax experimental design

Bayesian experimental design

Conclusions

32 / 40

Page 44: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Bayesian experimental design

Consider n parameterized experiments: x1, . . . , xn ∈ Rd .Each experiment has a real random response yi such that:

yi = x>i w∗ + ξi , ξi ∼ N (0, σ2), w∗ ∼ N (0, σ2A−1)

Goal: Select k n experiments to best estimate w∗

Select S = 4, 6, 9

Receive y4, y6, y9

x>4

x>6

x>9

fixed Xd

y

y4 .y6 .

y9

33 / 40

Page 45: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Bayesian A-optimal design

Given the Bayesian assumptions, we have

w | yS ∼ N(

(X>SXS + A)−1X>SyS , σ2(X>SXS + A)−1),

Bayesian A-optimality criterion:

fA(X>SXS) = tr((X>SXS + A)−1

).

Goal: Efficiently find subset S of size k such that:

fA(X>SXS) ≤ (1 + ε) · minS ′:|S ′|=k

fA(X>S ′XS ′)︸ ︷︷ ︸OPTk

34 / 40

Page 46: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Relaxation to a semi-definite program

SDP relaxation

The following can be found via an SDP solver in polynomial time:

p∗ = argminp1,...,pn

fA

( n∑i=1

pixix>i

),

subject to ∀i 0 ≤ pi ≤ 1,∑i

pi = k .

The solution p∗ satisfies fA(∑

i pixix>i

)≤ OPTk .

Question: For what k can we efficiently round this to S of size k?

35 / 40

Page 47: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Efficient rounding for effective dimension many points

Definition

Define A-effective dimension as dA = tr(X>X(X>X + A)−1

)≤ d .

Theorem ([DLM19])

If k = Ω(dAε + log 1/ε

ε2

), then there is a polynomial time algorithm

that finds subset S of size k such that

fA(X>SXS

)≤ (1 + ε) ·OPTk .

Remark: Extends to other Bayesian criteria: C/D/V-optimality.

Key idea: Rounding with A-regularized volume-rescaled sampling,a new kind of determinantal point process.

36 / 40

Page 48: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Comparison with prior work

Criteria Bayesian k = Ω(·)

[WYS17] A,V x d2

ε

[AZLSW17] A,C,D,E,G,V dε2

[NSTT19] A,D x dε + log 1/ε

ε2

our result [DLM19] A,C,D,V dAε + log 1/ε

ε2

37 / 40

Page 49: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Outline

Correcting the bias in least squares regression

Volume-rescaled sampling

Minimax experimental design

Bayesian experimental design

Conclusions

38 / 40

Page 50: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

Conclusions

Unbiased estimators for least squares, uses volume sampling

Recent developments:

I Experimental design without any noise assumptions, i.e.,

arbitrary response

I Minimax experimental design: bridging the gap bw statistical

and worst-case perspectives

I Applications in Bayesian experimental design: bridging the

gap bw experimental design and determinantal point processes

Going beyond least squares:

I extensions to non-square losses,

I applications in distributed optimization.

39 / 40

Page 51: Minimax and Bayesian experimental design: Bridging the gap ...mmahoney/talks/... · Minimax and Bayesian experimental design: Bridging the gap between statistical and worst-case approaches

References

Haim Avron and Christos Boutsidis.

Faster subset selection for matrices and applications.

SIAM Journal on Matrix Analysis and Applications, 34(4):1464–1499, 2013.

Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang.

Near-optimal design of experiments via regret minimization.

In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of

Machine Learning Research, pages 126–135, Sydney, Australia, August 2017.

Micha l Derezinski, Kenneth L. Clarkson, Michael W. Mahoney, and Manfred K. Warmuth.

Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least

squares regression.

In Proceedings of the 32nd Conference on Learning Theory, 2019.

Micha l Derezinski, Feynman Liang, and Michael W. Mahoney.

Distributed estimation of the inverse Hessian by determinantal averaging.

arXiv e-prints (to appear), June 2019.

Micha l Derezinski and Manfred K. Warmuth.

Reverse iterative volume sampling for linear regression.

Journal of Machine Learning Research, 19(23):1–39, 2018.

Micha l Derezinski, Manfred K. Warmuth, and Daniel Hsu.

Correcting the bias in least squares regression with volume-rescaled sampling.

In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.

Aleksandar Nikolov, Mohit Singh, and Uthaipon Tao Tantipongpipat.

Proportional volume sampling and approximation algorithms for a -optimal design.

In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1369–1386,

January 2019.

Yining Wang, Adams W. Yu, and Aarti Singh.

On computationally tractable selection of experiments in measurement-constrained regression models.

J. Mach. Learn. Res., 18(1):5238–5278, January 2017.

40 / 40