Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
REGRET BOUNDS FOR META BAYESIAN OPTIMIZATION WITH AN UNKNOWN GAUSSIAN PROCESS PRIORZI WANG∗, BEOMJOON KIM∗, LESLIE PACK KAELBLING ziw, beomjoon, [email protected] (∗equal contribution)
MAIN CONTRIBUTIONS
• A stand-alone Bayesian optimization module that takes in only a multi-tasktraining data set as input and then actively selects inputs to efficiently opti-mize a new function.
• Constructive analyses of the regret of this module.
BAYESIAN OPTIMIZATION• Maximize an expensive blackbox function f : X → R with sequential
queries x1, · · · , xT and noisy observations of their values y1, · · · , yT .
• Assume a Gaussian process prior f ∼ GP (µ, k).
• Use acquisition functions as the decision criterion for where to query.
• Major problem: the prior is unknown. Existing approaches:
– maximum likelihood;
– hierarchical Bayes.
• Current theoretical results break down if the prior is not given.
• What if we have experience with similar functions? Ex. Optimizing robotgrasps in different environments:
OUR MODEL
Generic knowledge of functions
GP(μ, k)distribution over functions
x1, y11 x2, y12 ⋯
observations of function values
x1, y21 x2, y22 ⋯
f
?? ?? ⋯
new function
⋯f2
function 2
f1function 1
f1, f2, ⋯fN, f ∼ GP(μ, k)
Goal: optimize with few queries
f
Our evaluation criteria:
• best-sample simple regret rT = maxx∈X f(x)−maxt∈[T ] f(xt);
• simple regret RT = maxx∈X f(x)− f(xt∗), t∗ = arg maxt∈[T ] yt.
GAUSSIAN PROCESSESGiven observations Dt = {(xτ , yτ )}tτ=1, yτ ∼ N (f(xτ ), σ2), the posteriorGP (µt, kt) satisfies
µt(x) = µ(x) + k(x,xt)(k(xt) + σ2I)−1(yt − µ(xt)),
kt(x, x′) = k(x, x′)− k(x,xt)(k(xt) + σ2I)−1k(xt, x
′),
where yt = [yτ ]Tτ=1 and xt = [xτ ]Tτ=1, µ(x) = [µ(xi)]ni=1, k(x,x′) =
[k(xi, x′j)]i∈[n],j∈[n′], k(x) = k(x,x).
MAIN IDEA: USE THE PAST EXPERIENCE TO ESTIMATE THE PRIOR OF f• Offline phase:
– collect meta training data: M evaluations from eachof the N functions sampled from the same prior,DN = {[(xj , yij)]Mj=1}Ni=1, yij ∼ N (fi(xj), σ
2), fi ∼GP (µ, k);
– estimate the prior mean function µ and kernel k from themeta training data.
• Online phase: estimate the posterior mean µt and covariancekt and use them for BO on a new function f ∼ GP (µ, k) witha total of T iterations.
ESTIMATE THE PRIOR GP (µ, k) AND POSTERIOR GP (µt, kt): DISCRETE AND CONTINUOUS CASESX is finite: directly estimate the prior and the posterior [1]Define the observation matrix as Y = [Yi]i∈[N ] = [yij ]i∈[N ],j∈[M ]. Missing entries? Use matrix completion [2].
µ(X) =1
NY T1N ∼ N (µ(X),
1
N(k(X) + σ2I)), k(X) =
1
N − 1(Y − 1N µ(X)T)T(Y − 1N µ(X)T) ∼ W(
1
N − 1(k(X) + σ2I).
µt(x) = µ(x) + k(x,xt)k(xt,xt)−1
(yt − µ(xt)), kt(x, x′) =
N − 1
N − t− 1
(k(x, x′)− k(x,xt)k(xt,xt)
−1k(xt, x
′)).
X ⊂ Rd is compact: using weights to represent the prior
• Assume there exist basis functions Φ = [φs]Ks=1 : X → RK , mean parameter u ∈ RK and covariance parameter Σ ∈
RK×K such that µ(x) = Φ(x)Tu and k(x, x′) = Φ(x)TΣΦ(x′), i.e. f = Φ(x)TW ∼ GP (µ, k),W ∼ N (u,Σ).
• Assume M ≥ K, and Φ(x) has full row rank. The observation Yi = Φ(x)TWi + εi ∼ N (Φ(x)Tu,Φ(x)TΣΦ(x) + σ2I),and we estimate the weight vector as Wi = (Φ(x)T)+Yi ∼ N (u,Σ + σ2(Φ(x)Φ(x)T)−1). Let W = [Wi]
Ni=1 ∈ RN×K .
• Unbiased GP prior parameter estimator: u = 1NWT1N and Σ = 1
N−1 (W − 1N u)T(W − 1N u).
• Unbiased GP posterior estimator: µt(x) = Φ(x)Tut and kt(x) = Φ(x)TΣtΦ(x) where
ut = u+ ΣΦ(xt)(Φ(xt)TΣΦ(xt))
−1(yt − Φ(xt)Tu), Σt =
N − 1
N − t− 1
(Σ− ΣΦ(xt)(Φ(xt)
TΣΦ(xt))−1Φ(xt)
TΣ).
Lemma 1. If the size of the training dataset satisfies N ≥ T + 2, then for any input x ∈ X, with probability at least 1− δ,
|µt(x)− µt(x)|2 < at(kt(x) + σ2(x)) and 1− 2√bt < kt(x)/(kt(x) + σ2(x)) < 1 + 2
√bt + 2bt,
where at =4(N−2+t+2
√t log (4/δ)+2 log (4/δ)
)δN(N−t−2) and bt = 1
N−t−1 log 4δ . For finite X, σ2(x) = σ2; for compact X, σ2(x) = σ2Φ(x)T(Φ(x)Φ(x)T)−1Φ(x).
NEAR ZERO REGRET BOUNDS FOR BO WITH THE ESTIMATED PRIOR AND POSTERIOR
Acquisition functions: αPIt−1(x) = µt−1(x)−f∗
kt−1(x)12, αGP-UCB
t−1 (x) = µt−1(x) + ζtkt−1(x)12 .
f∗ ≥ maxx∈X
f(x), ζt =
(6(N − 3 + t+ 2
√t log 6
δ + 2 log 6δ )/(δN(N − t− 1))
) 12
+ (2 log( 3δ ))
12
(1− 2( 1N−t log 6
δ )12 )
12
With high probability, the simple regret decreases to a constant proportional to the noise level σ as the number of iterations and training functions increases.
Theorem 2. Assume there exist constant c ≥ maxx∈X k(x) and a training dataset is available whose size is N ≥ 4 log 6δ + T + 2. With probability at
least 1− δ, the best-sample simple regret in T iterations of meta BO with either GP-UCB or PI satisfies
rUCBT < ηUCB
T (N)λT , rPIT < ηPI
T (N)λT , λ2T = O(ρT /T ) + σ(xτ )2,
where ηUCBT (N) = (m + C1)(√1+m√1−m + 1), ηPI
T (N) = (m + C2)(√1+m√1−m + 1) + C3, m = O(
√1
N−T ), C1, C2, C3 > 0 are constants, τ =
arg mint∈[T ] kt−1(xt) and ρT = maxA∈X,|A|=T
12 log |I + σ−2k(A)|. σ is defined in the same way as in Lemma 1.
Lemma 3. With probability at least 1− δ, the simple regret RT ≤ rT + 2(2 log 1δ )
12σ.
EXPERIMENTSOptimizing a grasp (N = 1800,M = 162)
0 20 40 60 80 100 120 140 1603.4
3.2
3.0
2.8
2.6
2.4
2.2
Random
Plain-UCB
PEM-BO-UCB
TLSM-BO-UCB
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.093.4
3.2
3.0
2.8
2.6
2.4
PEM-BO-UCB
Plain-UCB
Proportion of prior training dataset Number of evaluations of test f
Max o
bse
rved v
alu
e
Optimizing a grasp, base pose, and placement (N = 1500,M = 1000)
0 5 10 15 20 25 30
6
5
4
3
2
Random
Plain-UCB
PEM-BO-UCB
TLSM-BO-UCB
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
6
5
4
3
2
PEM-BO-UCB
Plain-UCB
Proportion of prior training dataset Number of evaluations of test f
Max o
bse
rved v
alu
e
Optimizing a continuous synthetic function drawn from a GP with a squaredexponential kernel (N = 100,M = 1000)
Proportion of prior training dataset Number of evaluations of test f
Max o
bse
rved v
alu
e
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
70
75
80
85
90
95
100
105 PEM-BO-UCB
Plain-UCB w/ correct form of prior
0 20 40 60 80
0
25
50
75
100
125
150
Random
Plain-UCB w/ correct form of prior
PEM-BO-UCB
TLSM-BO
• PEM-BO: our point estimate meta BO approach.• Plain-BO: plain BO using a squared exponential kernel and maximum like-
lihood for GP hyperparameter estimation.• TLSM-BO: transfer learning sequential model-based optimization [5].• Random: random selection.
SELECTED REFERENCES[1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley New
York, 1958.
[2] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foun-dations of Computational mathematics, 9(6):717, 2009.
[3] B. Kim, L. P. Kaelbling, and T. Lozano-Pérez. Learning to guide task and motionplanning using score-space representation. In ICRA, 2017.
[4] R. Neal. Bayesian Learning for Neural networks. Lecture Notes in Statistics 118.Springer, 1996.
[5] D. Yogatama and G. Mann. Efficient transfer learning method for automatic hyper-parameter tuning. In AISTATS, 2014.