39
Large Scale Bayesian Optimization with Subspace Decomposition Logan Mathesen Joint Work with : Pedrielli G. SCHOOL OF COMPUTING INFORMATICS & DECISION SYSTEMS ENGINEERING July 29 th , 2019 07/29/2019, L. Mathesen School of CIDSE 1 / 20

Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Large Scale Bayesian Optimization with SubspaceDecomposition

Logan MathesenJoint Work with: Pedrielli G.

SCHOOL OF COMPUTING INFORMATICS & DECISION SYSTEMSENGINEERING

July 29th, 2019

07/29/2019, L. Mathesen School of CIDSE 1 / 20

Page 2: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

1 Introduction

2 Background

3 Proposed Method

4 Empirical Analysis

5 Conclusion

07/29/2019, L. Mathesen School of CIDSE 2 / 20

Page 3: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Introduction

Global Optimization in High Dimension

Increasingly complex problems are being researched in manydomains

Large scale production of individualized cancer therapy;Control of distributed robots;Self driven cars.

Increase in complexity (and parameterization) of novel algorithmsdeveloped to address these problems;

Translation of hyper-parameter optimization problem tonon-convex black-box optimization.

07/29/2019, L. Mathesen School of CIDSE 3 / 20

Page 4: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Introduction

Non-linear Non-convex Black-box Optimization

Assume there exists f (x) : X→ R, where X ⊂ Rd.We want to find and optimal solution x∗:

x∗ ∈ arg minx∈X

f (x)

We are specifically interested in high values of d.

07/29/2019, L. Mathesen School of CIDSE 4 / 20

Page 5: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Introduction

Bayesian Optimization

State-of-the-art family within black-box optimization whenfunction evaluations are costly and number of allowed functionevaluations is lowSurrogate model search, often Gaussian processDerivative free method of global optimization

Figure: Peter I. Frazier. Bayesian Optimization. In INFORMS TutORials inOperations Research. Published online: 19 Oct 2018; 255-278.

07/29/2019, L. Mathesen School of CIDSE 5 / 20

Page 6: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Introduction

Challenges with Bayesian Optimiation

Performance is not satisfactory after 10 dimensions;

Three key and intertwined reasons for this:

1 Gaussian process learning effort grows cubically with number ofobservations, O(n3)

2 To ensure reasonable closeness to x∗ substantial coverage of X isrequired, number of observations needed for coverage exponentiallyincreases with dimension

3 Maximizing the acquisition function generally scales exponentiallywith the number of dimensions

07/29/2019, L. Mathesen School of CIDSE 6 / 20

Page 7: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Background

Embedding

Assumption of “low effective dimensionality” or “approximate loweffective dimensionality”

Dimensions do not impact objective functionDefinition: ε-effective subspace (Vε)A function f : Rd → R, has a valid Vε if ∃ linear subspace Vε ⊆ Rds.t., for all x ∈ Rd, |f(x)− f(xε)| ≤ ε. Where xε ∈ Vε is theorthogonal projection of x onto Vε

REMBO (Random EMbedding Bayesian Optimization)

SRE (Sequential Random Embeddings)

SI-BO (Subspace Identification – Bayesian Optimization) RandomSampling

Wang, Z., Hutter, F., Zoghi, M., Matheson, D., & de Feitas, N. (2016). Bayesian optimization in a billiondimensions via random embeddings. Journal of Artificial Intelligence Research, 55, 361-387.Qian, H., Hu, Y. Q., & Yu, Y. (2016, July). Derivative-Free Optimization of High-DimensionalNon-Convex Functions by Sequential Random Embeddings. In IJCAI (pp. 1946-1952).Djolonga, J., Krause, A., & Cevher, V. (2013). High-dimensional gaussian process bandits. In Advancesin Neural Information Processing Systems (pp. 1025-1033).

07/29/2019, L. Mathesen School of CIDSE 7 / 20

Page 8: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Background

Additive Modeling

structural assumption: f (·) decomposes as:

f (x) = f1(x1)

+ f2(x2)

+ . . .+ fm (xm)

where each xj ∈ Xj =∏dj

i [0, 1]j represents a “group” constitutingthe decomposition.

Each f j(xj)

is then independently modeled as a Gaussian process,and when added together recover the full dimensional model

Additive kernel based acquisition functions assessed over X can beoptimized by maximizing over each component Xj to reproducesampling in X.

Kandasamy, K., Schneider, J., & Poczos, B. (2015, June). High dimensional Bayesian optimisation andbandits via additive models. In International Conference on Machine Learning (pp. 295-304).

07/29/2019, L. Mathesen School of CIDSE 8 / 20

Page 9: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Background

Work in selection of appropriate xj “groups” include

MCMC methods;Factor graph methods;Gibbs sampling based structural kernel learning;Gibbs sampling based graph learning;Fourier Feature approximation.

Gardner, J., C. Guo, K. Weinberger, R. Garnett, and R. Grosse. 2017. “Discovering and exploitingadditivestructure for Bayesian optimization”. InArtificial Intelligence and Statistics, 1311–1319.Trong Nghia Hoang and Quang Minh Hoang and Ruofei Ouyang and Kian Hsiang Low 2018.“Decentral-ized High-Dimensional Bayesian Optimization With Factor Graphs”.Wang, Z., C. Li, S. Jegelka, and P. Kohli. 2017. “Batched High-dimensional Bayesian OptimizationviaStructural Kernel Learning”. InInternational Conference on Machine Learning (ICML).Rolland, P., J. Scarlett, I. Bogunovic, and V. Cevher. 2018. “High-Dimensional Bayesian OptimizationviaAdditive Models with Overlapping Groups”. InInternational Conference on Artificial IntelligenceandStatistics, 298–307.Mutny, M., and A. Krause. 2018. “Efficient High Dimensional Bayesian Optimization with AdditivityandQuadrature Fourier Features”. InAdvances in Neural Information Processing Systems, 9005–9016.

07/29/2019, L. Mathesen School of CIDSE 9 / 20

Page 10: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

Our Problem

We aim at finding the global minimum x∗ such that:

x∗ ∈ arg minx∈X

f (x)

f is non-linear and non-convex

X is continuous

No low dimensional structures can be easily learned about:

the changes of f (·) in X (embedding);f (·) modeled as a realization of an additive structure.

f is smooth (implied assumption from algorithm implementation)

We propose the Subspace COmmunication based OPtimization(SCOOP) algorithm for efficient optimization.

07/29/2019, L. Mathesen School of CIDSE 10 / 20

Page 11: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Algorithm Overview

Consider the original space to be separated into k subspacesX1,X2, . . . ,Xk each with dimensionality dj = d− nj , j = 1, . . . , k;

The subspace generation has to consider that all the dimensions ofthe original space X need to be considered in at least one of thedecompositions

In each subspace we use Bayesian optimization to yield subspaceoptimizers, however, any optimization technique could be used.

The key is the information sharing mechanism. In fact, every“batch” of iterations, the subspaces need to somewhatcommunicate the current state of the local optimization in orderto change the value of the “currently fixed” variables.

07/29/2019, L. Mathesen School of CIDSE 11 / 20

Page 12: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Algorithm Overview

Figure: Basic Idea of SCOOP.

07/29/2019, L. Mathesen School of CIDSE 11 / 20

Page 13: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 14: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

x = 0.6

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 15: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

x = 0.6

y = 0.75

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 16: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

x = 0.6

z

y

y = 0.75

z

x

x = 0.6 y = 0.75

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 17: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

x = 0.6

y = 0.75

x = 0.6

z

y

y = 0.75

z

x

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 18: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

x = 0.6

y = 0.75

x = 0.6

z

y

y = 0.75

z

x

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 19: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

x = 0.6

y = 0.75

x = 0.6

z

y

y = 0.75

z

xy = 0.4 x = 0.5

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 20: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

x = 0.6

y = 0.75

x = 0.6

z

y

y = 0.75

z

xy = 0.4 x = 0.5

x = 0.5

z

y

y = 0.4

z

x

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 21: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

x = 0.6

z

y

y = 0.75

z

xy = 0.4 x = 0.5

x = 0.5

z

y

y = 0.4

z

xx = 0.5

y = 0.4

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 22: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Visualization in R3

xy

z

x = 0.6

z

y

y = 0.75

z

xy = 0.4 x = 0.5

x = 0.5

z

y

y = 0.4

z

xx = 0.5

y = 0.4

07/29/2019, L. Mathesen School of CIDSE 12 / 20

Page 23: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Overview

07/29/2019, L. Mathesen School of CIDSE 13 / 20

Page 24: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Overview

07/29/2019, L. Mathesen School of CIDSE 13 / 20

Page 25: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Overview

07/29/2019, L. Mathesen School of CIDSE 13 / 20

Page 26: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Overview

07/29/2019, L. Mathesen School of CIDSE 13 / 20

Page 27: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

SCOOP Overview

07/29/2019, L. Mathesen School of CIDSE 13 / 20

Page 28: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Proposed Method

Information Sharing

Best Observed Sample Sharing (SCOOP-B)

xsharej ∈ arg min

x∈XSSj

f (x)

Marginalized Expected Improvement (SCOOP-E)

xsharej ∈ arg min

x∈XSSj

EIM

where

EIM (x|SSj) =

∫xi

E

[∆f (x) Φ

(∆f (x)

s (x)

)+ s (x)φ

(∆f (x)

s (x)

)]dxi

Link Failure

xshareij = (1− ξ) xshare

ij + ξη ∀i, ξ ∼ Ber (α) , η ∼ Beta (b1, b2)

07/29/2019, L. Mathesen School of CIDSE 14 / 20

Page 29: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Empirical Analysis

Sharing Strategy Testing in R3

(a) Rosenbrock. (b) RosenbrockResults.

(c) Ackley. (d) Ackley Results.

Figure: Performance of SCOOP in 2-d under several sharing strategies.

It appears that SCOOP-BL is the best performer under the differentset-ups.

07/29/2019, L. Mathesen School of CIDSE 15 / 20

Page 30: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Empirical Analysis

Algorithm Parametrization Testing in R50

Experimentation executed to test 3 algorithm parameters:

SSdim – number of dimensions active in each subspaces

SSdim = 2, yielding 25 subspacesSSdim = 5, yielding 10 subspaces

binit : bopt – ratio of initializing to optimizing samples

binit : bopt = 1 : 1binit : bopt = 1 : 5

% Link Fail – % hidden dimension value updates that fail

0% Link Fail10% Link Fail

07/29/2019, L. Mathesen School of CIDSE 16 / 20

Page 31: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Empirical Analysis

Algorithm Parametrization Testing in R50

Figure: Alternative SCOOP formulations tested on 50 dimensional Ackleyfunction, results presented are over 50 algorithm execution replications.

Better optimization at each iteration provides higher quality sharedinformation, yielding better overall algorithm performance.

07/29/2019, L. Mathesen School of CIDSE 17 / 20

Page 32: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Empirical Analysis

State-of-the-Art Comparison

SCOOP against REMBO and an Additive Gaussian process approachover a 20, 50, and 100 dimensional Rosenbrock function.

20 19 16 6 5 4 3 2 110

1

102

103

104

(a) Rosenbrock in R20

50 49 40 10 8 6 4 3 2 1

102

104

106

108

(b) Rosenbrock in R50

100 99 80 20 16 12 10 8 6 2 110

2

104

106

108

1010

(c) Rosenbrock in R100

Figure: Average best function value achieved by SCOOP, ADD GP, andREMBO (with varying embedding dimension, de).

Wang, Z., Hutter, F., Zoghi, M., Matheson, D., & de Feitas, N. (2016). Bayesian optimization in a billiondimensions via random embeddings. Journal of Artificial Intelligence Research, 55, 361-387.Wang, Z., C. Li, S. Jegelka, and P. Kohli. 2017. “Batched High-dimensional Bayesian OptimizationviaStructural Kernel Learning”. InInternational Conference on Machine Learning (ICML).

07/29/2019, L. Mathesen School of CIDSE 18 / 20

Page 33: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Empirical Analysis

State-of-the-Art Comparison

SCOOP against REMBO and an Additive Gaussian process approachover a 20, 50, and 100 dimensional Rosenbrock function.

20 19 16 6 5 4 3 2 1

2

3

4

5

6

7

8

(a) Rosenbrock in R20

50 49 40 10 8 6 4 3 2 1

0

10

20

30

40

50

(b) Rosenbrock in R50

100 99 80 20 16 12 10 8 6 2 1

0

20

40

60

80

100

120

(c) Rosenbrock in R100

Figure: Average Euclidean error between identified minimum and trueoptimum for SCOOP, ADD GP, and REMBO (with varying de).

Wang, Z., Hutter, F., Zoghi, M., Matheson, D., & de Feitas, N. (2016). Bayesian optimization in a billiondimensions via random embeddings. Journal of Artificial Intelligence Research, 55, 361-387.Wang, Z., C. Li, S. Jegelka, and P. Kohli. 2017. “Batched High-dimensional Bayesian OptimizationviaStructural Kernel Learning”. InInternational Conference on Machine Learning (ICML).

07/29/2019, L. Mathesen School of CIDSE 18 / 20

Page 34: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Conclusion

Conclusions and Future Work

SCOOP directly optimizes over multiple low dimensionalsubspaces, leveraging information communication among theseeasy optimizations to navigate the hard to search full dimensionalspace.

Further efforts are undergoing on tests in 1000 active dimensionsenvironments and alternative information sharing strategies

Optimal selection of active subspace dimensions andcommunication patterns/network amongst subspaces

07/29/2019, L. Mathesen School of CIDSE 19 / 20

Page 35: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Conclusion

Thanks for [email protected]

07/29/2019, L. Mathesen School of CIDSE 20 / 20

Page 36: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Conclusion

Ackley function

f(xxx) = −a exp

−b√√√√1

d

d∑i=1

x2i

− exp

(1

d

d∑i=1

cos(cxi)

)+ a+ exp(1)

Rosenbrock function

f(xxx) =

d−1∑i=1

[100(xi+1 − x2i )2 + (xi − 1)2]

07/29/2019, L. Mathesen School of CIDSE 20 / 20

Page 37: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Conclusion

Configuring neural net topology with 6, 9, and 15 layers. Wall-clocktime constrained optimization, accounting for both neural net selectionand training.

Figure: Neural net prediction accuracy results across nets with differingnumber of layers (dimension).

07/29/2019, L. Mathesen School of CIDSE 20 / 20

Page 38: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Conclusion

Efficient Global Optimization(EGO), Batched Additive BayesianOptimization via Structured KernelLearning (Add-GP)

Figure: Combined Function

(a) Evaluationsrequired to optimize.

(b) Time required tooptimize.

Figure: Performance of SCOOP.

Jones, D. R., Schonlau, M., & Welch, W. J. (1998). Efficient global optimization of expensive black-boxfunctions. Journal of Global optimization, 13(4), 455-492.Wang, Z., Li, C., Jegelka, S., & Kohli, P. (2017). Batched high-dimensional bayesian optimization viastructural kernel learning. arXiv preprint arXiv:1703.01973.

07/29/2019, L. Mathesen School of CIDSE 20 / 20

Page 39: Large Scale Bayesian Optimization with Subspace ...deza/slidesRIKEN2019/mathesen.pdfFigure: Alternative SCOOP formulations tested on 50 dimensional Ackley function, results presented

Conclusion

Figure: Higher Dimensional Results

07/29/2019, L. Mathesen School of CIDSE 20 / 20