Optimization and Machine Learning Training Algorithms for ...€¦ · Optimization and Machine...

Preview:

Citation preview

Optimization and Machine Learning TrainingAlgorithms for Fitting Numerical Physics Models

Raghu Bollapragada† Matt Menickelly†

Witold Nazarewicz ‡,§ Jared O’Neal †

Paul-Gerhard Reinhard ∗ Stefan M. Wild†

†Argonne National Laboratory, ‡Michigan State University, §University of Warsaw,∗Universitat Erlangen-Nurnberg

June 9, 2020

Optimizing Parameters of Fayans’ Functional� Use Fayans functional form proposed in [S.A. Fayans, JETP Lett. 68, 169 (1998)]

� Computer model FaNDF0 (m(ν;x)) [P.-G. Reinhard & W. Nazarewicz, PhysRevC (2017)]

� 13 free computer model parameters (x) to be fitted:� ASP, ASYMM, EOVERA, COMPR, RHO NM, DASYM, HGRADP, C0NABJ, C1NABJ,

H2VM, FXI, HGRADXI, H1XI� Use “even” dataset (d1. . . . , d198) from that study corresponding to 198

observables (ν1, . . . , ν198) derived from 72 nucleus configurations:

Class Number of ObservablesBinding Energy 63RMS Charge Radius 52Diffraction Radius 28Surface Thickness 26Neutron Single-Level Energy 4Proton Single-Level Energy 5Isotopic Shifts 3Neutron Pairing Gap 5Proton Pairing Gap 11Other 1

σi-weighted least squares optimization problem:

minx∈R13

f(x), where f(x) =1

198

198∑i=1

(m (νi;x)− di

σi

)2

,

2 / 16

Why optimizers would like this problem

σi-weighted least squares optimization problem:

minx∈R13

f(x), where f(x) =1

198

198∑i=1

(m (νi;x)− di

σi

)2

,

Computing

� We can compute all observables m(νi;x) in a couple of seconds using 192 cores

� Inexpensive: benchmarking optimization algorithms not prohibitive

This makes for a good case study!

An eye on general nuclear data-fitting problems of the form

minx∈Rn

f(x), where f(x) =1

nd

nd∑i=1

(m (νi;x)− di

σi

)2

,

where

� nd might be (substantially) larger than 198 ...

� but m(νi;x) doesn’t admit analytic gradients, like the Fayans computer model.

3 / 16

Supervised Machine Learning (in one slide!)

The steps

1 Given a dataset of feature (νi) -data (di) pairs {(νi, di)}nd

i=1.

2 Let x parameterize a model, m -m(νi;x).

3 Define a loss function `(m(ν;x), d)that (in some sense) penalizesdiscrepancies between the modelprediction m(ν;x) and data d.

4 An optimization problem involvingthe empirical average results:

minx

1

nd

nd∑i=1

`(m(νi;x), di).

5 Solve the problem with stochasticgradient (SG) methods.

Silly toy example

1 Height and weight (νi) - dog vs cat(di)

2 x are the weights of a CNN m(νi;x)

3 Loss function `(m(ν;x), d) isclassification cross entropy

4 Assemble optimization problem(Pytorch/Tensorflow?)

5 Optimal x∗ =⇒ classifier m(ν;x∗):

4 / 16

Optimization and Machine Learning

minx∈Rn

f(x), where f(x) =1

nd

nd∑i=1

(m (νi;x)− di

σi

)2

,

One can interpret f(x) as an empirical square loss function for regression - looks justlike supervised training of ML models!

In that case, why don’t we just use stochastic gradient (SG) methods1?

� No gradients of m(νi;x) ... need derivative-free techniques.� But even with gradients, no great reason to believe SG methods should

outperform “more traditional” optimization methods.

We performed a comparison of derivative-free solvers and approximate SG methods onthis case study problem.

Solvers� POUNDERS

� Nelder-Mead

� Kiefer-Wolfowitz Iteration

� Two-point Bandit

� Derivative-free ADAQN

1or one of the many, many, many variants of SG methods, e.g., Momentum, Polyak Averaging, Adam,

RMSProp, AdaGrad, AdaDelta, AdaMax, AMSGrad, NAG, Nadam, SVRG, SAG, SARAH, SPIDER, ...

5 / 16

The Solvers - POUNDERS� Previous use in calibrating nuclear models, e.g. [S.M. Wild, J. Sarich & N. Schunck, JPG:

NPP (2015)]

� Model-based derivative-free optimization� Builds (quadratic) interpolation models of objective on evaluated pairs (x, f(x)).� Exploits known square loss function structure of f(x) for better Hessian

approximation.� Treats the objective as deterministic (all nd = 198 observables are computed for

each function value).

6 / 16

The Solvers- Nelder-Mead simplex algorithm� A very popular (direct search) method of derivative-free optimization [J.A. Nelder &

R. Mead, The Computer Journal (1965)]

� Maintains a simplex (n+ 1 points in Rn) of evaluated pairs (x, f(x)).� Relative values of f(x) determine which simplex vertices to delete and which

new vertices to add and evaluate.� Treats the objective as deterministic (all nd = 198 observables are computed for

each function value).

x(1)

x(n+1)

xnew

xnew

xnew

xnew2

xnew1

xnew3

xnew1

7 / 16

The Solvers - Kiefer-Wolfowitz

� Just replace the gradients in stochastic gradient method with finite differences!

� Seminal paper [J. Kiefer & J. Wolfowitz, Annals of Math. Stats. (1952)] was published oneyear after Robbins and Monro’s classic paper on SG [H. Robbins & S. Monro, Annals of

Math. Stats. (1951)].

Let B ⊂ {1, . . . , nd} be a random batch of observables drawn without replacement.Define the function

fB(x) =1

|B|∑i∈B

(m (νi;x)− di

σi

)2

SG KWxk+1 ← xk − ηk∇fBk

(xk) xk+1 ← xk − ηkgk

gk is a finite difference approximation of ∇fBk(xk) computed from values of fBk

.

Batchsize (|Bk|) and stepsizes {ηk} are typically treated as hyperparameters.(So is finite difference parameter h).

8 / 16

The Solvers - Two-Point Bandit Methods

� Employs a two-point approximation of the approximate gradient.

� (compare to the n+1 or 2n points needed for finite difference gradients in KW).

� Sample a random unit direction u and evaluate fB(x) and fB(x+ hu).

� Let d =fB(x+ hu)− fB(x)

hu

SG KW Banditxk+1 ← xk − ηk∇fBk

(xk) xk+1 ← xk − ηkgk xk+1 ← xk − ηkdk

Once again, batchsize, stepsize, and finite difference parameter h are hyperparameters.

9 / 16

The Solvers - Derivative-Free ADAQN

� An L-BFGS method (a popular quasi-Newton method) ...

� but gradients are approximated by finite differencing.

� Specifically designed for empirical loss functions of the form f(x).

� Requests variable batchsizes |Bk|.� |Bk| depends on algorithmically determined estimates of the variance of the

approximate gradients.

� Key citations: Derivative-based: [R. Bollapragada, J. Nocedal, D. Mudigere, H. Shi & P. Tang,

ICML (2018)], Derivative-free: [R. Bollapragada & S.M. Wild, ICML Workshop (2019)]

10 / 16

Overall Comparison of Function Value Trajectories

Let’s focus on POUNDERS, KW, and ADAQN ...

11 / 16

The Case (For?) POUNDERS

Only one hyperparameter to tune (initial TR radius), and performance is veryinvariant!

12 / 16

The Case (Against?) KW

Figure: Summary results for KW, fixing 3 differentbatchsizes, and comparing across stepsizes. � Lots of

hyperparametertuning (batchsizesand stepsizesshown here) - veryexpensive in corehours.

� Much variabilityacrosshyperparameters.

� Smaller batch sizestend to result incomputationalfailure fromcomputer model!

13 / 16

The Case (For?) ADAQN

Only tuned one hyperparameter - initial batchsize.Reliable performance, seems to improve with smaller initial batchsizes!

14 / 16

Towards larger problems

Figure: Resource utilization plots for the final solvers

A study of parallel resource exploitation.On the x-axis - the number of observables m(x; νi) one could compute simultaneously.On the y-axis - the (median) number of rounds of full utilization of parallel resourcesof the size on the x-axis needed to reduce the optimality gap to a fraction τ .

15 / 16

Future Directions

� From the physics: More models! Bigger models?!

� From the mathematics: Consider randomized (sampled)variants of POUNDERS for a “best of both worlds” in termsof reliability and speed to solution?

Thank you!

16 / 16

Recommended