Machine Learning for Ab Initio Simulations · Machine Learning for Ab Initio Simulations Matthias Rupp Fritz Haber Institute of the Max Planck Society, Berlin, Germany

Machine Learning for Ab Initio Simulations

Matthias RuppFritz Haber Institute of the Max Planck Society, Berlin, Germany

Outline

1. Rationaleab initio simulations, machine learning

2. Kernel learningkernel trick, kernel ridge regression

3. Representationsmany-body tensor representation

4. Model buildingoverfitting, validation, free parameters

5. Property predictionenergies of molecules and crystals

Hands-on sessionproperty prediction with qmmlpack

2

Rationale

Challenges in quantum mechanical simulations

High-throughput screening

[Castelli et al, Energy Environ Sci 12, 2013]

Large systems

[Image: Tarini et al, IEEE Trans Visual Comput Graph 2006]

Long simulations

[Liwo et al, Proc Natl Acad Sci USA 102: 2362, 2005]

Quantum effects

[Image: Hiller et al, Nature 476: 236, 2011]4

Accurate simulations are limited by computational cost

accu

racy

−−−−−−−−−−−−−−→

speed

−−−−−−−−−−−−−−−→Method

Full Configuration InteractionCoupled ClusterMøller-Plesset second order perturbation theoryKohn-Sham Density Functional TheoryTight BindingMolecular Mechanics

Is it possible to be both accurate and fast?In high-throughput settings, machine learning offers a new trade-off

5

Interpolating electronic structure calculations with machine learning

Correlated calculations can be rapidly and accurately interpolated

property

●

● ●

●●

structure

• reference calculations, — ground truth,- - - machine learning model

[Rupp, Int J Quant Chem, 2015]

Assumes similar structure, similar propertyCorresponding space or metric

Flexible functional formSmoothness assumption

Physics in the examplesFlexibility versus physical constraints

6

Relationship with other models

ab initio

generally applicableno or little fittingform from physicsdeductivefew or no parametershigh accuracyslowsmall systems

force fields

limited domainfitting to one classform from physicsmostly deductivesome parameterslimited accuracyfastlarge systems

machine learning

generally applicablerefitted to any datasetform from statisticsinductivemany parametershigh accuracyin-betweenlarge systems

7

Machine learning

Machine learning (ML) studies algorithms whose performance improves with data(“learning from experience”). [Mitchell, McGraw Hill, 1997]

Data X →

.

Black box ML

Data: {( 1, y1), . . . , ( m, ym)}↓

⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭

algorithm

↓

Hypothesis: f̂ : "→ y

2/53...

2/53

→ Model f̂

� No explicitly programmed problem-specific solution

� Systematic identification of regularity in data for prediction & analysis

� Interpolation in high-dimensional spaces

8

Problem types

Unsupervised learningData do not have labels: given

{xi}ni=1

, find structureDimensionality reduction, clustering

Supervised learningData have labels: given

{(xi , yi )

}ni=1

, predict ỹ for new x̃Regression, classification

Active learningAlgorithm chooses data to label: choose n data

{xi}ni=1

to predict ỹ for new x̃Bayesian optimization

9

Learning theory

A

BC

D

ae

o

ℱ

prediction error =

approximation error a

+ estimation error e

+ optimization error o

F = model class, A = true model, B = best model in class, C = best identifiablemodel (data), D = best identifiable model (optimization)

Changes in size of F ⇔ a vs. e ⇔ bias-variance trade-off[Bottou & Bousquet, NIPS 2007]

10

Kernel learning

Learning with kernels

The kernel trick:

� Transform samples into higher-dimensional space

� Implicitly compute inner products there

� Rewrite linear algorithm using only inner products

-2 Π -Π 0 Π 2 Πx

Input space X

7→

φ−→

-2Π -Π Π 2Πx

-1

1sin x

Feature space H

k : X × X → R, k(x , z) =〈φ(x), φ(z)

〉

[Schölkopf & Smola: Learning with Kernels, 2002; Hofmann et al.: Ann Stat, 2008]

12

Kernel functions

0

Θx

z

ÈÈx-zÈÈ2

ÈÈxÈÈ2

ÈÈzÈÈ2

ÈÈ z ÈÈ2 cos ΘÈÈ x ÈÈ2

Kernels correspond to inner products

Encode information about lengths and angles:||x − z ||2 = 〈x , x〉 − 2 〈x , z〉+ 〈z , z〉, cos θ = 〈x ,z〉||x || ||z||

If k : X × X → R is symmetric positive semi-definite,then k(x , z) = 〈φ(x), φ(z)〉 for some φ : X → H

Useful propertiesClosed convex cone structureNatural interface K ij = k(xi , xj)X can be any non-empty set

13

Examples of kernel functionsLinear kernel

k(x , z) = 〈x , z〉

-� -� � � �

-�- ��

��

��(��)

-� -� � � �-�

�

�

Gaussian kernelexp(−‖x − z‖2/2σ2)

-� -� � � ��

��

��

��(��)

-� -✁ ✁ �✂

-✁

✁

✄

Laplacian kernelexp(−‖x − z‖1/σ)

-� -� � � ��

��

��

��(��)

-� -✁ ✁ �✂

-✁

✁

✄

14

Regression with kernels

Linear ridge regression

For models

f (x) =d∑

i=1

βix i

minimizing

minβ∈Rd

n∑

i=1

(f (xi )− yi

)2+ λ||β||2

yields

β =(XTX + λI

)−1XTy

f (x) =d∑

i=1

βix i

=d∑

i=1

n∑

j=1

αj(xj )i x i

=n∑

j=1

αj〈xj , x

〉

15

Regression with kernels

Linear ridge regression

For models

f (x) =d∑

i=1

βix i

minimizing

minβ∈Rd

n∑

i=1

(f (xi )− yi

)2+ λ||β||2

yields

β =(XTX + λI

)−1XTy

Kernel ridge regression

For models

f (x) =n∑

i=1

αik(xi , x)

minimizing

minα∈Rn

n∑

i=1

(f (xi )− yi

)2+ λ||α||2H

yields

α =(K + λI

)−1y15

A basis function picture of kernel regression

Weighted basis functions placed on training samples xi

xix

yi

y

xix

yi

y

— learned f (x)• training samples (xi , yi )

xix

yi

y

xix

yi

y

— basis functions k(xi , x)- - - prediction f̂ (x) =

∑i αik(xi , x)

[Vu et al, Int J Quant Chem, 2015; Rupp, Int J Quant Chem, 2015]16

Representer theorem

Kernel models have form

f (z) =n∑

i=1

αik(xi , z)

due to the representer theorem:

Any function minimizing a regularized risk functional

`((

xi , yi , f (xi ))ni=1

)+ g

(‖f ‖)

admits to above representation.

Intuition:

� model lives in space spanned by training data

� weighted sum of basis functions

[Schölkopf, Herbrich & Smola, COLT 2001]

17

Centering in kernel feature space

Centering X and y is equivalent to having a bias term b.

For kernel models, center in kernel feature space:

k̃(x , z) =〈φ(x)− 1

n

n∑

i=1

φ(xi ), φ(z)−1

n

n∑

i=1

φ(xi )〉

⇒ K̃ =(I − 1

n1)K(I − 1

n1)

Some kernels like Gaussian and Laplacian kernels do not need centering[Poggio et al., Tech. Rep., 2001]

18

Representations

Representations and Hilbert spaces

Numerical representation for arbitrarypoly-atomic systems requiredCorresponding Hilbert space as feature space

No unique solutionExample:(a, b) 7→ (a2,

√2ab, b2)

(a, b) 7→ (a2, ab, ab, b2)

Several state-of-the-art representationssymmetry functionsbispectrum, smooth overlap of atomic positionsmoment tensor potentialsmany-body-tensor and distribution-based representations

20

Hilbert spaces for arbitrary atomistic systems

Representation = numerical encoding of atomistic system for accurate interpolation

Requirements:

(i) Invariance against transformations preserving the propertytranslation, rotation, homonuclear permutations

(ii) Uniqueness: different in property ⇒ different in representationreconstructability

(iii) Smoothness: continuous, ideally differentiable

(iv) Generality: work with any atomistic systemmolecules, periodic systems

(v) Efficiency: fast to compute, require few reference calculations

(vi) Simplicity: conceptually straightforward

[Bartók et al, Phys Rev B, 2013; Shapeev, Multiscale Model Simul, 2016; Behler, J Chem Phys, 2011]

21

The many-body tensor representation

Distribution of k-body terms stratified by chemical elements:

fk(x , z) =Na∑

i=1

wk(i )D(x , gk(i )

) k∏

j=1

Czj ,Zij

�� -� / Å-��

��

��

�� -� / Å-�

ϕ/��

OH

OH

HH

HO

[Huo & Rupp, arXiv, 2017]

22

Examples: molecules and bulk crystals

�� -� / Å-�

ϕ/��

OC

OH

O

CCH

HH

HC

HO

CO

OO

� π/� π/� �π/� ∠ / ��

ϕ/��

OC

OH

O

CCH HHCHHO

HCHHCOHOHHOCHOOCHOCOOOHOOCO

� �� -� / Å-�

ϕ/��

HH

HC, CH

HO, OH

CC

CO, OC

OONaNaNa, ClClCl

NaClCl, ClClNaClNaNa, NaNaCl

NaClNa, ClNaCl

-� - �� ∠ϕ/��

��

23

Example: surfaces

Yasunobu Ando, work in progress, 2018

24

Model building

How regularization helps against overfitting

æ

æ æ

ææ

æ æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

ææ

æ

x

y

26

The effect of regularization

Underfitting

0.0 0.5 1.0 1.5 2.0x0.0

0.2

0.4

0.6

0.8

1.0

1.2y

0.123 / 0.443

λ too large

Fitting

0.0 0.5 1.0 1.5 2.0x0.0

0.2

0.4

0.6

0.8

1.0

1.2y

0.044 / 0.068

λ right

Overfitting

0.0 0.5 1.0 1.5 2.0x0.0

0.2

0.4

0.6

0.8

1.0

1.2y

0.036 / 0.939

λ too smallRupp, PhD thesis, 2009; Vu et al, Int. J. Quant. Chem., 1115, 2015

27

Validation

Why validate?

� Assess model performance

� Optimize free parameters (hyperparameters)

Which statistics to report?

� Root mean squared error (RMSE)

� Mean absolute error (MAE)

� Maximum absolute error

� Squared correlation coefficient (R2)

What else can we learn from validation?

� Distribution of errors, not only summary statistics

� Convergence of error with number of samples (learning curves)28

Validation

Golden rule:Never use training data for validation

Violation of this rule leads to overfittingby measuring flexibility in fitting instead of generalization abilityrote learner example

If there is sufficient data:

� Divide data into two subsets, training and validation

� Build model on training subset

� Estimate error of trained model on validation subset

Sometimes an external validation set is used in addition.

29

Statistical validation

If too few data, statistical re-sampling methods can be used,such as cross-validation, bagging, bootstrapping, jackknifing

k-fold cross-validation:

� Divide data into k evenly sized subsets

� For i = 1, . . . , k ,build model on union of subsets {1, . . . , k} \ {i} and validate on subset i

All model building steps must be repeated for data splits:

� All pre-processing such as feature selection and centering

� Optimization of hyperparameters

Hansen et al., J. Chem. Theor. Comput. 9(8): 3543, 2013.

30

Hyperparameters: physically motivated choices

Length scale σ of Gaussian kernel:2σ2 ≈ ‖x − z‖2

Regularization strength λ:=̂ noise variance (Bayesian)=̂ leeway around yi for fitting⇒ target accuracy

-5 0 5 10

-20

-15

-10

-5

00 2 4

-6

-4

-2

0

log2HlsLlog 2HnlL

log10HlsL

log 10HnlL

RMS

6.6

6.9

8.9

20

-6-4-20246

Rupp: Int. J. Quant. Chem. 115(16): 1058, 2015.

31

Hyperparameters: statistically motivated choices

� Data-driven choice of hyperparameters

� Grid search or gradient descent

� Statistical validation to estimate error

� For validation and hyperparameteroptimization, use nested data splits

-5 0 5 10

-20

-15

-10

-5

00 2 4

-6

-4

-2

0

log2HlsLlog 2HnlL

log10HlsL

log 10HnlL

RMS

6.6

6.9

8.9

20

-6-4-20246

Rupp: Int. J. Quant. Chem. 115(16): 1058, 2015.

32

Nested data splits

� Never use data from training in validation

� For performance assessment and hyperparameter optimization,use nested cross-validation or nested hold-out sets

� Beware of overfitting

Example 1: plain overfitting× Train on all data, predict all dataX Split data, train, predict

Example 2: centering× Center data, split data, train & predictX Split data, center training set, train, center test set, predict

Example 3: cross-validation with feature selection× Feature selection, cross-validationX Feature selection for each split of cross-validation

33

Property prediction

Examples

� Screening: chemical interpolationRupp et al., Phys. Rev. Lett. 108(5): 058301, 2012

� Dynamics: potential energy surfacesBehler, Phys. Chem. Chem. Phys. 13(40): 17930, 2011

� Crystal structure classificationGhiringhelli et al., Phys. Rev. Lett. 114(10): 105503, 2015

� Density functional theory: kinetic energiesSnyder et al., Phys. Rev. Lett. 108(25): 253002, 2012

� Transition state theory: dividing surfacesPozun et al., J. Chem. Phys. 136(17): 174101, 2012

� Amorphous systems: glassy liquidsSchoenholz et al., Nat. Phys. 12(5): 469, 2016

� Design: stable interface searchKiyohara et al., Jpn. J. Appl. Phys. 55(4): 045502, 2016

� Density functional theory: adaptive basis setsSchütt & VandeVondele, J. Chem. Theor. Comput., 2018

� Self-learning Monte Carlo methodLiu et al., Phys. Rev. B 95(4): 041101(R), 2017

� Phase transitions: out-of-equilibrium phasesVenderley et al., Phys. Rev. Lett. 120(15): 257204, 2018

� Slow dynamics: dimensionality reductionWehmeyer & Noé, J. Chem. Phys. 148(24): 241703, 2018

� Quantum state tomographyTorlai et al., Nat. Phys. 14(5): 447, 2018

� Prediction of tensorial propertiesGrisafi et al., Phys. Rev. Lett. 120(3): 036002, 2017

and many more. . .35

The combinatorial nature of chemical and materials spaceO

O

O OH

O

OO-

O

CHO

OOO

O

Cl

O

O

NH+

O

NH

O O-

O

HC

OOO

O

ONHO

O

OOO

O

NH

O

O

NH

O

O

O

O

O

O

NH

O

O

O

NH

O

O

O

O

O

NH

O

O

O

O

NH

O

HN

O

F

O

NH

O

OO

+HN

O O

O O

O

O

N

O NH O

O O

O

NH

O

O

HN

SO

H2N

O

O

NH

NH

OH

O

O

O

OH

O

O O O

O

O

O

O

O

O

O

NH2

O

NH

O

O

O

NH

O

O

ONHO

O

ONHO

O

OOO

NH+

OO

O

O

OO N

O

O

O

O

OOO

F

O

OOO

O

NH

O

O

O

O

OO

NH2

O

O

OO

O O O

OF

O

O

O

O

O O O

O NH2

O

HO

O

NH

F

OO

O

O

O

O

O

O

O

NH2O

NH

O

O

OOO

O

OOO

O

O

OO

F

O

OO

O O

ONH2

O

O

O

O

O

O

O NH2

NH2O

O

O

O

O

Cl

O

O

O

O

O

O NH O

O

O

NH

O

O

O

O

N

O

O

O

O

NH

O

O

O

O

O

O

O

O

NH

O

F

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

F

O

NH

O

Br

O

O

O

O

NH

OH

O NH O

O

O

O

O

O

O O

O

Br

NH2O

NH

O

OF

Figure 4: Molecules decoded from randomly-sampled points in the latent space of a variationalautoencoder, near to a given molecule (aspirin [2-(acetyloxy)benzoic acid], highlighted inblue).

to realistic drug-like molecules. In a related experiment, and following the success of othergenerative models of images, we performed interpolations in chemical space. Random drugsfrom the list of FDA approved molecules were selected and encoded by sampling the meanof the VAE. We then performed a linear grid interpolation over two dimensions. We decodedeach point in latent space multiple times and report the one whose latent representation,once re-encoded, is the closest to the sampled point (Figures 14-5)

Bayesian optimization of drug-like molecules The proposed molecule autoencodercan be used to discover new molecules with desired properties.

As a simple example, we first attempt to maximize the water-octanol partition coe�cient(logP), as estimated by RDkit. [43] logP is an important element in characterizing the drug-likeness of a molecule, and is of interest in drug design. To ensure that the resulting moleculesto be easy to synthesize in practice, we also incorporate the synthetic accessibility [44] (SA)score into our objective.

Our initial experiments, optimizing only the logP and SA scores, produced novel molecules,but ones having unrealistically large rings of carbon atoms. To avoid this problem, we addeda penalty for having carbon rings of size larger than 6 to our objective.

8

Molecule spaceGraph theory

Materials spaceGroup theory

Combinatorial explosionMachine learning

Left: aspirin derivatives

36

Learning potential energy surfaces

Figure: Chang, von Lilienfeld, CHIMIA 68(9): 602, 2014 37

Enthalpies of formation of small organic molecules

7 211 small organic molecules; C, N, O, S, Cl, saturated with HRelaxed geometries, formation energies at DFT/PBE level of theory

E / kcal mol−1 α / Å3

Representation Kernel RMSE MAE RMSE MAE

CM [1] Laplacian 4.76 3.47 0.17 0.13BoB [2] Laplacian 2.86 1.79 0.12 0.09BAML [3] Laplacian 2.54 1.15 0.12 0.07SOAP [4] REMatch 1.61 0.92 0.07 0.05

MBTR Linear 1.14 0.74 0.10 0.07MBTR Gaussian 0.97 0.60 0.06 0.04

[1] Rupp et al., Phys. Rev. Lett. 108(5): 058301, 2012; [2] Hansen et al., J. Phys. Chem. Lett. 6(12): 2326, 2015; [3] Huang

& von Lilienfeld, J. Chem. Phys. 145(16): 161102, 2016; [4] De et al., Phys. Chem. Chem. Phys. 18(20): 13754, 2016 38

Potential energy surfaces of organic molecules

Ab initio molecular dynamics; DFT/PBE, Tkatchenko-Scheffler van der Waals forces

DTNN GDML MBTR MBTRKernel — Matérn linear GaussianMolecule MAE MAE RMSE MAE RMSE MAE [kcal/mol]

benzene 0.04 0.07 0.06 0.05 0.05 0.04uracil – 0.11 0.14 0.10 0.06 0.04naphthalene – 0.12 0.15 0.11 0.15 0.11aspirin – 0.27 0.26 0.18 0.32 0.22salicylic acid 0.50 0.12 0.17 0.12 0.11 0.08malonaldehyde 0.19 0.16 0.28 0.21 0.13 0.10ethanol – 0.15 0.22 0.16 0.10 0.07toluene 0.18 0.12 0.16 0.12 0.15 0.11

DTNN = Deep Tensor Neural Network, Schütt et al, Nat. Comm 8: 13890, 2017;GDML = Gradient Domain Machine Learning, Chmiela et al, Sci. Adv. 3(5): e1603015, 2017

39

Energies of crystalline systems

-2 -1 0 1 2 DFT E / eV/atom-2

-1

0

1

ML

E / e

V/at

om

15 8 0 8 15

errormeV/atom

Periodic systems

Elpasolites ABC2D6 8 meV/atomDensity functional theory, ground state12 elements

Binary alloys 1–19 meV/atomUnrelaxed geometries; with Gus Hart et al.

Al-Ga-In conducting oxides 27 meV/atomNOMAD 2018 Kaggle challenge, Vegard’s rule

40

Reliability

“It is not the estimate that matters so much,as the degree of confidence with the opinion”

N. Taleb, 2004

Uncertainty estimates

Extrapolation versus interpolationDomain of applicability

Left: Logarithmized predictive varianceversus signed error for energy predictions[Snyder et al., Phys Rev Lett, 2012]

41

Interpretability

Explain individual predictionsRight for the right reason?

Sensitivity analysisLocal gradients explain variation, but not the “why”

How to quantify interpretability?How to interpret basis set expansions?

42

Summary

property

●

● ●

●●

structure

-2Π -Π Π 2Πx

-1

1sin x

� �� -� / Å-�

ϕ/��

HH

HC, CH

HO, OH

CC

CO, OC

OO

-5 0 5 10

-20

-15

-10

-5

00 2 4

-6

-4

-2

0

log2HlsL

log 2HnlL

log10HlsL

log 10HnlL

RMS

6.6

6.9

8.9

20

-6-4-20246

-2 -1 0 1 2 DFT E / eV/atom-2

-1

0

1

ML

E / e

V/at

om

15 8 0 8 15

errormeV/atom

Interpolation of ab initio simulationsAccurate and fast, via machine learning

Learning with kernelsKernel trick: implicit transformations

RepresentationsMany-body tensor representation

Model buildingRegularization, validation, hyperparameters

Energy predictionsMolecules, crystals, surfaces

43

Acknowledgements

Group

Haoyan Huo (former), Lucas Deecke (former)Marcel Langer (current), Alex Gößmann (current)

Funding

44

Tutorial

M. Rupp: Machine Learning forQuantum Mechanics in a Nutshell,Int. J. Quant. Chem. 15(16):1058–1073, 2015.

DOI 10.1002/qua.24954

Special Issue

Rupp, von Lilienfeld, Burke: GuestEditorial: Special Topic on Data-enabled Theoretical Chemistry, J.Chem. Phys. 148(24): 241401, 2018.

DOI 10.1063/1.5043213

Code

Efficient C++ implementation withPython bindings available as part ofmy qmmlpack package.

To appear this year.

Datasets

Public datasets of ab initiocalculations for molecules, solids andliquids available.

qmml.orgnomad-coe.eu

45
http://doi.org/10.1002/qua.24954http://doi.org/10.1063/1.5043213http://qmml.orghttps://nomad-coe.eu/

Documents

Machine Learning for Ab Initio Simulations · Machine Learning for Ab Initio Simulations Matthias Rupp Fritz Haber Institute of the Max Planck Society, Berlin, Germany