Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Machine Learning for Ab Initio Simulations
Matthias RuppFritz Haber Institute of the Max Planck Society, Berlin, Germany
Outline
1. Rationaleab initio simulations, machine learning
2. Kernel learningkernel trick, kernel ridge regression
3. Representationsmany-body tensor representation
4. Model buildingoverfitting, validation, free parameters
5. Property predictionenergies of molecules and crystals
Hands-on sessionproperty prediction with qmmlpack
2
Rationale
Challenges in quantum mechanical simulations
High-throughput screening
[Castelli et al, Energy Environ Sci 12, 2013]
Large systems
[Image: Tarini et al, IEEE Trans Visual Comput Graph 2006]
Long simulations
[Liwo et al, Proc Natl Acad Sci USA 102: 2362, 2005]
Quantum effects
[Image: Hiller et al, Nature 476: 236, 2011]4
Accurate simulations are limited by computational cost
accu
racy
−−−−−−−−−−−−−−→
speed
−−−−−−−−−−−−−−−→Method
Full Configuration InteractionCoupled ClusterMøller-Plesset second order perturbation theoryKohn-Sham Density Functional TheoryTight BindingMolecular Mechanics
Is it possible to be both accurate and fast?In high-throughput settings, machine learning offers a new trade-off
5
Interpolating electronic structure calculations with machine learning
Correlated calculations can be rapidly and accurately interpolated
property
●
● ●
●●
structure
• reference calculations, — ground truth,- - - machine learning model
[Rupp, Int J Quant Chem, 2015]
Assumes similar structure, similar propertyCorresponding space or metric
Flexible functional formSmoothness assumption
Physics in the examplesFlexibility versus physical constraints
6
Relationship with other models
ab initio
generally applicableno or little fittingform from physicsdeductivefew or no parametershigh accuracyslowsmall systems
force fields
limited domainfitting to one classform from physicsmostly deductivesome parameterslimited accuracyfastlarge systems
machine learning
generally applicablerefitted to any datasetform from statisticsinductivemany parametershigh accuracyin-betweenlarge systems
7
Machine learning
Machine learning (ML) studies algorithms whose performance improves with data(“learning from experience”). [Mitchell, McGraw Hill, 1997]
Data X →
.
Black box ML
Data: {( 1, y1), . . . , ( m, ym)}↓
⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭
algorithm
↓
Hypothesis: f̂ : "→ y
2/53...
2/53
→ Model f̂
� No explicitly programmed problem-specific solution
� Systematic identification of regularity in data for prediction & analysis
� Interpolation in high-dimensional spaces
8
Problem types
Unsupervised learningData do not have labels: given
{xi}ni=1
, find structureDimensionality reduction, clustering
Supervised learningData have labels: given
{(xi , yi )
}ni=1
, predict ỹ for new x̃Regression, classification
Active learningAlgorithm chooses data to label: choose n data
{xi}ni=1
to predict ỹ for new x̃Bayesian optimization
9
Learning theory
A
BC
D
ae
o
ℱ
prediction error =
approximation error a
+ estimation error e
+ optimization error o
F = model class, A = true model, B = best model in class, C = best identifiablemodel (data), D = best identifiable model (optimization)
Changes in size of F ⇔ a vs. e ⇔ bias-variance trade-off[Bottou & Bousquet, NIPS 2007]
10
Kernel learning
Learning with kernels
The kernel trick:
� Transform samples into higher-dimensional space
� Implicitly compute inner products there
� Rewrite linear algorithm using only inner products
-2 Π -Π 0 Π 2 Πx
Input space X
7→
φ−→
-2Π -Π Π 2Πx
-1
1sin x
Feature space H
k : X × X → R, k(x , z) =〈φ(x), φ(z)
〉
[Schölkopf & Smola: Learning with Kernels, 2002; Hofmann et al.: Ann Stat, 2008]
12
Learning with kernels
The kernel trick:
� Transform samples into higher-dimensional space
� Implicitly compute inner products there
� Rewrite linear algorithm using only inner products
-2 Π -Π 0 Π 2 Πx
Input space X
7→
φ−→
-2Π -Π Π 2Πx
-1
1sin x
Feature space H
k : X × X → R, k(x , z) =〈φ(x), φ(z)
〉
[Schölkopf & Smola: Learning with Kernels, 2002; Hofmann et al.: Ann Stat, 2008]
12
Kernel functions
0
Θx
z
ÈÈx-zÈÈ2
ÈÈxÈÈ2
ÈÈzÈÈ2
ÈÈ z ÈÈ2 cos ΘÈÈ x ÈÈ2
Kernels correspond to inner products
Encode information about lengths and angles:||x − z ||2 = 〈x , x〉 − 2 〈x , z〉+ 〈z , z〉, cos θ = 〈x ,z〉||x || ||z||
If k : X × X → R is symmetric positive semi-definite,then k(x , z) = 〈φ(x), φ(z)〉 for some φ : X → H
Useful propertiesClosed convex cone structureNatural interface K ij = k(xi , xj)X can be any non-empty set
13
Examples of kernel functionsLinear kernel
k(x , z) = 〈x , z〉
-� -� � � �
-�- ��
��
��(���)
-� -� � � �-�
�
�
Gaussian kernelexp(−‖x − z‖2/2σ2)
-� -� � � ���
��
��
��(���)
-� -✁ ✁ �✂
-✁
✁
✄
Laplacian kernelexp(−‖x − z‖1/σ)
-� -� � � ���
��
��
��(���)
-� -✁ ✁ �✂
-✁
✁
✄
14
Regression with kernels
Linear ridge regression
For models
f (x) =d∑
i=1
βix i
minimizing
minβ∈Rd
n∑
i=1
(f (xi )− yi
)2+ λ||β||2
yields
β =(XTX + λI
)−1XTy
f (x) =d∑
i=1
βix i
=d∑
i=1
n∑
j=1
αj(xj )i x i
=n∑
j=1
αj〈xj , x
〉
15
Regression with kernels
Linear ridge regression
For models
f (x) =d∑
i=1
βix i
minimizing
minβ∈Rd
n∑
i=1
(f (xi )− yi
)2+ λ||β||2
yields
β =(XTX + λI
)−1XTy
Kernel ridge regression
For models
f (x) =n∑
i=1
αik(xi , x)
minimizing
minα∈Rn
n∑
i=1
(f (xi )− yi
)2+ λ||α||2H
yields
α =(K + λI
)−1y15
Regression with kernels
Linear ridge regression
For models
f (x) =d∑
i=1
βix i
minimizing
minβ∈Rd
n∑
i=1
(f (xi )− yi
)2+ λ||β||2
yields
β =(XTX + λI
)−1XTy
Kernel ridge regression
For models
f (x) =n∑
i=1
αik(xi , x)
minimizing
minα∈Rn
n∑
i=1
(f (xi )− yi
)2+ λ||α||2H
yields
α =(K + λI
)−1y15
A basis function picture of kernel regression
Weighted basis functions placed on training samples xi
xix
yi
y
xix
yi
y
— learned f (x)• training samples (xi , yi )
xix
yi
y
xix
yi
y
— basis functions k(xi , x)- - - prediction f̂ (x) =
∑i αik(xi , x)
[Vu et al, Int J Quant Chem, 2015; Rupp, Int J Quant Chem, 2015]16
Representer theorem
Kernel models have form
f (z) =n∑
i=1
αik(xi , z)
due to the representer theorem:
Any function minimizing a regularized risk functional
`((
xi , yi , f (xi ))ni=1
)+ g
(‖f ‖)
admits to above representation.
Intuition:
� model lives in space spanned by training data
� weighted sum of basis functions
[Schölkopf, Herbrich & Smola, COLT 2001]
17
Centering in kernel feature space
Centering X and y is equivalent to having a bias term b.
For kernel models, center in kernel feature space:
k̃(x , z) =〈φ(x)− 1
n
n∑
i=1
φ(xi ), φ(z)−1
n
n∑
i=1
φ(xi )〉
⇒ K̃ =(I − 1
n1)K(I − 1
n1)
Some kernels like Gaussian and Laplacian kernels do not need centering[Poggio et al., Tech. Rep., 2001]
18
Representations
Representations and Hilbert spaces
Numerical representation for arbitrarypoly-atomic systems requiredCorresponding Hilbert space as feature space
No unique solutionExample:(a, b) 7→ (a2,
√2ab, b2)
(a, b) 7→ (a2, ab, ab, b2)
Several state-of-the-art representationssymmetry functionsbispectrum, smooth overlap of atomic positionsmoment tensor potentialsmany-body-tensor and distribution-based representations
20
Hilbert spaces for arbitrary atomistic systems
Representation = numerical encoding of atomistic system for accurate interpolation
Requirements:
(i) Invariance against transformations preserving the propertytranslation, rotation, homonuclear permutations
(ii) Uniqueness: different in property ⇒ different in representationreconstructability
(iii) Smoothness: continuous, ideally differentiable
(iv) Generality: work with any atomistic systemmolecules, periodic systems
(v) Efficiency: fast to compute, require few reference calculations
(vi) Simplicity: conceptually straightforward
[Bartók et al, Phys Rev B, 2013; Shapeev, Multiscale Model Simul, 2016; Behler, J Chem Phys, 2011]
21
The many-body tensor representation
Distribution of k-body terms stratified by chemical elements:
fk(x , z) =Na∑
i=1
wk(i )D(x , gk(i )
) k∏
j=1
Czj ,Zij
��� ��� ��� ��� �-� / Å-���
��
��
��� ������ �� � -� / Å-�
ϕ/��������������
OH
OH
HH
HO
[Huo & Rupp, arXiv, 2017]
22
Examples: molecules and bulk crystals
���� ��� ���� �� � -� / Å-�
ϕ/��������������
OC
OH
O
CCH
HH
HC
HO
CO
OO
� π/� π/� �π/� ∠ / ���
ϕ/��������������
OC
OH
O
CCH HHCHHO
HCHHCOHOHHOCHOOCHOCOOOHOOCO
� ���� ��� ���� � -� / Å-�
ϕ/��������������
HH
HC, CH
HO, OH
CC
CO, OC
OONaNaNa, ClClCl
NaClCl, ClClNaClNaNa, NaNaCl
NaClNa, ClNaCl
-� - �� � �� ��� ∠ϕ/��
������������
23
Example: surfaces
Yasunobu Ando, work in progress, 2018
24
Model building
How regularization helps against overfitting
æ
æ æ
ææ
æ æ
æ
æ
æ
æ
æ
æ
æ
æ
æ
æ
ææ
æ
x
y
26
The effect of regularization
Underfitting
0.0 0.5 1.0 1.5 2.0x0.0
0.2
0.4
0.6
0.8
1.0
1.2y
0.123 / 0.443
λ too large
Fitting
0.0 0.5 1.0 1.5 2.0x0.0
0.2
0.4
0.6
0.8
1.0
1.2y
0.044 / 0.068
λ right
Overfitting
0.0 0.5 1.0 1.5 2.0x0.0
0.2
0.4
0.6
0.8
1.0
1.2y
0.036 / 0.939
λ too smallRupp, PhD thesis, 2009; Vu et al, Int. J. Quant. Chem., 1115, 2015
27
Validation
Why validate?
� Assess model performance
� Optimize free parameters (hyperparameters)
Which statistics to report?
� Root mean squared error (RMSE)
� Mean absolute error (MAE)
� Maximum absolute error
� Squared correlation coefficient (R2)
What else can we learn from validation?
� Distribution of errors, not only summary statistics
� Convergence of error with number of samples (learning curves)28
Validation
Golden rule:Never use training data for validation
Violation of this rule leads to overfittingby measuring flexibility in fitting instead of generalization abilityrote learner example
If there is sufficient data:
� Divide data into two subsets, training and validation
� Build model on training subset
� Estimate error of trained model on validation subset
Sometimes an external validation set is used in addition.
29
Statistical validation
If too few data, statistical re-sampling methods can be used,such as cross-validation, bagging, bootstrapping, jackknifing
k-fold cross-validation:
� Divide data into k evenly sized subsets
� For i = 1, . . . , k ,build model on union of subsets {1, . . . , k} \ {i} and validate on subset i
All model building steps must be repeated for data splits:
� All pre-processing such as feature selection and centering
� Optimization of hyperparameters
Hansen et al., J. Chem. Theor. Comput. 9(8): 3543, 2013.
30
Hyperparameters: physically motivated choices
Length scale σ of Gaussian kernel:2σ2 ≈ ‖x − z‖2
Regularization strength λ:=̂ noise variance (Bayesian)=̂ leeway around yi for fitting⇒ target accuracy
-5 0 5 10
-20
-15
-10
-5
00 2 4
-6
-4
-2
0
log2HlsLlog 2HnlL
log10HlsL
log 10HnlL
RMS
6.6
6.9
8.9
20
-6-4-20246
Rupp: Int. J. Quant. Chem. 115(16): 1058, 2015.
31
Hyperparameters: statistically motivated choices
� Data-driven choice of hyperparameters
� Grid search or gradient descent
� Statistical validation to estimate error
� For validation and hyperparameteroptimization, use nested data splits
-5 0 5 10
-20
-15
-10
-5
00 2 4
-6
-4
-2
0
log2HlsLlog 2HnlL
log10HlsL
log 10HnlL
RMS
6.6
6.9
8.9
20
-6-4-20246
Rupp: Int. J. Quant. Chem. 115(16): 1058, 2015.
32
Nested data splits
� Never use data from training in validation
� For performance assessment and hyperparameter optimization,use nested cross-validation or nested hold-out sets
� Beware of overfitting
Example 1: plain overfitting× Train on all data, predict all dataX Split data, train, predict
Example 2: centering× Center data, split data, train & predictX Split data, center training set, train, center test set, predict
Example 3: cross-validation with feature selection× Feature selection, cross-validationX Feature selection for each split of cross-validation
33
Property prediction
Examples
� Screening: chemical interpolationRupp et al., Phys. Rev. Lett. 108(5): 058301, 2012
� Dynamics: potential energy surfacesBehler, Phys. Chem. Chem. Phys. 13(40): 17930, 2011
� Crystal structure classificationGhiringhelli et al., Phys. Rev. Lett. 114(10): 105503, 2015
� Density functional theory: kinetic energiesSnyder et al., Phys. Rev. Lett. 108(25): 253002, 2012
� Transition state theory: dividing surfacesPozun et al., J. Chem. Phys. 136(17): 174101, 2012
� Amorphous systems: glassy liquidsSchoenholz et al., Nat. Phys. 12(5): 469, 2016
� Design: stable interface searchKiyohara et al., Jpn. J. Appl. Phys. 55(4): 045502, 2016
� Density functional theory: adaptive basis setsSchütt & VandeVondele, J. Chem. Theor. Comput., 2018
� Self-learning Monte Carlo methodLiu et al., Phys. Rev. B 95(4): 041101(R), 2017
� Phase transitions: out-of-equilibrium phasesVenderley et al., Phys. Rev. Lett. 120(15): 257204, 2018
� Slow dynamics: dimensionality reductionWehmeyer & Noé, J. Chem. Phys. 148(24): 241703, 2018
� Quantum state tomographyTorlai et al., Nat. Phys. 14(5): 447, 2018
� Prediction of tensorial propertiesGrisafi et al., Phys. Rev. Lett. 120(3): 036002, 2017
and many more. . .35
The combinatorial nature of chemical and materials spaceO
O
O OH
O
OO-
O
CHO
OOO
O
Cl
O
O
NH+
O
NH
O O-
O
HC
OOO
O
ONHO
O
OOO
O
NH
O
O
NH
O
O
O
O
O
O
NH
O
O
O
NH
O
O
O
O
O
NH
O
O
O
O
NH
O
HN
O
F
O
NH
O
OO
+HN
O O
O O
O
O
N
O NH O
O O
O
NH
O
O
HN
SO
H2N
O
O
NH
NH
OH
O
O
O
OH
O
O O O
O
O
O
O
O
O
O
NH2
O
NH
O
O
O
NH
O
O
ONHO
O
ONHO
O
OOO
NH+
OO
O
O
OO N
O
O
O
O
OOO
F
O
OOO
O
NH
O
O
O
O
OO
NH2
O
O
OO
O O O
OF
O
O
O
O
O O O
O NH2
O
HO
O
NH
F
OO
O
O
O
O
O
O
O
NH2O
NH
O
O
OOO
O
OOO
O
O
OO
F
O
OO
O O
ONH2
O
O
O
O
O
O
O NH2
NH2O
O
O
O
O
Cl
O
O
O
O
O
O NH O
O
O
NH
O
O
O
O
N
O
O
O
O
NH
O
O
O
O
O
O
O
O
NH
O
F
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
F
O
NH
O
Br
O
O
O
O
NH
OH
O NH O
O
O
O
O
O
O O
O
Br
NH2O
NH
O
OF
Figure 4: Molecules decoded from randomly-sampled points in the latent space of a variationalautoencoder, near to a given molecule (aspirin [2-(acetyloxy)benzoic acid], highlighted inblue).
to realistic drug-like molecules. In a related experiment, and following the success of othergenerative models of images, we performed interpolations in chemical space. Random drugsfrom the list of FDA approved molecules were selected and encoded by sampling the meanof the VAE. We then performed a linear grid interpolation over two dimensions. We decodedeach point in latent space multiple times and report the one whose latent representation,once re-encoded, is the closest to the sampled point (Figures 14-5)
Bayesian optimization of drug-like molecules The proposed molecule autoencodercan be used to discover new molecules with desired properties.
As a simple example, we first attempt to maximize the water-octanol partition coe�cient(logP), as estimated by RDkit. [43] logP is an important element in characterizing the drug-likeness of a molecule, and is of interest in drug design. To ensure that the resulting moleculesto be easy to synthesize in practice, we also incorporate the synthetic accessibility [44] (SA)score into our objective.
Our initial experiments, optimizing only the logP and SA scores, produced novel molecules,but ones having unrealistically large rings of carbon atoms. To avoid this problem, we addeda penalty for having carbon rings of size larger than 6 to our objective.
8
Molecule spaceGraph theory
Materials spaceGroup theory
Combinatorial explosionMachine learning
Left: aspirin derivatives
36
Learning potential energy surfaces
Figure: Chang, von Lilienfeld, CHIMIA 68(9): 602, 2014 37
Enthalpies of formation of small organic molecules
7 211 small organic molecules; C, N, O, S, Cl, saturated with HRelaxed geometries, formation energies at DFT/PBE level of theory
E / kcal mol−1 α / Å3
Representation Kernel RMSE MAE RMSE MAE
CM [1] Laplacian 4.76 3.47 0.17 0.13BoB [2] Laplacian 2.86 1.79 0.12 0.09BAML [3] Laplacian 2.54 1.15 0.12 0.07SOAP [4] REMatch 1.61 0.92 0.07 0.05
MBTR Linear 1.14 0.74 0.10 0.07MBTR Gaussian 0.97 0.60 0.06 0.04
[1] Rupp et al., Phys. Rev. Lett. 108(5): 058301, 2012; [2] Hansen et al., J. Phys. Chem. Lett. 6(12): 2326, 2015; [3] Huang
& von Lilienfeld, J. Chem. Phys. 145(16): 161102, 2016; [4] De et al., Phys. Chem. Chem. Phys. 18(20): 13754, 2016 38
Potential energy surfaces of organic molecules
Ab initio molecular dynamics; DFT/PBE, Tkatchenko-Scheffler van der Waals forces
DTNN GDML MBTR MBTRKernel — Matérn linear GaussianMolecule MAE MAE RMSE MAE RMSE MAE [kcal/mol]
benzene 0.04 0.07 0.06 0.05 0.05 0.04uracil – 0.11 0.14 0.10 0.06 0.04naphthalene – 0.12 0.15 0.11 0.15 0.11aspirin – 0.27 0.26 0.18 0.32 0.22salicylic acid 0.50 0.12 0.17 0.12 0.11 0.08malonaldehyde 0.19 0.16 0.28 0.21 0.13 0.10ethanol – 0.15 0.22 0.16 0.10 0.07toluene 0.18 0.12 0.16 0.12 0.15 0.11
DTNN = Deep Tensor Neural Network, Schütt et al, Nat. Comm 8: 13890, 2017;GDML = Gradient Domain Machine Learning, Chmiela et al, Sci. Adv. 3(5): e1603015, 2017
39
Energies of crystalline systems
-2 -1 0 1 2 DFT E / eV/atom-2
-1
0
1
ML
E / e
V/at
om
15 8 0 8 15
errormeV/atom
Periodic systems
Elpasolites ABC2D6 8 meV/atomDensity functional theory, ground state12 elements
Binary alloys 1–19 meV/atomUnrelaxed geometries; with Gus Hart et al.
Al-Ga-In conducting oxides 27 meV/atomNOMAD 2018 Kaggle challenge, Vegard’s rule
40
Reliability
“It is not the estimate that matters so much,as the degree of confidence with the opinion”
N. Taleb, 2004
Uncertainty estimates
Extrapolation versus interpolationDomain of applicability
Left: Logarithmized predictive varianceversus signed error for energy predictions[Snyder et al., Phys Rev Lett, 2012]
41
Interpretability
Explain individual predictionsRight for the right reason?
Sensitivity analysisLocal gradients explain variation, but not the “why”
How to quantify interpretability?How to interpret basis set expansions?
42
Summary
property
●
● ●
●●
structure
-2Π -Π Π 2Πx
-1
1sin x
� ���� ��� ���� � -� / Å-�
ϕ/��������������
HH
HC, CH
HO, OH
CC
CO, OC
OO
-5 0 5 10
-20
-15
-10
-5
00 2 4
-6
-4
-2
0
log2HlsL
log 2HnlL
log10HlsL
log 10HnlL
RMS
6.6
6.9
8.9
20
-6-4-20246
-2 -1 0 1 2 DFT E / eV/atom-2
-1
0
1
ML
E / e
V/at
om
15 8 0 8 15
errormeV/atom
Interpolation of ab initio simulationsAccurate and fast, via machine learning
Learning with kernelsKernel trick: implicit transformations
RepresentationsMany-body tensor representation
Model buildingRegularization, validation, hyperparameters
Energy predictionsMolecules, crystals, surfaces
43
Acknowledgements
Group
Haoyan Huo (former), Lucas Deecke (former)Marcel Langer (current), Alex Gößmann (current)
Funding
44
Tutorial
M. Rupp: Machine Learning forQuantum Mechanics in a Nutshell,Int. J. Quant. Chem. 15(16):1058–1073, 2015.
DOI 10.1002/qua.24954
Special Issue
Rupp, von Lilienfeld, Burke: GuestEditorial: Special Topic on Data-enabled Theoretical Chemistry, J.Chem. Phys. 148(24): 241401, 2018.
DOI 10.1063/1.5043213
Code
Efficient C++ implementation withPython bindings available as part ofmy qmmlpack package.
To appear this year.
Datasets
Public datasets of ab initiocalculations for molecules, solids andliquids available.
qmml.orgnomad-coe.eu
45
http://doi.org/10.1002/qua.24954http://doi.org/10.1063/1.5043213http://qmml.orghttps://nomad-coe.eu/