64
Big-Data Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Hands-on workshop and Humboldt-Kolleg: Density-Functional Theory and Beyond - Basic Principles and Modern Insights Isfahan University of Technology, Isfahan, Iran, May 2 to 13, 2016

BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Big­Data Analytics in 

Materials Science

Luca M. GhiringhelliFritz Haber Institute

Hands­on workshop and Humboldt­Kolleg:Density­Functional Theory and Beyond ­ Basic Principles and Modern Insights

Isfahan University of Technology, Isfahan, Iran, May 2 to 13, 2016

Page 2: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Data, data, data: big data

Big­data challenge, four­V:Volume  (amount of data)Variety  (heterogeneity, of form and meaning of data)Veracity  (uncertainty of data quality)Velocity ?

High­throughput screening: query and read out what was stored

Shouldn't we do more?

Analysis

­ Identify (so far) hidden correlations­ Identify which materials should be studied next as most promising candidates­ Identify anomalies

Page 3: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

We have a dream

From the periodic table of the elements to a chart of materials

Mendeleev's 1871 periodic table

Page 4: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

We have a dream

From the periodic table of the elements to a chart of materials

Mendeleev's 1871 periodic table

Ga=69.7  Ge=72.6

Page 5: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

We have a dream

From the periodic table of the elements to a chart of materials:Organize materials according to their properties and functions, e.g.

­ figure of merit of thermoelectrics (as function of T )

­ turn­over frequency of catalytic materials (as function of T and p)

­ efficiency of photovoltaic systems

Page 6: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Training setCalculate properties and 

functions P, for many materials, iDensity­Functional Theory

Fast PredictionCalculate properties 

and functions for new values of d (new materials)

Big Data Analysis

DescriptorFind the appropriate 

descriptor di, build a table: | i | di | Pi | 

LearningFind the function PSL(d) for the table; 

do cross validation.Statistical learning

Page 7: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

(Orbital period)² = C (orbit's major axis)³

Learning   Discovery→

Suppose to know the trajectories of all planets in the solar system, from accurate observations (experiment)orby numerically integrating general relativity equations (calculations at the highest level of theory)

Data (collected by Tycho Brahe)

Statistical learning(performed by 

Johannes Kepler)

Physical law(assessed by 

Isaac Newton)

Page 8: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Databases, platforms

“Just Databases”ICSD Inorganic Crystal Structure DB http://icsd.fiz­karlsruhe.deCOD Crystallography Open DB  http://www.crystallography.net/ESP Electronic Structure Project http://gurka.fysik.uu.se/ESP/CCCBDB Comp. Chemistry Comparison and Benchmark DB 

http://cccbdb.nist.gov/Databases + analytic toolsMaterials project http://www.materialsproject.orgAFLOW Atomatic Flow for Materials Discovery  http://aflowlib.orgAiiDA Automated interactive infrastructure and DB for Atomistic Simulations 

http://www.aiida.netOQMD Open Quantum Materials DB http://oqmd.org/

http://nomad­repository.euhttp://nomad­coe.eu 

Code­dependent raw data   conversion layer   code­independent representation→ →

Page 9: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Outline

Descriptors and fingerprints

A (personal­taste compiled) zoo of machine learning / data mining techniques

Regularised regression

Linear and non­linear dimensionality reduction

Feature selection

Some words on causal descriptor­property relationship

Page 10: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Descriptors

Can we predict an optimal material for a complex process (e.g. heterogenous catalysis) 

by looking to a simple (set of) descriptor(s) ?

Page 11: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

A simple but insightful descriptor

Page 12: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

(Genetic­like) fingerprint: 1D polymers “eugenetics”

Data: 175 linear 4­blocks periodic polymers. 7 blocks:  CH2, SiF2, SiCl2, GeF2, GeCl2, SnF2, SnCl2, 

Descriptor: 20 dimensions [# building blocks of type i, of ii pairs, of iii triplets]

Pilania, Wang, …, and Ramprasad, Scientific Reports 3, 2810 (2013). DOI: 10.1038/srep02810

Page 13: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Isayev, …, and Curtarolo, Chemistry of Materials 27, 735 (2015)

(Genetic­like) fingerprint

Page 14: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Isayev, …, and Curtarolo, Chemistry of Materials 27, 735 (2015)

(Genetic­like) fingerprint

Page 15: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Supervised learning

d   → P mapping Support vector machinesNeural networksDecision treesGenetic programming   (symbolic regression)Kernel ridge regressionCompressed sensing

Unsupervised learning

d   → d'Find patterns / trends

Principal­components analysisNon­linear dim. reduction  Sketch mapClusteringLocal pattern discovery

Machine learning / data mining : a classification

Focus on “learning”: the algorithm has to improve with data size (“learning by experience”)

Page 16: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Supervised learning

d   → P mapping Kernel ridge regressionCompressed sensing   (+ symbolic regression)

Unsupervised learning

d   → d'Find patterns / trends

Principal­components analysisNon­linear dim. reduction  Sketch map

Machine learning / data mining : a classification

Focus on “learning”: the algorithm has to improve with data size (“learning by experience”)

Page 17: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Figure of merit to be optimized:

Regularization (prefer “lower complexity” in the solution)

(Linear) ridge regression

Explicit solver:

Alternative view, via Hilbert space representation theorem:

Sum over data points!

Ridge Regression: Mathematical formulation

norm

Page 18: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Linear kernel

Kernel Ridge Regression: Mathematical formulation

Non­linear kernel

Page 19: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Kernel Ridge Regression: Mathematical formulation

Non­linear kernel

Linear kernel

Gaussian (radial basis function) kernel

Laplacian kernel

Polynomial kernel

Page 20: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Kernel Ridge Regression: Mathematical formulation

Non­linear kernel

In all cases, a kernel introduces a similarity measure

Page 21: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

KRR success stories: Gaussian Approximation Potentials

Translation, rotational, permutational invariant, unique, smooth local­environment descriptor.(Spherically­averaged spherical harmonic expansion of Gaussian densities centered on nuclei)

Page 22: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

KRR success stories: Molecular properties

Page 23: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Pilania, Wang, …, and Ramprasad, Scientific Reports 3, 2810 (2013). DOI: 10.1038/srep02810

KRR success stories: 1D polymers “eugenetics”

Data: 175 linear 4­blocks periodic polymers. 7 blocks:  CH2, SiF2, SiCl2, GeF2, GeCl2, SnF2, SnCl2, 

Descriptor: 20 dimensions [# building blocks of type i, of ii pairs, of iii triplets]

Page 24: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Regularized regression in practice: beware of overfitting

Page 25: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Regularized regression in practice: beware of overfitting

Page 26: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Regularized regression in practice: do validation

Page 27: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Supervised learning

d   → P mapping Kernel ridge regressionCompressed sensing   (+ symbolic regression)

Unsupervised learning

d   → d'Find patterns / trends

Principal­components analysisNon­linear dim. reduction  Sketch map

Machine learning / data mining : a classification

Focus on “learning”: the algorithm has to improve with data size (“learning by experience”)

Page 28: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Showcase: classification octet binaries crystal structures

Page 29: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

The chemical space

Ansatz: atomic features

● Valence number Zv● Energy of valence s orbital Es● Energy of valence p orbital Ep● Radius of valence s orbital rs● Radius of valence p orbital rp

Ansatz: atomic features

● Valence number Zv● Energy of valence s orbital Es● Energy of valence p orbital Ep● Radius of valence s orbital rs● Radius of valence p orbital rp

Page 30: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

KS 

leve

ls [

eV]

Valence p

Valence sRadial probability densities 

[Å]

Primary (atomic) features

Radius @ maxAverage radiusTurning point

example: Sn (Tin)

Valence p (HOMO)

Valence s

KS lev els [eV

]

LUMO

Page 31: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

(Linear) dimensionality reduction: principal components

Principal component analysis

Pearson, K. "On Lines and Planes of Closest Fit to Systems of Points in Space". Philosophical Magazine 2, 559 (1901)

Orthonormal transformation of coordinates, converting a set of (possibly) linearly correlated coordinates into a new set of linearly uncorrelated (called principal or normal) components, such that the first component has the largest variance and each subsequent has the largest variance constrained to being orthogonal to all the preceding components

Page 32: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Principal component analysis

Pearson, K. "On Lines and Planes of Closest Fit to Systems of Points in Space". Philosophical Magazine 2, 559 (1901)

(Linear) dimensionality reduction: principal components

Saad, …, Chelikowsky, and Andreoni, PRB  85, 104104 (2012)1 2 3A

rb. (

linea

r) s

cale

Components

Ansatz: atomic features

● Valence number Zv● Energy of valence s orbital Es● Energy of valence p orbital Ep● Radius of valence s orbital rs● Radius of valence p orbital rp

rs, rp, Es/Zv, Ep/zv, 

for A and B atoms

Ansatz: atomic features

● Valence number Zv● Energy of valence s orbital Es● Energy of valence p orbital Ep● Radius of valence s orbital rs● Radius of valence p orbital rp

rs, rp, Es/Zv, Ep/zv, 

for A and B atoms

What's on the axes?

Linear combination of (possibly all) the initial dimensions

Page 33: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Supervised learning

d   → P mapping Kernel ridge regressionCompressed sensing   (+ symbolic regression)

Unsupervised learning

d   → d'Find patterns / trends

Principal­components analysisNon­linear dim. reduction  Sketch map

Machine learning / data mining : a classification

Focus on “learning”: the algorithm has to improve with data size (“learning by experience”)

Page 34: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

(Non­linear) dimensionality reduction

Page 35: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

(Non­linear) dimensionality reduction

Page 36: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

(Non­linear) dimensionality reduction

Page 37: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Proximity matchingProximity matching

Page 38: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Sketch­map algorithm

Page 39: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Minimization of the stress function (for a set of landmarks points)

Sketch­map algorithm

Page 40: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

From clusters to defects in bulk

The high dimensional representation is still an important choice

Page 41: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

From clusters to defects in bulk

What's on the axes?

Page 42: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Supervised learning

d   → P mapping Kernel ridge regressionCompressed sensing   (+ symbolic regression)

Unsupervised learning

d   → d'Find patterns / trends

Principal­components analysisNon­linear dim. reduction  Sketch map

Machine learning / data mining : a classification

Focus on “learning”: the algorithm has to improve with data size (“learning by experience”)

Page 43: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

What about having a dimensionality reduction, or call it feature selection, 

i.e., such that the (best) low­dimensional representation is selected 

among (many many) given candidates?

It is time for: compressed sensing

Reference:LMG, J. Vybiral, S. V. Levchenko, C. Draxl, and M. Scheffler, 

Phys. Rev. Lett. 114, 105503 (2015)Don't overlook the Supplementary Information!

Page 44: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

82 octet AB binary compounds

We have a dreamProof of Concept: Descriptor for the Classification “Zincblende/Wurtzite or Rocksalt?”

Rocksalt

ZincblendeRocksalt/Zincblende

Page 45: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

82 octet AB binary compounds

d1

d2 RS

J. A. van Vechten, Phys. Rev. 182, 891 (1969).J. C. Phillips, Rev. Mod. Phys. 42, 317 (1970).J. St. John and A.N. Bloch, Phys. Rev. Lett. 33, 1095 (1974)A. Zunger, Phys. Rev. B 22, 5839 (1980).D. G. Pettifor, Solid State Commun. 51, 31 (1984).Y. Saad, D. Gao, T. Ngo, S. Bobbitt, J. R. Chelikowsky, and W. Andreoni, Phys. Rev. B 85, 104104 (2012).

?

We have a dreamProof of Concept: Descriptor for the Classification “Zincblende/Wurtzite or Rocksalt?”

Rocksalt

ZincblendeRS/ZB

Page 46: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

82 octet AB binary compounds

d1

d2 RS

We have a dreamProof of Concept: Descriptor for the Classification “Zincblende/Wurtzite or Rocksalt?”

Ansatz: atomic features

● HOMO● LUMO● Ionization Potential● Electron Affinity● Radius of valence s orbital● Radius of valence p orbital● Radius of valence d orbital● … ?

Ansatz: atomic features

● HOMO● LUMO● Ionization Potential● Electron Affinity● Radius of valence s orbital● Radius of valence p orbital● Radius of valence d orbital● … ?

E(Rocksalt) – E(Zinkblende)E(Rocksalt) – E(Zinkblende)

Rocksalt

ZincblendeRS/ZB

Page 47: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Figure of merit to be optimized:

Regularization (prefer “lower complexity” in the solution)

A more complex regularization:

(Linear) ridge regression

Mathematical formulation of the problem

NP – hard !!!

Page 48: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Mathematical formulation of the problem: sparsity

LASSO: convex problem, equivalent to the NP-hard if features (columns of D) are uncorrelated

Page 49: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

LASSO, compressed/ive sensing in Materials Science

Page 50: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

­  Find a descriptor AND an accurate  evaluation  for the difference  in energy between RS and ZB crystal structures for all (82) AB octet semiconductors.ΔE = ΔE ( d )

­  Possibly  identify  a  2D  descriptor  which  gives  a “nice” representation of the materials in a plane

The task

Ansatz: atomic features

● HOMO● LUMO● Ionization Potential● Electron Affinity● Radius of valence s orbital● Radius of valence p orbital● Radius of valence d orbital● … ?

Ansatz: atomic features

● HOMO● LUMO● Ionization Potential● Electron Affinity● Radius of valence s orbital● Radius of valence p orbital● Radius of valence d orbital● … ?

E(Rocksalt) – E(Zinkblende)E(Rocksalt) – E(Zinkblende)

Page 51: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

KS level 1 KS level 2

+

Radius 1 Radius 2

/

| x ­ y |

Systematic construction of the feature space

+

Radius 1 Radius 2 KS level 1 KS level 2

| x ­ y |

exp(x)

(x)^n

In practice: formalism borrowed form symbolic regression

Page 52: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Systematic construction of the feature space: EUREQA

Descriptor (candidates: 242)a The largest distance between a H atom and its nearest Si neighborb The shortest distance between a Si atom and its sixth­nearest Si neighborc The maximum bond valence sum on a Si atomd The smallest value for the fifth­smallest relative bond length around a Si atome The fourth­shortest distance between a Si atom and its eighth­nearest neighborf The second­shortest distance between a Si atom and its fifth­nearest neighborg The third­shortest distance between a Si atom and its sixth­nearest neighborh The H­Si nearest­neighbor distance for the hydrogen atom with the fourth­smallest difference between the distances to the two Si atoms nearest to a H atom

T. Müller et al. PRB 89 115202 (2014):Data: ~1000 amorphous structures of 216 Si atoms (saturated)

Property: hole trap depth

EUREQA: genetic programming software. Global optimization (genetic algorithm).Schmidt M., Lipson H., Science, Vol. 324, No. 5923, (2009)

Page 53: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

­  Find a descriptor AND an accurate  evaluation  for the difference  in energy between RS and ZB crystal structures for all (82) AB octet semiconductors.ΔE = ΔE ( d )

­  Possibly  identify  a  2D  descriptor  which  gives  a “nice” representation of the materials in a plane

The task

Ansatz: atomic features

● HOMO H● LUMO L● Ionization Potential  IP● Electron Affinity EA● Radius of valence s orbital rs● Radius of valence p orbital rp● Radius of valence d orbital rd● Thousands of non­linear 

functions of the above

Ansatz: atomic features

● HOMO H● LUMO L● Ionization Potential  IP● Electron Affinity EA● Radius of valence s orbital rs● Radius of valence p orbital rp● Radius of valence d orbital rd● Thousands of non­linear 

functions of the above

E(Rocksalt) – E(Zinkblende)E(Rocksalt) – E(Zinkblende)

Page 54: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

1D

2D

3D

“Extended” LASSO : features are correlated, so the first 25-30 features selected by lasso when scanning from large to low λ are selected and all single features, all pairs, all triplets... are separately tested via linear regression (the NP-hard problem, but only with 25-30 features)

1D 2D 3D

Finding the descriptor

Page 55: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Two­dimensional descriptor

0 0.2 eV 0.45 eV 1.0 eV-0.2 eV

Page 56: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

A good model must be predictive within the data domain (interpolation): 

cross validation

Page 57: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Performance of the descriptors: accuracy, validation

ε

!

“Complexity”

Erro

r

Training err.

Validation err.

Leave 10% out cross validation

Errors are energies, in eV

Max Absolute Error

Convergence with dimensionality of the descriptor

Page 58: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Regularized regression in practice: do validation

Page 59: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

A good model must be predictive within the data domain (interpolation): 

cross validation

A better model should be causal:stability analysis

Page 60: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Few words on causality

There are four possibilities (types of causality relationship) behind P(d):

1. d → P : P “listens” to d

2. P → d : d “listens” to P

3. A → d and A → P : There is no direct connection between d and P, but d and P both “listen” to a third “actuator”

4. There is no direct connection between d and P, but they have a common effect (Berkson paradox)

...that listens to both and screams: “I occurred” [Judea Pearl]

[If the admission criteria to a certain graduate school call for either high grades as an undergraduate or special musical talents, then these two attributes will be found to be correlated (negatively) in the student population of that school, even if these attributes are uncorrelated in the population at large (selection bias). Indeed, students with low grades are likely to be exceptionally gifted in music, which explains their admission to graduate school.]

Page 61: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Few words on causality

We are not able to write down a scientific law that connects the descriptor

directly with the total-energy difference between RS and ZB structures.However, ZA, ZB determine these descriptors, and ZA, ZB determine the many-body Hamiltonians and the total-energy difference.

ML has diligently took over “Kepler's work”, but no Newton, yetQuestion: is the latter step always necessary?

Quantitative analysis: effect of noise

The same 2D descriptor is found:

Page 62: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

A good model must be predictive within the data domain (interpolation): 

cross validation

A better model should be causal:stability analysis

An ideal model should be predictive outside the data domain (extrapolation)

Page 63: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

When both carbon diamond and BN are excluded from training:

If all C containing binaries (C, SiC, GeC, and SnC) are excluded from training, i.e. no explicit information on C is given to the model:

Hadn't we known about diamond … we'd have predicted it!

Hadn't we known about any carbon­containing binary …we'd have predicted carbon chemistry (from atomic features)

E(LDA) E(predicted)

C -2.64 eV -1.37 eV

SiC -0.67 eV -0.48 eV

GeC -0.81 eV -0.46 eV

SnC -0.45 eV -0.23 eV

E(LDA) E(predicted)

C -2.64 eV -1.44 eV

BN -1.71 eV -1.37 eV

Page 64: BigData Analytics in Materials Science · BigData Analytics in Materials Science Luca M. Ghiringhelli Fritz Haber Institute Handson workshop and HumboldtKolleg: DensityFunctional

Big­data for Materials Science: Infrastructures

Descriptors and fingerprints

(Selected) machine­learning / data­mining methods:Kernel ridge regressionAutomatic descriptor search: dimensionality reduction

Principal component analysisSketch map

Automatic descriptor search: Feature selectionLASSO (compressed sensing) + symbolic regression

Application to a model materials­science problemApplication of compressed sensing to basis­set construction 

Some words on causal descriptor­property relationshipCross­validation and Stability analysis

Summary