28
AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

Embed Size (px)

Citation preview

Page 1: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Data Mining Approaches in Atomistic Modeling

H. AouragURMER, University of Tlemcen

Page 2: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Outline

• Introduction

• Ex 1: Intergranular Embrittlement of Fe

• Ex 2: Catalytic Activity - Hydrogenation

• Ex 3: Stainless Steel CrxNiyFe(1-x-y)

• Ex 4: Conductivity T7 7xxx Al Alloys

• Ex 5: Boiling Points

• Ex 6: Crystal Structure Prediction – open questions…

Page 3: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Predicting Properties with Atomistic Modeling

Atomistic modeling• Atom positions• Electronic structure• Energies

Macroscopic properties• Elastic properties• Conductivity• Toxicity

?Band GapElastic Constants

Direct calculation

Band GapElastic Constants

Segregation EnergiesActivation Barriers

Physical lawsConstitutive relations

EmbrittlementTransport

WeldabilityToxicity

Data MiningAtomic Scale Descriptors

Page 4: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Power of Data Mining

• Does not require complete and accurate multiscale theories

• New physics in relationships R• Quick, cheap screening for desired properties, errors,

etc. – can be qualitative

Use known data to establish R

Calculated Atomistic Properties Database

Measured Macroscopic Properties DatabaseR

Calculated Atomistic Properties Database

Predicted Macroscopic Properties DatabaseR

Use R to predict new data

Page 5: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Key Issues

– Descriptors accessible to modeling– Descriptors optimally chosen

• Use known relationships/physics• Optimize from large set of possibilities

– Descriptors→Property relationship is robust• Sensible choice of methods• tested with cross validation, test sets

– Data• Large enough• Clean enough

Macroscopic Properties

Data MiningAtomic scale descriptors

Page 6: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 1: Intergranular Embrittlement of Fe

• Property: Fe embrittlement• Descriptors→Property relationship:

Embrittlement [Grain boundary segregation E - Free surface segregation E] = (EGB – EFS) (Rice ’89)

• Descriptors: (EGB – EFS) (calculated ab initio)

• Data: Embrittling potency for B, C, P, S.

Page 7: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 1: Intergranular Embrittlement of Fe

(Wu, et al., Phys. Rev. B., ‘96)

Also correctly predicts effect of Mn and Mo on P embrittlement!(Zhong, et al., Phys Rev B, ’97, Geng, et al., Solid State Comm., ’01)

Page 8: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 2: Catalytic Activity - Hydrogenation

• Property: Reaction rates (Hydrogenation of ethene, benzene on 3d transition metal M)

• Descriptors→Property relationship:

Adapted Bronsted-Evans_Polanyi Free E

+ Langmuir-Hinshelwood Rate Equations

Rate = R[EMC,12 fitting “constants” independent of M]

• Descriptors:

– EMC = M-C bond strength in bulk NaCl structure (calculated ab initio)

– 12 fitting “constants” (fit to experimental data for each reaction)• Data: 10-20 reaction rates for each of ethene and benzene

Page 9: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 2: Catalytic Activity - Hydrogenation

(Toulhoat, et al. ’02)

Ethene: C2H4+H2→C2H6

EMC

Benzene: C6H6+3H2→C6H12

EMC

Cross-validation in black Cross-validation with alloys

Page 10: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 3: Stainless Steel CrxNiyFe(1-x-y)

• Property: High hardness and ductility

• Descriptors→Property relationship:Hardness shear modulus = G

Ductility bulk modulus/shear modulus = B/G

• Descriptors: B,G (from ab initio)

• Data: Not clearly defined

Page 11: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Vic

kers

Ha

rdne

ss [G

Pa

]

Shear Modulus [GPa]

Hardness vs. Shear Modulus

(Teter, MRS Bulletin, ’98)

Page 12: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

(Vitos, et al., Nature Materials, ‘02)

High G (hard)

High B/G (ductile)

Conflict!

High

Low

Cr

(at%

)

Bulk Modulus B

Ni (at%)Ni (at%)

Shear Modulus GC

r (a

t%

)

• Optimal at ~Cr18Ni24Fe58 (multiple patents)

• Predict improved mechanical properties for Ir, Os doping

Ex 3: Stainless Steel CrxNiyFe(1-x-y))

Page 13: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 4: Conductivity T7 7xxx Al Alloys

• Property: Electrical conductivity • Descriptors→Property relationship:

– Linear: = V*d (requires only fitting)– Neurofuzzy: = NF(d) (requires only fitting)– Physical: = P(d) (requires thermodynamic models of relevant

phases, Rayleigh–Maxwell equation for resistivity with dispersed particles, Starink-Zahra equation for precipitation, 1D diffusion equation, Matthiesen’s rule for resistivity with dissolved elements)

• Descriptors: Concentrations, ageing time d = xZn, xMg, xCu, xZr, xFe, xSi, t

Page 14: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 4: Conductivity T7 7xxx Al Alloys

measured for 36 concentration/ageing time samples

R-Model Fitting Params

RMS Error (%)

Cross Validation (%)

Linear 7 4.75 5.25

Neurofuzzy 5 1.35 1.525

Physical 6 0.97 1.05

(Starink, et al., ‘00)

Page 15: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 5: Boiling Points(Quantitative Structure-Property Relationships: QSPR)

• Property: Boiling Point TB

• Descriptors→Property relationship: Neural Network (10:18:1, sigmoid, backpropagation)

• Descriptors: Electrostatic and structural properties (calculated with semiempirical VAMP – AM1)

• Data: TB for 6629 molecules containing elements H, B, C, N, O, F, Al, Si, P, S, Cl, Zn, Ge, Br, Sn, I, Hg

Page 16: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Data Mining Descriptors→Property Relationships

Many general approaches• Graphical• Linear Regressions (normal least squares, principal component

regression, partial least squares, …)• Neural Networks (perceptrons, feed-forward, radial-basis, …)• Clustering (k-means, nearest-neighbor, …)

In Out Many choices in each approachNeural Networks:• Number of neurons/layers – 3:4:1• Transfer functions: step, sigmoid, tansig, etc.• Training method: backpropagation algorithms

Thousands of possible approaches!• Many yield similar results• Appropriate for different situations• Problem dependent - much art!!

Page 17: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Descriptors1. Partial positive surface area (sum of the surface area of positive atoms) 2. Partial negative surface area (sum of the surface area of negative atoms) 3. Total charge weighted positive surface area (descriptor 1 multiplied by the total positive charge) 4. Total charge weighted negative surface area (descriptor 2 multiplied by the total negative charge) 5. Atomic charge weighted positive surface area: (sum of sasa*charge for all positive atoms) 6. Atomic charge weighted negative surface area (sum of sasa*charge for all negative atoms) 7. Difference in charged surface areas: (descriptor 1 - descriptor 2) 8. Difference in total charge weighted surface areas (descriptor 3 - descriptor 4) 9. Difference in atomic charge weighted surface areas (descriptor 5 - descriptor 6) 10. Fractional charged partial surface areas (6 descriptors divided by total surface area) 11. " 12. " 13. " 14. " 15. " 16. Surface weighted charged partial surface areas (6 descriptors multiplied by total surface area)17. " 18. " 19. " 20. " 21. " 22. Relative positive charge (charge of most positive atom divided by total positive charge 23. Relative negative charge (charge of most negative atom divided by total negative charge 24. Relative positive charge surface area (surface area of most positive atom divided by descriptor 22) 25. Relative negative charge surface area (surface area of most negative atom divided by descriptor 23) 26. Total hydrophobic surface area (sum of surface areas of atoms with |charge| < 0.2) 27. Total polar surface area (sum of surface areas of atoms with |charge| > 0.2) 28. Relative hydrophobic surface area (descriptor 26 divided by total surface area) 29. Relative polar surface area (descriptor 27 divided by total surface area) 30. Total solvent-accessible surface area (http://www.accelrys.com/cerius2/descriptor.html#list)

Charged partial surface areas descriptors, Accelyris QSAR module

Page 18: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Descriptors

• Many broad categories: composition, topological, electronic, physical-chemical properties, …

• Thousands of possible descriptors– Use physical knowledge to choose relevant

ones (e.g., QSAR principle)– Use numerical methods to choose important

descriptors

Page 19: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 5: Boiling Point Descriptors

(Chalk, et al., J Chem. Inf. Comput. Sci, ‘01)

Page 20: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 5: Atomistic Modeling Methods

Use VAMP – AM1 and PM3 Hamiltonians– Semi-empirical molecular orbital based– Quantum mechanical, but matrix elements are

fit to experimental data– Can calculate optimized geometries,

electronic structure (charge properties)– Fairly accurate (known failings) and fast

Page 21: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 5: Boiling PointsTraining set (6000) Test set (629)

17 (max -119) 19 (max -94)

Large errors often due to• Incorrect experimental measurements of TB (low pressure)• Incorrect experimental structures (tautomer misidentification)• Failure of atomistic modeling method (approximation errors)

(Chalk, et al., J Chem. Inf. Comput. Sci, ‘01)

Page 22: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 6: Crystal Structure Prediction

• Property: Stable crystal structure• Descriptors→Property relationship:

Neighbor Clustering algorithm (Euclidean metric)

• Descriptors: Chemical scale (empirically assigned value for each element) (Pettifor, J. Phys. C, ’86)

• Data: All intermetallic binary alloys (thousands)

Page 23: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

NaCl

CsCl

(Rodgers, CRYSTMET, ‘03)

Structure

Maps

Page 24: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Ex 6: Crystal Structure Prediction

• Powerful: structure maps can give 90-95% predictive accuracy

• Many Descriptors: ~50 have been tried based on size, atomic number, cohesive energy, electrochemistry, valence electrons

• Can’t be extended: accurate maps require ~40% of the possible systems to be known (~80% binaries known, ~0.1% quaternaries)

• Can atomistic modeling help?– Fill in data for multicomponent systems– Provide optimal descriptors

(Villars, Intermetallic Compounds, ’94)

Page 25: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Conclusions

• Atomistic modeling and data mining can provide valuable predictive ability when physical theories are incomplete

• Key issues are data quality, descriptors, and descriptor→properties relationship

• Dangers of overfitting and tuning

Page 26: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

Are these words closer than by chance?Can the Bible predict future events?

Some say yes (Witzumn, et al, Stat. Sci., ’94)

Some say no (McKay, et al., Stat. Sci., ’99)

• Many articles• >60 books on Bible Codes on Amazon• 1 major motion picture (Omega Code)

Bible Code

Be careful with your statistics!

Page 27: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

The First and Greatest Example of Atomic Level Data Mining

Page 28: AMASS – 7/25/03 Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen

AMASS – 7/25/03

END