AMASS – 7/25/03
Data Mining Approaches in Atomistic Modeling
H. AouragURMER, University of Tlemcen
AMASS – 7/25/03
Outline
• Introduction
• Ex 1: Intergranular Embrittlement of Fe
• Ex 2: Catalytic Activity - Hydrogenation
• Ex 3: Stainless Steel CrxNiyFe(1-x-y)
• Ex 4: Conductivity T7 7xxx Al Alloys
• Ex 5: Boiling Points
• Ex 6: Crystal Structure Prediction – open questions…
AMASS – 7/25/03
Predicting Properties with Atomistic Modeling
Atomistic modeling• Atom positions• Electronic structure• Energies
Macroscopic properties• Elastic properties• Conductivity• Toxicity
?Band GapElastic Constants
Direct calculation
Band GapElastic Constants
Segregation EnergiesActivation Barriers
Physical lawsConstitutive relations
EmbrittlementTransport
WeldabilityToxicity
Data MiningAtomic Scale Descriptors
AMASS – 7/25/03
Power of Data Mining
• Does not require complete and accurate multiscale theories
• New physics in relationships R• Quick, cheap screening for desired properties, errors,
etc. – can be qualitative
Use known data to establish R
Calculated Atomistic Properties Database
Measured Macroscopic Properties DatabaseR
Calculated Atomistic Properties Database
Predicted Macroscopic Properties DatabaseR
Use R to predict new data
AMASS – 7/25/03
Key Issues
– Descriptors accessible to modeling– Descriptors optimally chosen
• Use known relationships/physics• Optimize from large set of possibilities
– Descriptors→Property relationship is robust• Sensible choice of methods• tested with cross validation, test sets
– Data• Large enough• Clean enough
Macroscopic Properties
Data MiningAtomic scale descriptors
AMASS – 7/25/03
Ex 1: Intergranular Embrittlement of Fe
• Property: Fe embrittlement• Descriptors→Property relationship:
Embrittlement [Grain boundary segregation E - Free surface segregation E] = (EGB – EFS) (Rice ’89)
• Descriptors: (EGB – EFS) (calculated ab initio)
• Data: Embrittling potency for B, C, P, S.
AMASS – 7/25/03
Ex 1: Intergranular Embrittlement of Fe
(Wu, et al., Phys. Rev. B., ‘96)
Also correctly predicts effect of Mn and Mo on P embrittlement!(Zhong, et al., Phys Rev B, ’97, Geng, et al., Solid State Comm., ’01)
AMASS – 7/25/03
Ex 2: Catalytic Activity - Hydrogenation
• Property: Reaction rates (Hydrogenation of ethene, benzene on 3d transition metal M)
• Descriptors→Property relationship:
Adapted Bronsted-Evans_Polanyi Free E
+ Langmuir-Hinshelwood Rate Equations
Rate = R[EMC,12 fitting “constants” independent of M]
• Descriptors:
– EMC = M-C bond strength in bulk NaCl structure (calculated ab initio)
– 12 fitting “constants” (fit to experimental data for each reaction)• Data: 10-20 reaction rates for each of ethene and benzene
AMASS – 7/25/03
Ex 2: Catalytic Activity - Hydrogenation
(Toulhoat, et al. ’02)
Ethene: C2H4+H2→C2H6
EMC
Benzene: C6H6+3H2→C6H12
EMC
Cross-validation in black Cross-validation with alloys
AMASS – 7/25/03
Ex 3: Stainless Steel CrxNiyFe(1-x-y)
• Property: High hardness and ductility
• Descriptors→Property relationship:Hardness shear modulus = G
Ductility bulk modulus/shear modulus = B/G
• Descriptors: B,G (from ab initio)
• Data: Not clearly defined
AMASS – 7/25/03
Vic
kers
Ha
rdne
ss [G
Pa
]
Shear Modulus [GPa]
Hardness vs. Shear Modulus
(Teter, MRS Bulletin, ’98)
AMASS – 7/25/03
(Vitos, et al., Nature Materials, ‘02)
High G (hard)
High B/G (ductile)
Conflict!
High
Low
Cr
(at%
)
Bulk Modulus B
Ni (at%)Ni (at%)
Shear Modulus GC
r (a
t%
)
• Optimal at ~Cr18Ni24Fe58 (multiple patents)
• Predict improved mechanical properties for Ir, Os doping
Ex 3: Stainless Steel CrxNiyFe(1-x-y))
AMASS – 7/25/03
Ex 4: Conductivity T7 7xxx Al Alloys
• Property: Electrical conductivity • Descriptors→Property relationship:
– Linear: = V*d (requires only fitting)– Neurofuzzy: = NF(d) (requires only fitting)– Physical: = P(d) (requires thermodynamic models of relevant
phases, Rayleigh–Maxwell equation for resistivity with dispersed particles, Starink-Zahra equation for precipitation, 1D diffusion equation, Matthiesen’s rule for resistivity with dissolved elements)
• Descriptors: Concentrations, ageing time d = xZn, xMg, xCu, xZr, xFe, xSi, t
AMASS – 7/25/03
Ex 4: Conductivity T7 7xxx Al Alloys
measured for 36 concentration/ageing time samples
R-Model Fitting Params
RMS Error (%)
Cross Validation (%)
Linear 7 4.75 5.25
Neurofuzzy 5 1.35 1.525
Physical 6 0.97 1.05
(Starink, et al., ‘00)
AMASS – 7/25/03
Ex 5: Boiling Points(Quantitative Structure-Property Relationships: QSPR)
• Property: Boiling Point TB
• Descriptors→Property relationship: Neural Network (10:18:1, sigmoid, backpropagation)
• Descriptors: Electrostatic and structural properties (calculated with semiempirical VAMP – AM1)
• Data: TB for 6629 molecules containing elements H, B, C, N, O, F, Al, Si, P, S, Cl, Zn, Ge, Br, Sn, I, Hg
AMASS – 7/25/03
Data Mining Descriptors→Property Relationships
Many general approaches• Graphical• Linear Regressions (normal least squares, principal component
regression, partial least squares, …)• Neural Networks (perceptrons, feed-forward, radial-basis, …)• Clustering (k-means, nearest-neighbor, …)
In Out Many choices in each approachNeural Networks:• Number of neurons/layers – 3:4:1• Transfer functions: step, sigmoid, tansig, etc.• Training method: backpropagation algorithms
Thousands of possible approaches!• Many yield similar results• Appropriate for different situations• Problem dependent - much art!!
AMASS – 7/25/03
Descriptors1. Partial positive surface area (sum of the surface area of positive atoms) 2. Partial negative surface area (sum of the surface area of negative atoms) 3. Total charge weighted positive surface area (descriptor 1 multiplied by the total positive charge) 4. Total charge weighted negative surface area (descriptor 2 multiplied by the total negative charge) 5. Atomic charge weighted positive surface area: (sum of sasa*charge for all positive atoms) 6. Atomic charge weighted negative surface area (sum of sasa*charge for all negative atoms) 7. Difference in charged surface areas: (descriptor 1 - descriptor 2) 8. Difference in total charge weighted surface areas (descriptor 3 - descriptor 4) 9. Difference in atomic charge weighted surface areas (descriptor 5 - descriptor 6) 10. Fractional charged partial surface areas (6 descriptors divided by total surface area) 11. " 12. " 13. " 14. " 15. " 16. Surface weighted charged partial surface areas (6 descriptors multiplied by total surface area)17. " 18. " 19. " 20. " 21. " 22. Relative positive charge (charge of most positive atom divided by total positive charge 23. Relative negative charge (charge of most negative atom divided by total negative charge 24. Relative positive charge surface area (surface area of most positive atom divided by descriptor 22) 25. Relative negative charge surface area (surface area of most negative atom divided by descriptor 23) 26. Total hydrophobic surface area (sum of surface areas of atoms with |charge| < 0.2) 27. Total polar surface area (sum of surface areas of atoms with |charge| > 0.2) 28. Relative hydrophobic surface area (descriptor 26 divided by total surface area) 29. Relative polar surface area (descriptor 27 divided by total surface area) 30. Total solvent-accessible surface area (http://www.accelrys.com/cerius2/descriptor.html#list)
Charged partial surface areas descriptors, Accelyris QSAR module
AMASS – 7/25/03
Descriptors
• Many broad categories: composition, topological, electronic, physical-chemical properties, …
• Thousands of possible descriptors– Use physical knowledge to choose relevant
ones (e.g., QSAR principle)– Use numerical methods to choose important
descriptors
AMASS – 7/25/03
Ex 5: Boiling Point Descriptors
(Chalk, et al., J Chem. Inf. Comput. Sci, ‘01)
AMASS – 7/25/03
Ex 5: Atomistic Modeling Methods
Use VAMP – AM1 and PM3 Hamiltonians– Semi-empirical molecular orbital based– Quantum mechanical, but matrix elements are
fit to experimental data– Can calculate optimized geometries,
electronic structure (charge properties)– Fairly accurate (known failings) and fast
AMASS – 7/25/03
Ex 5: Boiling PointsTraining set (6000) Test set (629)
17 (max -119) 19 (max -94)
Large errors often due to• Incorrect experimental measurements of TB (low pressure)• Incorrect experimental structures (tautomer misidentification)• Failure of atomistic modeling method (approximation errors)
(Chalk, et al., J Chem. Inf. Comput. Sci, ‘01)
AMASS – 7/25/03
Ex 6: Crystal Structure Prediction
• Property: Stable crystal structure• Descriptors→Property relationship:
Neighbor Clustering algorithm (Euclidean metric)
• Descriptors: Chemical scale (empirically assigned value for each element) (Pettifor, J. Phys. C, ’86)
• Data: All intermetallic binary alloys (thousands)
AMASS – 7/25/03
NaCl
CsCl
(Rodgers, CRYSTMET, ‘03)
Structure
Maps
AMASS – 7/25/03
Ex 6: Crystal Structure Prediction
• Powerful: structure maps can give 90-95% predictive accuracy
• Many Descriptors: ~50 have been tried based on size, atomic number, cohesive energy, electrochemistry, valence electrons
• Can’t be extended: accurate maps require ~40% of the possible systems to be known (~80% binaries known, ~0.1% quaternaries)
• Can atomistic modeling help?– Fill in data for multicomponent systems– Provide optimal descriptors
(Villars, Intermetallic Compounds, ’94)
AMASS – 7/25/03
Conclusions
• Atomistic modeling and data mining can provide valuable predictive ability when physical theories are incomplete
• Key issues are data quality, descriptors, and descriptor→properties relationship
• Dangers of overfitting and tuning
AMASS – 7/25/03
Are these words closer than by chance?Can the Bible predict future events?
Some say yes (Witzumn, et al, Stat. Sci., ’94)
Some say no (McKay, et al., Stat. Sci., ’99)
• Many articles• >60 books on Bible Codes on Amazon• 1 major motion picture (Omega Code)
Bible Code
Be careful with your statistics!
AMASS – 7/25/03
The First and Greatest Example of Atomic Level Data Mining
AMASS – 7/25/03
END