View
219
Download
0
Category
Preview:
Citation preview
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
Data Mining and Materials Informatics: a primer
Krishna Rajan
Department of Materials Science and EngineeringNSF Intl. Materials Institute Combinatorial Sciences &
Materials Informatics CollaboratoryIowa State University
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
What is data? primary, secondary , derivative sources of data
What do we mean by mining data? data correlations –
dimensional analysis approachdata correlations –
data mining
What can we learn from data mining?
Classification data mining databaseshierarchy of datatrends in data
Predictionpredicting structure-property relationshipscomputational informatics vs. computational materials sciencedata mining for building databases
OUTLINE
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
INFORMATICS
• Establish new correlations
• Identify outliers
• Enlarge database / virtual libraries
• Evaluate databases
• Establish predictions
What is it ? Why ?
•
Searching for patterns of behavior among multivariate data sets
•
Can pattern recognition lead to predictions?
Krishna Rajan
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
COMPLEXITY OF BIOSYSTEMS
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
Data + Correlations + Theory = Knowledge-base
•• SimulationsSimulations•• Combinatorially derived datasetsCombinatorially derived datasets•• Digital librariesDigital libraries•• Spectral and imaging librariesSpectral and imaging libraries
•• Machine learning Machine learning •• Data compressionData compression•• Pattern recognitionPattern recognition
•• Atomistic based calculationsAtomistic based calculations•• Continuum based theoriesContinuum based theories
ULTRA LARGE SCALE INFORMATION SPACE
Informatics-BasedDesign
INFORMATICS STRATEGY
Machine learning toolkit:• Classification• Prediction• Hybrid tools
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
Ideker
and Lauffenburger:(2003)
INFORMATICS-BASED DESIGN STRATEGIES
BIOLOGICAL ANALOGUE CRYSTAL STRUCTURE ANALOGUE
Nye (1950)
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
Statistical Tools
Combinatorial & Spectral Libraries
Latent Variables /Partial Least Squares
(PLS)
Principal Component Analysis(PCA)
Support Vector Machine(SVM)
Materials DatabasesExperimental dataComputational deriveddata sets
Simulations
Association Mining(AM)
Classification
Prediction
Input & Output
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
Requirements
•
Global minima•
Categorical data•
Missing / variable data•
Skewed distributions•
Large data sets•
Scalable
Pitfalls
•
Local minima•
Categorical data difficult•
Convergence difficult for large data sets
•
Few outliers can lead to poor performance
MACHINE LEARNING AND MATERIALS DISCOVERY
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
DATA CURATION
KNOWLEDGE DISCOVERY
Database administration & management
• thermodynamic, crystallographic & propertydata bases
• combinatorial experimental data
Oracle, Unix, Supercomputing, SQL
• taxonomy and ontology of materialsscience data
• data sharing , networking / cyber infrastructure
JAVA, HTML, Python
• object oriented programming language• visualization of high dimensional data
Data mining algorithms
• Clustering analysis • Quantitative Structure-Activity Relationships
(QSAR) for materials design
DATA REPRESENTATION
DATA STORAGE
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
•
Accelerated insertion of materials into engineering systems
•
Rapid multiscale design and optimization of materials properties
•
Establishment of new structure –property correlations among large, heterogeneous and distributed data sets
•
Discovery of new chemistries and compounds
•
Formulation and / or refinement of new theories for materials behavior
•
Rapid identification of critical data and theoretical needs for future problems
WHY COUPLE COMPUTATIONAL MATERIALS SCIENCE AND INFORMATICS ?
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
OriginalData
TransformedData
Patterns
KnowledgeData
Warehousing
Feature Extraction
Data Mining & Visualization
Interpretation
SOFT MODELING vs. HARD MODELING
Data bases
DimensionalAnalysis and/or Theory
Constitutive Equations
Hard
modelingHard
modeling
Soft modelingSoft modeling
Krishna Rajan
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
22x22 (484 cells –
2500 data points):
Silicon nitride descriptors
DATA MAP
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
DIMENSIONALITY REDUCTION
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
EIGENVALUE DECOMPOSITION
eigenvalueeigenvector
If V is nonsingular, this become the eigenvalue decomposition
Eigenvalues
on the diagonal of this diagonal matrix
Eigenvectors
forming the columns of this matrix
Sp pλ=
SP P= Λ
1S P P−= ΛLoading Matrix=Eigenvector Matrix
Eigenvalue Matrix
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
PREDICTIONSPLS
‘Upgradeable’ linear predictive toolQuadratic and interaction terms :flexibility
Calculate latent factors from original predictor variablesT=XW where
T-
latent factors, X –
predictors, W -
weights
Use latent factors to predict responseY=TQ+E where Y -
responses, Q –
loadings, E-
noise
•• W maximizes covariance between Y and T
Factors
UT
Responses
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
INFORMATICS AND THE PERIODIC TABLE
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
DATA WAREHOUSING
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
0 20 400 5000 100000 2000 40001 2 30 5 100 2 40 10 202 4 6
0
20
40
0
5000
10000
0
2000
4000
1
2
3
0
5
10
0
2
4
0
10
20
2
4
6
1 2 3 4 5 6 7
8
8
7
6
5
4
3
2
1
1. # of atoms/unit cell 2. Valence electron # 3. Electronegativity 4. 1st
Ionization potential
5. Atomic radius 6. Melting point 7. Boiling point 8. Density
BIVARIATE MAPS
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
-2 -1 0 1 2-6 -4 -2 0 2 4 6-6 -4 -2 0 2 4 6
-2
-1
0
1
2
-6
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
6
PC1 PC2 PC3
PC
3
P
C2
P
C1
SEEKING PATTERNS
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
-4 -3 -2 -1 0 1 2-3
-2
-1
0
1
2
3
4
5
Scores on PC 2 (22.61%)
Sco
res
on P
C 3
(12.
02%
)
1
2 3 4
5
6 7 8
9 10
11
12 13
14
15
16 17 18
19 20
21
22
23
24
25 26
27
28
29 30
31 32
33 34 35 36 37 38
39 40 41 42 43 44
45
46 47 48
49
50
51
52 53
54
55
56 57
58
Na, Mg, Al
K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, and GaSr, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, Ag, Cd, In, and Sn
Cs, Ba, La, Ce, Pr, Nd, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Tl, Pb, and Bi
Each cluster represent each row in the periodic table
MENDELEEV SEQUENCING
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
•
Location of property in loading plot indicates influence of property on PC
•
Atomic weight has no influence on PC1, high influence on PC2 and 3
•
Clustered properties indicate relationships
•
Electrical and thermal conductivity•
Melting point and density•
Molar volume and atomic radius•
Melting and boiling points, heats of fusion and vaporization, and valence number
•
Pauling electronegativity and first ionization potential
AtomicWeight
PseudopotentialRadius
DensityPauling
electronegativity
FirstIonizationPotential
ThermalConductivity
ElectricalConductivity
Specific Heat
Martynov-Batsanov
electronegativity
MolarVolume
Covalentradius
Metallicradius
Heat ofFusion
Heat ofVaporization
ValenceElectronnumber Melting
Point
BoilingPoint
DEVELOPING PHYSICAL LAWS
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
RPI Electrical Conductivity (0.00000001*ohm*m)1e-12 1e-10 1e-8 1e-6 1e-4 0.01
Ther
mal
Con
duct
ivity
(W/(m
*K))
1
10
100
32. Germanium (Ge)
6. Carbon, diamond (C)
52. Tellurium (Te)
83. Bismuth (Bi)
29. Copper (Cu)
34. Selenium (Se)
DEVELOPING PHYSICAL LAWS
20
30
40
50
60
70
500 1000 1500 2000 2500
Act
ivat
ion
Ener
gy, Q
[kca
l/mol
]
Tm
[K]
Pb
Ag
Au
Cu
NiCo
Fe Pt
18Rg
Linear regression
Bokshtein & Zhukhovitskii
Al
Activation energy for self-diffusion versus melting point
From Glicksman
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
9 descriptors from NIST data-
Lattice parameters (a, b, c, α, β, γ)-
c/a, b/c-
V/abc
Hexagonal
MonoclinicTetragonal(P42
/mmc)
-100
-50
0
50
-50
0
50-40
-30
-20
-10
0
10
6
1 2 9 22 8 10 23 3 15 21 16 4 25 26 31 27 34 32 28 14 12 13
PC 1 (61.71%)
33 30 5 7 24 20 11
Scores Plot
18
29
19 17
PC 2 (29.17%)
PC
3 (
5.85
%)
1 2 9 22
8 10
23 3
15
21
16 4 25 26
31 34
27
32
28
14 12 13
Tetragonal(I-42d)
Orthorhombic
Cubic
PCA Score plot
(34 binary, ternary, quaternary compounds)
PCA of 34 binary, ternary, quaternary compounds: Score plotPCA of 34 binary, ternary, quaternary compounds: Score plot
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
1 2 3 4 5 6789 10 1112 13 1415 16 1718 19202122 23 2425 262728 29 3031 32 3334
Cluster analysis of 34 binary, ternary, quaternary compoundsCluster analysis of 34 binary, ternary, quaternary compounds
Cubic(Fm-3m)
Tetragonal(I-42d)
Cubic(P41
32,P-43m,Pm-3,Pm-3,Fd-3ms)
Orthorhombic
Cubic(Im-3,Pm-3)
Tetragonal(I4/mcm)
Hexagonal (P63
/mmc,P6/mmm, P-6m2)/ Trigonal (P-3m1)
Tetragonal(P42/mmc)
Hexagonal (P-62m,P-6)Monoclinic
Hexagonal(P63
/mmc)
α, β, γ
=
90° α, β, γ ≠ 90°
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
Atomic packing of single element
Li Al Cu ScMg
BCC FCC HCP FCC HCP
sp electrons/atom† 1 3 2 11 2
Ex.) Engel’s model using “# of sp electrons/atom”†
BCC < 1.5HCP 1.7 –
2.1 FCC 2.5-3
RULE EXTRACTION
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
PATTERN DETECTION
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
-6-4
-20
24
6
-6
-4
-2
0
2
4
6
-6-4
-20
24
6
PC2(22.43
%)
PC3(
19.4
7%)
PC1(49.89%)
classification
Graduciz, Gaune-EscardUniv.Novi
Sad, SerbiaCNRS, Marseilles
covalent ionic
prediction
DATA MINING DATABASES
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
COMPUTATIONAL ISSUES
Data can come across length and time scales
Focus on properties of signal / macroscopic behavior rather than noise/ error. Assume complexity !!!
Utilize data dimensionality reduction techniques
Analyze variation and correlation in data
Covariance
Establish correlations acrossdiverse data sets ( ie. length & time scales)
Identify outliers: explore cause
Develop predictive models-
Target requirements of missing data
-
Quantitatively assess data diversity
Model relationships in data to seek heuristic relationships:
Advanced statistical learning tools can deal with:-
skewed data-
missing data-
differentiate between local and global minima
-
ultra large scale datasets-
variable uncertainty
•Singular value decomposition•Support vector machines•Association mining•Fuzzy clustering
Establish multivariate database:
Seek DIVERSITY
in datasets
Krishna Rajan
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
CRYSTAL CHEMISTRY DESIGN
1.
Assess influence of latentvariables ( i.e. electronic structure parameters) on properties of known data
2.
Establish heuristic relationships on database of all input variables instead of phenomenological relationships in bivariate manner
3.
Use statistical learning to predict new materials behavioron new multivariate input data
4. Inverse problem approach to formulate quantitative structure-
property relationships
Ching et.al J. Amer. Ceram.Soc. 85 75-80 (2002)
Krishna Rajan
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
Virtual library:via informatics
Refractory metals
Suh and Rajan (2005, 2006)
“Real” library:via first principles
calculations
Krishna Rajan
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
LINKING DISPARATE LENGTH SCALESVisualizing Associations
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
hcp
fcc
SYNTHESIZING REMOTE DATA SETS
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
• Search and retrieve
• Refining descriptors
• Developing predictions
• Filling in missing data
• Structure databases
• Diffraction spectra databases
• Hyper spectral imaging databases
Mapping / Visualization of databases:
• Correlations• Trend analysis
STRATEGY FOR MATERIALS DESIGN
Krishna RajanTMS / ASM Materials Informatics WorkshopCincinatti, OH October 15th 2006
Science was originally empirical, like Leonardo, making wonderful drawings of
nature. Next came the theorists whotried to write down the equations that
explained the observed behaviors, like Kepler or Einstein. Then when we got to
complex enough systems like the clustering of a million galaxies, there came the
computer simulations, the computational branch of science. Now we are getting into the data exploration part of science,
which
is kind of a little bit of them all
””....
Dr. Alex Szalay: Virtual Observatory Project
May 20th 2003
The Facts of the Matter: Finding, understanding, and using information about our
physical world (2000)
CONCLUSIONS
Information science based design of Information science based design of materialsmaterials…………next stage of the scientific next stage of the scientific
discovery processdiscovery process
Division of Materials Research:Division of Materials Research:Cyber infrastructure and cyber discovery in Cyber infrastructure and cyber discovery in
materials science (2006)materials science (2006)
Recommended