Applications of Data Mining for Predicting Material Properties Vipin Kumar University of Minnesota [email protected] kumar Research supported

Applications of Data Mining for Predicting Material Properties

Vipin KumarUniversity of Minnesota

[email protected]

www.cs.umn.edu/~kumar

Research supported by NSF, ARL

mailto:[email protected]

http://www.cs.umn.edu/~kumar

August 4, 2006 Data Mining for Predicting Material Properties 2

Why Data Mining? Commercial Viewpoint

• Lots of data is being collected and warehoused – Web data

• Yahoo! collects 10GB/hour

– purchases at department/grocery stores

• Walmart records 20 million transactions per day

– Bank/Credit Card transactions

• Computers have become cheaper and more powerful• Competitive Pressure is Strong

– Provide better, customized services for an edge (e.g. in Customer Relationship Management)


Why Data Mining? Scientific Viewpoint

• Data collected and stored at enormous speeds (GB/hour)

– remote sensors on a satellite• NASA EOSDIS archives over

1-petabytes of earth science data / year

– telescopes scanning the skies• Sky survey data

– gene expression data

– scientific simulations • terabytes of data generated in a few hours

• Data mining may help scientists – in automated analysis of massive data sets– in hypothesis formation

SST

Precipitation

NPP

Pressure

SST

Precipitation

NPP

Pressure

Longitude

Latitude

Timegrid cell zone

...

Data Mining for Climate Data

NASA ESE questions: How is the global Earth system changing?

What are the primary forcings?

How does Earth system respond to natural & human-induced changes?

What are the consequences of changes in the Earth system?

How well can we predict future changes?

Global snapshots of values for a number of variables on land surfaces or water

NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS

NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years….

http://www.nasa.gov/centers/ames/news/releases/2003/03_51AR.html

High Resolution EOS Data:

•EOS satellites provide high resolution measurements• Finer spatial grids

• 1 km 1 km grid produces 694,315,008 data points• Going from 0.5º 0.5º degree data to 1 km 1 km data results in a 2500-

fold increase in the data size• More frequent measurements• Multiple instruments

•High resolution data allows us to answer more detailed questions:• Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties

• Finding relationships between leaf area index (LAI) and topography of a river drainage basin

• Finding relationships between fire frequency and elevation as well as topographic position

•Leads to substantially high computational and memory requirementsDisturbance Viewer

This interactive module displays the locations on the earth surface where significant disturbance events have been detected.

Detection of Ecosystem Disturbances:

Data Mining for Cyber Security

• Due to proliferation of Internet, more and more organizations are becoming vulnerable to sophisticated cyber attacks

• Traditional Intrusion Detection Systems (IDS) have well-known limitations– Too many false alarms– Unable to detect sophisticated and novel attacks– Unable to detect insider abuse/ policy abuse

• Data Mining is well suited to address these challenges

0

20000

40000

60000

80000

100000

120000

1 2 3 4 5 6 7 8 9 10 11 12 13 14

• Incorporated into Interrogator architecture at ARL Center for Intrusion Monitoring and Protection (CIMP)

• Helps analyze data from multiple sensors at DoD sites around the country• Routinely detects Insider Abuse / Policy Violations / Worms / Scans

Large Scale Data Analysis is needed for

• Correlation of suspicious events across network sites

– Helps detect sophisticated attacks not identifiable by single site analyses

• Analysis of long term data (months/years)

– Uncover suspicious stealth activities (e.g. insiders leaking/modifying information)

MINDS – Minnesota Intrusion Detection System

http://www.caida.org/outreach/papers/2003/sapphire/sql-after.gif


Data Mining for Biomedical Informatics

Recent technological advances are helping to generate large amounts of both medical and genomic data

• High-throughput experiments/techniques- Gene and protein sequences- Gene-expression data- Biological networks and phylogenetic profiles

• Electronic Medical Records- IBM-Mayo clinic partnership has created a DB of 5

million patients- Single Nucleotides Polymorphisms (SNPs)

Data mining offers potential solution for analysis of large-scale data

• Automated analysis of patients history for customized treatment

• Prediction of the functions of anonymous genes• Identification of putative binding sites in protein

structures for drugs/chemicals discovery

Protein Interaction Network


• Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

• Traditional Techniquesmay be unsuitable due to – Enormity of data– High dimensionality

of data– Heterogeneous,

distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems


Data Mining Tasks...

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes


12 Yes Divorced 220K No

13 No Single 85K Yes


15 No Single 90K Yes 10

Predictive M

odeling

Clustering

Association

Rules

Anomaly Detection

Milk

Data


• QSAR is the process by which chemical structure is quantitatively correlated with a well defined process, such as biological activity or chemical reactivity. (Wikipedia) http://en.wikipedia.org/wiki/Quantitative_structure-activity_relationship

• Long history, but more importance with the expansion of pharmaceutical industry and advances in the life sciences http://media.wiley.com/product_data/excerpt/03/04712709/0471270903.pdf

• Mostly focused on mathematical and statistical models but machine learning and data mining techniques are increasingly being applied, particularly in the areas of prediction and molecule mining.

QSAR Quantitative Structure-Activity Relationship

http://en.wikipedia.org/wiki/Quantitative_structure-activity_relationship

http://media.wiley.com/product_data/excerpt/03/04712709/0471270903.pdf


• Computationally efficient algorithms to mine large databases of molecular graphs and identify key substructures present in active (inactive) compounds.

• Sophisticated feature selection and generation algorithms that identify and synthesize substructure-based features that simultaneously simplify the representation of the original compounds while retaining and exposing their key features.

• Kernel-based clustering and classification approaches that take into account the relationships between these substructures at different levels of granularity and complexity.

Some Applications of Data Mining to Chemical Informatics

George Karypis, University of Minnesota


• Example: development of efficient algorithms to find frequent substructures in molecular graphs (either topological or geometric). The topological version of this algorithm, called FSG, is currently available as part of the pattern discovery toolkit PAFI, which can be downloaded from http://glaros.dtc.umn.edu/gkhome/pafi/overview.

Some Applications of Data Mining to Chemical Informatics

George Karypis, University of Minnesota

http://glaros.dtc.umn.edu/gkhome/pafi/overview

http://glaros.dtc.umn.edu/gkhome/pafi/overview


Case Study: Predicting Chemical Properties

Goal is to predict chemical properties of interest.– Sensitivity of energetic materials– Properties of stealth coatings– Conformational and reactive site variations in proteins

Information available for prediction includes– Results from computational chemistry programs, e.g.,

GAUSSIAN. Electrostatic potential and electron density Vibrational frequencies and bond energies

– Chemical composition– 3D structure

bond length, bond angles, etc.– Electron scattering cross section

Approach is to apply data mining techniques. Data on 34 compounds provided by Dr. Betsy Rice, ARL.

Electron density iso-surface.

Electron density iso-surface colored by electrostatic potential

Approach Based on Computational Chemistry Data

Electron density iso-surface.

Electron density iso-surface colored by electrostatic potential

Surface AreaSurface area of the iso-surface. Computed from the physical positions of the points on the iso-surface.

Average Positive Electrostatic Potential

Average value of the electrostatic potential for all iso-surface points with positive potential.

Average Negative Electrostatic Potential

Average value of the electrostatic potential for all iso-surface points with negative potential.

Average All Electrostatic Potential

Average value of the electrostatic potential for all iso-surface points.

Most Positive Electrostatic Potential

Maximum value of the electrostatic potential for all iso-surface points with positive potential.

Most Negative Electrostatic Potential

Minimum value of the electrostatic potential for all iso-surface points with negative potential.

Sig PositiveVariance of all iso-surface points with positive potential.

Sig NegativeVariance of all iso-surface points with negative potential.

SIG(TOT) Sig Positive + Sig Negative

PIThe average “absolute” potential for all points on the iso-surface, after subtracting Average All Electro-static Potential.

BALANCE 2)(

_*_

TOTSig

NegativeSigPositiveSig

Data on 34 compounds provided by Dr. Betsy Rice, ARL.


Visualization: Sensitivity vs. Balance


Visualization: Avg. Neg Pot. Vs. Balance vs. Sensitivity

Circle size proportional to sensitivity


Approach Based on Frequent Substructures

• Frequent substructures were found using FSG, a Frequent Subgraph Discovery program by M. Kuramochi and G. Karypis.

molecule1 graph of molecule1

molecule2 graph of molecule2

…

…

moleculeN graph of moleculeN

Threshold t

substr1 substr2 … substrM

molecule1 1 0 … 1

molecule2 0 1 … 1

… … … … …

… … … … …

moleculeN 1 0 … 0

H:

C:

N:

O:

Example Decision Tree Based on Substructures

Substr_19

1 0

Substr_147

1 0High

Substr_296

1 0

Substr_31

1 0

Substr_298

1 0High Low

High

High

Low

6

3 7 4

5 3

Substr_19 :

Substr_31 :

Substr_147:

Substr_296:

Substr_298:

Precision = 89.3%

• Each branch of the tree corresponds to a rule.

• If substr_19 is present, the predicted class of this molecule is “High sensitivity”.

• If substr_19 and substr_296 are not present, but substr_147 is present the predicted class is “Low sensitivity”.


Geometric Approach Based on Alpha-Shapes

• An alpha-shape is a computational technique for representing 3D shape at different levels of detail (alpha > 0 is a parameter which controls the detail).– Used extensively in the analysis

of molecular structure (molecular volume and surface area).

– Reveals interesting features such as voids, tunnels, and pockets in a molecule’s structure.

Example: Picric acid (C6H3N3O7)

alpha = 0

4.4172

alpha = 18.837


Regression Approach Using Multiple Sets of Features

• Want to predict the impact sensitivity of each molecule, having a set of most predictive attributes.– Different sets of attributes

• Frequent substructures• Alpha-shapes features, e.g., voids, etc.• Topological, chemical, and electronic descriptors

– Data provided by Dr. J.M. Cense

– Reference: “Prediction of the Impact Sensitivity by Neural Networks”, by H. Nefati and J.M. Cense, Advance ACS Abstracts, March 1996.

• Regression techniques: Artificial Neural Networks (ANN), regression-SVM, linear least squares.– We used ANN to predict the sensitivity.


Regression using topological, geometrical, electronic descriptors

0 sensitivity 20 number of N-N double bonds

1 oxigen balance 21 number of C-N triple bonds

2 molecular electronegativity 22 number of C atoms

3 # of CO2 groups 23 number of H atoms

4 # of NO2-Csp2 bonds 24 number of N atoms

5 # of NO2-Csp3 bonds 25 number of O atoms

6 # of NO2-N bonds 26 100/molecular weight

7 # of NO2-O bonds 27 indicator of aromaticity (0 1)

8 # of rings 28 sum of X-NO2 charges dissymmetry

9 number of NH2 groups 29 avg of the X-NO2 discharge symmetry

10 number of OH groups 30sum of X-NO2 charges dissymmetry/mol weight

11 number of C(NO2)3 groups 31 length of the longest X-NO2

12 -CH in alpha of nitoaromatic 32 length of the shortest X-NO2

13 indicator of symmetry 33 highest potential for a X-NO2

14 number of -C=O groups 34 smallest potential for a X-NO2

15 number of Y-O-X groups 35 avg potential for a X-NO2

16 number of C-C double bonds 36 avg length of the X-NO2

17 number of C-C triple bonds 37 heat of formation

18 number of C-N double bonds 38 dipole

19 number of N-N triple bonds 39 ionization potential

Target: sensitivity

11 attribute set:(topological only)1 4 5 6 7 12 20 23 24 25 25

13 attribute set:1 2 4 6 12 13 14 17 18 20 29 30 32

Data on 200 compounds provided by Dr. J.M. Cense


Replicated NN regression result

These are the best results that were reported in the paper by Nefati and Cense.

The predicted value is an average of the output of 11 and 13 attribute 2-hidden layer node NNs.

Average absolute error 0.175

Average squared error 0.047

NN11.2+NN13.2 avg

0

0.5

1

1.5

2

2.5

3

0 0.5 1 1.5 2 2.5 3

Real h_50

Pre

dic

ted

h_50


An Alpha-shape feature: voids

NN(11+void).2+NN(13+void).2 avg2

0

0.5

1

1.5

2

2.5

0 1 2 3

Real h_50

Pre

dic

ted

h_50

Adding the number of voids for each molecule at alpha max improves the model(with exactly the same parameters with NN from the previous page).

Average absolute error0.133

Average squared error0.029


Concluding Remarks

Data mining is making important contributions in data analysis is many areas of science and business

A set of techniques and software for preprocessing, mining, and visualizing large data sets can be an important enabling cyber-infrastructure for science and engineering research.


Bibliography Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar,

Addison-Wesley April 2005 Introduction to Parallel Computing, (Second Edition) by Ananth Grama, Anshul

Gupta, George Karypis, and Vipin Kumar. Addison-Wesley, 2003 Data Mining for Scientific and Engineering Applications, edited by R. Grossman,

C. Kamath, W. P. Kegelmeyer, V. Kumar, and R. Namburu, Kluwer Academic Publishers, 2001

J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon, "Emerging Scientific Applications in Data Mining", Communications of the ACMVolume 45, Number 8, pp 54-58, August 2002

Kevin W. DeRonne and George Karypis, “Effective Optimization Algorithms for Fragment-assembly based Protein Structure Prediction,” Computational Systems Bioinformatics Conference (CSB), 2006.

Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds,” IEEE Trans. Knowl. Data Eng. 17(8): 1036-1050, 2005.

Michihiro Kuramochi and George Karypis, “An Efficient Algorithm for Discovering Frequent Subgraphs,” IEEE Trans. Knowl. Data Eng. 16(9): 1038-1051, 2004.

Gayle Eherenman, “Mining What Others Miss,” Mechanical Engineering Magazine, http://www.memagazine.org/backissues/feb05/features/miningwh/miningwh.html, February, 2005.

Documents

Applications of Data Mining for Predicting Material Properties Vipin Kumar University of Minnesota [email protected] kumar Research supported