45
Linear algebra methods for data mining with applications to materials Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM 2012 Annual Meeting Twin Cities, July 13th, 2012

Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

  • Upload
    lynga

  • View
    243

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Linear algebra methods for data mining withapplications to materials

Yousef SaadDepartment of Computer Science

and Engineering

University of Minnesota

SIAM 2012 Annual MeetingTwin Cities, July 13th, 2012

Page 2: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Introduction: a few factoids

ä Data is growing exponentially at an “alarming” rate:

• 90% of data in world today was created in last two years

• Every day, 2.3 Million terabytes (2.3×1018 bytes) created

ä “Big data” term coined to reflect this trend

ä Mixed blessing: Opportunities & big challenges.

ä Trend is re-shaping & energizing many research areas ...

ä ... including my own: numerical linear algebra

Minneapolis, 07-13-2012 p. 2

Page 3: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Introduction: What is data mining?

Set of methods and tools to extract meaningful informationor patterns from data. Broad area : data analysis, machinelearning, pattern recognition, information retrieval, ...

ä Tools used: linear algebra; Statistics; Graph theory; Approx-imation theory; Optimization; ...

ä This talk: brief introduction – with emphasis on linear alge-bra viewpoint

ä + our initial work on materials.

ä Big emphasis on “Dimension reduction methods”

Minneapolis, 07-13-2012 p. 3

Page 4: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Drowning in data

1010110011000111011100

0101101001100100

11001010011

1010111001101

110010

110011

110010

110010

111001

Dimension

PLEASE !

Reduction

Picture modified from http://www.123rf.com/photo_7238007_man-drowning-reaching-out-for-help.html

Minneapolis, 07-13-2012 p. 4

Page 5: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Major tool of Data Mining: Dimension reduction

ä Goal is not as much to reduce size (& cost) but to:

• Reduce noise and redundancy in data before performing atask [e.g., classification as in digit/face recognition]

• Discover ‘important features’ or ‘patterns’

ä Map data to: Preserve proximity? Maximize variance?Preserve a certain graph?

ä Other term used: feature extraction [general term for lowdimensional representations of data]

Minneapolis, 07-13-2012 p. 5

Page 6: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

The problem of Dimension Reduction

ä Given d � m find amapping Φ:

Φ : x ∈ Rm −→ y ∈ Rd

Practically: Given: X = [x1, · · · , xn] ∈ Rm×n, find a

low-dimens. representation Y = [y1, · · · , yn] ∈ Rd×n of X

m

n

X

Y

x

y

i

id

n

Φ may be linear,or nonlinear (im-plicit). Linear case(W ∈ Rm×d):

Y = W>X

Minneapolis, 07-13-2012 p. 6

Page 7: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Example: Principal Component Analysis (PCA)

In Principal Component Analysis W is computed to maxi-mize variance of projected data:

maxW ∈ Rm×d

W>W = I

d∑i=1

∥∥∥∥∥∥yi −1

n

n∑j=1

yj

∥∥∥∥∥∥2

2

, yi = W>xi.

ä Leads to maximizing

Tr[W>(X − µe>)(X − µe>)>W

], µ = 1

nΣn

i=1xi

Minneapolis, 07-13-2012 p. 7

Page 8: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

ä Solution W = { dominant eigenvectors } of the covariancematrix == Set of left singular vectors of X̄ = X − µe>

SVD:

X̄ = UΣV >, U>U = I, V >V = I, Σ = Diag

ä Optimal W = Ud ≡ matrix of first d columns of U

ä Solution W also minimizes ‘reconstruction error’ ..

∑i

‖xi −WW Txi‖2 =∑

i

‖xi −Wyi‖2

ä In some methods recentering to zero is not done, i.e., X̄replaced by X.

Minneapolis, 07-13-2012 p. 8

Page 9: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Example : Digit images (a random sample of 30)

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

5 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

205 10 15

10

20

Minneapolis, 07-13-2012 p. 9

Page 10: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

2-D ’reductions’:

−10 −5 0 5 10−6

−4

−2

0

2

4

6PCA − digits : 0 −− 4

01234

−0.2 −0.1 0 0.1 0.2−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15LLE − digits : 0 −− 4

0.07 0.071 0.072 0.073 0.074−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2K−PCA − digits : 0 −− 4

01234

−5.4341 −5.4341 −5.4341 −5.4341 −5.4341

x 10−3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1ONPP − digits : 0 −− 4

Minneapolis, 07-13-2012 p. 10

Page 11: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Unsupervised learning

“Unsupervised learning” : meth-ods that do not exploit known labelsä Example of digits: perform a 2-D projectionä Images of same digit tend tocluster (more or less)ä Such 2-D representations arepopular for visualizationä Can also try to find naturalclusters in data, e.g., in materialsä Basic clusterning technique: K-means

−6 −4 −2 0 2 4 6 8−5

−4

−3

−2

−1

0

1

2

3

4

5PCA − digits : 5 −− 7

567

SuperhardPhotovoltaic

Superconductors

Catalytic

Ferromagnetic

Thermo−electricMulti−ferroics

Minneapolis, 07-13-2012 p. 11

Page 12: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Example: Sparse Matrices viewpoint (J. Chen & YS ’09)

ä Communities modeled by an ‘affinity’ graph [e.g., ’user Asends frequent e-mails to user B’]ä Adjacency Graph represented by a sparse matrix

← OriginalmatrixGoal: Find

ordering soblocks areas dense aspossible→

ä Use ‘blocking’ techniques for sparse matricesä Advantage of this viewpoint: need not know # of clusters.[data: www-personal.umich.edu/˜mejn/netdata/]

Minneapolis, 07-13-2012 p. 12

Page 13: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Supervised learning: classification

ä Best illustration: written digits recognition example

Given: a set oflabeled samples(training set), andan (unlabeled) testimage.Problem: find

label of test image

��������

������������

������ ��

Training data Test data

Dim

ensio

n red

uctio

n

Dig

it 0

Dig

fit

1

Dig

it 2

Dig

it 9

Dig

it ?

?

Dig

it 0

Dig

fit

1

Dig

it 2

Dig

it 9

Dig

it ?

?

ä Roughly speaking: we seek dimension reduction so thatrecognition is ‘more effective’ in low-dim. space

Minneapolis, 07-13-2012 p. 13

Page 14: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

K-nearest neighbors (KNN) classification

ä Idea of a voting system: getdistances between test sampleand training samples

ä Get the k nearest neighbors(here k = 8)

ä Predominant class amongthese k items is assigned to thetest sample (“∗” here)

?kk

k k

k

n

n

nn

n

nnn

k

t

t

t

tt

tt

t

t

t

k

Minneapolis, 07-13-2012 p. 14

Page 15: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Fisher’s Linear Discriminant Analysis (LDA)

Goal: Use label information to define a good projector, i.e.,one that can ‘discriminate’ well between given classes

ä Define “between scatter”: a measure of how well separatedtwo distinct classes are.

ä Define “within scatter”: a measure of how well clustereditems of the same class are.

ä Objective: make “between scatter” measure large and “withinscatter” small.

Idea: Find projector that maximizes the ratio of the “betweenscatter” measure over “within scatter” measure

Minneapolis, 07-13-2012 p. 15

Page 16: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Define:

SB =c∑

k=1

nk(µ(k) − µ)(µ(k) − µ)T ,

SW =c∑

k=1

∑xi ∈Xk

(xi − µ(k))(xi − µ(k))T

Where:

• µ = mean (X)• µ(k) = mean (Xk)• Xk = k-th class• nk = |Xk|

H

GLOBAL CENTROID

CLUSTER CENTROIDS

H

X3

1X

µ

X2

Minneapolis, 07-13-2012 p. 16

Page 17: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

ä Consider 2ndmoments for a vec-tor a:

aTSBa =c∑

i=1

nk |aT(µ(k) − µ)|2,

aTSWa =c∑

k=1

∑xi ∈ Xk

|aT(xi − µ(k))|2

ä aTSBa ≡ weighted variance of projected µj’s

ä aTSWa ≡ w. sum of variances of projected classes Xj’s

ä LDA projects the data so as to maxi-mize the ratio of these two numbers:

maxa

aTSBa

aTSWa

ä Optimal a = eigenvector associated with the largest eigen-value of: SBui = λiSWui .

Minneapolis, 07-13-2012 p. 17

Page 18: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

LDA – Extension to arbitrary dimensions

ä Criterion: maximize the ratioof two traces:

Tr [UTSBU ]

Tr [UTSWU ]

ä Constraint: UTU = I (orthogonal projector).

ä Reduced dimension data: Y = UTX.

Common viewpoint: hard to maximize, therefore ...

ä ... alternative: Solve insteadthe (‘easier’) problem:

maxUTSWU=I

Tr [UTSBU ]

ä Solution: largest eigenvectors of SBui = λiSWui .

Minneapolis, 07-13-2012 p. 18

Page 19: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

LDA – Extension to arbitrary dimensions (cont.)

ä Consider the originalproblem:

maxU ∈ Rn×p, UTU=I

Tr [UTAU ]

Tr [UTBU ]

Let A, B be symmetric & assume that B is semi-positivedefinite with rank(B) > n− p. Then Tr [UTAU ]/Tr [UTBU ]has a finite maximum value ρ∗. The maximum is reached for acertain U∗ that is unique up to unitary transforms of columns.

ä Considerthe function:

f(ρ) = maxV TV =I

Tr [V T(A− ρB)V ]

ä Call V (ρ) the maximizer.

ä Note: V (ρ) = Set of eigenvectors - not unique

Minneapolis, 07-13-2012 p. 19

Page 20: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

ä Define G(ρ) ≡ A− ρB and its n eigenvalues:

µ1(ρ) ≥ µ2(ρ) ≥ · · · ≥ µn(ρ) .

ä Clearly:

f(ρ) = µ1(ρ) + µ2(ρ) + · · ·+ µp(ρ) .

ä Can express this differently. Define eigenprojector:

P (ρ) = V (ρ)V (ρ)T

ä Then:f(ρ) = Tr [V (ρ)TG(ρ)V (ρ)]

= Tr [G(ρ)V (ρ)V (ρ)T ]

= Tr [G(ρ)P (ρ)].

Minneapolis, 07-13-2012 p. 20

Page 21: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

ä Recall [e.g.Kato ’65] that:

P (ρ) =−1

2πi

∫Γ

(G(ρ)− zI)−1 dz

Γ is a smooth curve containing the p eivenvalues of interest

ä Hence: f(ρ) =−1

2πiTr

∫Γ

G(ρ)(G(ρ)− zI)−1 dz = ...

=−1

2πiTr

∫Γ

z(G(ρ)− zI)−1 dz

ä With this, can prove :

1. f is a non-increasing function of ρ;2. f(ρ) = 0 iff ρ = ρ∗;3. f ′(ρ) = −Tr [V (ρ)TBV (ρ)]

Minneapolis, 07-13-2012 p. 21

Page 22: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Can now use Newton’s method.

ρnew = ρ−Tr [V (ρ)T(A− ρB)V (ρ)]

−Tr [V (ρ)TBV (ρ)]=

Tr [V (ρ)TAV (ρ)]

Tr [V (ρ)TBV (ρ)]

ä Newton’s method to find the zero of f ≡ a fixed point

iteration with g(ρ) =Tr [V T(ρ)AV (ρ)]

Tr [V T(ρ)BV (ρ)],

ä Idea: Compute V (ρ) by a Lanczos-type procedure

ä Note: Standard problem - [not generalized]→ inexpensive!

ä See T. Ngo, M. Bellalij, and Y.S. 2010 for details

Minneapolis, 07-13-2012 p. 22

Page 23: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

ä Recent papers advocated similar or related techniques

[1] C. Shen, H. Li, and M. J. Brooks, A convex programmingapproach to the trace quotient problem. In ACCV (2) – 2007.

[2] H. Wang, S.C. Yan, D.Xu, X.O. Tang, and T. Huang. Traceratio vs. ratio trace for dimensionality reduction. In IEEEConference on Computer Vision and Pattern Recognition, 2007

[3] S. Yan and X. O. Tang, “Trace ratio revisited” Proceedings ofthe European Conference on Computer Vision, 2006.

...

Minneapolis, 07-13-2012 p. 23

Page 24: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Graph-based methods

ä Start with a graph of data. e.g.: graphof k nearest neighbors (k-NN graph)Want: Perform a projection which pre-

serves the graph in some sense

ä Define a graph Laplacean:

L = D −W

x

xj

i

yi

yj

e.g.,: wij =

{1 if j ∈ Ni

0 else D = diag

dii =∑j 6=i

wij

with Ni = neighborhood of i (excluding i)

Minneapolis, 07-13-2012 p. 24

Page 25: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Example: The Laplacean eigenmaps approach

Laplacean Eigenmaps [Belkin-Niyogi ’01] *minimizes*

F(Y ) =n∑

i,j=1

wij‖yi − yj‖2 subject to Y DY > = I

Motivation: if ‖xi − xj‖ is small(orig. data), we want ‖yi − yj‖ to bealso small (low-Dim. data)ä Original data used indirectlythrough its graphä Leads to n× n sparse eigenvalueproblem [In ‘sample’ space]

x

xj

i

yi

yj

Minneapolis, 07-13-2012 p. 25

Page 26: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Locally Linear Embedding (Roweis-Saul-00)

ä Very similar to Eigenmaps.

ä Graph Laplacean matrix is replaced by an ‘affinity’ graph

Graph: Each xi written as a convexcombination of its k nearest neighbors:xi ≈ Σwijxj,

∑j∈Ni

wij = 1ä Optimal weights computed(’local calculation’) by minimizing‖xi − Σwijxj‖ for i = 1, · · · , n

x

xj

i

ä Mapped data (Y ) computed by minimizing∑‖yi − Σwijyj‖2

Minneapolis, 07-13-2012 p. 26

Page 27: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

ONPP (Kokiopoulou and YS ’05)

ä Orthogonal Neighborhood Preserving Projections

ä A linear (orthogonoal) version of LLE obtained by writing Yin the form Y = V >X

ä Same graph as LLE. Objective: preserve the affinity graph(as in LEE) *but* with the constraint Y = V >X

ä Problem solved to obtain mapping:

minV

Tr[V >X(I −W>)(I −W )X>V

]s.t. V TV = I

ä In LLE replace V >X by Y

Minneapolis, 07-13-2012 p. 27

Page 28: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Face Recognition – background

Problem: We are given a database of images: [arrays of pixelvalues]. And a test (new) image.

↖ ↑ ↗

Question: Does this new image correspond to one of thosein the database?

Minneapolis, 07-13-2012 p. 28

Page 29: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Example: Eigenfaces [Turk-Pentland, ’91]

ä Idea identical with the one we saw for digits:

– Consider each picture as a (1-D) column of all pixels– Put together into an array A of size # pixels×# images.

. . . =:. . .

︸ ︷︷ ︸A

– Do an SVD of A and perform comparison with any test imagein low-dim. space

– Similar to LSI in spirit – but data is not sparse.

Minneapolis, 07-13-2012 p. 29

Page 30: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Graph-based methods in a supervised setting

Graph-based methods can be adapted to supervised mode.Idea: Build G so that nodes in the same class are neighbors.If c = # classes, G consists of c cliques.

ä Weight matrix W =block-diagonalä Note: rank(W ) = n− c.ä As before, graph Laplacean:

Lc = D −W

W =

W1

W2. . .

Wc

ä Can be used for ONPP and other graph based methods

ä Improvement: add repulsion Laplacean [Kokiopoulou, YS09]

Minneapolis, 07-13-2012 p. 30

Page 31: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Class 1 Class 2

Class 3

Leads to eigenvalue problemwith matrix:

Lc − ρLR

• Lc = class-Laplacean,• LR = repulsion Laplacean,• ρ = parameter

Test: ORL 40 subjects, 10 sample images each – example:

# of pixels : 112× 92; TOT. # images : 400

Minneapolis, 07-13-2012 p. 31

Page 32: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

10 20 30 40 50 60 70 80 90 1000.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98ORL −− TrainPer−Class=5

onpppcaolpp−Rlaplacefisheronpp−Rolpp

ä Remarkable: there are values of ρ which give better resultsthan using the optimum ρ obtained from maximizing trace ratio

Minneapolis, 07-13-2012 p. 32

Page 33: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Data mining for materials: Materials Informatics

*Collabor.: J. Chelikowsky (UT Austin), & Da Gao (U. of M)

ä Huge potential in exploiting two trends:

1 Improvements in efficiency and capabilities in computa-tional methods for materials

2 Recent progress in data mining techniques

ä Current practice: “One student, one alloy, one PhD” [seespecial MRS issue on materials informatics]→ Slow ..

ä Data Mining: can help speed-up process, e.g., by exploringin smarter ways

Issue 1: Who will do the work? Few researchers are familiarwith both worlds

Minneapolis, 07-13-2012 p. 33

Page 34: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Issue 2: databases, and more generally sharing, not too com-mon in materials

The inherently fragmented and multidisciplinary nature ofthe materials community poses barriers to establishingthe required networks for sharing results and information.One of the largest challenges will be encouraging scien-tists to think of themselves not as individual researchersbut as part of a powerful network collectively analyzingand using data generated by the larger community.These barriers must be overcome.

NSTC report to the White House, June 2011.

ä Materials genome initiative [NSF]

Minneapolis, 07-13-2012 p. 34

Page 35: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Unsupervised learning

ä 1970s: Data Mining “byhand”: Find coordinatesthat will cluster materialsaccording to structureä 2-D projection fromphysical knowledgeä ‘Anomaly Detection’:helped find that compoundCu F does not exist

see: J. R. Chelikowsky, J. C. Phillips, Phys Rev. B 19 (1978).

Minneapolis, 07-13-2012 p. 35

Page 36: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Question: Can modern data mining achieve a similar dia-grammatic separation of structures?

ä Should use only information from the two constituent atoms

ä Experiment: 67 binary ‘octets’.

ä Use PCA – exploit only data from 2 constituent atoms:

1. Number of valence electrons;

2. Ionization energies of the s-states of the ion core;

3. Ionization energies of the p-states of the ion core;

4. Radii for the s-states as determined from model potentials;

5. Radii for the p-states as determined from model potentials.

Minneapolis, 07-13-2012 p. 36

Page 37: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

ä Result:

−2 −1 0 1 2 3−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

BN

SiC

MgS

ZnO

MgSe

CuCl

MgTe

CuBr

CuICdTe

AgI

ZWDRZWWR

Minneapolis, 07-13-2012 p. 37

Page 38: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Supervised learning: classification

ä Problem: classify an unknown binary compound into itscrystal structure classä 55 compounds, 6 crystal structure classesä “leave-one-out” experiment

Case 1: Use features 1:5 for atom A and 2:5 for atom B. Noscaling is applied.

Case 2: Features 2:5 from each atom + scale features 2 to 4by square root of # valence electrons (feature 1)

Case 3: Features 1:5 for atom A and 2:5 for atom B. Scalefeatures 2 and 3 by square root of # valence electrons.

Minneapolis, 07-13-2012 p. 38

Page 39: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Three methods tested

1. PCA classification. Project and do identification in space ofreduced dimension (Euclidean distance in low-dim space).

2. KNN K-nearest neighbor classification –

3. Orthogonal Neighborhood Preserving Projection (ONPP) - agraph based method - [see Kokiopoulou, YS, 2005]

Recognition rates for 3different methods usingdifferent features

Case KNN ONPP PCACase 1 0.909 0.945 0.945Case 2 0.945 0.945 1.000Case 3 0.945 0.945 0.982

Minneapolis, 07-13-2012 p. 39

Page 40: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

A few experiments with a larger database

ä 529 AxBy compounds - Extract ‘valid’ entries

ä Source: Semiconductors, Data Handbook Otfried Madelung,Springer; 3rd edition (January 22, 2004)

ä Band-gap information available -

ä Experiments : [unsupervised learning]

• 2-D projection using SPDF blocks

• 2-D projection showing Insulator - Semi-conductor property

Minneapolis, 07-13-2012 p. 40

Page 41: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

S

F

D

P

Minneapolis, 07-13-2012 p. 41

Page 42: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Results for SDP 2-D visualization

−15 −10 −5 0 5 10

−10

−5

0

5

PCA projection, SDP blocks ; 526compounds

SPDPPP

−10 −5 0 5 10 15

−6

−4

−2

0

2

4

6

8

10

12

ONPP projection (k=20), SDP blocks ; 526compounds

SP

DP

PP

Features used: val. elec.; electronegativ.; covalent rad.; den-sity; ionic rad.; chemical scale; + band-gap.

Minneapolis, 07-13-2012 p. 42

Page 43: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Results for 2-D Semiconductor/Insulator visualization

ä Had to consider AB only compounds [119 of them]

−20 −15 −10 −5 0 5 10

−6

−4

−2

0

2

4

6

8

10

12

PCA projection, S/I labeling ; 119compounds

S

I

−15 −10 −5 0 5 10

−8

−6

−4

−2

0

2

4

6

8

ONPP projection (k=20), S/I labeling ; 119compounds

SI

features used: val. electrons; electro-negativity; cov. rad.;density; ionic rad.;

Minneapolis, 07-13-2012 p. 43

Page 44: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

Conclusion

ä Many, interesting new matrix problems in areas that involvethe effective mining of data

ä Among the most pressing issues is that of reducing compu-tational cost - [SVD, SDP, ... too costly]

ä Many online resources available in computer science/ ap-plied math...

ä .. but there is an inertia to overcome. In particular ...

ä ... data-sharing is not as widespread in materials science

ä Plus side: Materials informatics will likely be energized bythe materials genome project.

Minneapolis, 07-13-2012 p. 44

Page 45: Yousef Saad Department of Computer Science and Engineeringsaad/PDF/SIAM_AN_12.pdf · Yousef Saad Department of Computer Science ... or patterns from data. Broad area : data analysis,

When one door closes, another opens; but we often look solong and so regretfully upon the closed door that we do notsee the one which has opened for us.

Alexander Graham Bell (1847-1922)

Thank you !

ä Visit my web-site at www.cs.umn.edu/˜saad

Minneapolis, 07-13-2012 p. 45