11-A_PCA

7/28/2019 11-A_PCA

1/7

Principal ComponentAnalysis

Principal Component Analysis

Factor analysis dates back to 1930s.

It was originally used in psychology to study

intelligence. Attempts were made to relate

test results to other factors.

Premise was that X = S L + E where !X = test performance

L = intrinsic intelligence factors

S = individual scoree

E = residual error

Principal Component AnalysisUsing an eigenvector rotation, it would be

possible to decompose the X matrix into a

series of loadings and scores.

Underlying or intrinsic factors related tointelligence could then be detected.

In chemistry, this approach can be used by

diagonalizating the correlation or

covariance matrix - Principal ComponentAnalysis.

Principal Component AnalysisPCA is typically conducted using the

covariance matrix from autoscaled data.!

It is then diagonalized - eigenvector rotation

Typically, the largest eigenvectors (based on

the size of the eigenvalues) are the most

important.

Covariance and Correlation

Covariance A measure of the association of two variables.

The sum of cross products between twovariables as deviations from their respectivemeans.

Correlation

The covariance between two z-transformed variables (autoscaled).

Principal Component Analysis

Approaches fall into two categories

Complete diagonalization of the matrix.

Approximation methods that extract onecomponent at a time.

In the end, the results are the same. The data isdecomposed into a set of loadings, scores and aresidual.

+ + . . . + +=

m

n

m

nnnn

mm mpap2p1

t1 t2 ta

EX

7/28/2019 11-A_PCA

2/7

EV1

EV2

Variable1

Variab

le2

Variable

3

Principal component (PC).

A linear combination of relatedvariables. It represents an intrinsicfactor of your data.

Scores.

The projection of your data in to PCspace.

Loading.

Show the relative significance of theoriginal variables.

Residual.

The data that could not be correlated-- typically random noise.

Varimax rotation

A secondary tweaking of the PCs to help

better observe relationships.

It is essentially a secondary rotation of your

data in an attempt to lump all variance from

individual variables in to single components.

It can often help you to better understand

the effects of your original data.

Varimax rotation

1

2

3

4

5

original

variables

% variance % variance % variance

PC1 PC2 PC3

Assume we have 5 original variables and are

only interested in the first 3 PC.

Varimax rotation

After varimax rotation, it might look like:

% variance % variance % variance

1

2

3

4

5

original

variables

PC1 PC2 PC3

The significance of each

variable is easier to see.

Using PCA results

The best way to appreciate PCA is to look at

a series of examples.

Well attempt to show what types ofinformation can be obtained and how it can

be used.

Examples

Classification of artifacts Classification of whiskey Noise reduction of 3D data

PCA of archaeological artifacts

The information presented in this

example is from:

!Kowalski, Schatzki and Stross,Anal.

Chem. 44, 2176 (1972).

A complete evaluation of the data is

also presented in Chemometrics by

Sharaf, Illman and Kowalski, John Wiley

& Sons, 1986.

7/28/2019 11-A_PCA

3/7


Summary of study.

Native American artifacts made of obsidian glass wereobtained from 5 sites in northern California. Samples from 4quarry sites obsidian were obtained in the same area.

X-ray fluorescence analysis for ten elements (Fe, Ti, Ba, Ca, K,Mn, Rb, Sr, Y and Zr) was conducted on all 75 samples.

Questions posed.

Can the different sources of obsidian be differentiated basedon the chemical measurements made?

Can something be said regarding the sources of the artifactsand the migration and trading patterns of the Indians?


Will start by initially assigning classes to

each type of sample - to be used in the

labeling of the various plots.

! 1 - 4 Quarry samples! 5 - 7 Artifacts from Indian sites

Both unscaled and autoscaled daa will

be evaluated using XLStat,

Archaeological artifacts - Data scaling Archaeological artifacts - Eigenvalues

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

F1 F2 F3 F4 F5 F6 F7 F8 F9 F 10

Eigenvalue

0

20

40

60

80

100

Cumulativevariability(%)

Virtually all of

the variance is in

the first principal

component.


We can now produce displays of our

components.

A plot of the scores for PC1 vs. PC2 will

result in about 98% of the original

information being displayed.

A loadings plot of L1 vs. L2 will show the

importance of the original variables in the

construction of PC1 and PC2.

7

7

77

7

6

6

6

55 5

5

4

4

444 4

444

44

4

4

4

44

44

4

4

43

3

33

3333

3

333 3

33

3

3 3333

33

22

2

22

22

22

1

1

11 1

1

1

11

1

ZrY

SrRb

Mn

K Ca

Ba

Ti

Fe

-400

-200

0

200

400

600

-800 -600 -400 -200 0 200 400 600

F1 (81.68 %)

F2(16.1

4%

)

PCA of archaeological artifactsPC1 vs. PC2

Quarry samplefrom site 1 tend

to form anindividual group

Quarry site 3 and artifact site

7 appear to be related

7/28/2019 11-A_PCA

4/7

PCA of archaeological artifacts L1 vs. L2

The loadings showthat Ca and Fe bothhave an effect on PC1.

The other variableshave a smaller effect.

Zr

Y

Sr

Rb

Mn

K

Ca

Ba

Ti

Fe

-1

-0.75

-0.5

-0.25

0

0.25

0.5

0.75

1

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

Correlation plot

XLStat will alsoproduce a plot thaindicates thecorrelation betweethe original variabland the factors.

Here, it indicatesthat Y has little effeand the other haveimpact on both PC

Archaeological artifacts - Autoscaling


0

1

2

3

4

5

6

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

Eigenvalue

0

20

40

60

80

100

Cumulativevariability(%)

Now that each variableis given equal weight,variance is no longer allsmashed into the 1stcomponent.

7

7

77

7

6

6 6

5 5

5

544

44

4

4

4

44

4

44 4

4

4

44

4 4

443

3

3

3

3

3

3

3

33

33

33

3

3

3

3

3

3

3

33

2

2

2

2

2

2

2

22

1

1

1

1

1

1

1

1

1

1

Zr

Y

Sr

Rb

Mn

K

Ca

Ba

Ti

Fe

-3

-2

-1

0

1

2

3

4

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5

F1 (52.52 %)

F2(20.78%

)

Biplot showing scores and loadings


So what does it all mean?

In this case, both scaled and unscaled resultsindicate a grouping of related samples - XRFresults can be used to classify related samples.

Samples from different quarries (1-4) caneasily be determined although samples fromsite 2 are pretty scattered.

Can we tell anything about the artifacts?

7/28/2019 11-A_PCA

5/7

No artifactsappear to

have comefrom quarry

site 4.

It

s wellresolvedfrom the

othersamples.

7

7

77

7

6

6 6

5 5

5

544

44

4

4

4

44

4

44 4

4

4

44

4 4

443

3

3

3

3

3

3

3

33

33

33

3

3

3

3

3

3

3

33

2

2

2

2

2

22

22

1

1

1

1

1

1

1

1

1

1

Zr

Y

Sr

Rb

Mn

K

Ca

Ba

Ti

Fe

-3

-2

-1

0

1

2

3

4

5

-6 - 5 -4 -3 -2 - 1 0 1 2 3 4 5 6

F1 (52.52 %)

F2(20.7

8%)

Archaeological artifacts - Results

7

7

77

7

6

6 6

5 5

5

544

44

4

4

4

44

4

44 4

4

4

44

4 4

443

3

3

3

3

3

3

3

33

33

33

3

3

3

3

3

3

3

33

2

2

2

2

2

22

22

1

1

1

1

1

1

1

1

1

1

Zr

Y

Sr

Rb

Mn

K

Ca

Ba

Ti

Fe

-3

-2

-1

0

1

2

3

4

5

-6 - 5 -4 -3 -2 - 1 0 1 2 3 4 5 6

F1 (52.52 %)

F2(20.7

8%)

Artifacts from siteappear to come fro

quarry 2, although tresults are scattere

Artifacts from siteare from quarry

Site 5 artifacts appeto come from all ov

the place. possible that this w

a nomadic trib

Using the loadings The loadings indicate that many of our

variables are closely related.

In addition, V appears to have little effect onour results.

We can reprocess our data after eliminatingV and some of our correlated variables.

This might improve our results. At a minimum it will make subsequent

studies easier - less data to collect.

Using the loadings

Zr

Y

SrRb

Mn

KCa

Ba

Ti

Fe

-1

-0.75

-0.5

-0.25

0

0.25

0.5

0.75

1

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

Use KUse Ca

Use Fe

Use

Remove

Modified Study

We now only 4 variables.

Is it enough to still give the same

characterization as with the original 10?

If this works, well be able to save quite a

bit of time and money on subsequent

assays.

Also, does it improve our results?

7

7

7

7

76

66

5

55

54

4

4

44

44

444

444

44

44

4

44

4

3

33

33 333

3

33333

3 33

3

333

33

2

2

2

22 22

2

2

111111

1111

Zr

K

Ca

-4

-2

0

2

4

6

-5 -3 -1 1 3 5

F1 (60.47 %)

F2(27.6

6%

)

Our results are almost identicalto our 10 variable work.

7/28/2019 11-A_PCA

6/7

Classification of whiskey

Another example -- the workconducted in our laboratory.

One study involved the characterization

of whiskey based on GC/MS traces.

This example show what you mightneed to do in order to make your data

suitable for PCA evaluations.

Representative whiskeys

Methylene chloride

extracts for a serie

of whiskeys were

assayed using a

GC/MS.

Variables needed

to be constructed

from these traces.

Data preprocessingEach chromatograph consisted of

approximately 1800 points. This

would be too much for many

systems to handle.Variables were constructedby summing response at1 min intervals resulting in

30 variables.

Data preprocessing To improve variable stability:

The smallest response value was treated asa baseline for background correction.

An internal standard was used tonormalized detector response.

The internal standard was also used toaccount for small time variations.

Data preprocessing

All data was autoscaled prior to PCA.

Questions askedCould whiskies be classified?

Could the approach be used to

detect

sample dilution?

blending?

contamination?

Initial PCA analysis

S - Scotch

B - Bourbon

C - Canadian

L - Blended

T - Tennessee

7/28/2019 11-A_PCA

7/7

Blending of one scotch into another

3

1

-1

-3

-3 -1 1 3

X

X

X

X

XX

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

XX

X

X

X

X

X

X

X

X

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

YY

Y

Y

Y

Y

75Y

50Y

25Y

PC1

PC2

% of brand Y in

the blend.

Contamination

Dilution3

1

-1

-3-7 -3 1 5

XX

X

X

XX

X

X

X

X

XX

X

X

X

X

X

X

X

X

X

XX

X

XX

XX

X

X

X

000

202020

404040

606060

80

8080

PC1

PC2

% by V, whiskey

Documents

11-A_PCA