Upload
deborahrosales
View
215
Download
0
Embed Size (px)
Citation preview
7/28/2019 11-A_PCA
1/7
Principal ComponentAnalysis
Principal Component Analysis
Factor analysis dates back to 1930s.
It was originally used in psychology to study
intelligence. Attempts were made to relate
test results to other factors.
Premise was that X = S L + E where !X = test performance
L = intrinsic intelligence factors
S = individual scoree
E = residual error
Principal Component AnalysisUsing an eigenvector rotation, it would be
possible to decompose the X matrix into a
series of loadings and scores.
Underlying or intrinsic factors related tointelligence could then be detected.
In chemistry, this approach can be used by
diagonalizating the correlation or
covariance matrix - Principal ComponentAnalysis.
Principal Component AnalysisPCA is typically conducted using the
covariance matrix from autoscaled data.!
It is then diagonalized - eigenvector rotation
Typically, the largest eigenvectors (based on
the size of the eigenvalues) are the most
important.
Covariance and Correlation
Covariance A measure of the association of two variables.
The sum of cross products between twovariables as deviations from their respectivemeans.
Correlation
The covariance between two z-transformed variables (autoscaled).
Principal Component Analysis
Approaches fall into two categories
Complete diagonalization of the matrix.
Approximation methods that extract onecomponent at a time.
In the end, the results are the same. The data isdecomposed into a set of loadings, scores and aresidual.
+ + . . . + +=
m
n
m
nnnn
mm mpap2p1
t1 t2 ta
EX
7/28/2019 11-A_PCA
2/7
EV1
EV2
Variable1
Variab
le2
Variable
3
Principal component (PC).
A linear combination of relatedvariables. It represents an intrinsicfactor of your data.
Scores.
The projection of your data in to PCspace.
Loading.
Show the relative significance of theoriginal variables.
Residual.
The data that could not be correlated-- typically random noise.
Varimax rotation
A secondary tweaking of the PCs to help
better observe relationships.
It is essentially a secondary rotation of your
data in an attempt to lump all variance from
individual variables in to single components.
It can often help you to better understand
the effects of your original data.
Varimax rotation
1
2
3
4
5
original
variables
% variance % variance % variance
PC1 PC2 PC3
Assume we have 5 original variables and are
only interested in the first 3 PC.
Varimax rotation
After varimax rotation, it might look like:
% variance % variance % variance
1
2
3
4
5
original
variables
PC1 PC2 PC3
The significance of each
variable is easier to see.
Using PCA results
The best way to appreciate PCA is to look at
a series of examples.
Well attempt to show what types ofinformation can be obtained and how it can
be used.
Examples
Classification of artifacts Classification of whiskey Noise reduction of 3D data
PCA of archaeological artifacts
The information presented in this
example is from:
!Kowalski, Schatzki and Stross,Anal.
Chem. 44, 2176 (1972).
A complete evaluation of the data is
also presented in Chemometrics by
Sharaf, Illman and Kowalski, John Wiley
& Sons, 1986.
7/28/2019 11-A_PCA
3/7
PCA of archaeological artifacts
Summary of study.
Native American artifacts made of obsidian glass wereobtained from 5 sites in northern California. Samples from 4quarry sites obsidian were obtained in the same area.
X-ray fluorescence analysis for ten elements (Fe, Ti, Ba, Ca, K,Mn, Rb, Sr, Y and Zr) was conducted on all 75 samples.
Questions posed.
Can the different sources of obsidian be differentiated basedon the chemical measurements made?
Can something be said regarding the sources of the artifactsand the migration and trading patterns of the Indians?
PCA of archaeological artifacts
Will start by initially assigning classes to
each type of sample - to be used in the
labeling of the various plots.
! 1 - 4 Quarry samples! 5 - 7 Artifacts from Indian sites
Both unscaled and autoscaled daa will
be evaluated using XLStat,
Archaeological artifacts - Data scaling Archaeological artifacts - Eigenvalues
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
F1 F2 F3 F4 F5 F6 F7 F8 F9 F 10
Eigenvalue
0
20
40
60
80
100
Cumulativevariability(%)
Virtually all of
the variance is in
the first principal
component.
PCA of archaeological artifacts
We can now produce displays of our
components.
A plot of the scores for PC1 vs. PC2 will
result in about 98% of the original
information being displayed.
A loadings plot of L1 vs. L2 will show the
importance of the original variables in the
construction of PC1 and PC2.
7
7
77
7
6
6
6
55 5
5
4
4
444 4
444
44
4
4
4
44
44
4
4
43
3
33
3333
3
333 3
33
3
3 3333
33
22
2
22
22
22
1
1
11 1
1
1
11
1
ZrY
SrRb
Mn
K Ca
Ba
Ti
Fe
-400
-200
0
200
400
600
-800 -600 -400 -200 0 200 400 600
F1 (81.68 %)
F2(16.1
4%
)
PCA of archaeological artifactsPC1 vs. PC2
Quarry samplefrom site 1 tend
to form anindividual group
Quarry site 3 and artifact site
7 appear to be related
7/28/2019 11-A_PCA
4/7
PCA of archaeological artifacts L1 vs. L2
The loadings showthat Ca and Fe bothhave an effect on PC1.
The other variableshave a smaller effect.
Zr
Y
Sr
Rb
Mn
K
Ca
Ba
Ti
Fe
-1
-0.75
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
Correlation plot
XLStat will alsoproduce a plot thaindicates thecorrelation betweethe original variabland the factors.
Here, it indicatesthat Y has little effeand the other haveimpact on both PC
Archaeological artifacts - Autoscaling
PCA of archaeological artifacts
0
1
2
3
4
5
6
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Eigenvalue
0
20
40
60
80
100
Cumulativevariability(%)
Now that each variableis given equal weight,variance is no longer allsmashed into the 1stcomponent.
7
7
77
7
6
6 6
5 5
5
544
44
4
4
4
44
4
44 4
4
4
44
4 4
443
3
3
3
3
3
3
3
33
33
33
3
3
3
3
3
3
3
33
2
2
2
2
2
2
2
22
1
1
1
1
1
1
1
1
1
1
Zr
Y
Sr
Rb
Mn
K
Ca
Ba
Ti
Fe
-3
-2
-1
0
1
2
3
4
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5
F1 (52.52 %)
F2(20.78%
)
Biplot showing scores and loadings
PCA of archaeological artifacts
So what does it all mean?
In this case, both scaled and unscaled resultsindicate a grouping of related samples - XRFresults can be used to classify related samples.
Samples from different quarries (1-4) caneasily be determined although samples fromsite 2 are pretty scattered.
Can we tell anything about the artifacts?
7/28/2019 11-A_PCA
5/7
No artifactsappear to
have comefrom quarry
site 4.
It
s wellresolvedfrom the
othersamples.
7
7
77
7
6
6 6
5 5
5
544
44
4
4
4
44
4
44 4
4
4
44
4 4
443
3
3
3
3
3
3
3
33
33
33
3
3
3
3
3
3
3
33
2
2
2
2
2
22
22
1
1
1
1
1
1
1
1
1
1
Zr
Y
Sr
Rb
Mn
K
Ca
Ba
Ti
Fe
-3
-2
-1
0
1
2
3
4
5
-6 - 5 -4 -3 -2 - 1 0 1 2 3 4 5 6
F1 (52.52 %)
F2(20.7
8%)
Archaeological artifacts - Results
7
7
77
7
6
6 6
5 5
5
544
44
4
4
4
44
4
44 4
4
4
44
4 4
443
3
3
3
3
3
3
3
33
33
33
3
3
3
3
3
3
3
33
2
2
2
2
2
22
22
1
1
1
1
1
1
1
1
1
1
Zr
Y
Sr
Rb
Mn
K
Ca
Ba
Ti
Fe
-3
-2
-1
0
1
2
3
4
5
-6 - 5 -4 -3 -2 - 1 0 1 2 3 4 5 6
F1 (52.52 %)
F2(20.7
8%)
Artifacts from siteappear to come fro
quarry 2, although tresults are scattere
Artifacts from siteare from quarry
Site 5 artifacts appeto come from all ov
the place. possible that this w
a nomadic trib
Using the loadings The loadings indicate that many of our
variables are closely related.
In addition, V appears to have little effect onour results.
We can reprocess our data after eliminatingV and some of our correlated variables.
This might improve our results. At a minimum it will make subsequent
studies easier - less data to collect.
Using the loadings
Zr
Y
SrRb
Mn
KCa
Ba
Ti
Fe
-1
-0.75
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
Use KUse Ca
Use Fe
Use
Remove
Modified Study
We now only 4 variables.
Is it enough to still give the same
characterization as with the original 10?
If this works, well be able to save quite a
bit of time and money on subsequent
assays.
Also, does it improve our results?
7
7
7
7
76
66
5
55
54
4
4
44
44
444
444
44
44
4
44
4
3
33
33 333
3
33333
3 33
3
333
33
2
2
2
22 22
2
2
111111
1111
Zr
K
Ca
-4
-2
0
2
4
6
-5 -3 -1 1 3 5
F1 (60.47 %)
F2(27.6
6%
)
Our results are almost identicalto our 10 variable work.
7/28/2019 11-A_PCA
6/7
Classification of whiskey
Another example -- the workconducted in our laboratory.
One study involved the characterization
of whiskey based on GC/MS traces.
This example show what you mightneed to do in order to make your data
suitable for PCA evaluations.
Representative whiskeys
Methylene chloride
extracts for a serie
of whiskeys were
assayed using a
GC/MS.
Variables needed
to be constructed
from these traces.
Data preprocessingEach chromatograph consisted of
approximately 1800 points. This
would be too much for many
systems to handle.Variables were constructedby summing response at1 min intervals resulting in
30 variables.
Data preprocessing To improve variable stability:
The smallest response value was treated asa baseline for background correction.
An internal standard was used tonormalized detector response.
The internal standard was also used toaccount for small time variations.
Data preprocessing
All data was autoscaled prior to PCA.
Questions askedCould whiskies be classified?
Could the approach be used to
detect
sample dilution?
blending?
contamination?
Initial PCA analysis
S - Scotch
B - Bourbon
C - Canadian
L - Blended
T - Tennessee
7/28/2019 11-A_PCA
7/7
Blending of one scotch into another
3
1
-1
-3
-3 -1 1 3
X
X
X
X
XX
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
XX
X
X
X
X
X
X
X
X
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
YY
Y
Y
Y
Y
75Y
50Y
25Y
PC1
PC2
% of brand Y in
the blend.
Contamination
Dilution3
1
-1
-3-7 -3 1 5
XX
X
X
XX
X
X
X
X
XX
X
X
X
X
X
X
X
X
X
XX
X
XX
XX
X
X
X
000
202020
404040
606060
80
8080
PC1
PC2
% by V, whiskey