11-A_PCA

Embed Size (px)

Citation preview

  • 7/28/2019 11-A_PCA

    1/7

    Principal ComponentAnalysis

    Principal Component Analysis

    Factor analysis dates back to 1930s.

    It was originally used in psychology to study

    intelligence. Attempts were made to relate

    test results to other factors.

    Premise was that X = S L + E where !X = test performance

    L = intrinsic intelligence factors

    S = individual scoree

    E = residual error

    Principal Component AnalysisUsing an eigenvector rotation, it would be

    possible to decompose the X matrix into a

    series of loadings and scores.

    Underlying or intrinsic factors related tointelligence could then be detected.

    In chemistry, this approach can be used by

    diagonalizating the correlation or

    covariance matrix - Principal ComponentAnalysis.

    Principal Component AnalysisPCA is typically conducted using the

    covariance matrix from autoscaled data.!

    It is then diagonalized - eigenvector rotation

    Typically, the largest eigenvectors (based on

    the size of the eigenvalues) are the most

    important.

    Covariance and Correlation

    Covariance A measure of the association of two variables.

    The sum of cross products between twovariables as deviations from their respectivemeans.

    Correlation

    The covariance between two z-transformed variables (autoscaled).

    Principal Component Analysis

    Approaches fall into two categories

    Complete diagonalization of the matrix.

    Approximation methods that extract onecomponent at a time.

    In the end, the results are the same. The data isdecomposed into a set of loadings, scores and aresidual.

    + + . . . + +=

    m

    n

    m

    nnnn

    mm mpap2p1

    t1 t2 ta

    EX

  • 7/28/2019 11-A_PCA

    2/7

    EV1

    EV2

    Variable1

    Variab

    le2

    Variable

    3

    Principal component (PC).

    A linear combination of relatedvariables. It represents an intrinsicfactor of your data.

    Scores.

    The projection of your data in to PCspace.

    Loading.

    Show the relative significance of theoriginal variables.

    Residual.

    The data that could not be correlated-- typically random noise.

    Varimax rotation

    A secondary tweaking of the PCs to help

    better observe relationships.

    It is essentially a secondary rotation of your

    data in an attempt to lump all variance from

    individual variables in to single components.

    It can often help you to better understand

    the effects of your original data.

    Varimax rotation

    1

    2

    3

    4

    5

    original

    variables

    % variance % variance % variance

    PC1 PC2 PC3

    Assume we have 5 original variables and are

    only interested in the first 3 PC.

    Varimax rotation

    After varimax rotation, it might look like:

    % variance % variance % variance

    1

    2

    3

    4

    5

    original

    variables

    PC1 PC2 PC3

    The significance of each

    variable is easier to see.

    Using PCA results

    The best way to appreciate PCA is to look at

    a series of examples.

    Well attempt to show what types ofinformation can be obtained and how it can

    be used.

    Examples

    Classification of artifacts Classification of whiskey Noise reduction of 3D data

    PCA of archaeological artifacts

    The information presented in this

    example is from:

    !Kowalski, Schatzki and Stross,Anal.

    Chem. 44, 2176 (1972).

    A complete evaluation of the data is

    also presented in Chemometrics by

    Sharaf, Illman and Kowalski, John Wiley

    & Sons, 1986.

  • 7/28/2019 11-A_PCA

    3/7

    PCA of archaeological artifacts

    Summary of study.

    Native American artifacts made of obsidian glass wereobtained from 5 sites in northern California. Samples from 4quarry sites obsidian were obtained in the same area.

    X-ray fluorescence analysis for ten elements (Fe, Ti, Ba, Ca, K,Mn, Rb, Sr, Y and Zr) was conducted on all 75 samples.

    Questions posed.

    Can the different sources of obsidian be differentiated basedon the chemical measurements made?

    Can something be said regarding the sources of the artifactsand the migration and trading patterns of the Indians?

    PCA of archaeological artifacts

    Will start by initially assigning classes to

    each type of sample - to be used in the

    labeling of the various plots.

    ! 1 - 4 Quarry samples! 5 - 7 Artifacts from Indian sites

    Both unscaled and autoscaled daa will

    be evaluated using XLStat,

    Archaeological artifacts - Data scaling Archaeological artifacts - Eigenvalues

    0

    20000

    40000

    60000

    80000

    100000

    120000

    140000

    160000

    180000

    F1 F2 F3 F4 F5 F6 F7 F8 F9 F 10

    Eigenvalue

    0

    20

    40

    60

    80

    100

    Cumulativevariability(%)

    Virtually all of

    the variance is in

    the first principal

    component.

    PCA of archaeological artifacts

    We can now produce displays of our

    components.

    A plot of the scores for PC1 vs. PC2 will

    result in about 98% of the original

    information being displayed.

    A loadings plot of L1 vs. L2 will show the

    importance of the original variables in the

    construction of PC1 and PC2.

    7

    7

    77

    7

    6

    6

    6

    55 5

    5

    4

    4

    444 4

    444

    44

    4

    4

    4

    44

    44

    4

    4

    43

    3

    33

    3333

    3

    333 3

    33

    3

    3 3333

    33

    22

    2

    22

    22

    22

    1

    1

    11 1

    1

    1

    11

    1

    ZrY

    SrRb

    Mn

    K Ca

    Ba

    Ti

    Fe

    -400

    -200

    0

    200

    400

    600

    -800 -600 -400 -200 0 200 400 600

    F1 (81.68 %)

    F2(16.1

    4%

    )

    PCA of archaeological artifactsPC1 vs. PC2

    Quarry samplefrom site 1 tend

    to form anindividual group

    Quarry site 3 and artifact site

    7 appear to be related

  • 7/28/2019 11-A_PCA

    4/7

    PCA of archaeological artifacts L1 vs. L2

    The loadings showthat Ca and Fe bothhave an effect on PC1.

    The other variableshave a smaller effect.

    Zr

    Y

    Sr

    Rb

    Mn

    K

    Ca

    Ba

    Ti

    Fe

    -1

    -0.75

    -0.5

    -0.25

    0

    0.25

    0.5

    0.75

    1

    -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

    Correlation plot

    XLStat will alsoproduce a plot thaindicates thecorrelation betweethe original variabland the factors.

    Here, it indicatesthat Y has little effeand the other haveimpact on both PC

    Archaeological artifacts - Autoscaling

    PCA of archaeological artifacts

    0

    1

    2

    3

    4

    5

    6

    F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

    Eigenvalue

    0

    20

    40

    60

    80

    100

    Cumulativevariability(%)

    Now that each variableis given equal weight,variance is no longer allsmashed into the 1stcomponent.

    7

    7

    77

    7

    6

    6 6

    5 5

    5

    544

    44

    4

    4

    4

    44

    4

    44 4

    4

    4

    44

    4 4

    443

    3

    3

    3

    3

    3

    3

    3

    33

    33

    33

    3

    3

    3

    3

    3

    3

    3

    33

    2

    2

    2

    2

    2

    2

    2

    22

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    Zr

    Y

    Sr

    Rb

    Mn

    K

    Ca

    Ba

    Ti

    Fe

    -3

    -2

    -1

    0

    1

    2

    3

    4

    -6 -5 -4 -3 -2 -1 0 1 2 3 4 5

    F1 (52.52 %)

    F2(20.78%

    )

    Biplot showing scores and loadings

    PCA of archaeological artifacts

    So what does it all mean?

    In this case, both scaled and unscaled resultsindicate a grouping of related samples - XRFresults can be used to classify related samples.

    Samples from different quarries (1-4) caneasily be determined although samples fromsite 2 are pretty scattered.

    Can we tell anything about the artifacts?

  • 7/28/2019 11-A_PCA

    5/7

    No artifactsappear to

    have comefrom quarry

    site 4.

    It

    s wellresolvedfrom the

    othersamples.

    7

    7

    77

    7

    6

    6 6

    5 5

    5

    544

    44

    4

    4

    4

    44

    4

    44 4

    4

    4

    44

    4 4

    443

    3

    3

    3

    3

    3

    3

    3

    33

    33

    33

    3

    3

    3

    3

    3

    3

    3

    33

    2

    2

    2

    2

    2

    22

    22

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    Zr

    Y

    Sr

    Rb

    Mn

    K

    Ca

    Ba

    Ti

    Fe

    -3

    -2

    -1

    0

    1

    2

    3

    4

    5

    -6 - 5 -4 -3 -2 - 1 0 1 2 3 4 5 6

    F1 (52.52 %)

    F2(20.7

    8%)

    Archaeological artifacts - Results

    7

    7

    77

    7

    6

    6 6

    5 5

    5

    544

    44

    4

    4

    4

    44

    4

    44 4

    4

    4

    44

    4 4

    443

    3

    3

    3

    3

    3

    3

    3

    33

    33

    33

    3

    3

    3

    3

    3

    3

    3

    33

    2

    2

    2

    2

    2

    22

    22

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    Zr

    Y

    Sr

    Rb

    Mn

    K

    Ca

    Ba

    Ti

    Fe

    -3

    -2

    -1

    0

    1

    2

    3

    4

    5

    -6 - 5 -4 -3 -2 - 1 0 1 2 3 4 5 6

    F1 (52.52 %)

    F2(20.7

    8%)

    Artifacts from siteappear to come fro

    quarry 2, although tresults are scattere

    Artifacts from siteare from quarry

    Site 5 artifacts appeto come from all ov

    the place. possible that this w

    a nomadic trib

    Using the loadings The loadings indicate that many of our

    variables are closely related.

    In addition, V appears to have little effect onour results.

    We can reprocess our data after eliminatingV and some of our correlated variables.

    This might improve our results. At a minimum it will make subsequent

    studies easier - less data to collect.

    Using the loadings

    Zr

    Y

    SrRb

    Mn

    KCa

    Ba

    Ti

    Fe

    -1

    -0.75

    -0.5

    -0.25

    0

    0.25

    0.5

    0.75

    1

    -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

    Use KUse Ca

    Use Fe

    Use

    Remove

    Modified Study

    We now only 4 variables.

    Is it enough to still give the same

    characterization as with the original 10?

    If this works, well be able to save quite a

    bit of time and money on subsequent

    assays.

    Also, does it improve our results?

    7

    7

    7

    7

    76

    66

    5

    55

    54

    4

    4

    44

    44

    444

    444

    44

    44

    4

    44

    4

    3

    33

    33 333

    3

    33333

    3 33

    3

    333

    33

    2

    2

    2

    22 22

    2

    2

    111111

    1111

    Zr

    K

    Ca

    -4

    -2

    0

    2

    4

    6

    -5 -3 -1 1 3 5

    F1 (60.47 %)

    F2(27.6

    6%

    )

    Our results are almost identicalto our 10 variable work.

  • 7/28/2019 11-A_PCA

    6/7

    Classification of whiskey

    Another example -- the workconducted in our laboratory.

    One study involved the characterization

    of whiskey based on GC/MS traces.

    This example show what you mightneed to do in order to make your data

    suitable for PCA evaluations.

    Representative whiskeys

    Methylene chloride

    extracts for a serie

    of whiskeys were

    assayed using a

    GC/MS.

    Variables needed

    to be constructed

    from these traces.

    Data preprocessingEach chromatograph consisted of

    approximately 1800 points. This

    would be too much for many

    systems to handle.Variables were constructedby summing response at1 min intervals resulting in

    30 variables.

    Data preprocessing To improve variable stability:

    The smallest response value was treated asa baseline for background correction.

    An internal standard was used tonormalized detector response.

    The internal standard was also used toaccount for small time variations.

    Data preprocessing

    All data was autoscaled prior to PCA.

    Questions askedCould whiskies be classified?

    Could the approach be used to

    detect

    sample dilution?

    blending?

    contamination?

    Initial PCA analysis

    S - Scotch

    B - Bourbon

    C - Canadian

    L - Blended

    T - Tennessee

  • 7/28/2019 11-A_PCA

    7/7

    Blending of one scotch into another

    3

    1

    -1

    -3

    -3 -1 1 3

    X

    X

    X

    X

    XX

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    X

    XX

    X

    X

    X

    X

    X

    X

    X

    X

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    Y

    YY

    Y

    Y

    Y

    Y

    75Y

    50Y

    25Y

    PC1

    PC2

    % of brand Y in

    the blend.

    Contamination

    Dilution3

    1

    -1

    -3-7 -3 1 5

    XX

    X

    X

    XX

    X

    X

    X

    X

    XX

    X

    X

    X

    X

    X

    X

    X

    X

    X

    XX

    X

    XX

    XX

    X

    X

    X

    000

    202020

    404040

    606060

    80

    8080

    PC1

    PC2

    % by V, whiskey