51
Data - Examples Studying individuals Studying variables Interpretation aids Principal Component Analysis (PCA) François Husson Applied Mathematics Department - Rennes Agrocampus [email protected] 1 / 36

Principal Component Analysis (PCA) - GitHub Pages · Data - ExamplesStudying individualsStudying variablesInterpretation aids Principal Component Analysis (PCA) 1 Data-examples-notation

  • Upload
    others

  • View
    23

  • Download
    0

Embed Size (px)

Citation preview

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Principal Component Analysis (PCA)

    François Husson

    Applied Mathematics Department - Rennes Agrocampus

    [email protected]

    1 / 36

    mailto:[email protected]

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Principal Component Analysis (PCA)

    1 Data - examples - notation2 Studying individuals3 Studying variables4 Interpreting the data

    1 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Which kinds of data?PCA applies to data tables where rows are considered asindividuals and columns as quantitative variables

    Figure: Data table in PCA

    For variable k, we note:

    the mean: x̄k =1I

    I∑i=1

    xik

    the standard-deviation:

    sk =

    √√√√1I

    I∑i=1

    (xik − x̄k)2

    2 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Examples

    • Sensory analysis: score for attribute k of product i• Ecology: concentration of pollutant k in river i• Economics: indicator value k for year i• Genetics: expression of gene k for patient i• Biology: measure k for animal i• Marketing: value of measure k for brand i• Sociology: time spent on activity k by individuals from socialclass i

    • etc.

    ⇒ There exist many data tables like these

    3 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Wine data

    • 10 individuals (rows): white wines from the Loire region• 30 variables (columns):

    • 27 continuous variables: sensory descriptors• 2 continuous variables: odor and overall preference• 1 categorical variable: wine label (Vouvray or Sauvignon)

    O.fr

    uity

    O.p

    assi

    on

    O.c

    itrus

    Sw

    eetn

    ess

    Aci

    dity

    Bitt

    erne

    ss

    Ast

    ringe

    ncy

    Aro

    ma.

    inte

    nsity

    Aro

    ma.

    pers

    iste

    ncy

    Vis

    ual.i

    nten

    sity

    Odo

    r.pr

    efer

    ence

    Ove

    rall.

    pref

    eren

    ce

    Labe

    l

    S Michaud 4,3 2,4 5,7 … 3,5 5,9 4,1 1,4 7,1 6,7 5,0 6,0 5,0 SauvignonS Renaudie 4,4 3,1 5,3 … 3,3 6,8 3,8 2,3 7,2 6,6 3,4 5,4 5,5 SauvignonS Renaudie 4,4 3,1 5,3 … 3,3 6,8 3,8 2,3 7,2 6,6 3,4 5,4 5,5 SauvignonS Trotignon 5,1 4,0 5,3 … 3,0 6,1 4,1 2,4 6,1 6,1 3,0 5,0 5,5 SauvignonS Buisse Domaine 4,3 2,4 3,6 … 3,9 5,6 2,5 3,0 4,9 5,1 4,1 5,3 4,6 SauvignonS Buisse Cristal 5,6 3,1 3,5 … 3,4 6,6 5,0 3,1 6,1 5,1 3,6 6,1 5,0 SauvignonV Aub Silex 3,9 0,7 3,3 … 7,9 4,4 3,0 2,4 5,9 5,6 4,0 5,0 5,5 VouvrayV Aub Marigny 2,1 0,7 1,0 … 3,5 6,4 5,0 4,0 6,3 6,7 6,0 5,1 4,1 VouvrayV Font Domaine 5,1 0,5 2,5 … 3,0 5,7 4,0 2,5 6,7 6,3 6,4 4,4 5,1 VouvrayV Font Brûlés 5,1 0,8 3,8 … 3,9 5,4 4,0 3,1 7,0 6,1 7,4 4,4 6,4 VouvrayV Font Coteaux 4,1 0,9 2,7 … 3,8 5,1 4,3 4,3 7,3 6,6 6,3 6,0 5,7 Vouvray

    4 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Issues – goals

    The data table can be seen as a set of rows or a set of columns

    Studying individuals• When can we say that 2 individuals are similar (or dissimilar)with respect to all the variables?

    • If there are many individuals, is it possible to categorize them?

    ⇒ groups of individuals, partitions between them

    5 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Issues – goals

    Studying variables• For individuals, we interpret similarity in terms of thevariables’ values

    • Between variables, we talk instead of “relationships”• Linear relationships are commonplace, and a firstapproximation of many links ⇒ correlation coefficient

    ⇒ visualization of the correlation matrix⇒ find a small number of synthetic variables to summarize manyvariables (e.g. of a prior synthetic variable: the mean. But here wesearch for posterior synthetic variables from the data)

    6 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Issues – goals

    Links between the two points-of-view

    • Characterize groups of individuals using the variables⇒ need an automatic procedure

    • Use specific individuals to better understand links betweenvariables⇒ use of extreme individuals (return to individuals tounderstand more simply)

    PCA issues:• Descriptive method to explore data: visualization of data withsimple plots

    • Data compression - summarize a big data table of individuals× quantitative variables

    7 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Two point clouds

    X X

    ind ivar k

    RI

    RK

    ind 1var 1

    Individuals study Variables study

    1

    i

    I

    1 k K1

    i

    I

    1 k K

    8 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Studying individuals

    9 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    The cloud of individuals NI1 individual = 1 row of the data table ⇒ 1 point in Rk

    • If K = 1: axial representation• If K = 2: scatter plot• If K = 3: 3D graphical representation (more difficult)• If K = 4: impossible to “see” BUT the concept is easy

    Notion of similarity: (squared) distance between individuals i andi ′:

    d2(i , i ′) =K∑

    k=1(xik − xi ′k)2 (thanks Mr Pythagoras)

    Studying the individuals ≡ Studying the shape of the cloud NI

    9 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    The cloud of individuals NI

    • Study the structure, i.e., the shape, of the cloud of individuals• Individuals are in RK

    10 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Centering – standardizing data• Centering data does not modify the shape of the cloud⇒ centering is always done

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +++

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    ++

    +

    ++

    +

    +

    +

    +

    +

    +

    +

    ++

    +

    +

    +

    +

    +

    +

    55 60 65 70 75 80 85

    1.5

    1.6

    1.7

    1.8

    1.9

    weight (kg)

    Size

    (m

    )

    + ++ +++ + +++ +++ +++++ + +++ ++ ++ +++ + ++ + +++ ++ +++ ++ ++ + + +++

    50 55 60 65 70 75 80 85

    −10

    −5

    05

    1015

    weight (kg)

    Size

    (m

    )

    +

    +

    +

    +

    +

    ++

    ++

    +

    +

    ++

    ++

    +

    +

    +

    +++

    +

    +

    +

    +

    +++

    +

    +

    +

    +

    ++

    +

    +

    +

    +

    ++

    +

    +

    +

    +

    +

    +

    +++

    +

    −20 −10 0 10 20

    150

    160

    170

    180

    190

    weight (quintal)

    Size

    (cm

    )

    • Standardizing data is necessary if units are different betweenvariables

    xik ↪→xik − x̄k

    sk

    11 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Centering – standardizing data• Centering data does not modify the shape of the cloud⇒ centering is always done

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +++

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    ++

    +

    ++

    +

    +

    +

    +

    +

    +

    +

    ++

    +

    +

    +

    +

    +

    +

    55 60 65 70 75 80 85

    1.5

    1.6

    1.7

    1.8

    1.9

    weight (kg)

    Size

    (m

    )

    + ++ +++ + +++ +++ +++++ + +++ ++ ++ +++ + ++ + +++ ++ +++ ++ ++ + + +++

    50 55 60 65 70 75 80 85

    −10

    −5

    05

    1015

    weight (kg)

    Size

    (m

    )+

    +

    +

    +

    +

    ++

    ++

    +

    +

    ++

    ++

    +

    +

    +

    +++

    +

    +

    +

    +

    +++

    +

    +

    +

    +

    ++

    +

    +

    +

    +

    ++

    +

    +

    +

    +

    +

    +

    +++

    +

    −20 −10 0 10 20

    150

    160

    170

    180

    190

    weight (quintal)

    Size

    (cm

    )

    • Standardizing data is necessary if units are different betweenvariables

    xik ↪→xik − x̄k

    sk

    11 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Centering – standardizing data• Centering data does not modify the shape of the cloud⇒ centering is always done

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +++

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    ++

    +

    ++

    +

    +

    +

    +

    +

    +

    +

    ++

    +

    +

    +

    +

    +

    +

    55 60 65 70 75 80 85

    1.5

    1.6

    1.7

    1.8

    1.9

    weight (kg)

    Size

    (m

    )

    + ++ +++ + +++ +++ +++++ + +++ ++ ++ +++ + ++ + +++ ++ +++ ++ ++ + + +++

    50 55 60 65 70 75 80 85

    −10

    −5

    05

    1015

    weight (kg)

    Size

    (m

    )+

    +

    +

    +

    +

    ++

    ++

    +

    +

    ++

    ++

    +

    +

    +

    +++

    +

    +

    +

    +

    +++

    +

    +

    +

    +

    ++

    +

    +

    +

    +

    ++

    +

    +

    +

    +

    +

    +

    +++

    +

    −20 −10 0 10 20

    150

    160

    170

    180

    190

    weight (quintal)

    Size

    (cm

    )

    • Standardizing data is necessary if units are different betweenvariables

    xik ↪→xik − x̄k

    sk11 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Centering – standardizing data

    O.fr

    uity

    O.p

    assi

    on

    O.c

    itrus

    Sw

    eetn

    ess

    Aci

    dity

    Bitt

    erne

    ss

    Ast

    ringe

    ncy

    Aro

    ma.

    inte

    nsity

    Aro

    ma.

    pers

    iste

    ncy

    Vis

    ual.i

    nten

    sity

    O

    .frui

    ty

    O.p

    assi

    on

    O.c

    itrus

    Sw

    eetn

    ess

    Aci

    dity

    Bitt

    erne

    ss

    Ast

    ringe

    ncy

    Aro

    ma.

    inte

    nsity

    Aro

    ma.

    pers

    iste

    ncy

    Vis

    ual.i

    nten

    sity

    S Michaud -0,17 0,45 1,50 … -0,30 0,11 0,20 -1,79 0,95 1,07 0,06S Renaudie 0,02 1,03 1,16 … -0,46 1,39 -0,31 -0,65 0,99 0,82 -1,08S Trotignon 0,79 1,73 1,16 … -0,67 0,48 0,20 -0,60 -0,44 0,07 -1,34S Buisse Domaine -0,17 0,45 -0,07 … -0,02 -0,25 -2,01 0,19 -2,24 -1,66 -0,55S Buisse Cristal 1,30 1,03 -0,12 … -0,39 1,20 1,39 0,34 -0,44 -1,66 -0,90V Aub Silex -0,60 -0,97 -0,27 … 2,93 -2,07 -1,33 -0,60 -0,84 -0,92 -0,64V Aub Marigny -2,44 -0,97 -1,94 … -0,30 0,84 1,39 1,45 -0,18 0,98 0,76V Font Domaine 0,79 -1,11 -0,85 … -0,67 -0,12 0,03 -0,44 0,29 0,41 1,03V Font Brûlés 0,79 -0,84 0,13 … -0,02 -0,61 0,03 0,34 0,75 0,07 1,73V Font Coteaux -0,29 -0,82 -0,69 … -0,11 -0,98 0,37 1,76 1,15 0,82 0,94

    PCA ≡ Studying the standardized data setDifficult to visualize the cloud NI ⇒ try to get an approximateview of it

    12 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fitting the cloud of individualsPCA searches for the best summary space for optimal visualizationof NI⇐⇒ Find a subspace that sums up the data the bestViewpoint quality:

    • faithfully reproduce the cloud’s shape (animation)

    13 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fitting the cloud of individuals

    PCA searches for the best summary space for optimal visualizationof NI⇐⇒ Find a subspace that sums up the data the best

    Viewpoint quality:• faithfully reproduce the cloud’s shape (animation)

    1st proposition 2nd proposition 3rd proposition

    13 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fitting the cloud of individuals

    PCA searches for the best summary space for optimal visualizationof NI⇐⇒ Find a subspace that sums up the data the best

    Viewpoint quality:• faithfully reproduce the cloud’s shape (animation)• best representation of diversity, variability• doesn’t distort distances between individuals

    How to quantify the quality of a viewpoint?

    notion of dispersion, of variability, also called inertiainertia ≡ variance generalized to several dimensions

    14 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fit the individuals’ cloud

    Figure: Camel or dromedary? (illustration by J.P. Fénelon)

    15 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fit the individuals’ cloud

    Figure: Camel or dromedary? (illustration by J.P. Fénelon)

    15 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fit the individuals’ cloud

    How to find the best view to approximate the cloud?

    1 find an axis that distorts the cloud the least

    u1

    min

    max Hi

    i

    O

    (iHi )2 small with Hi ∈ axis ⇔(OHi )2 large (Pythagoras)⇒ we want

    ∑i (OHi )2 large

    2 Find the best plane: maximize∑

    i (OHi )2 with Hi ∈ planeThe best plane contains the best axis: we search for u2⊥u1and maximizing

    ∑i (OHi )2

    3 we can look for a third axis (etc.) with maximum inertia

    16 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Illustration of the construction of PCA components

    (Illustration: J. Li)16 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Example: wine data

    • Sensory descriptors are used as active variables: only thesevariables are used to construct the axes

    • Variables are (centered and) standardized

    O.fr

    uity

    O.p

    assi

    on

    O.c

    itrus

    Sw

    eetn

    ess

    Aci

    dity

    Bitt

    erne

    ss

    Ast

    ringe

    ncy

    Aro

    ma.

    inte

    nsity

    Aro

    ma.

    pers

    iste

    ncy

    Vis

    ual.i

    nten

    sity

    Odo

    r.pr

    efer

    ence

    Ove

    rall.

    pref

    eren

    ce

    Labe

    l

    S Michaud 4,3 2,4 5,7 … 3,5 5,9 4,1 1,4 7,1 6,7 5,0 6,0 5,0 SauvignonS Renaudie 4,4 3,1 5,3 … 3,3 6,8 3,8 2,3 7,2 6,6 3,4 5,4 5,5 SauvignonS Renaudie 4,4 3,1 5,3 … 3,3 6,8 3,8 2,3 7,2 6,6 3,4 5,4 5,5 SauvignonS Trotignon 5,1 4,0 5,3 … 3,0 6,1 4,1 2,4 6,1 6,1 3,0 5,0 5,5 SauvignonS Buisse Domaine 4,3 2,4 3,6 … 3,9 5,6 2,5 3,0 4,9 5,1 4,1 5,3 4,6 SauvignonS Buisse Cristal 5,6 3,1 3,5 … 3,4 6,6 5,0 3,1 6,1 5,1 3,6 6,1 5,0 SauvignonV Aub Silex 3,9 0,7 3,3 … 7,9 4,4 3,0 2,4 5,9 5,6 4,0 5,0 5,5 VouvrayV Aub Marigny 2,1 0,7 1,0 … 3,5 6,4 5,0 4,0 6,3 6,7 6,0 5,1 4,1 VouvrayV Font Domaine 5,1 0,5 2,5 … 3,0 5,7 4,0 2,5 6,7 6,3 6,4 4,4 5,1 VouvrayV Font Brûlés 5,1 0,8 3,8 … 3,9 5,4 4,0 3,1 7,0 6,1 7,4 4,4 6,4 VouvrayV Font Coteaux 4,1 0,9 2,7 … 3,8 5,1 4,3 4,3 7,3 6,6 6,3 6,0 5,7 Vouvray

    17 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Example: graphing the individuals

    20

    2D

    im 2

    (25

    .14%

    )

    S MichaudS Renaudie

    S Trotignon

    S Buisse Cristal

    V Aub Marigny

    V Font Domaine

    V Font Brules

    V Font Coteaux

    -6 -4 -2 0 2 4

    -6-4

    -2

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    S Buisse Domaine

    V Aub Silex

    How to interpret the dimensions? Why are S. Trotignon and V.Font Brules far apart? ⇒ Need variables to interpret the directionsof variability

    18 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Individuals’ coordinates considered as variables

    Fi1 = 1.102

    4

    Dim

    2 (2

    5.14

    %)

    S Michaud

    S Renaudie

    S Trotignon

    S Buisse Cristal

    V Aub Marigny

    V Font Domaine

    V Font Brules

    V Font Coteaux

    x

    1

    F F

    1

    K1 k

    -6.01.1

    F.1 F.2

    -6 -4 -2 0 2 4 6

    -6-4

    -2

    Dim 1 (43.48%)

    Dim

    2 (2

    5.14

    %)

    S Buisse Domaine

    V Aub Silex

    xik

    i

    I

    Fi1Fi2

    Fi2 = -6.0

    19 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Representation of the variables as an interpretation aid forthe individuals’ cloud

    • Correlations between the variable x.k and F.1 (and F.2)

    x.kO.Vanilla

    10

    -1

    1

    -1

    r(F.1, x.k)

    r(F.2, x.k)

    ⇒ Correlation circle20 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Representation of the variables as an interpretation aid forthe individuals’ cloud

    0.0

    0.5

    1.0

    Dim

    2 (

    25.1

    4%)

    O.passion

    O.citrus

    O.Intensity.before.shakingO.Intensity.after.shaking

    Expression

    O.vanillaO.wooded

    O.mushroom

    O.flower

    O.alcohol

    Grade

    Surface.feeling

    Freshness

    Attack.intensity

    AcidityBitterness

    Astringency

    Aroma.intensity

    Aroma.persistency

    Visual.intensity

    -1.0 -0.5 0.0 0.5 1.0

    -1.0

    -0.5

    0.0

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    O.fruityO.candied.fruitO.mushroom

    O.plante

    Typicity

    OxidationSmoothness

    Sweetness

    How to interpret the firstdimension?

    How to interpret thesecond dimension?

    Main directions of variability: ???

    1 - fruity and flowery odors versus woody and vegetal odors;2 - bitterness and acidity versus sweetness

    21 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Representation of the variables as an interpretation aid forthe individuals’ cloud

    0.0

    0.5

    1.0

    Dim

    2 (

    25.1

    4%)

    O.passion

    O.citrus

    O.Intensity.before.shakingO.Intensity.after.shaking

    Expression

    O.vanillaO.wooded

    O.mushroom

    O.flower

    O.alcohol

    Grade

    Surface.feeling

    Freshness

    Attack.intensity

    AcidityBitterness

    Astringency

    Aroma.intensity

    Aroma.persistency

    Visual.intensity

    -1.0 -0.5 0.0 0.5 1.0

    -1.0

    -0.5

    0.0

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    O.fruityO.candied.fruitO.mushroom

    O.plante

    Typicity

    OxidationSmoothness

    Sweetness

    How to interpret the firstdimension?

    How to interpret thesecond dimension?

    Main directions of variability:1 - fruity and flowery odors versus woody and vegetal odors;2 - bitterness and acidity versus sweetness

    21 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Studying the variables

    22 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fitting the variables’ cloud NK

    O

    1xikxik

    1 variable = 1 point in an I-dimensional space

    cos(θkl ) =< x.k , x.l >‖x.k‖ ‖x.l‖

    =∑I

    i=1 xikxil√∑Ii=1 x2ik

    √∑Ii=1 x2il )

    Since variables are centered: cos(θkl )= r(x.k , x.l )

    If variables are standardized ⇒ points are on a I-sphere of radius 1

    22 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fitting the variables’ cloud NK

    O

    1xikxik

    1 variable = 1 point in an I-dimensional space

    cos(θkl ) =< x.k , x.l >‖x.k‖ ‖x.l‖

    =∑I

    i=1 xikxil√∑Ii=1 x2ik

    √∑Ii=1 x2il )

    Since variables are centered: cos(θkl )= r(x.k , x.l )

    If variables are standardized ⇒ points are on a I-sphere of radius 1

    22 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fitting the variables’ cloud NK

    O

    1xikxik

    1 variable = 1 point in an I-dimensional space

    cos(θkl ) =< x.k , x.l >‖x.k‖ ‖x.l‖

    =∑I

    i=1 xikxil√∑Ii=1 x2ik

    √∑Ii=1 x2il )

    Since variables are centered: cos(θkl )= r(x.k , x.l )

    If variables are standardized ⇒ points are on a I-sphere of radius 1

    22 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fitting the variables’ cloud NK

    Similar strategy as for individuals: sequentially find orthogonalaxes:

    argmaxv1∈RI

    K∑k=1

    r(v1, x.k)2

    ⇒ v1 is the best synthetic variable for summarizing the variables

    Find the 2nd axis, then the 3rd, etc.

    23 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fitting the variables’ cloud NK0.

    00.

    51.

    0

    Dim

    2 (

    25.1

    4%)

    O.passion

    O.citrus

    O.Intensity.before.shakingO.Intensity.after.shaking

    Expression

    O.vanillaO.wooded

    O.mushroom

    O.flower

    O.alcohol

    Grade

    Surface.feeling

    Freshness

    Attack.intensity

    AcidityBitterness

    Astringency

    Aroma.intensity

    Aroma.persistency

    Visual.intensity

    -1.0 -0.5 0.0 0.5 1.0

    -1.0

    -0.5

    0.0

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    O.fruityO.candied.fruitO.mushroom

    O.plante

    Typicity

    OxidationSmoothness

    Sweetness

    ⇒ Same graph as before!!!!

    24 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    INCREDIBLE!!! AMAZING!!!

    24 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Fitting the variables’ cloud NK0.

    00.

    51.

    0

    Dim

    2 (

    25.1

    4%)

    O.passion

    O.citrus

    O.Intensity.before.shakingO.Intensity.after.shaking

    Expression

    O.vanillaO.wooded

    O.mushroom

    O.flower

    O.alcohol

    Grade

    Surface.feeling

    Freshness

    Attack.intensity

    AcidityBitterness

    Astringency

    Aroma.intensity

    Aroma.persistency

    Visual.intensity

    -1.0 -0.5 0.0 0.5 1.0

    -1.0

    -0.5

    0.0

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    O.fruityO.candied.fruitO.mushroom

    O.plante

    Typicity

    OxidationSmoothness

    Sweetness

    ⇒ Same graph as before!!!!

    • interpretation aid forthe individuals’ graph

    • optimal representationof the variables’ cloud

    • visualization of thecorrelation matrix

    INCREDIBLE!!! OUTSTANDING!!!

    24 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Linking the two representations: transition formulas

    Scores: F•sLoadings: G•s/

    √λs Fis =

    1√λs

    K∑k=1

    xikGks Gks =1√λs

    I∑i=1

    xikFis

    =⇒ Individuals are on the same side as their correspondingvariables with high values

    20

    2D

    im 2

    (25

    .14%

    )

    S MichaudS Renaudie

    S Trotignon

    S Buisse Cristal

    V Aub Marigny

    V Font Domaine

    V Font Brules

    V Font Coteaux

    -6 -4 -2 0 2 4

    -6-4

    -2

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    S Buisse Domaine

    V Aub Silex

    Aub SilexO.intensity.after.shaking -2,54O.intensity.before.shaking -2,37Expression -2,25Acidity -2,07Attack.intensity -1,36Attack.intensity -1,36Bitterness -1,33Freshness -1,15… …Typicity 1,01Sweetness 2,93

    0.0

    0.5

    1.0

    Dim

    2 (

    25.1

    4%)

    O.passion

    O.citrus

    O.Intensity.before.shakingO.Intensity.after.shaking

    Expression

    O.vanillaO.wooded

    O.mushroom

    O.flower

    O.alcohol

    Grade

    Surface.feeling

    Freshness

    Attack.intensity

    AcidityBitterness

    Astringency

    Aroma.intensity

    Aroma.persistency

    Visual.intensity

    -1.0 -0.5 0.0 0.5 1.0

    -1.0

    -0.5

    0.0

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    O.fruityO.candied.fruitO.mushroom

    O.plante

    Typicity

    OxidationSmoothness

    Sweetness

    25 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Linking the two representations: transition formulas

    Scores: F•sLoadings: G•s/

    √λs Fis =

    1√λs

    K∑k=1

    xikGks Gks =1√λs

    I∑i=1

    xikFis

    =⇒ Individuals are on the same side as their correspondingvariables with high values

    20

    2D

    im 2

    (25

    .14%

    )

    S MichaudS Renaudie

    S Trotignon

    S Buisse Cristal

    V Aub Marigny

    V Font Domaine

    V Font Brules

    V Font Coteaux

    -6 -4 -2 0 2 4

    -6-4

    -2

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    S Buisse Domaine

    V Aub Silex

    Aub SilexO.intensity.after.shaking -2,54O.intensity.before.shaking -2,37Expression -2,25Acidity -2,07Attack.intensity -1,36Attack.intensity -1,36Bitterness -1,33Freshness -1,15… …Typicity 1,01Sweetness 2,93

    0.0

    0.5

    1.0

    Dim

    2 (

    25.1

    4%)

    O.passion

    O.citrus

    O.Intensity.before.shakingO.Intensity.after.shaking

    Expression

    O.vanillaO.wooded

    O.mushroom

    O.flower

    O.alcohol

    Grade

    Surface.feeling

    Freshness

    Attack.intensity

    AcidityBitterness

    Astringency

    Aroma.intensity

    Aroma.persistency

    Visual.intensity

    -1.0 -0.5 0.0 0.5 1.0

    -1.0

    -0.5

    0.0

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    O.fruityO.candied.fruitO.mushroom

    O.plante

    Typicity

    OxidationSmoothness

    Sweetness

    25 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Projections. . .

    r(A,B) = cos(θA,B)cos(θA,B) ≈ cos(θHA,HB ) if the variables are well-projected

    A

    B

    C

    DHAHB

    HC

    HD

    HA

    HB

    HC

    HD

    HEE

    HE

    Only well-projected variables can be interpreted!

    26 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Interpretation aids

    27 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Choosing the number of dimensions

    Bar chart of eigenvalues,tests,confidence intervals,cross-validation (estim_ncp function),etc.

    1 2 3 4 5 6 7 8 9

    Percentage of variance

    Dimension

    Per

    cent

    age

    of in

    ertia

    exp

    lain

    ed b

    y a

    dim

    ensi

    on

    010

    2030

    40

    Two goals:⇒ Interpretation⇒ Separate structure from noise

    Data NoiseStructurePCA

    x.1 x.Kx.k F1 FQ FK

    27 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Percentage of variance obtained under independence⇒ Is there structure in my data?

    Number of variablesnbind 4 5 6 7 8 9 10 11 12 13 14 15 16

    5 96.5 93.1 90.2 87.6 85.5 83.4 81.9 80.7 79.4 78.1 77.4 76.6 75.56 93.3 88.6 84.8 81.5 79.1 76.9 75.1 73.2 72.2 70.8 69.8 68.7 68.07 90.5 84.9 80.9 77.4 74.4 72.0 70.1 68.3 67.0 65.3 64.3 63.2 62.28 88.1 82.3 77.2 73.8 70.7 68.2 66.1 64.0 62.8 61.2 60.0 59.0 58.09 86.1 79.5 74.8 70.7 67.4 65.1 62.9 61.1 59.4 57.9 56.5 55.4 54.310 84.5 77.5 72.3 68.2 65.0 62.4 60.1 58.3 56.5 55.1 53.7 52.5 51.511 82.8 75.7 70.3 66.3 62.9 60.1 58.0 56.0 54.4 52.7 51.3 50.1 49.212 81.5 74.0 68.6 64.4 61.2 58.3 55.8 54.0 52.4 50.9 49.3 48.2 47.213 80.0 72.5 67.2 62.9 59.4 56.7 54.4 52.2 50.5 48.9 47.7 46.6 45.414 79.0 71.5 65.7 61.5 58.1 55.1 52.8 50.8 49.0 47.5 46.2 45.0 44.015 78.1 70.3 64.6 60.3 57.0 53.9 51.5 49.4 47.8 46.1 44.9 43.6 42.516 77.3 69.4 63.5 59.2 55.6 52.9 50.3 48.3 46.6 45.2 43.6 42.4 41.417 76.5 68.4 62.6 58.2 54.7 51.8 49.3 47.1 45.5 44.0 42.6 41.4 40.318 75.5 67.6 61.8 57.1 53.7 50.8 48.4 46.3 44.6 43.0 41.6 40.4 39.319 75.1 67.0 60.9 56.5 52.8 49.9 47.4 45.5 43.7 42.1 40.7 39.6 38.420 74.1 66.1 60.1 55.6 52.1 49.1 46.6 44.7 42.9 41.3 39.8 38.7 37.525 72.0 63.3 57.1 52.5 48.9 46.0 43.4 41.4 39.6 38.1 36.7 35.5 34.530 69.8 61.1 55.1 50.3 46.7 43.6 41.1 39.1 37.3 35.7 34.4 33.2 32.135 68.5 59.6 53.3 48.6 44.9 41.9 39.5 37.4 35.6 34.0 32.7 31.6 30.440 67.5 58.3 52.0 47.3 43.4 40.5 38.0 36.0 34.1 32.7 31.3 30.1 29.145 66.4 57.1 50.8 46.1 42.4 39.3 36.9 34.8 33.1 31.5 30.2 29.0 27.950 65.6 56.3 49.9 45.2 41.4 38.4 35.9 33.9 32.1 30.5 29.2 28.1 27.0100 60.9 51.4 44.9 40.0 36.3 33.3 31.0 28.9 27.2 25.8 24.5 23.3 22.3

    Table: 95 % quantile for inertia in the two first axes of 10 000 PCA ondata with independent variables 28 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Percentage of variance obtained under independence

    Number of variablesnbind 17 18 19 20 25 30 35 40 50 75 100 150 200

    5 74.9 74.2 73.5 72.8 70.7 68.8 67.4 66.4 64.7 62.0 60.5 58.5 57.46 67.0 66.3 65.6 64.9 62.3 60.4 58.9 57.6 55.8 52.9 51.0 49.0 47.87 61.3 60.7 59.7 59.1 56.4 54.3 52.6 51.4 49.5 46.4 44.6 42.4 41.28 57.0 56.2 55.4 54.5 51.8 49.7 47.8 46.7 44.6 41.6 39.8 37.6 36.49 53.6 52.5 51.8 51.2 48.1 45.9 44.4 42.9 41.0 38.0 36.1 34.0 32.710 50.6 49.8 49.0 48.3 45.2 42.9 41.4 40.1 38.0 35.0 33.2 31.0 29.811 48.1 47.2 46.5 45.8 42.8 40.6 39.0 37.7 35.6 32.6 30.8 28.7 27.512 46.2 45.2 44.4 43.8 40.7 38.5 36.9 35.5 33.5 30.5 28.8 26.7 25.513 44.4 43.4 42.8 41.9 39.0 36.8 35.1 33.9 31.8 28.8 27.1 25.0 23.914 42.9 42.0 41.3 40.4 37.4 35.2 33.6 32.3 30.4 27.4 25.7 23.6 22.415 41.6 40.7 39.8 39.1 36.2 34.0 32.4 31.1 29.0 26.0 24.3 22.4 21.216 40.4 39.5 38.7 37.9 35.0 32.8 31.1 29.8 27.9 24.9 23.2 21.2 20.117 39.4 38.5 37.6 36.9 33.8 31.7 30.1 28.8 26.8 23.9 22.2 20.3 19.218 38.3 37.4 36.7 35.8 32.9 30.7 29.1 27.8 25.9 22.9 21.3 19.4 18.319 37.4 36.5 35.8 34.9 32.0 29.9 28.3 27.0 25.1 22.2 20.5 18.6 17.520 36.7 35.8 34.9 34.2 31.3 29.1 27.5 26.2 24.3 21.4 19.8 18.0 16.925 33.5 32.5 31.8 31.1 28.1 26.0 24.5 23.3 21.4 18.6 17.0 15.2 14.230 31.2 30.3 29.5 28.8 26.0 23.9 22.3 21.1 19.3 16.6 15.1 13.4 12.535 29.5 28.6 27.9 27.1 24.3 22.2 20.7 19.6 17.8 15.2 13.7 12.1 11.140 28.1 27.3 26.5 25.8 23.0 21.0 19.5 18.4 16.6 14.1 12.7 11.1 10.245 27.0 26.1 25.4 24.7 21.9 20.0 18.5 17.4 15.7 13.2 11.8 10.3 9.450 26.1 25.3 24.6 23.8 21.1 19.1 17.7 16.6 14.9 12.5 11.1 9.6 8.7100 21.5 20.7 19.9 19.3 16.7 14.9 13.6 12.5 11.0 8.9 7.7 6.4 5.7

    Table: 95 % quantile for inertia in the two first axes of 10 000 PCA ondata with independent variables

    29 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Supplementary information• For the quantitative variables: project supplementary variablesonto the axes

    • For categorical variables: project the barycenter of individualsin each category

    20

    2D

    im 2

    (25

    .14%

    )

    S MichaudS Renaudie

    S Trotignon

    S Buisse Cristal

    V Aub Marigny

    V Font Domaine

    V Font Brules

    V Font Coteaux

    Sauvignon

    Vouvray

    SauvignonVouvray

    -6 -4 -2 0 2 4

    -6-4

    -2

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    S Buisse Domaine

    V Aub Silex

    0.0

    0.5

    1.0

    Dim

    2 (

    25.1

    4%)

    O.passion

    O.citrus

    O.Intensity.before.shakingO.Intensity.after.shaking

    Expression

    O.vanillaO.wooded

    O.mushroom

    O.flower

    O.alcohol

    Grade

    Surface.feeling

    Freshness

    Attack.intensity

    AcidityBitterness

    Astringency

    Aroma.intensity

    Aroma.persistency

    Visual.intensityOdor.preference

    -1.0 -0.5 0.0 0.5 1.0

    -1.0

    -0.5

    0.0

    Dim 1 (43.48%)

    Dim

    2 (

    25.1

    4%)

    O.fruityO.candied.fruitO.mushroom

    O.plante

    Typicity

    OxidationSmoothness

    Sweetness

    Overall.preference

    ⇒ Supplementary information not used to build the axes 30 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Quality of the representation: cos2

    • cos2(θiHi ) for the individuals: distance between individuals canonly be interpreted for well-projected individuals> round(res.pca$ind$cos2,2)

    Dim.1 Dim.2S Michaud 0.62 0.07S Renaudie 0.73 0.15S Trotignon 0.78 0.07

    • cos2(θkHk ) for the variables: only well-projected variables(high cos2) can be interpreted!> round(res.pca$var$cos2,2)

    Dim.1 Dim.2O.fruity 0.08 0.02O.passion 0.80 0.03O.citrus 0.69 0.00

    31 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Contributions⇒ Contributions to components:

    • for an individual: Ctrs(i) =F 2is∑Ii=1 F

    2is

    = F2isλs

    ⇒ Individuals with a large coordinate value contribute most> round(res.pca$ind$contrib,2)

    Dim.1 Dim.2S Michaud 15.49 3.10S Renaudie 15.56 5.56S Trotignon 15.46 2.43

    • for a variable: Ctrs(k) = r(x.k ,vs)2∑K

    k=1 r(x.k ,vs)2

    = r(x.k ,vs)2

    λs

    ⇒ Variables highly correlated with the principal componentcontribute the most> round(res.pca$var$contrib,2)

    Dim.1 Dim.2O.fruity 0.67 0.34O.passion 6.84 0.40O.citrus 5.89 0.02

    32 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Characterizing the axesUsing the continuous variables:

    • correlation between each variable and the principal componentof rank s is calculated

    • correlation coefficients are sorted and significant ones areoutput

    > dimdesc(res.pca)$Dim.1$quanti $Dim.2$quanticorr p.value corr p.value

    O.candied.fruit 0.93 9.5e-05 O.intensity.before.shaking 0.97 3.1e-06Grade 0.93 1.2e-04 O.intensity.after.shaking 0.95 3.6e-05Surface.feeling 0.89 5.5e-04 Attack.intensity 0.85 1.7e-03Typicity 0.86 1.4e-03 Expression 0.84 2.2e-03O.mushroom 0.84 2.3e-03 Aroma.persistency 0.75 1.3e-02Visual.intensity 0.83 3.1e-03 Bitterness 0.71 2.3e-02

    ... ... ... Aroma.intensity 0.66 4.0e-02O.plante -0.87 1.0e-03O.flower -0.89 4.9e-04O.passion -0.90 4.5e-04Freshness -0.91 2.9e-04 Sweetness -0.78 8.0e-03

    33 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    Characterizing the axesUsing the categorical variables:

    • Do one-way analysis of variance with the coordinates of theindividuals (F.s) described by the categorical variable

    • an F-test by variable• for each category, a Student’s t-test to compare the average of

    the category with the general mean> dimdesc(res.pca)Dim.1$quali

    R2 p.valueLabel 0.874 7.30e-05

    Dim.1$categoryEstimate p.value

    Vouvray 3.203 7.30e-05Sauvignon -3.203 7.30e-05

    34 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    PCA in practice

    1 Choose active variables2 Rescale (or not) the variables3 Perform PCA4 Choose the number of dimensions to interpret5 Joint analysis of the cloud of individuals and the cloud of

    variables6 Use indicators to enrich interpretation7 Go back to raw data for interpretation

    35 / 36

  • Data - Examples Studying individuals Studying variables Interpretation aids

    More

    K31116

    François Husson • Sébastien LêJérôme Pagès

    Husson Lê

    Pagès

    Statistics

    Exploratory Multivariate Analysis by Example Using R

    S E C O N D E D I T I O N

    Exploratory Multivariate Analysis by Exam

    ple Using R

    Chapman & Hall/CRC Computer Science & Data Analysis SeriesChapman & Hall/CRC Computer Science & Data Analysis Series

    Second Edition

    ISBN: 978-1-138-19634-6

    9 781 138 196346

    90000

    Husson F., Lê S. & Pagès J. (2017)Exploratory Multivariate Analysis by ExampleUsing R2nd edition, 230 p., CRC/Press.

    The FactoMineR package for doing PCA:http://factominer.free.fr/

    Videos on Youtube:• Youtube channel: youtube.com/HussonFrancois• a playlist with movies in English• a playlist with movies in French

    36 / 36

    http://factominer.free.fr/https://www.youtube.com/HussonFrancoisyoutube.com/HussonFrancoishttps://www.youtube.com/playlist?list=PLnZgp6epRBbTsZEFXi_p6W48HhNyqwxIuhttps://www.youtube.com/playlist?list=PLnZgp6epRBbQu2QtCyqYL80In1P-A_Iud

    Data - ExamplesStudying individualsStudying variablesInterpretation aids

    fd@rm@0: fd@rm@1: