Upload
others
View
23
Download
0
Embed Size (px)
Citation preview
Data - Examples Studying individuals Studying variables Interpretation aids
Principal Component Analysis (PCA)
François Husson
Applied Mathematics Department - Rennes Agrocampus
1 / 36
mailto:[email protected]
Data - Examples Studying individuals Studying variables Interpretation aids
Principal Component Analysis (PCA)
1 Data - examples - notation2 Studying individuals3 Studying variables4 Interpreting the data
1 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Which kinds of data?PCA applies to data tables where rows are considered asindividuals and columns as quantitative variables
Figure: Data table in PCA
For variable k, we note:
the mean: x̄k =1I
I∑i=1
xik
the standard-deviation:
sk =
√√√√1I
I∑i=1
(xik − x̄k)2
2 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Examples
• Sensory analysis: score for attribute k of product i• Ecology: concentration of pollutant k in river i• Economics: indicator value k for year i• Genetics: expression of gene k for patient i• Biology: measure k for animal i• Marketing: value of measure k for brand i• Sociology: time spent on activity k by individuals from socialclass i
• etc.
⇒ There exist many data tables like these
3 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Wine data
• 10 individuals (rows): white wines from the Loire region• 30 variables (columns):
• 27 continuous variables: sensory descriptors• 2 continuous variables: odor and overall preference• 1 categorical variable: wine label (Vouvray or Sauvignon)
O.fr
uity
O.p
assi
on
O.c
itrus
…
Sw
eetn
ess
Aci
dity
Bitt
erne
ss
Ast
ringe
ncy
Aro
ma.
inte
nsity
Aro
ma.
pers
iste
ncy
Vis
ual.i
nten
sity
Odo
r.pr
efer
ence
Ove
rall.
pref
eren
ce
Labe
l
S Michaud 4,3 2,4 5,7 … 3,5 5,9 4,1 1,4 7,1 6,7 5,0 6,0 5,0 SauvignonS Renaudie 4,4 3,1 5,3 … 3,3 6,8 3,8 2,3 7,2 6,6 3,4 5,4 5,5 SauvignonS Renaudie 4,4 3,1 5,3 … 3,3 6,8 3,8 2,3 7,2 6,6 3,4 5,4 5,5 SauvignonS Trotignon 5,1 4,0 5,3 … 3,0 6,1 4,1 2,4 6,1 6,1 3,0 5,0 5,5 SauvignonS Buisse Domaine 4,3 2,4 3,6 … 3,9 5,6 2,5 3,0 4,9 5,1 4,1 5,3 4,6 SauvignonS Buisse Cristal 5,6 3,1 3,5 … 3,4 6,6 5,0 3,1 6,1 5,1 3,6 6,1 5,0 SauvignonV Aub Silex 3,9 0,7 3,3 … 7,9 4,4 3,0 2,4 5,9 5,6 4,0 5,0 5,5 VouvrayV Aub Marigny 2,1 0,7 1,0 … 3,5 6,4 5,0 4,0 6,3 6,7 6,0 5,1 4,1 VouvrayV Font Domaine 5,1 0,5 2,5 … 3,0 5,7 4,0 2,5 6,7 6,3 6,4 4,4 5,1 VouvrayV Font Brûlés 5,1 0,8 3,8 … 3,9 5,4 4,0 3,1 7,0 6,1 7,4 4,4 6,4 VouvrayV Font Coteaux 4,1 0,9 2,7 … 3,8 5,1 4,3 4,3 7,3 6,6 6,3 6,0 5,7 Vouvray
4 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Issues – goals
The data table can be seen as a set of rows or a set of columns
Studying individuals• When can we say that 2 individuals are similar (or dissimilar)with respect to all the variables?
• If there are many individuals, is it possible to categorize them?
⇒ groups of individuals, partitions between them
5 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Issues – goals
Studying variables• For individuals, we interpret similarity in terms of thevariables’ values
• Between variables, we talk instead of “relationships”• Linear relationships are commonplace, and a firstapproximation of many links ⇒ correlation coefficient
⇒ visualization of the correlation matrix⇒ find a small number of synthetic variables to summarize manyvariables (e.g. of a prior synthetic variable: the mean. But here wesearch for posterior synthetic variables from the data)
6 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Issues – goals
Links between the two points-of-view
• Characterize groups of individuals using the variables⇒ need an automatic procedure
• Use specific individuals to better understand links betweenvariables⇒ use of extreme individuals (return to individuals tounderstand more simply)
PCA issues:• Descriptive method to explore data: visualization of data withsimple plots
• Data compression - summarize a big data table of individuals× quantitative variables
7 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Two point clouds
X X
ind ivar k
RI
RK
ind 1var 1
Individuals study Variables study
1
i
I
1 k K1
i
I
1 k K
8 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Studying individuals
9 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
The cloud of individuals NI1 individual = 1 row of the data table ⇒ 1 point in Rk
• If K = 1: axial representation• If K = 2: scatter plot• If K = 3: 3D graphical representation (more difficult)• If K = 4: impossible to “see” BUT the concept is easy
Notion of similarity: (squared) distance between individuals i andi ′:
d2(i , i ′) =K∑
k=1(xik − xi ′k)2 (thanks Mr Pythagoras)
Studying the individuals ≡ Studying the shape of the cloud NI
9 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
The cloud of individuals NI
• Study the structure, i.e., the shape, of the cloud of individuals• Individuals are in RK
10 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Centering – standardizing data• Centering data does not modify the shape of the cloud⇒ centering is always done
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++
+
+
+
+
+
+
+
+
+
+
+
++
+
++
+
+
+
+
+
+
+
++
+
+
+
+
+
+
55 60 65 70 75 80 85
1.5
1.6
1.7
1.8
1.9
weight (kg)
Size
(m
)
+ ++ +++ + +++ +++ +++++ + +++ ++ ++ +++ + ++ + +++ ++ +++ ++ ++ + + +++
50 55 60 65 70 75 80 85
−10
−5
05
1015
weight (kg)
Size
(m
)
+
+
+
+
+
++
++
+
+
++
++
+
+
+
+++
+
+
+
+
+++
+
+
+
+
++
+
+
+
+
++
+
+
+
+
+
+
+++
+
−20 −10 0 10 20
150
160
170
180
190
weight (quintal)
Size
(cm
)
• Standardizing data is necessary if units are different betweenvariables
xik ↪→xik − x̄k
sk
11 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Centering – standardizing data• Centering data does not modify the shape of the cloud⇒ centering is always done
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++
+
+
+
+
+
+
+
+
+
+
+
++
+
++
+
+
+
+
+
+
+
++
+
+
+
+
+
+
55 60 65 70 75 80 85
1.5
1.6
1.7
1.8
1.9
weight (kg)
Size
(m
)
+ ++ +++ + +++ +++ +++++ + +++ ++ ++ +++ + ++ + +++ ++ +++ ++ ++ + + +++
50 55 60 65 70 75 80 85
−10
−5
05
1015
weight (kg)
Size
(m
)+
+
+
+
+
++
++
+
+
++
++
+
+
+
+++
+
+
+
+
+++
+
+
+
+
++
+
+
+
+
++
+
+
+
+
+
+
+++
+
−20 −10 0 10 20
150
160
170
180
190
weight (quintal)
Size
(cm
)
• Standardizing data is necessary if units are different betweenvariables
xik ↪→xik − x̄k
sk
11 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Centering – standardizing data• Centering data does not modify the shape of the cloud⇒ centering is always done
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++
+
+
+
+
+
+
+
+
+
+
+
++
+
++
+
+
+
+
+
+
+
++
+
+
+
+
+
+
55 60 65 70 75 80 85
1.5
1.6
1.7
1.8
1.9
weight (kg)
Size
(m
)
+ ++ +++ + +++ +++ +++++ + +++ ++ ++ +++ + ++ + +++ ++ +++ ++ ++ + + +++
50 55 60 65 70 75 80 85
−10
−5
05
1015
weight (kg)
Size
(m
)+
+
+
+
+
++
++
+
+
++
++
+
+
+
+++
+
+
+
+
+++
+
+
+
+
++
+
+
+
+
++
+
+
+
+
+
+
+++
+
−20 −10 0 10 20
150
160
170
180
190
weight (quintal)
Size
(cm
)
• Standardizing data is necessary if units are different betweenvariables
xik ↪→xik − x̄k
sk11 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Centering – standardizing data
O.fr
uity
O.p
assi
on
O.c
itrus
…
Sw
eetn
ess
Aci
dity
Bitt
erne
ss
Ast
ringe
ncy
Aro
ma.
inte
nsity
Aro
ma.
pers
iste
ncy
Vis
ual.i
nten
sity
O
.frui
ty
O.p
assi
on
O.c
itrus
Sw
eetn
ess
Aci
dity
Bitt
erne
ss
Ast
ringe
ncy
Aro
ma.
inte
nsity
Aro
ma.
pers
iste
ncy
Vis
ual.i
nten
sity
S Michaud -0,17 0,45 1,50 … -0,30 0,11 0,20 -1,79 0,95 1,07 0,06S Renaudie 0,02 1,03 1,16 … -0,46 1,39 -0,31 -0,65 0,99 0,82 -1,08S Trotignon 0,79 1,73 1,16 … -0,67 0,48 0,20 -0,60 -0,44 0,07 -1,34S Buisse Domaine -0,17 0,45 -0,07 … -0,02 -0,25 -2,01 0,19 -2,24 -1,66 -0,55S Buisse Cristal 1,30 1,03 -0,12 … -0,39 1,20 1,39 0,34 -0,44 -1,66 -0,90V Aub Silex -0,60 -0,97 -0,27 … 2,93 -2,07 -1,33 -0,60 -0,84 -0,92 -0,64V Aub Marigny -2,44 -0,97 -1,94 … -0,30 0,84 1,39 1,45 -0,18 0,98 0,76V Font Domaine 0,79 -1,11 -0,85 … -0,67 -0,12 0,03 -0,44 0,29 0,41 1,03V Font Brûlés 0,79 -0,84 0,13 … -0,02 -0,61 0,03 0,34 0,75 0,07 1,73V Font Coteaux -0,29 -0,82 -0,69 … -0,11 -0,98 0,37 1,76 1,15 0,82 0,94
PCA ≡ Studying the standardized data setDifficult to visualize the cloud NI ⇒ try to get an approximateview of it
12 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fitting the cloud of individualsPCA searches for the best summary space for optimal visualizationof NI⇐⇒ Find a subspace that sums up the data the bestViewpoint quality:
• faithfully reproduce the cloud’s shape (animation)
13 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fitting the cloud of individuals
PCA searches for the best summary space for optimal visualizationof NI⇐⇒ Find a subspace that sums up the data the best
Viewpoint quality:• faithfully reproduce the cloud’s shape (animation)
1st proposition 2nd proposition 3rd proposition
13 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fitting the cloud of individuals
PCA searches for the best summary space for optimal visualizationof NI⇐⇒ Find a subspace that sums up the data the best
Viewpoint quality:• faithfully reproduce the cloud’s shape (animation)• best representation of diversity, variability• doesn’t distort distances between individuals
How to quantify the quality of a viewpoint?
notion of dispersion, of variability, also called inertiainertia ≡ variance generalized to several dimensions
14 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fit the individuals’ cloud
Figure: Camel or dromedary? (illustration by J.P. Fénelon)
15 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fit the individuals’ cloud
Figure: Camel or dromedary? (illustration by J.P. Fénelon)
15 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fit the individuals’ cloud
How to find the best view to approximate the cloud?
1 find an axis that distorts the cloud the least
u1
min
max Hi
i
O
(iHi )2 small with Hi ∈ axis ⇔(OHi )2 large (Pythagoras)⇒ we want
∑i (OHi )2 large
2 Find the best plane: maximize∑
i (OHi )2 with Hi ∈ planeThe best plane contains the best axis: we search for u2⊥u1and maximizing
∑i (OHi )2
3 we can look for a third axis (etc.) with maximum inertia
16 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Illustration of the construction of PCA components
(Illustration: J. Li)16 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Example: wine data
• Sensory descriptors are used as active variables: only thesevariables are used to construct the axes
• Variables are (centered and) standardized
O.fr
uity
O.p
assi
on
O.c
itrus
…
Sw
eetn
ess
Aci
dity
Bitt
erne
ss
Ast
ringe
ncy
Aro
ma.
inte
nsity
Aro
ma.
pers
iste
ncy
Vis
ual.i
nten
sity
Odo
r.pr
efer
ence
Ove
rall.
pref
eren
ce
Labe
l
S Michaud 4,3 2,4 5,7 … 3,5 5,9 4,1 1,4 7,1 6,7 5,0 6,0 5,0 SauvignonS Renaudie 4,4 3,1 5,3 … 3,3 6,8 3,8 2,3 7,2 6,6 3,4 5,4 5,5 SauvignonS Renaudie 4,4 3,1 5,3 … 3,3 6,8 3,8 2,3 7,2 6,6 3,4 5,4 5,5 SauvignonS Trotignon 5,1 4,0 5,3 … 3,0 6,1 4,1 2,4 6,1 6,1 3,0 5,0 5,5 SauvignonS Buisse Domaine 4,3 2,4 3,6 … 3,9 5,6 2,5 3,0 4,9 5,1 4,1 5,3 4,6 SauvignonS Buisse Cristal 5,6 3,1 3,5 … 3,4 6,6 5,0 3,1 6,1 5,1 3,6 6,1 5,0 SauvignonV Aub Silex 3,9 0,7 3,3 … 7,9 4,4 3,0 2,4 5,9 5,6 4,0 5,0 5,5 VouvrayV Aub Marigny 2,1 0,7 1,0 … 3,5 6,4 5,0 4,0 6,3 6,7 6,0 5,1 4,1 VouvrayV Font Domaine 5,1 0,5 2,5 … 3,0 5,7 4,0 2,5 6,7 6,3 6,4 4,4 5,1 VouvrayV Font Brûlés 5,1 0,8 3,8 … 3,9 5,4 4,0 3,1 7,0 6,1 7,4 4,4 6,4 VouvrayV Font Coteaux 4,1 0,9 2,7 … 3,8 5,1 4,3 4,3 7,3 6,6 6,3 6,0 5,7 Vouvray
17 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Example: graphing the individuals
20
2D
im 2
(25
.14%
)
S MichaudS Renaudie
S Trotignon
S Buisse Cristal
V Aub Marigny
V Font Domaine
V Font Brules
V Font Coteaux
-6 -4 -2 0 2 4
-6-4
-2
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
S Buisse Domaine
V Aub Silex
How to interpret the dimensions? Why are S. Trotignon and V.Font Brules far apart? ⇒ Need variables to interpret the directionsof variability
18 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Individuals’ coordinates considered as variables
Fi1 = 1.102
4
Dim
2 (2
5.14
%)
S Michaud
S Renaudie
S Trotignon
S Buisse Cristal
V Aub Marigny
V Font Domaine
V Font Brules
V Font Coteaux
x
1
F F
1
K1 k
-6.01.1
F.1 F.2
-6 -4 -2 0 2 4 6
-6-4
-2
Dim 1 (43.48%)
Dim
2 (2
5.14
%)
S Buisse Domaine
V Aub Silex
xik
i
I
Fi1Fi2
Fi2 = -6.0
19 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Representation of the variables as an interpretation aid forthe individuals’ cloud
• Correlations between the variable x.k and F.1 (and F.2)
x.kO.Vanilla
10
-1
1
-1
r(F.1, x.k)
r(F.2, x.k)
⇒ Correlation circle20 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Representation of the variables as an interpretation aid forthe individuals’ cloud
0.0
0.5
1.0
Dim
2 (
25.1
4%)
O.passion
O.citrus
O.Intensity.before.shakingO.Intensity.after.shaking
Expression
O.vanillaO.wooded
O.mushroom
O.flower
O.alcohol
Grade
Surface.feeling
Freshness
Attack.intensity
AcidityBitterness
Astringency
Aroma.intensity
Aroma.persistency
Visual.intensity
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
O.fruityO.candied.fruitO.mushroom
O.plante
Typicity
OxidationSmoothness
Sweetness
How to interpret the firstdimension?
How to interpret thesecond dimension?
Main directions of variability: ???
1 - fruity and flowery odors versus woody and vegetal odors;2 - bitterness and acidity versus sweetness
21 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Representation of the variables as an interpretation aid forthe individuals’ cloud
0.0
0.5
1.0
Dim
2 (
25.1
4%)
O.passion
O.citrus
O.Intensity.before.shakingO.Intensity.after.shaking
Expression
O.vanillaO.wooded
O.mushroom
O.flower
O.alcohol
Grade
Surface.feeling
Freshness
Attack.intensity
AcidityBitterness
Astringency
Aroma.intensity
Aroma.persistency
Visual.intensity
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
O.fruityO.candied.fruitO.mushroom
O.plante
Typicity
OxidationSmoothness
Sweetness
How to interpret the firstdimension?
How to interpret thesecond dimension?
Main directions of variability:1 - fruity and flowery odors versus woody and vegetal odors;2 - bitterness and acidity versus sweetness
21 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Studying the variables
22 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fitting the variables’ cloud NK
O
1xikxik
1 variable = 1 point in an I-dimensional space
cos(θkl ) =< x.k , x.l >‖x.k‖ ‖x.l‖
=∑I
i=1 xikxil√∑Ii=1 x2ik
√∑Ii=1 x2il )
Since variables are centered: cos(θkl )= r(x.k , x.l )
If variables are standardized ⇒ points are on a I-sphere of radius 1
22 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fitting the variables’ cloud NK
O
1xikxik
1 variable = 1 point in an I-dimensional space
cos(θkl ) =< x.k , x.l >‖x.k‖ ‖x.l‖
=∑I
i=1 xikxil√∑Ii=1 x2ik
√∑Ii=1 x2il )
Since variables are centered: cos(θkl )= r(x.k , x.l )
If variables are standardized ⇒ points are on a I-sphere of radius 1
22 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fitting the variables’ cloud NK
O
1xikxik
1 variable = 1 point in an I-dimensional space
cos(θkl ) =< x.k , x.l >‖x.k‖ ‖x.l‖
=∑I
i=1 xikxil√∑Ii=1 x2ik
√∑Ii=1 x2il )
Since variables are centered: cos(θkl )= r(x.k , x.l )
If variables are standardized ⇒ points are on a I-sphere of radius 1
22 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fitting the variables’ cloud NK
Similar strategy as for individuals: sequentially find orthogonalaxes:
argmaxv1∈RI
K∑k=1
r(v1, x.k)2
⇒ v1 is the best synthetic variable for summarizing the variables
Find the 2nd axis, then the 3rd, etc.
23 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fitting the variables’ cloud NK0.
00.
51.
0
Dim
2 (
25.1
4%)
O.passion
O.citrus
O.Intensity.before.shakingO.Intensity.after.shaking
Expression
O.vanillaO.wooded
O.mushroom
O.flower
O.alcohol
Grade
Surface.feeling
Freshness
Attack.intensity
AcidityBitterness
Astringency
Aroma.intensity
Aroma.persistency
Visual.intensity
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
O.fruityO.candied.fruitO.mushroom
O.plante
Typicity
OxidationSmoothness
Sweetness
⇒ Same graph as before!!!!
24 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
INCREDIBLE!!! AMAZING!!!
24 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Fitting the variables’ cloud NK0.
00.
51.
0
Dim
2 (
25.1
4%)
O.passion
O.citrus
O.Intensity.before.shakingO.Intensity.after.shaking
Expression
O.vanillaO.wooded
O.mushroom
O.flower
O.alcohol
Grade
Surface.feeling
Freshness
Attack.intensity
AcidityBitterness
Astringency
Aroma.intensity
Aroma.persistency
Visual.intensity
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
O.fruityO.candied.fruitO.mushroom
O.plante
Typicity
OxidationSmoothness
Sweetness
⇒ Same graph as before!!!!
• interpretation aid forthe individuals’ graph
• optimal representationof the variables’ cloud
• visualization of thecorrelation matrix
INCREDIBLE!!! OUTSTANDING!!!
24 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Linking the two representations: transition formulas
Scores: F•sLoadings: G•s/
√λs Fis =
1√λs
K∑k=1
xikGks Gks =1√λs
I∑i=1
xikFis
=⇒ Individuals are on the same side as their correspondingvariables with high values
20
2D
im 2
(25
.14%
)
S MichaudS Renaudie
S Trotignon
S Buisse Cristal
V Aub Marigny
V Font Domaine
V Font Brules
V Font Coteaux
-6 -4 -2 0 2 4
-6-4
-2
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
S Buisse Domaine
V Aub Silex
Aub SilexO.intensity.after.shaking -2,54O.intensity.before.shaking -2,37Expression -2,25Acidity -2,07Attack.intensity -1,36Attack.intensity -1,36Bitterness -1,33Freshness -1,15… …Typicity 1,01Sweetness 2,93
0.0
0.5
1.0
Dim
2 (
25.1
4%)
O.passion
O.citrus
O.Intensity.before.shakingO.Intensity.after.shaking
Expression
O.vanillaO.wooded
O.mushroom
O.flower
O.alcohol
Grade
Surface.feeling
Freshness
Attack.intensity
AcidityBitterness
Astringency
Aroma.intensity
Aroma.persistency
Visual.intensity
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
O.fruityO.candied.fruitO.mushroom
O.plante
Typicity
OxidationSmoothness
Sweetness
25 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Linking the two representations: transition formulas
Scores: F•sLoadings: G•s/
√λs Fis =
1√λs
K∑k=1
xikGks Gks =1√λs
I∑i=1
xikFis
=⇒ Individuals are on the same side as their correspondingvariables with high values
20
2D
im 2
(25
.14%
)
S MichaudS Renaudie
S Trotignon
S Buisse Cristal
V Aub Marigny
V Font Domaine
V Font Brules
V Font Coteaux
-6 -4 -2 0 2 4
-6-4
-2
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
S Buisse Domaine
V Aub Silex
Aub SilexO.intensity.after.shaking -2,54O.intensity.before.shaking -2,37Expression -2,25Acidity -2,07Attack.intensity -1,36Attack.intensity -1,36Bitterness -1,33Freshness -1,15… …Typicity 1,01Sweetness 2,93
0.0
0.5
1.0
Dim
2 (
25.1
4%)
O.passion
O.citrus
O.Intensity.before.shakingO.Intensity.after.shaking
Expression
O.vanillaO.wooded
O.mushroom
O.flower
O.alcohol
Grade
Surface.feeling
Freshness
Attack.intensity
AcidityBitterness
Astringency
Aroma.intensity
Aroma.persistency
Visual.intensity
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
O.fruityO.candied.fruitO.mushroom
O.plante
Typicity
OxidationSmoothness
Sweetness
25 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Projections. . .
r(A,B) = cos(θA,B)cos(θA,B) ≈ cos(θHA,HB ) if the variables are well-projected
A
B
C
DHAHB
HC
HD
HA
HB
HC
HD
HEE
HE
Only well-projected variables can be interpreted!
26 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Interpretation aids
27 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Choosing the number of dimensions
Bar chart of eigenvalues,tests,confidence intervals,cross-validation (estim_ncp function),etc.
1 2 3 4 5 6 7 8 9
Percentage of variance
Dimension
Per
cent
age
of in
ertia
exp
lain
ed b
y a
dim
ensi
on
010
2030
40
Two goals:⇒ Interpretation⇒ Separate structure from noise
Data NoiseStructurePCA
x.1 x.Kx.k F1 FQ FK
27 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Percentage of variance obtained under independence⇒ Is there structure in my data?
Number of variablesnbind 4 5 6 7 8 9 10 11 12 13 14 15 16
5 96.5 93.1 90.2 87.6 85.5 83.4 81.9 80.7 79.4 78.1 77.4 76.6 75.56 93.3 88.6 84.8 81.5 79.1 76.9 75.1 73.2 72.2 70.8 69.8 68.7 68.07 90.5 84.9 80.9 77.4 74.4 72.0 70.1 68.3 67.0 65.3 64.3 63.2 62.28 88.1 82.3 77.2 73.8 70.7 68.2 66.1 64.0 62.8 61.2 60.0 59.0 58.09 86.1 79.5 74.8 70.7 67.4 65.1 62.9 61.1 59.4 57.9 56.5 55.4 54.310 84.5 77.5 72.3 68.2 65.0 62.4 60.1 58.3 56.5 55.1 53.7 52.5 51.511 82.8 75.7 70.3 66.3 62.9 60.1 58.0 56.0 54.4 52.7 51.3 50.1 49.212 81.5 74.0 68.6 64.4 61.2 58.3 55.8 54.0 52.4 50.9 49.3 48.2 47.213 80.0 72.5 67.2 62.9 59.4 56.7 54.4 52.2 50.5 48.9 47.7 46.6 45.414 79.0 71.5 65.7 61.5 58.1 55.1 52.8 50.8 49.0 47.5 46.2 45.0 44.015 78.1 70.3 64.6 60.3 57.0 53.9 51.5 49.4 47.8 46.1 44.9 43.6 42.516 77.3 69.4 63.5 59.2 55.6 52.9 50.3 48.3 46.6 45.2 43.6 42.4 41.417 76.5 68.4 62.6 58.2 54.7 51.8 49.3 47.1 45.5 44.0 42.6 41.4 40.318 75.5 67.6 61.8 57.1 53.7 50.8 48.4 46.3 44.6 43.0 41.6 40.4 39.319 75.1 67.0 60.9 56.5 52.8 49.9 47.4 45.5 43.7 42.1 40.7 39.6 38.420 74.1 66.1 60.1 55.6 52.1 49.1 46.6 44.7 42.9 41.3 39.8 38.7 37.525 72.0 63.3 57.1 52.5 48.9 46.0 43.4 41.4 39.6 38.1 36.7 35.5 34.530 69.8 61.1 55.1 50.3 46.7 43.6 41.1 39.1 37.3 35.7 34.4 33.2 32.135 68.5 59.6 53.3 48.6 44.9 41.9 39.5 37.4 35.6 34.0 32.7 31.6 30.440 67.5 58.3 52.0 47.3 43.4 40.5 38.0 36.0 34.1 32.7 31.3 30.1 29.145 66.4 57.1 50.8 46.1 42.4 39.3 36.9 34.8 33.1 31.5 30.2 29.0 27.950 65.6 56.3 49.9 45.2 41.4 38.4 35.9 33.9 32.1 30.5 29.2 28.1 27.0100 60.9 51.4 44.9 40.0 36.3 33.3 31.0 28.9 27.2 25.8 24.5 23.3 22.3
Table: 95 % quantile for inertia in the two first axes of 10 000 PCA ondata with independent variables 28 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Percentage of variance obtained under independence
Number of variablesnbind 17 18 19 20 25 30 35 40 50 75 100 150 200
5 74.9 74.2 73.5 72.8 70.7 68.8 67.4 66.4 64.7 62.0 60.5 58.5 57.46 67.0 66.3 65.6 64.9 62.3 60.4 58.9 57.6 55.8 52.9 51.0 49.0 47.87 61.3 60.7 59.7 59.1 56.4 54.3 52.6 51.4 49.5 46.4 44.6 42.4 41.28 57.0 56.2 55.4 54.5 51.8 49.7 47.8 46.7 44.6 41.6 39.8 37.6 36.49 53.6 52.5 51.8 51.2 48.1 45.9 44.4 42.9 41.0 38.0 36.1 34.0 32.710 50.6 49.8 49.0 48.3 45.2 42.9 41.4 40.1 38.0 35.0 33.2 31.0 29.811 48.1 47.2 46.5 45.8 42.8 40.6 39.0 37.7 35.6 32.6 30.8 28.7 27.512 46.2 45.2 44.4 43.8 40.7 38.5 36.9 35.5 33.5 30.5 28.8 26.7 25.513 44.4 43.4 42.8 41.9 39.0 36.8 35.1 33.9 31.8 28.8 27.1 25.0 23.914 42.9 42.0 41.3 40.4 37.4 35.2 33.6 32.3 30.4 27.4 25.7 23.6 22.415 41.6 40.7 39.8 39.1 36.2 34.0 32.4 31.1 29.0 26.0 24.3 22.4 21.216 40.4 39.5 38.7 37.9 35.0 32.8 31.1 29.8 27.9 24.9 23.2 21.2 20.117 39.4 38.5 37.6 36.9 33.8 31.7 30.1 28.8 26.8 23.9 22.2 20.3 19.218 38.3 37.4 36.7 35.8 32.9 30.7 29.1 27.8 25.9 22.9 21.3 19.4 18.319 37.4 36.5 35.8 34.9 32.0 29.9 28.3 27.0 25.1 22.2 20.5 18.6 17.520 36.7 35.8 34.9 34.2 31.3 29.1 27.5 26.2 24.3 21.4 19.8 18.0 16.925 33.5 32.5 31.8 31.1 28.1 26.0 24.5 23.3 21.4 18.6 17.0 15.2 14.230 31.2 30.3 29.5 28.8 26.0 23.9 22.3 21.1 19.3 16.6 15.1 13.4 12.535 29.5 28.6 27.9 27.1 24.3 22.2 20.7 19.6 17.8 15.2 13.7 12.1 11.140 28.1 27.3 26.5 25.8 23.0 21.0 19.5 18.4 16.6 14.1 12.7 11.1 10.245 27.0 26.1 25.4 24.7 21.9 20.0 18.5 17.4 15.7 13.2 11.8 10.3 9.450 26.1 25.3 24.6 23.8 21.1 19.1 17.7 16.6 14.9 12.5 11.1 9.6 8.7100 21.5 20.7 19.9 19.3 16.7 14.9 13.6 12.5 11.0 8.9 7.7 6.4 5.7
Table: 95 % quantile for inertia in the two first axes of 10 000 PCA ondata with independent variables
29 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Supplementary information• For the quantitative variables: project supplementary variablesonto the axes
• For categorical variables: project the barycenter of individualsin each category
20
2D
im 2
(25
.14%
)
S MichaudS Renaudie
S Trotignon
S Buisse Cristal
V Aub Marigny
V Font Domaine
V Font Brules
V Font Coteaux
Sauvignon
Vouvray
SauvignonVouvray
-6 -4 -2 0 2 4
-6-4
-2
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
S Buisse Domaine
V Aub Silex
0.0
0.5
1.0
Dim
2 (
25.1
4%)
O.passion
O.citrus
O.Intensity.before.shakingO.Intensity.after.shaking
Expression
O.vanillaO.wooded
O.mushroom
O.flower
O.alcohol
Grade
Surface.feeling
Freshness
Attack.intensity
AcidityBitterness
Astringency
Aroma.intensity
Aroma.persistency
Visual.intensityOdor.preference
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
Dim 1 (43.48%)
Dim
2 (
25.1
4%)
O.fruityO.candied.fruitO.mushroom
O.plante
Typicity
OxidationSmoothness
Sweetness
Overall.preference
⇒ Supplementary information not used to build the axes 30 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Quality of the representation: cos2
• cos2(θiHi ) for the individuals: distance between individuals canonly be interpreted for well-projected individuals> round(res.pca$ind$cos2,2)
Dim.1 Dim.2S Michaud 0.62 0.07S Renaudie 0.73 0.15S Trotignon 0.78 0.07
• cos2(θkHk ) for the variables: only well-projected variables(high cos2) can be interpreted!> round(res.pca$var$cos2,2)
Dim.1 Dim.2O.fruity 0.08 0.02O.passion 0.80 0.03O.citrus 0.69 0.00
31 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Contributions⇒ Contributions to components:
• for an individual: Ctrs(i) =F 2is∑Ii=1 F
2is
= F2isλs
⇒ Individuals with a large coordinate value contribute most> round(res.pca$ind$contrib,2)
Dim.1 Dim.2S Michaud 15.49 3.10S Renaudie 15.56 5.56S Trotignon 15.46 2.43
• for a variable: Ctrs(k) = r(x.k ,vs)2∑K
k=1 r(x.k ,vs)2
= r(x.k ,vs)2
λs
⇒ Variables highly correlated with the principal componentcontribute the most> round(res.pca$var$contrib,2)
Dim.1 Dim.2O.fruity 0.67 0.34O.passion 6.84 0.40O.citrus 5.89 0.02
32 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Characterizing the axesUsing the continuous variables:
• correlation between each variable and the principal componentof rank s is calculated
• correlation coefficients are sorted and significant ones areoutput
> dimdesc(res.pca)$Dim.1$quanti $Dim.2$quanticorr p.value corr p.value
O.candied.fruit 0.93 9.5e-05 O.intensity.before.shaking 0.97 3.1e-06Grade 0.93 1.2e-04 O.intensity.after.shaking 0.95 3.6e-05Surface.feeling 0.89 5.5e-04 Attack.intensity 0.85 1.7e-03Typicity 0.86 1.4e-03 Expression 0.84 2.2e-03O.mushroom 0.84 2.3e-03 Aroma.persistency 0.75 1.3e-02Visual.intensity 0.83 3.1e-03 Bitterness 0.71 2.3e-02
... ... ... Aroma.intensity 0.66 4.0e-02O.plante -0.87 1.0e-03O.flower -0.89 4.9e-04O.passion -0.90 4.5e-04Freshness -0.91 2.9e-04 Sweetness -0.78 8.0e-03
33 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
Characterizing the axesUsing the categorical variables:
• Do one-way analysis of variance with the coordinates of theindividuals (F.s) described by the categorical variable
• an F-test by variable• for each category, a Student’s t-test to compare the average of
the category with the general mean> dimdesc(res.pca)Dim.1$quali
R2 p.valueLabel 0.874 7.30e-05
Dim.1$categoryEstimate p.value
Vouvray 3.203 7.30e-05Sauvignon -3.203 7.30e-05
34 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
PCA in practice
1 Choose active variables2 Rescale (or not) the variables3 Perform PCA4 Choose the number of dimensions to interpret5 Joint analysis of the cloud of individuals and the cloud of
variables6 Use indicators to enrich interpretation7 Go back to raw data for interpretation
35 / 36
Data - Examples Studying individuals Studying variables Interpretation aids
More
K31116
François Husson • Sébastien LêJérôme Pagès
Husson Lê
Pagès
Statistics
Exploratory Multivariate Analysis by Example Using R
S E C O N D E D I T I O N
Exploratory Multivariate Analysis by Exam
ple Using R
Chapman & Hall/CRC Computer Science & Data Analysis SeriesChapman & Hall/CRC Computer Science & Data Analysis Series
Second Edition
ISBN: 978-1-138-19634-6
9 781 138 196346
90000
Husson F., Lê S. & Pagès J. (2017)Exploratory Multivariate Analysis by ExampleUsing R2nd edition, 230 p., CRC/Press.
The FactoMineR package for doing PCA:http://factominer.free.fr/
Videos on Youtube:• Youtube channel: youtube.com/HussonFrancois• a playlist with movies in English• a playlist with movies in French
36 / 36
http://factominer.free.fr/https://www.youtube.com/HussonFrancoisyoutube.com/HussonFrancoishttps://www.youtube.com/playlist?list=PLnZgp6epRBbTsZEFXi_p6W48HhNyqwxIuhttps://www.youtube.com/playlist?list=PLnZgp6epRBbQu2QtCyqYL80In1P-A_Iud
Data - ExamplesStudying individualsStudying variablesInterpretation aids
fd@rm@0: fd@rm@1: