Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Correspondence
Analysis
–F
Murtagh
1
��
��
Correspondence
Analysis
Topics:
�
Basics,and
preliminary
example
(studentexamscores)
�
Metrics,clouds
ofpoints,m
asses,inertia
�
Factors,decomposition
ofinertia,contributions,dualspaces
�
Hierarchicalagglom
erativeclustering
�
Minim
umvariance
criterion
�
Exam
plesin
depth(pptfile)
�
Javaapplication:
http://astro.u-strasbg.fr/�
fmurtagh/m
da-sw
Correspondence
Analysis
–F
Murtagh
2
��
��
Basics
�
Observations�
variablesm
atrix.
�
Through
displayand
throughquantitative
measures,investigate
relationships
between
observations,andbetw
eenvariables.
�
Similar
inthese
objectivesto
principalcomponents
analysis,multidim
ensional
scaling,Kohonen
self-organizingfeature
map,and
others.
�
Correspondence
analysisis
oftenused
inconjunction
with
clustering.
�
Inputdata,andinputdata
coding,arethe
major
issuesw
hichdistinguish
correspondenceanalysis
fromother
algorithmically-sim
ilar(or
alternative
algorithmic)
methods.
Correspondence
Analysis
–F
Murtagh
3
��
��
Scores5
studentsin
6subjects
CSc
CPg
CGr
CNw
DbM
SwE
A54
55
31
36
46
40
B35
56
20
20
49
45
C47
73
39
30
48
57
D54
72
33
42
57
21
E18
24
11
14
19
7
CSc
CPg
CGr
CNw
DbM
SwE
mean
profile:
.18
.24
.12
.12
.19
.15
profile
of
D:
.19
.26
.12
.15
.20
.08
profile
of
E:
.19
.26
.12
.15
.20
.08
Scores(outof
100)of
5students,A
–E,in
6subjects.
Subjects:CSc
:C
omputer
ScienceProficiency,C
Pg
:C
omputer
Programm
ing,CGr
:C
omputer
Graphics,C
Nw
:
Com
puterN
etworks,D
bM
:Database
Managem
ent,SwE
:Software
Engineering.
Correspondence
Analysis
–F
Murtagh
4
��
��
Scores5
studentsin
6subjects
(Cont’d.)
�
Correspondence
analysishighlights
thesim
ilaritiesand
thedifferences
inthe
profiles.
�
Note
thatallthescores
ofD
andE
arein
thesam
eproportion
(E’s
scoresare
one-thirdthose
ofD
).
�
Note
alsothatE
hasthe
lowestscores
bothin
absoluteand
relativeterm
sin
all
thesubjects.
�
Dand
Ehave
identicalprofiles:w
ithoutdatacoding
theyw
ouldbe
locatedat
thesam
elocation
inthe
outputdisplay.
�
Both
Dand
Eshow
apositive
associationw
ithCNw
(computer
networks)
anda
negativeassociation
with
SwE
(software
engineering)because
incom
parison
with
them
eanprofile,D
andE
have,intheir
profile,arelatively
larger
componentof
CNw
anda
relativelysm
allercom
ponentofSwE
.
Correspondence
Analysis
–F
Murtagh
5
��
��
�
We
needto
clearlydifferentiate
between
theprofiles
ofD
andE
,which
we
do
bydoubling
thedata.
�
Doubling:
we
attributetw
oscores
persubjectinstead
ofa
singlescore.
The
“scoreaw
arded”,�������,is
equaltothe
initialscore.T
he“score
not
awarded”,�
������,is
equaltoits
complem
ent,i.e.,�����������.
�
Lever
principle:a
“�
”variable
andits
corresponding“�
”variable
lieon
the
oppositesides
ofthe
originand
collinearw
ithit.
�
And:
ifthe
mass
ofthe
profileof
��
isgreater
thanthe
mass
ofthe
profileof
��
(which
means
thattheaverage
scorefor
thesubject
�
was
greaterthan
50outof
100),thepoint
��
iscloser
tothe
originthan
��
.
�
We
willfind
thatexceptinCPg
,theaverage
scoreof
thestudents
was
below50
inallthe
subjects.
Correspondence
Analysis
–F
Murtagh
6
��
��
Data
coding:D
oubling
CSc+
CSc-
CPg+
CPg-CGr+
CGr-
CNw+
CNw-
DbM+
DbM-
SwE+
SwE-
A54
46
55
45
31
69
36
64
46
54
40
60
B35
65
56
44
20
80
20
80
49
51
45
55
C47
53
73
27
39
61
30
70
48
52
57
43
D54
46
72
28
33
67
42
58
57
43
21
79
E18
82
24
76
11
89
14
86
19
81
793
Doubled
tableof
scoresderived
fromprevious
table.N
ote:allrow
snow
havethe
same
total.
Correspondence
Analysis
–F
Murtagh
7
��
��
Factor 1 (77%
inertia)
Factor 2 (18% inertia)
-0.4-0.2
0.00.2
0.4
-0.2 -0.1 0.0 0.1 0.2 0.3 0.4
A
BC
D
E
CS
c+
CS
c-
CP
g+
CP
g-
CG
r+
CG
r-
CN
w+
CN
w-
DbM
+
DbM
-
Sw
E+
Sw
E-
Correspondence
Analysis
–F
Murtagh
8
��
��
Metrics
�
The
notionof
distanceis
crucial,sincew
ew
anttoinvestigate
relationships
between
observationsand/or
variables.
�
Recall:
����������������������,then:
scalarproduct
����������������������������������.
�
Euclidean
norm:
����������������.
�
Euclidean
distance:
����������.T
hesquared
Euclidean
distanceis:
��������������
�
Orthogonality:
�
isorthogonalto
�
if�������.
�
Distance
issym
metric
(�������������),positive
(������
�),and
definite
( �������������).
Correspondence
Analysis
–F
Murtagh
9
��
��
Metrics
(cont’d.)
�
Any
symm
etric,positive,definitem
atrix
�
definesa
generalizedE
uclidean
space.Scalar
productis ������
�����,norm
is�������
,and
Euclidean
distanceis
�����������
.
�
Classicalcase:
�
��
,theidentity
matrix.
�
Norm
alizationto
unitvariance:
�
isdiagonalm
atrixw
ith
�thdiagonalterm
�
��� .
�
Mahalanobis
distance:
�
isinverse
variance-covariancem
atrix.
�
Nexttopic:
Scalarproductdefines
orthogonalprojection.
Correspondence
Analysis
–F
Murtagh
10
��
��
Metrics
(cont’d.)
�
Projectedvalue,projection,coordinate:
��������
������
.Here
��
and
�
areboth
vectors.
�
Norm
ofvector
��������
��������������.
�
The
quantity
������
���can
beinterpreted
asthe
cosineof
theangle
between
vectors
�
and
�
.
+x
/|
/|
/|
/|
/a
|
+-----+-----
u
Ox1
Correspondence
Analysis
–F
Murtagh
11
��
��
Metrics
(cont’d.)
�
Consider
thecase
ofcentred
�
-valuedcoordinates
orvariables,
�� .
�
The
sumof
variablevectors
isa
constant,proportionaltothe
mean
variable.
�
Therefore
thecentred
vectorslie
ona
hyperplane
�
,ora
sub-space,of
dimension
���.
�
Consider
aprobability
distribution�
definedon
,i.e.forall
�
we
have
����
(note:
��
toavoid
inconvenienceof
lower
dim.subspace)
and �������� .
�
Covariance
matrix:
��� ,diagonalm
atrixw
ithdiagonalelem
entsconsisting
of
the
�
terms.
�
Have:
���
�����
����� ����
var���;and
���
�����
����� �� ���
cov
�����.
Correspondence
Analysis
–F
Murtagh
12
��
��
Metrics
(cont’d.)
�
Use
ofm
etric���
on
isassociated
with
thefollow
ing
��
distancerelative
to
centre
�� .
�
This
newdistance
isa
generalizedE
uclidean
�����
metric.
�
Letboth
��
and
��
beprobability
densities.
�
Then:
��� ���� ��
��
��
��
�������� ��� �
��
�� �
.
�
Link
with
��
statistic:let
���
bea
datatable
ofprobabilities
derivedfrom
frequenciesor
counts.
���
���� ������.
�
Marginals
ofthis
tableare
��
and
��
.Consider
independenceof
effectsw
here
thedata
tableis
���
��� ��
.
�
Then
the
��
distanceof
centre
���
between
thedensities
���
and
���
is
��� ���� ��
��
��
��
�������� ��� �
��
�� �
.
Correspondence
Analysis
–F
Murtagh
13
��
��
�
With
thecoefficient �
�,this
isthe
quantityw
hichcan
beassessed
with
a
��
testwith
���
degreesof
freedom.
�
The
��
distanceis
usedin
correspondenceanalysis.
�
Clearly,under
appropriatecircum
stances(w
hen
�����
�
constant)then
it
becomes
aclassicalE
uclideandistance.
Correspondence
Analysis
–F
Murtagh
14
��
��
Inputdata
table,marginals,and
masses
�
The
givencontingency
tabledata
aredenoted
���
����������������������.
�
We
have
������
��������.A
nalogously
����
isdefined,and
���
���
��������.
�
Fromfrequencies
toprobabilities:
���
����
�������
�
�����������
,similarly
��
isdefined
as
��������
�
��������� ,and��
analogously.
�
The
conditionaldistributionof
��
knowing
�,also
termed
the
�thprofile
with
coordinatesindexed
bythe
elements
of
,is
���
����
���
������
��
���
���� ���
���
andlikew
isefor
�
� .
Correspondence
Analysis
–F
Murtagh
15
��
��
Clouds
ofpoints,m
asses,andinertia
�
Mom
entofinertia
ofa
cloudof
pointsin
aE
uclideanspace,w
ithboth
distances
andm
assesdefined:
����
������
����� ��� ��� ��
�
��
����� ����� .
�
Here:
�
isthe
Euclidean
distancefrom
thecloud
centre,and
��
isthe
mass
of
element
�.
�
The
mass
isthe
marginaldistribution
ofthe
inputdatatable.
�
Correspondence
analysisis,as
willbe
seen,adecom
positionof
theinertia
ofa
cloudof
points,endowed
with
masses.
Correspondence
Analysis
–F
Murtagh
16
��
��
Inertiaand
DistributionalE
quivalence
�
Another
expressionfor
inertia:
����
���������
� �����
��� ��� �� ��
���
��
���
����� ��� �
��
�� �
.
�
The
term
��� ��� �� ��
���
isthe
��
metric
between
theprobability
distribution
���
andthe
productofm
arginaldistributions
�� ��
,with
ascentre
ofthe
metric
theproduct
�� ��
.
�
Principle
ofdistributionalequivalence:C
onsidertw
oelem
ents
��
and
��
of
�
with
identicalprofiles:i.e.
�
�
�
��
�
�
.C
onsidernow
thatelements
(or
columns)
��
and
��
arereplaced
with
anew
element
��
suchthatthe
new
coordinatesare
aggregatedprofiles,
��
�
���
�
���
� ,and
thenew
masses
are
similarly
aggregated:
��
�
���
�
���
� .T
henthere
isno
effectonthe
distributionof
distancesbetw
eenelem
entsof
.T
hedistance
between
elements
of
�
,otherthan
��
and
��
isnaturally
notmodified.
Correspondence
Analysis
–F
Murtagh
17
��
��
Inertiaand
DistributionalE
quivalence(C
ont’d.)
�
The
principleof
distributionalequivalenceleads
torepresentational
self-similarity:
aggregationof
rows
orcolum
ns,asdefined
above,leadsto
the
same
analysis.T
hereforeitis
veryappropriate
toanalyze
acontingency
table
with
finegranularity,and
seekin
theanalysis
tom
ergerow
sor
columns,
throughaggregation.
Correspondence
Analysis
–F
Murtagh
18
��
��
Factors
�
Correspondence
Analysis
producesan
orderedsequence
ofpairs,called
factors,
��
��
�
associatedw
ithrealnum
berscalled
eigenvalues
���
��.
�
We
denote
�
��
thevalue
ofthe
factorof
rank
�
forelem
ent
�
of
;and
similarly
�
���
isthe
valueof
thefactor
ofrank
�
forelem
ent
�
of
�
.
�
We
seethat
�
isa
functionon
,and�
isa
functionon
�
.
�
The
number
ofeigenvalues
andassociated
factorcouples
is:
�����������
��������������,w
here���denotes
setcardinality.
Correspondence
Analysis
–F
Murtagh
19
��
��
Properties
offactors
��
����� �
�����
�
���
�
�����
��
����� �� �����
�
���
�������
��
����� �
���������Æ
�
��
���
�
���������Æ
�
�
Notation:
Æ
�
��
if
����
and��
if
���
.
�
Norm
alizedfactors:
onthe
sets
and�
,we
nextdefinethe
functions
��
and
�
ofzero
mean,of
unitvariance,pairwise
uncorrelatedon
(resp.�
),and
associatedw
ithm
asses
��
(resp.
�� ).
��
����� �
�����
�
���
�����
��
����� �������
�
���
������
��
����� �
���������Æ
�
�
���
��� �����Æ
�
Correspondence
Analysis
–F
Murtagh
20
��
��
�
Betw
eenunnorm
alizedand
normalized
factors,we
havethe
following
relations.
��
��������
�
����������������
�
��������
�
�����������������
�
The
mom
entofinertia
ofthe
clouds����
and
�� ���
inthe
directionof
the
�
axisis
�
.
Correspondence
Analysis
–F
Murtagh
21
��
��
Forw
ardtransform
�
Have
thatthe��
metric
isdefined
indirectspace,i.e.space
ofprofiles.
�
The
Euclidean
metric
isdefined
forthe
factors.
�
We
cancharacterize
correspondenceanalysis
asthe
mapping
ofa
cloudin
��
spaceto
Euclidean
space.
�
Distances
between
profilesare
asfollow
s.
���� ����
� ��
�
��
�� ��� ����
��
�
��
�����
��
�����
������
��
� ��
�
� ��
�
��
��� ��
� ��
�
� ��
����
�����
��
�����
������
�
Norm
,ordistance
ofa
point
�����
fromthe
originor
centreof
gravityof
thecloud
����,is
asfollow
s.
���������� ��� ��
�
��
�����
�� ���
�������
� ��� ��
�
��
�����
�� ���
Correspondence
Analysis
–F
Murtagh
22
��
��
Inversetransform
�
The
correspondenceanalysis
transform,taking
profilesinto
afactor
space,is
reversedw
ithno
lossof
information
asfollow
s ��������
.
���
��� �
����
�����
����
�
����
��� �
�
Forprofiles
we
havethe
following.
��
�
��� ����
����
�
����
��� �
���
��
����
����
�
����
��� �
Correspondence
Analysis
–F
Murtagh
23
��
��
Decom
positionof
inertia
�
The
distanceof
apointfrom
thecentre
ofgravity
ofthe
cloudis
asfollow
s.
���������� ��� ���
�� ��� ��
��
�
�
Decom
positionof
thecloud’s
inertiais
asfollow
s.
�����
������
�����
�
��
����� �����
�
Ingreater
detail,we
havethe
following
forthis
decomposition.
��
��
����� �� ���
and
�������
�����
�� ���
Correspondence
Analysis
–F
Murtagh
24
��
��
Relative
andabsolute
contributions
��� ����
isthe
absolutecontribution
ofpoint
�
tothe
inertiaof
thecloud,
����
����,or
thevariance
ofpoint
�.
��� �� ���
isthe
absolutecontribution
ofpoint
�
tothe
mom
entofinertia
�
.
��� �� ���
�
isthe
relativecontribution
ofpoint
�
tothe
mom
entofinertia
�
.
(Often
denotedC
TR
.)
��� ���
isthe
contributionof
point
tothe
��
distancebetw
een
�
andthe
centre
ofthe
cloud
����.
�����
��� ���
�����
isthe
relativecontribution
ofthe
factor
�
topoint
�.
(Often
denotedC
OR
.)
�
Based
onthe
latterterm
,we
have: �
�����
�� ���
�������.
�
Analogous
formulas
holdfor
thepoints
�
inthe
cloud
�� ���.
Correspondence
Analysis
–F
Murtagh
25
��
��
Reduction
ofdim
ensionality
�
Interpretationis
usuallylim
itedto
thefirstfew
factors.
�
Decom
positionof
inertiais
usuallyfar
lessdecisive
than(cum
ulative)
percentagevariance
explainedin
principalcomponents
analysis.O
nereason
for
this:in
CA
,oftenrecoding
tendsto
bringinputdata
coordinatescloser
to
verticesof
hypercube.
�
QLT
�����
������
����
,w
hereangle
hasbeen
definedabove
(previous
section)and
where
��!�
isthe
qualityof
representationof
element
�
inthe
factorspace
ofdim
ension
��.
�
INR
��������
isthe
distanceof
element
fromthe
centreof
gravityof
the
cloud.
�
POID
�����
isthe
mass
orm
arginalfrequencyof
theelem
ent
�.
Correspondence
Analysis
–F
Murtagh
26
��
��
Interpretationof
results
1.Projections
ontofactors
1and
2,2and
3,1and
3,etc.of
set
,set
�
,orboth
setssim
ultaneously.
2.Spectrum
ofnon-increasing
valuesof
eigenvalues.
3.Interpretation
ofaxes.
We
candistinguish
between
thegeneral(latentsem
antic,
conceptual)m
eaningof
axes,andaxes
which
havesom
ethingspecific
tosay
aboutgroupsof
elements.
Usually
contrastisim
portant:w
hatisfound
tobe
analogousatone
extremity
versusthe
otherextrem
ity;oroppositions
or
polarities.
4.Factors
aredeterm
inedby
howm
uchthe
elements
contributeto
theirdispersion.
Therefore
thevalues
ofC
TR
areexam
inedin
orderto
identifyor
tonam
ethe
factors(for
example,w
ithhigher
orderconcepts).
(Informally,C
TR
allows
us
tow
orkfrom
theelem
entstow
ardsthe
factors.)
5.T
hevalues
ofC
OR
aresquared
cosines,which
canbe
consideredas
beinglike
Correspondence
Analysis
–F
Murtagh
27
��
��
correlationcoefficients.
IfC
OR
�����
islarge
(say,around0.8)
thenw
ecan
say
thatthatelementis
wellexplained
bythe
axisof
rank
�
.(Inform
ally,CO
R
allows
usto
work
fromthe
factorstow
ardsthe
elements.)
Correspondence
Analysis
–F
Murtagh
28
��
��
Analysis
ofthe
dualspaces
�
We
havethe
following.
��
��������
�
���� �
���
for
���������
�
��
��������
�����
� �
���
for���������
��
�
These
areterm
edthe
transitionform
ulas.T
hecoordinate
ofelem
ent
�
isthe
barycentreof
thecoordinates
ofthe
elements
��
,with
associatedm
assesof
valuegiven
bythe
coordinatesof
��
ofthe
profile
��� .
This
isallto
within
the
����
constant.
Correspondence
Analysis
–F
Murtagh
29
��
��
Analysis
ofthe
dualspaces(cont’d.)
�
We
alsohave
thefollow
ing.
��
��������
�
����
���
�
��������
�����
� �
���
�
This
implies
thatwe
canpass
easilyfrom
onespace
tothe
other.I.e.w
ecarry
outthediagonalization,or
eigen-reduction,inthe
more
computationally
favourablespace
which
isusually
���
.In
theoutputdisplay,the
barycentric
principlecom
esinto
play:this
allows
usto
simultaneously
viewand
interpret
observationsand
attributes.
Correspondence
Analysis
–F
Murtagh
30
��
��
Supplementary
elements
�
Overly-preponderantelem
ents(i.e.row
orcolum
nprofiles),or
exceptional
elements
(e.g.asex
attribute,givenother
performance
orbehaviouralattributes)
may
beplaced
assupplem
entaryelem
ents.
�
This
means
thattheyare
givenzero
mass
inthe
analysis,andtheir
projections
aredeterm
inedusing
thetransition
formulas.
�
This
amounts
tocarrying
outacorrespondence
analysisfirst,w
ithoutthese
elements,and
thenprojecting
theminto
thefactor
spacefollow
ingthe
determination
ofallproperties
ofthis
space.
Correspondence
Analysis
–F
Murtagh
31
��
��
Summ
ary
Space
���
:
1.
�
rowpoints,each
of
"
coordinates.
2.T
he
���
coordinateis
��
�� .
3.T
hem
assof
point
�
is
�� .
4.T
he��
distancebetw
eenrow
points
�
and
�
is:
���������
�������
��
����
��
���
Hence
thisis
aE
uclideandistance,w
ithrespect
tothe
weighting
�
�
(forall
�),between
profile
values
��
��
etc.
5.T
hecriterion
tobe
optimized:
thew
eightedsum
ofsquares
ofprojections,w
herethe
weighting
isgiven
by
��
(forall
�).
Correspondence
Analysis
–F
Murtagh
32
��
��
Space
���
:
1."
column
points,eachof
�
coordinates.
2.T
he���
coordinateis
��
�
.
3.T
hem
assof
point
�
is
�
.
4.T
he
��
distancebetw
eencolum
npoints
#
and
�
is:
���#�����
�
��� �
���
��
����
��
���
Hence
thisis
aE
uclideandistance,w
ithrespect
tothe
weighting
�
��
(forall
�),between
profile
values
���
��
etc.
5.T
hecriterion
tobe
optimized:
thew
eightedsum
ofsquares
ofprojections,w
herethe
weighting
isgiven
by
�
(forall
�).
Correspondence
Analysis
–F
Murtagh
33
��
��
Correspondence
Analysis
–F
Murtagh
34
��
��
Hierarchicalclustering
�
Hierarchicalagglom
erationon
�
observationvectors,
�,involves
aseries
of
����������
pairwise
agglomerations
ofobservations
orclusters,w
iththe
following
properties.
�
Ahierarchy
��������
suchthat:
1.
�
2.
����
3.for
each
��������������������or
����
�
An
indexedhierarchy
isthe
pair
���$�
where
thepositive
functiondefined
on
�
,i.e.,$������
,satisfies:
1.
$�����
if
��
isa
singleton
2.
������$���!$����
�
Function
$
isthe
agglomeration
level.
Correspondence
Analysis
–F
Murtagh
35
��
��
�
Take
����,let
�����and
������,and
let
���be
thelow
estlevelclusterfor
which
thisis
true.T
henif
we
define
%�������$�����,
%
isan
ultrametric.
�
Recall:
Distances
satisfythe
triangleinequality
����&�������������&�.
An
ultrametric
satisfies����&�����������������&��.In
anultram
etric
spacetriangles
formed
byany
threepoints
areisosceles.
An
ultrametric
isa
specialdistanceassociated
with
rootedtrees.U
ltrametrics
areused
inother
fieldsalso
–in
quantumm
echanics,numericaloptim
ization,number
theory,and
algorithmic
logic.
�
Inpractice,w
estartw
itha
Euclidean
distanceor
otherdissim
ilarity,usesom
e
criterionsuch
asm
inimizing
thechange
invariance
resultingfrom
the
agglomerations,and
thendefine
$���
asthe
dissimilarity
associatedw
iththe
agglomeration
carriedout.
Correspondence
Analysis
–F
Murtagh
36
��
��
Minim
umvariance
agglomeration
�
ForE
uclideandistance
inputs,thefollow
ingdefinitions
holdfor
them
inimum
varianceor
Ward
errorsum
ofsquares
agglomerative
criterion.
�
Coordinates
ofthe
newcluster
center,following
agglomeration
of
�
and
��,
where
"�
isthe
mass
ofcluster
�
definedas
clustercardinality,and
(vector)
�
denotesusing
overloadednotation
thecenter
of(set)
cluster
�:
�����"� ��"�����
�"��"���.
�
Following
theagglom
erationof
�
and��,w
edefine
thefollow
ingdissim
ilarity:
�"� "���
�"��"��������.
�
Hierarchicalclustering
isusually
basedon
factorprojections,if
desiredusing
a
limited
number
offactors
(e.g.7)in
orderto
filteroutthe
mostuseful
information
inour
data.
�
Insuch
acase,hierarchicalclustering
canbe
seento
bea
mapping
ofE
uclidean
distancesinto
ultrametric
distances.
Correspondence
Analysis
–F
Murtagh
37
��
��
Efficient
NN
chainalgorithm
�
�
�
�
�
ed
cb
a�
�
�
�
�
AN
N-chain
(nearestneighbourchain)
Correspondence
Analysis
–F
Murtagh
38
��
��
Efficient
NN
chainalgorithm
(cont’d.)
�
An
NN
-chainconsists
ofan
arbitrarypointfollow
edby
itsN
N;follow
edby
the
NN
fromam
ongthe
remaining
pointsof
thissecond
point;andso
onuntilw
e
necessarilyhave
some
pairof
pointsw
hichcan
beterm
edreciprocalor
mutual
NN
s.(Such
apair
ofR
NN
sm
aybe
thefirsttw
opoints
inthe
chain;andw
e
haveassum
edthatno
two
dissimilarities
areequal.)
�
Inconstructing
aN
N-chain,irrespective
ofthe
startingpoint,w
em
ay
agglomerate
apair
ofR
NN
sas
soonas
theyare
found.
�
Exactness
ofthe
resultinghierarchy
isguaranteed
when
thecluster
agglomeration
criterionrespects
thereducibility
property.
�
Inversionim
possibleif:
������!���������������������!��������
Correspondence
Analysis
–F
Murtagh
39
��
��
Minim
umvariance
method:
properties
�
We
seekto
agglomerate
two
clusters,
'�
and
'� ,into
cluster
'
suchthatthe
within-class
varianceof
thepartition
therebyobtained
ism
inimum
.
�
Alternatively,the
between-class
varianceof
thepartition
obtainedis
tobe
maxim
ized.
�
Let
(
and
)
bethe
partitionsprior
to,andsubsequentto,the
agglomeration;let
�� ,
�� ,...be
classesof
thepartitions.
(
�
��� ��� ��������'� �'� �
)
�
��� ��� ��������'��
�
Totalvarianceof
thecloud
ofobjects
in
"
-dimensionalspace
isdecom
posed
intothe
sumof
within-class
varianceand
between-class
variance.This
is
Huyghen’s
theoremin
classicalmechanics.
�
Totalvariance,between-class
variance,andw
ithin-classvariance
areas
follows:
Correspondence
Analysis
–F
Murtagh
40
��
��
*���
�� ���� ���#��,*�(���
���
���
�
���#��;and
�� ���� ���� ������.
�
Fortw
opartitions,before
andafter
anagglom
eration,we
haverespectively:
*���*�(����
��*���
*���*�)����
��*���
�
Fromthis,itcan
beshow
nthatthe
criterionto
beoptim
izedin
agglomerating
'�
and
'�
intonew
class
'
is:
*�(��*�)�
�
*�'��*�'� ��*�'� �
�
��������
��������� �� ��� ��
Correspondence
Analysis
–F
Murtagh
41
��
��
FAC
OR
andV
AC
OR
:A
nalysisof
clusters
�
The
barycentricprinciple
allows
bothrow
pointsand
column
pointsto
be
displayedsim
ultaneouslyas
projections.
�
We
thereforecan
consider:
–sim
ultaneousdisplay
ofand
�–
treeon
–tree
on
�
�
Tohelp
analyzethese
outputsw
ecan
explorethe
representationof
clusters
(derivedfrom
thehierarchicaltrees)
infactor
space,leadingto
programs
traditionallycalled
FAC
OR
.
�
And
therepresentation
ofclusters
inthe
profilecoordinate
space,leadingto
programs
traditionallycalled
VA
CO
R.
Correspondence
Analysis
–F
Murtagh
42
��
��
�
Inthe
caseof
FAC
OR
,forevery
couple
����of
apartition
of
,we
calculate
��� ��� �
������� � �����
This
canbe
decomposed
usingthe
axesof
���
,asw
ellasusing
thefactorial
axes.
�
Inthe
caseof
VA
CO
R,w
ecan
explorethe
clusterdipoles
which
takesaccount
ofthe
“elder”and
“younger”cluster
components:
n
/\
/\
/\
a(n)
b(n)
�
We
have
�
����
��� ���
�� ��
���.
We
considerthe
vectorsdefining
the
dipole:
�������and
���+����.
�
We
thenstudy
thesquared
cosineof
theangle
between
vector
�����+����and
Correspondence
Analysis
–F
Murtagh
43
��
��
thefactorialaxis
ofrank
�.
�
This
squaredcosine
definesthe
relativecontribution
ofthe
pair
���
tothe
level
index
$���
ofthe
class
�.
Correspondence
Analysis
–F
Murtagh
44
��
��
Summ
ary
�
Correspondence
analysisdisplays
observationprofiles
ina
low-dim
ensional
factorialspace.
�
Profilesare
pointsendow
edw
ith
��
distance.
�
Under
appropriatecircum
stances,the
��
distancereduces
toa
Euclidean
distance.
�
Afactorialspace
isnearly
always
Euclidean.
�
Simultaneously
ahierarchicalclustering
isbuiltusing
theobservation
profiles.
�
Usually
oneor
asm
allnumber
ofpartitions
arederived
fromthe
hierarchical
clustering.
�
Ahierarchicalclustering
definesan
ultrametric
distance.
�
Inputforthe
hierarchicalclusteringis
usuallyfactor
projections.
Correspondence
Analysis
–F
Murtagh
45
��
��
�
Insum
mary,correspondence
analysisinvolves
mapping
a
��
distanceinto
a
particularE
uclideandistance;and
mapping
thisE
uclideandistance
intoan
ultrametric
distance.
�
The
aimis
tohave
differentbutcomplem
entaryanalytic
toolsto
facilitate
interpretationof
ourdata.
Correspondence
Analysis
–F
Murtagh
46
��
��
Toread
further
�
Ch.
Bastin,J.P.B
enzécri,Ch.
Bourgaritand
P.Cazes,P
ratiquede
l’Analyse
des
Données,Tom
e2,D
unod,Paris,1980.
�
J.P.Benzécriand
F.Benzécri,F.P
ratiquede
l’Analyse
desD
onnées,Vol.1:
Analyse
desC
orrespondances.E
xposéÉ
lémentaire,D
unod,Paris,1980.
�
J.P.Benzécri,L’A
nalysedes
Données.
Tome
1.L
aTaxinom
ie,2nded.,D
unod,
Paris,1976.
�
J.P.Benzécri,L’A
nalysedes
Données.
Tome
2.L’A
nalysedes
Correspondances,
2nded.,D
unod,Paris,1976.
�
J.P.Benzécri,C
orrespondenceA
nalysisH
andbook,MarcelD
ekker,Basel,
1992.
�
M.Jam
bu,Classification
Autom
atiquepour
l’Analyse
desD
onnées.1.
Méthodes
etAlgorithm
es,Dunod,Paris,1978.
Correspondence
Analysis
–F
Murtagh
47
��
��
�
L.L
ebart,A.M
orineauand
K.M
.Warw
ick,Multivariate
Descriptive
Statistical
Analysis,W
iley,New
York,1984.
�
F.Murtagh,“A
surveyof
recentadvancesin
hierarchicalclusteringalgorithm
s”,
The
Com
puterJournal,26,354-359,1983.
�
F.Murtagh,M
ultidimensionalC
lusteringA
lgorithms,C
OM
PSTAT
Lectures
Volum
e4,Physica-V
erlag,Vienna,1985.
�
F.Murtagh
andA
.Heck,M
ultivariateD
ataA
nalysis,Kluw
er,1987.
�
H.R
ouanetandB
.Le
Roux,A
nalysedes
Données
Multidim
ensionnelles,
Dunod,Paris,1993.
�
M.V
olle,Analyse
desD
onnées,2ndE
dition,Econom
ica,Paris,1980.