47
Correspondence Analysis – F Murtagh 1 Correspondence Analysis Topics: Basics, and preliminary example (student exam scores) Metrics, clouds of points, masses, inertia Factors, decomposition of inertia, contributions, dual spaces Hierarchical agglomerative clustering Minimum variance criterion Examples in depth (ppt file) Java application: http://astro.u-strasbg.fr/ fmurtagh/mda-sw

expose · Title: expose.dvi Created Date: 7/19/2003 2:31:51 PM

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Correspondence

    Analysis

    –F

    Murtagh

    1

    ��

    ��

    Correspondence

    Analysis

    Topics:

    Basics,and

    preliminary

    example

    (studentexamscores)

    Metrics,clouds

    ofpoints,m

    asses,inertia

    Factors,decomposition

    ofinertia,contributions,dualspaces

    Hierarchicalagglom

    erativeclustering

    Minim

    umvariance

    criterion

    Exam

    plesin

    depth(pptfile)

    Javaapplication:

    http://astro.u-strasbg.fr/�

    fmurtagh/m

    da-sw

  • Correspondence

    Analysis

    –F

    Murtagh

    2

    ��

    ��

    Basics

    Observations�

    variablesm

    atrix.

    Through

    displayand

    throughquantitative

    measures,investigate

    relationships

    between

    observations,andbetw

    eenvariables.

    Similar

    inthese

    objectivesto

    principalcomponents

    analysis,multidim

    ensional

    scaling,Kohonen

    self-organizingfeature

    map,and

    others.

    Correspondence

    analysisis

    oftenused

    inconjunction

    with

    clustering.

    Inputdata,andinputdata

    coding,arethe

    major

    issuesw

    hichdistinguish

    correspondenceanalysis

    fromother

    algorithmically-sim

    ilar(or

    alternative

    algorithmic)

    methods.

  • Correspondence

    Analysis

    –F

    Murtagh

    3

    ��

    ��

    Scores5

    studentsin

    6subjects

    CSc

    CPg

    CGr

    CNw

    DbM

    SwE

    A54

    55

    31

    36

    46

    40

    B35

    56

    20

    20

    49

    45

    C47

    73

    39

    30

    48

    57

    D54

    72

    33

    42

    57

    21

    E18

    24

    11

    14

    19

    7

    CSc

    CPg

    CGr

    CNw

    DbM

    SwE

    mean

    profile:

    .18

    .24

    .12

    .12

    .19

    .15

    profile

    of

    D:

    .19

    .26

    .12

    .15

    .20

    .08

    profile

    of

    E:

    .19

    .26

    .12

    .15

    .20

    .08

    Scores(outof

    100)of

    5students,A

    –E,in

    6subjects.

    Subjects:CSc

    :C

    omputer

    ScienceProficiency,C

    Pg

    :C

    omputer

    Programm

    ing,CGr

    :C

    omputer

    Graphics,C

    Nw

    :

    Com

    puterN

    etworks,D

    bM

    :Database

    Managem

    ent,SwE

    :Software

    Engineering.

  • Correspondence

    Analysis

    –F

    Murtagh

    4

    ��

    ��

    Scores5

    studentsin

    6subjects

    (Cont’d.)

    Correspondence

    analysishighlights

    thesim

    ilaritiesand

    thedifferences

    inthe

    profiles.

    Note

    thatallthescores

    ofD

    andE

    arein

    thesam

    eproportion

    (E’s

    scoresare

    one-thirdthose

    ofD

    ).

    Note

    alsothatE

    hasthe

    lowestscores

    bothin

    absoluteand

    relativeterm

    sin

    all

    thesubjects.

    Dand

    Ehave

    identicalprofiles:w

    ithoutdatacoding

    theyw

    ouldbe

    locatedat

    thesam

    elocation

    inthe

    outputdisplay.

    Both

    Dand

    Eshow

    apositive

    associationw

    ithCNw

    (computer

    networks)

    anda

    negativeassociation

    with

    SwE

    (software

    engineering)because

    incom

    parison

    with

    them

    eanprofile,D

    andE

    have,intheir

    profile,arelatively

    larger

    componentof

    CNw

    anda

    relativelysm

    allercom

    ponentofSwE

    .

  • Correspondence

    Analysis

    –F

    Murtagh

    5

    ��

    ��

    We

    needto

    clearlydifferentiate

    between

    theprofiles

    ofD

    andE

    ,which

    we

    do

    bydoubling

    thedata.

    Doubling:

    we

    attributetw

    oscores

    persubjectinstead

    ofa

    singlescore.

    The

    “scoreaw

    arded”,�������,is

    equaltothe

    initialscore.T

    he“score

    not

    awarded”,�

    ������,is

    equaltoits

    complem

    ent,i.e.,�����������.

    Lever

    principle:a

    “�

    ”variable

    andits

    corresponding“�

    ”variable

    lieon

    the

    oppositesides

    ofthe

    originand

    collinearw

    ithit.

    And:

    ifthe

    mass

    ofthe

    profileof

    ��

    isgreater

    thanthe

    mass

    ofthe

    profileof

    ��

    (which

    means

    thattheaverage

    scorefor

    thesubject

    was

    greaterthan

    50outof

    100),thepoint

    ��

    iscloser

    tothe

    originthan

    ��

    .

    We

    willfind

    thatexceptinCPg

    ,theaverage

    scoreof

    thestudents

    was

    below50

    inallthe

    subjects.

  • Correspondence

    Analysis

    –F

    Murtagh

    6

    ��

    ��

    Data

    coding:D

    oubling

    CSc+

    CSc-

    CPg+

    CPg-CGr+

    CGr-

    CNw+

    CNw-

    DbM+

    DbM-

    SwE+

    SwE-

    A54

    46

    55

    45

    31

    69

    36

    64

    46

    54

    40

    60

    B35

    65

    56

    44

    20

    80

    20

    80

    49

    51

    45

    55

    C47

    53

    73

    27

    39

    61

    30

    70

    48

    52

    57

    43

    D54

    46

    72

    28

    33

    67

    42

    58

    57

    43

    21

    79

    E18

    82

    24

    76

    11

    89

    14

    86

    19

    81

    793

    Doubled

    tableof

    scoresderived

    fromprevious

    table.N

    ote:allrow

    snow

    havethe

    same

    total.

  • Correspondence

    Analysis

    –F

    Murtagh

    7

    ��

    ��

    Factor 1 (77%

    inertia)

    Factor 2 (18% inertia)

    -0.4-0.2

    0.00.2

    0.4

    -0.2 -0.1 0.0 0.1 0.2 0.3 0.4

    A

    BC

    D

    E

    CS

    c+

    CS

    c-

    CP

    g+

    CP

    g-

    CG

    r+

    CG

    r-

    CN

    w+

    CN

    w-

    DbM

    +

    DbM

    -

    Sw

    E+

    Sw

    E-

  • Correspondence

    Analysis

    –F

    Murtagh

    8

    ��

    ��

    Metrics

    The

    notionof

    distanceis

    crucial,sincew

    ew

    anttoinvestigate

    relationships

    between

    observationsand/or

    variables.

    Recall:

    ����������������������,then:

    scalarproduct

    ����������������������������������.

    Euclidean

    norm:

    ����������������.

    Euclidean

    distance:

    ����������.T

    hesquared

    Euclidean

    distanceis:

    ��������������

    Orthogonality:

    isorthogonalto

    if�������.

    Distance

    issym

    metric

    (�������������),positive

    (������

    �),and

    definite

    ( �������������).

  • Correspondence

    Analysis

    –F

    Murtagh

    9

    ��

    ��

    Metrics

    (cont’d.)

    Any

    symm

    etric,positive,definitem

    atrix

    definesa

    generalizedE

    uclidean

    space.Scalar

    productis ������

    �����,norm

    is�������

    ,and

    Euclidean

    distanceis

    �����������

    .

    Classicalcase:

    ��

    ,theidentity

    matrix.

    Norm

    alizationto

    unitvariance:

    isdiagonalm

    atrixw

    ith

    �thdiagonalterm

    ��� .

    Mahalanobis

    distance:

    isinverse

    variance-covariancem

    atrix.

    Nexttopic:

    Scalarproductdefines

    orthogonalprojection.

  • Correspondence

    Analysis

    –F

    Murtagh

    10

    ��

    ��

    Metrics

    (cont’d.)

    Projectedvalue,projection,coordinate:

    ��������

    ������

    .Here

    ��

    and

    areboth

    vectors.

    Norm

    ofvector

    ��������

    ��������������.

    The

    quantity

    ������

    ���can

    beinterpreted

    asthe

    cosineof

    theangle

    between

    vectors

    and

    .

    +x

    /|

    /|

    /|

    /|

    /a

    |

    +-----+-----

    u

    Ox1

  • Correspondence

    Analysis

    –F

    Murtagh

    11

    ��

    ��

    Metrics

    (cont’d.)

    Consider

    thecase

    ofcentred

    -valuedcoordinates

    orvariables,

    �� .

    The

    sumof

    variablevectors

    isa

    constant,proportionaltothe

    mean

    variable.

    Therefore

    thecentred

    vectorslie

    ona

    hyperplane

    ,ora

    sub-space,of

    dimension

    ���.

    Consider

    aprobability

    distribution�

    definedon

    ,i.e.forall

    we

    have

    ����

    (note:

    ��

    toavoid

    inconvenienceof

    lower

    dim.subspace)

    and �������� .

    Covariance

    matrix:

    ��� ,diagonalm

    atrixw

    ithdiagonalelem

    entsconsisting

    of

    the

    terms.

    Have:

    ���

    �����

    ����� ����

    var���;and

    ���

    �����

    ����� �� ���

    cov

    �����.

  • Correspondence

    Analysis

    –F

    Murtagh

    12

    ��

    ��

    Metrics

    (cont’d.)

    Use

    ofm

    etric���

    on

    isassociated

    with

    thefollow

    ing

    ��

    distancerelative

    to

    centre

    �� .

    This

    newdistance

    isa

    generalizedE

    uclidean

    �����

    metric.

    Letboth

    ��

    and

    ��

    beprobability

    densities.

    Then:

    ��� ���� ��

    ��

    ��

    ��

    �������� ��� �

    ��

    �� �

    .

    Link

    with

    ��

    statistic:let

    ���

    bea

    datatable

    ofprobabilities

    derivedfrom

    frequenciesor

    counts.

    ���

    ���� ������.

    Marginals

    ofthis

    tableare

    ��

    and

    ��

    .Consider

    independenceof

    effectsw

    here

    thedata

    tableis

    ���

    ��� ��

    .

    Then

    the

    ��

    distanceof

    centre

    ���

    between

    thedensities

    ���

    and

    ���

    is

    ��� ���� ��

    ��

    ��

    ��

    �������� ��� �

    ��

    �� �

    .

  • Correspondence

    Analysis

    –F

    Murtagh

    13

    ��

    ��

    With

    thecoefficient �

    �,this

    isthe

    quantityw

    hichcan

    beassessed

    with

    a

    ��

    testwith

    ���

    degreesof

    freedom.

    The

    ��

    distanceis

    usedin

    correspondenceanalysis.

    Clearly,under

    appropriatecircum

    stances(w

    hen

    �����

    constant)then

    it

    becomes

    aclassicalE

    uclideandistance.

  • Correspondence

    Analysis

    –F

    Murtagh

    14

    ��

    ��

    Inputdata

    table,marginals,and

    masses

    The

    givencontingency

    tabledata

    aredenoted

    ���

    ����������������������.

    We

    have

    ������

    ��������.A

    nalogously

    ����

    isdefined,and

    ���

    ���

    ��������.

    Fromfrequencies

    toprobabilities:

    ���

    ����

    �������

    �����������

    ,similarly

    ��

    isdefined

    as

    ��������

    ��������� ,and��

    analogously.

    The

    conditionaldistributionof

    ��

    knowing

    �,also

    termed

    the

    �thprofile

    with

    coordinatesindexed

    bythe

    elements

    of

    ,is

    ���

    ����

    ���

    ������

    ��

    ���

    ���� ���

    ���

    andlikew

    isefor

    � .

  • Correspondence

    Analysis

    –F

    Murtagh

    15

    ��

    ��

    Clouds

    ofpoints,m

    asses,andinertia

    Mom

    entofinertia

    ofa

    cloudof

    pointsin

    aE

    uclideanspace,w

    ithboth

    distances

    andm

    assesdefined:

    ����

    ������

    ����� ��� ��� ��

    ��

    ����� ����� .

    Here:

    isthe

    Euclidean

    distancefrom

    thecloud

    centre,and

    ��

    isthe

    mass

    of

    element

    �.

    The

    mass

    isthe

    marginaldistribution

    ofthe

    inputdatatable.

    Correspondence

    analysisis,as

    willbe

    seen,adecom

    positionof

    theinertia

    ofa

    cloudof

    points,endowed

    with

    masses.

  • Correspondence

    Analysis

    –F

    Murtagh

    16

    ��

    ��

    Inertiaand

    DistributionalE

    quivalence

    Another

    expressionfor

    inertia:

    ����

    ���������

    � �����

    ��� ��� �� ��

    ���

    ��

    ���

    ����� ��� �

    ��

    �� �

    .

    The

    term

    ��� ��� �� ��

    ���

    isthe

    ��

    metric

    between

    theprobability

    distribution

    ���

    andthe

    productofm

    arginaldistributions

    �� ��

    ,with

    ascentre

    ofthe

    metric

    theproduct

    �� ��

    .

    Principle

    ofdistributionalequivalence:C

    onsidertw

    oelem

    ents

    ��

    and

    ��

    of

    with

    identicalprofiles:i.e.

    ��

    .C

    onsidernow

    thatelements

    (or

    columns)

    ��

    and

    ��

    arereplaced

    with

    anew

    element

    ��

    suchthatthe

    new

    coordinatesare

    aggregatedprofiles,

    ��

    ���

    ���

    � ,and

    thenew

    masses

    are

    similarly

    aggregated:

    ��

    ���

    ���

    � .T

    henthere

    isno

    effectonthe

    distributionof

    distancesbetw

    eenelem

    entsof

    .T

    hedistance

    between

    elements

    of

    ,otherthan

    ��

    and

    ��

    isnaturally

    notmodified.

  • Correspondence

    Analysis

    –F

    Murtagh

    17

    ��

    ��

    Inertiaand

    DistributionalE

    quivalence(C

    ont’d.)

    The

    principleof

    distributionalequivalenceleads

    torepresentational

    self-similarity:

    aggregationof

    rows

    orcolum

    ns,asdefined

    above,leadsto

    the

    same

    analysis.T

    hereforeitis

    veryappropriate

    toanalyze

    acontingency

    table

    with

    finegranularity,and

    seekin

    theanalysis

    tom

    ergerow

    sor

    columns,

    throughaggregation.

  • Correspondence

    Analysis

    –F

    Murtagh

    18

    ��

    ��

    Factors

    Correspondence

    Analysis

    producesan

    orderedsequence

    ofpairs,called

    factors,

    ��

    ��

    associatedw

    ithrealnum

    berscalled

    eigenvalues

    ���

    ��.

    We

    denote

    ��

    thevalue

    ofthe

    factorof

    rank

    forelem

    ent

    of

    ;and

    similarly

    ���

    isthe

    valueof

    thefactor

    ofrank

    forelem

    ent

    of

    .

    We

    seethat

    isa

    functionon

    ,and�

    isa

    functionon

    .

    The

    number

    ofeigenvalues

    andassociated

    factorcouples

    is:

    �����������

    ��������������,w

    here���denotes

    setcardinality.

  • Correspondence

    Analysis

    –F

    Murtagh

    19

    ��

    ��

    Properties

    offactors

    ��

    ����� �

    �����

    ���

    �����

    ��

    ����� �� �����

    ���

    �������

    ��

    ����� �

    ���������Æ

    ��

    ���

    ���������Æ

    Notation:

    Æ

    ��

    if

    ����

    and��

    if

    ���

    .

    Norm

    alizedfactors:

    onthe

    sets

    and�

    ,we

    nextdefinethe

    functions

    ��

    and

    ofzero

    mean,of

    unitvariance,pairwise

    uncorrelatedon

    (resp.�

    ),and

    associatedw

    ithm

    asses

    ��

    (resp.

    �� ).

    ��

    ����� �

    �����

    ���

    �����

    ��

    ����� �������

    ���

    ������

    ��

    ����� �

    ���������Æ

    ���

    ��� �����Æ

  • Correspondence

    Analysis

    –F

    Murtagh

    20

    ��

    ��

    Betw

    eenunnorm

    alizedand

    normalized

    factors,we

    havethe

    following

    relations.

    ��

    ��������

    ����������������

    ��������

    �����������������

    The

    mom

    entofinertia

    ofthe

    clouds����

    and

    �� ���

    inthe

    directionof

    the

    axisis

    .

  • Correspondence

    Analysis

    –F

    Murtagh

    21

    ��

    ��

    Forw

    ardtransform

    Have

    thatthe��

    metric

    isdefined

    indirectspace,i.e.space

    ofprofiles.

    The

    Euclidean

    metric

    isdefined

    forthe

    factors.

    We

    cancharacterize

    correspondenceanalysis

    asthe

    mapping

    ofa

    cloudin

    ��

    spaceto

    Euclidean

    space.

    Distances

    between

    profilesare

    asfollow

    s.

    ���� ����

    � ��

    ��

    �� ��� ����

    ��

    ��

    �����

    ��

    �����

    ������

    ��

    � ��

    � ��

    ��

    ��� ��

    � ��

    � ��

    ����

    �����

    ��

    �����

    ������

    Norm

    ,ordistance

    ofa

    point

    �����

    fromthe

    originor

    centreof

    gravityof

    thecloud

    ����,is

    asfollow

    s.

    ���������� ��� ��

    ��

    �����

    �� ���

    �������

    � ��� ��

    ��

    �����

    �� ���

  • Correspondence

    Analysis

    –F

    Murtagh

    22

    ��

    ��

    Inversetransform

    The

    correspondenceanalysis

    transform,taking

    profilesinto

    afactor

    space,is

    reversedw

    ithno

    lossof

    information

    asfollow

    s ��������

    .

    ���

    ��� �

    ����

    �����

    ����

    ����

    ��� �

    Forprofiles

    we

    havethe

    following.

    ��

    ��� ����

    ����

    ����

    ��� �

    ���

    ��

    ����

    ����

    ����

    ��� �

  • Correspondence

    Analysis

    –F

    Murtagh

    23

    ��

    ��

    Decom

    positionof

    inertia

    The

    distanceof

    apointfrom

    thecentre

    ofgravity

    ofthe

    cloudis

    asfollow

    s.

    ���������� ��� ���

    �� ��� ��

    ��

    Decom

    positionof

    thecloud’s

    inertiais

    asfollow

    s.

    �����

    ������

    �����

    ��

    ����� �����

    Ingreater

    detail,we

    havethe

    following

    forthis

    decomposition.

    ��

    ��

    ����� �� ���

    and

    �������

    �����

    �� ���

  • Correspondence

    Analysis

    –F

    Murtagh

    24

    ��

    ��

    Relative

    andabsolute

    contributions

    ��� ����

    isthe

    absolutecontribution

    ofpoint

    tothe

    inertiaof

    thecloud,

    ����

    ����,or

    thevariance

    ofpoint

    �.

    ��� �� ���

    isthe

    absolutecontribution

    ofpoint

    tothe

    mom

    entofinertia

    .

    ��� �� ���

    isthe

    relativecontribution

    ofpoint

    tothe

    mom

    entofinertia

    .

    (Often

    denotedC

    TR

    .)

    ��� ���

    isthe

    contributionof

    point

    tothe

    ��

    distancebetw

    een

    andthe

    centre

    ofthe

    cloud

    ����.

    �����

    ��� ���

    �����

    isthe

    relativecontribution

    ofthe

    factor

    topoint

    �.

    (Often

    denotedC

    OR

    .)

    Based

    onthe

    latterterm

    ,we

    have: �

    �����

    �� ���

    �������.

    Analogous

    formulas

    holdfor

    thepoints

    inthe

    cloud

    �� ���.

  • Correspondence

    Analysis

    –F

    Murtagh

    25

    ��

    ��

    Reduction

    ofdim

    ensionality

    Interpretationis

    usuallylim

    itedto

    thefirstfew

    factors.

    Decom

    positionof

    inertiais

    usuallyfar

    lessdecisive

    than(cum

    ulative)

    percentagevariance

    explainedin

    principalcomponents

    analysis.O

    nereason

    for

    this:in

    CA

    ,oftenrecoding

    tendsto

    bringinputdata

    coordinatescloser

    to

    verticesof

    hypercube.

    QLT

    �����

    ������

    ����

    ,w

    hereangle

    hasbeen

    definedabove

    (previous

    section)and

    where

    ��!�

    isthe

    qualityof

    representationof

    element

    inthe

    factorspace

    ofdim

    ension

    ��.

    INR

    ��������

    isthe

    distanceof

    element

    fromthe

    centreof

    gravityof

    the

    cloud.

    POID

    �����

    isthe

    mass

    orm

    arginalfrequencyof

    theelem

    ent

    �.

  • Correspondence

    Analysis

    –F

    Murtagh

    26

    ��

    ��

    Interpretationof

    results

    1.Projections

    ontofactors

    1and

    2,2and

    3,1and

    3,etc.of

    set

    ,set

    ,orboth

    setssim

    ultaneously.

    2.Spectrum

    ofnon-increasing

    valuesof

    eigenvalues.

    3.Interpretation

    ofaxes.

    We

    candistinguish

    between

    thegeneral(latentsem

    antic,

    conceptual)m

    eaningof

    axes,andaxes

    which

    havesom

    ethingspecific

    tosay

    aboutgroupsof

    elements.

    Usually

    contrastisim

    portant:w

    hatisfound

    tobe

    analogousatone

    extremity

    versusthe

    otherextrem

    ity;oroppositions

    or

    polarities.

    4.Factors

    aredeterm

    inedby

    howm

    uchthe

    elements

    contributeto

    theirdispersion.

    Therefore

    thevalues

    ofC

    TR

    areexam

    inedin

    orderto

    identifyor

    tonam

    ethe

    factors(for

    example,w

    ithhigher

    orderconcepts).

    (Informally,C

    TR

    allows

    us

    tow

    orkfrom

    theelem

    entstow

    ardsthe

    factors.)

    5.T

    hevalues

    ofC

    OR

    aresquared

    cosines,which

    canbe

    consideredas

    beinglike

  • Correspondence

    Analysis

    –F

    Murtagh

    27

    ��

    ��

    correlationcoefficients.

    IfC

    OR

    �����

    islarge

    (say,around0.8)

    thenw

    ecan

    say

    thatthatelementis

    wellexplained

    bythe

    axisof

    rank

    .(Inform

    ally,CO

    R

    allows

    usto

    work

    fromthe

    factorstow

    ardsthe

    elements.)

  • Correspondence

    Analysis

    –F

    Murtagh

    28

    ��

    ��

    Analysis

    ofthe

    dualspaces

    We

    havethe

    following.

    ��

    ��������

    ���� �

    ���

    for

    ���������

    ��

    ��������

    �����

    � �

    ���

    for���������

    ��

    These

    areterm

    edthe

    transitionform

    ulas.T

    hecoordinate

    ofelem

    ent

    isthe

    barycentreof

    thecoordinates

    ofthe

    elements

    ��

    ,with

    associatedm

    assesof

    valuegiven

    bythe

    coordinatesof

    ��

    ofthe

    profile

    ��� .

    This

    isallto

    within

    the

    ����

    constant.

  • Correspondence

    Analysis

    –F

    Murtagh

    29

    ��

    ��

    Analysis

    ofthe

    dualspaces(cont’d.)

    We

    alsohave

    thefollow

    ing.

    ��

    ��������

    ����

    ���

    ��������

    �����

    � �

    ���

    This

    implies

    thatwe

    canpass

    easilyfrom

    onespace

    tothe

    other.I.e.w

    ecarry

    outthediagonalization,or

    eigen-reduction,inthe

    more

    computationally

    favourablespace

    which

    isusually

    ���

    .In

    theoutputdisplay,the

    barycentric

    principlecom

    esinto

    play:this

    allows

    usto

    simultaneously

    viewand

    interpret

    observationsand

    attributes.

  • Correspondence

    Analysis

    –F

    Murtagh

    30

    ��

    ��

    Supplementary

    elements

    Overly-preponderantelem

    ents(i.e.row

    orcolum

    nprofiles),or

    exceptional

    elements

    (e.g.asex

    attribute,givenother

    performance

    orbehaviouralattributes)

    may

    beplaced

    assupplem

    entaryelem

    ents.

    This

    means

    thattheyare

    givenzero

    mass

    inthe

    analysis,andtheir

    projections

    aredeterm

    inedusing

    thetransition

    formulas.

    This

    amounts

    tocarrying

    outacorrespondence

    analysisfirst,w

    ithoutthese

    elements,and

    thenprojecting

    theminto

    thefactor

    spacefollow

    ingthe

    determination

    ofallproperties

    ofthis

    space.

  • Correspondence

    Analysis

    –F

    Murtagh

    31

    ��

    ��

    Summ

    ary

    Space

    ���

    :

    1.

    rowpoints,each

    of

    "

    coordinates.

    2.T

    he

    ���

    coordinateis

    ��

    �� .

    3.T

    hem

    assof

    point

    is

    �� .

    4.T

    he��

    distancebetw

    eenrow

    points

    and

    is:

    ���������

    �������

    ��

    ����

    ��

    ���

    Hence

    thisis

    aE

    uclideandistance,w

    ithrespect

    tothe

    weighting

    (forall

    �),between

    profile

    values

    ��

    ��

    etc.

    5.T

    hecriterion

    tobe

    optimized:

    thew

    eightedsum

    ofsquares

    ofprojections,w

    herethe

    weighting

    isgiven

    by

    ��

    (forall

    �).

  • Correspondence

    Analysis

    –F

    Murtagh

    32

    ��

    ��

    Space

    ���

    :

    1."

    column

    points,eachof

    coordinates.

    2.T

    he���

    coordinateis

    ��

    .

    3.T

    hem

    assof

    point

    is

    .

    4.T

    he

    ��

    distancebetw

    eencolum

    npoints

    #

    and

    is:

    ���#�����

    ��� �

    ���

    ��

    ����

    ��

    ���

    Hence

    thisis

    aE

    uclideandistance,w

    ithrespect

    tothe

    weighting

    ��

    (forall

    �),between

    profile

    values

    ���

    ��

    etc.

    5.T

    hecriterion

    tobe

    optimized:

    thew

    eightedsum

    ofsquares

    ofprojections,w

    herethe

    weighting

    isgiven

    by

    (forall

    �).

  • Correspondence

    Analysis

    –F

    Murtagh

    33

    ��

    ��

  • Correspondence

    Analysis

    –F

    Murtagh

    34

    ��

    ��

    Hierarchicalclustering

    Hierarchicalagglom

    erationon

    observationvectors,

    �,involves

    aseries

    of

    ����������

    pairwise

    agglomerations

    ofobservations

    orclusters,w

    iththe

    following

    properties.

    Ahierarchy

    ��������

    suchthat:

    1.

    2.

    ����

    3.for

    each

    ��������������������or

    ����

    An

    indexedhierarchy

    isthe

    pair

    ���$�

    where

    thepositive

    functiondefined

    on

    ,i.e.,$������

    ,satisfies:

    1.

    $�����

    if

    ��

    isa

    singleton

    2.

    ������$���!$����

    Function

    $

    isthe

    agglomeration

    level.

  • Correspondence

    Analysis

    –F

    Murtagh

    35

    ��

    ��

    Take

    ����,let

    �����and

    ������,and

    let

    ���be

    thelow

    estlevelclusterfor

    which

    thisis

    true.T

    henif

    we

    define

    %�������$�����,

    %

    isan

    ultrametric.

    Recall:

    Distances

    satisfythe

    triangleinequality

    ����&�������������&�.

    An

    ultrametric

    satisfies����&�����������������&��.In

    anultram

    etric

    spacetriangles

    formed

    byany

    threepoints

    areisosceles.

    An

    ultrametric

    isa

    specialdistanceassociated

    with

    rootedtrees.U

    ltrametrics

    areused

    inother

    fieldsalso

    –in

    quantumm

    echanics,numericaloptim

    ization,number

    theory,and

    algorithmic

    logic.

    Inpractice,w

    estartw

    itha

    Euclidean

    distanceor

    otherdissim

    ilarity,usesom

    e

    criterionsuch

    asm

    inimizing

    thechange

    invariance

    resultingfrom

    the

    agglomerations,and

    thendefine

    $���

    asthe

    dissimilarity

    associatedw

    iththe

    agglomeration

    carriedout.

  • Correspondence

    Analysis

    –F

    Murtagh

    36

    ��

    ��

    Minim

    umvariance

    agglomeration

    ForE

    uclideandistance

    inputs,thefollow

    ingdefinitions

    holdfor

    them

    inimum

    varianceor

    Ward

    errorsum

    ofsquares

    agglomerative

    criterion.

    Coordinates

    ofthe

    newcluster

    center,following

    agglomeration

    of

    and

    ��,

    where

    "�

    isthe

    mass

    ofcluster

    definedas

    clustercardinality,and

    (vector)

    denotesusing

    overloadednotation

    thecenter

    of(set)

    cluster

    �:

    �����"� ��"�����

    �"��"���.

    Following

    theagglom

    erationof

    and��,w

    edefine

    thefollow

    ingdissim

    ilarity:

    �"� "���

    �"��"��������.

    Hierarchicalclustering

    isusually

    basedon

    factorprojections,if

    desiredusing

    a

    limited

    number

    offactors

    (e.g.7)in

    orderto

    filteroutthe

    mostuseful

    information

    inour

    data.

    Insuch

    acase,hierarchicalclustering

    canbe

    seento

    bea

    mapping

    ofE

    uclidean

    distancesinto

    ultrametric

    distances.

  • Correspondence

    Analysis

    –F

    Murtagh

    37

    ��

    ��

    Efficient

    NN

    chainalgorithm

    ed

    cb

    a�

    AN

    N-chain

    (nearestneighbourchain)

  • Correspondence

    Analysis

    –F

    Murtagh

    38

    ��

    ��

    Efficient

    NN

    chainalgorithm

    (cont’d.)

    An

    NN

    -chainconsists

    ofan

    arbitrarypointfollow

    edby

    itsN

    N;follow

    edby

    the

    NN

    fromam

    ongthe

    remaining

    pointsof

    thissecond

    point;andso

    onuntilw

    e

    necessarilyhave

    some

    pairof

    pointsw

    hichcan

    beterm

    edreciprocalor

    mutual

    NN

    s.(Such

    apair

    ofR

    NN

    sm

    aybe

    thefirsttw

    opoints

    inthe

    chain;andw

    e

    haveassum

    edthatno

    two

    dissimilarities

    areequal.)

    Inconstructing

    aN

    N-chain,irrespective

    ofthe

    startingpoint,w

    em

    ay

    agglomerate

    apair

    ofR

    NN

    sas

    soonas

    theyare

    found.

    Exactness

    ofthe

    resultinghierarchy

    isguaranteed

    when

    thecluster

    agglomeration

    criterionrespects

    thereducibility

    property.

    Inversionim

    possibleif:

    ������!���������������������!��������

  • Correspondence

    Analysis

    –F

    Murtagh

    39

    ��

    ��

    Minim

    umvariance

    method:

    properties

    We

    seekto

    agglomerate

    two

    clusters,

    '�

    and

    '� ,into

    cluster

    '

    suchthatthe

    within-class

    varianceof

    thepartition

    therebyobtained

    ism

    inimum

    .

    Alternatively,the

    between-class

    varianceof

    thepartition

    obtainedis

    tobe

    maxim

    ized.

    Let

    (

    and

    )

    bethe

    partitionsprior

    to,andsubsequentto,the

    agglomeration;let

    �� ,

    �� ,...be

    classesof

    thepartitions.

    (

    ��� ��� ��������'� �'� �

    )

    ��� ��� ��������'��

    Totalvarianceof

    thecloud

    ofobjects

    in

    "

    -dimensionalspace

    isdecom

    posed

    intothe

    sumof

    within-class

    varianceand

    between-class

    variance.This

    is

    Huyghen’s

    theoremin

    classicalmechanics.

    Totalvariance,between-class

    variance,andw

    ithin-classvariance

    areas

    follows:

  • Correspondence

    Analysis

    –F

    Murtagh

    40

    ��

    ��

    *���

    �� ���� ���#��,*�(���

    ���

    ���

    ���#��;and

    �� ���� ���� ������.

    Fortw

    opartitions,before

    andafter

    anagglom

    eration,we

    haverespectively:

    *���*�(����

    ��*���

    *���*�)����

    ��*���

    Fromthis,itcan

    beshow

    nthatthe

    criterionto

    beoptim

    izedin

    agglomerating

    '�

    and

    '�

    intonew

    class

    '

    is:

    *�(��*�)�

    *�'��*�'� ��*�'� �

    ��������

    ��������� �� ��� ��

  • Correspondence

    Analysis

    –F

    Murtagh

    41

    ��

    ��

    FAC

    OR

    andV

    AC

    OR

    :A

    nalysisof

    clusters

    The

    barycentricprinciple

    allows

    bothrow

    pointsand

    column

    pointsto

    be

    displayedsim

    ultaneouslyas

    projections.

    We

    thereforecan

    consider:

    –sim

    ultaneousdisplay

    ofand

    �–

    treeon

    –tree

    on

    Tohelp

    analyzethese

    outputsw

    ecan

    explorethe

    representationof

    clusters

    (derivedfrom

    thehierarchicaltrees)

    infactor

    space,leadingto

    programs

    traditionallycalled

    FAC

    OR

    .

    And

    therepresentation

    ofclusters

    inthe

    profilecoordinate

    space,leadingto

    programs

    traditionallycalled

    VA

    CO

    R.

  • Correspondence

    Analysis

    –F

    Murtagh

    42

    ��

    ��

    Inthe

    caseof

    FAC

    OR

    ,forevery

    couple

    ����of

    apartition

    of

    ,we

    calculate

    ��� ��� �

    ������� � �����

    This

    canbe

    decomposed

    usingthe

    axesof

    ���

    ,asw

    ellasusing

    thefactorial

    axes.

    Inthe

    caseof

    VA

    CO

    R,w

    ecan

    explorethe

    clusterdipoles

    which

    takesaccount

    ofthe

    “elder”and

    “younger”cluster

    components:

    n

    /\

    /\

    /\

    a(n)

    b(n)

    We

    have

    ����

    ��� ���

    �� ��

    ���.

    We

    considerthe

    vectorsdefining

    the

    dipole:

    �������and

    ���+����.

    We

    thenstudy

    thesquared

    cosineof

    theangle

    between

    vector

    �����+����and

  • Correspondence

    Analysis

    –F

    Murtagh

    43

    ��

    ��

    thefactorialaxis

    ofrank

    �.

    This

    squaredcosine

    definesthe

    relativecontribution

    ofthe

    pair

    ���

    tothe

    level

    index

    $���

    ofthe

    class

    �.

  • Correspondence

    Analysis

    –F

    Murtagh

    44

    ��

    ��

    Summ

    ary

    Correspondence

    analysisdisplays

    observationprofiles

    ina

    low-dim

    ensional

    factorialspace.

    Profilesare

    pointsendow

    edw

    ith

    ��

    distance.

    Under

    appropriatecircum

    stances,the

    ��

    distancereduces

    toa

    Euclidean

    distance.

    Afactorialspace

    isnearly

    always

    Euclidean.

    Simultaneously

    ahierarchicalclustering

    isbuiltusing

    theobservation

    profiles.

    Usually

    oneor

    asm

    allnumber

    ofpartitions

    arederived

    fromthe

    hierarchical

    clustering.

    Ahierarchicalclustering

    definesan

    ultrametric

    distance.

    Inputforthe

    hierarchicalclusteringis

    usuallyfactor

    projections.

  • Correspondence

    Analysis

    –F

    Murtagh

    45

    ��

    ��

    Insum

    mary,correspondence

    analysisinvolves

    mapping

    a

    ��

    distanceinto

    a

    particularE

    uclideandistance;and

    mapping

    thisE

    uclideandistance

    intoan

    ultrametric

    distance.

    The

    aimis

    tohave

    differentbutcomplem

    entaryanalytic

    toolsto

    facilitate

    interpretationof

    ourdata.

  • Correspondence

    Analysis

    –F

    Murtagh

    46

    ��

    ��

    Toread

    further

    Ch.

    Bastin,J.P.B

    enzécri,Ch.

    Bourgaritand

    P.Cazes,P

    ratiquede

    l’Analyse

    des

    Données,Tom

    e2,D

    unod,Paris,1980.

    J.P.Benzécriand

    F.Benzécri,F.P

    ratiquede

    l’Analyse

    desD

    onnées,Vol.1:

    Analyse

    desC

    orrespondances.E

    xposéÉ

    lémentaire,D

    unod,Paris,1980.

    J.P.Benzécri,L’A

    nalysedes

    Données.

    Tome

    1.L

    aTaxinom

    ie,2nded.,D

    unod,

    Paris,1976.

    J.P.Benzécri,L’A

    nalysedes

    Données.

    Tome

    2.L’A

    nalysedes

    Correspondances,

    2nded.,D

    unod,Paris,1976.

    J.P.Benzécri,C

    orrespondenceA

    nalysisH

    andbook,MarcelD

    ekker,Basel,

    1992.

    M.Jam

    bu,Classification

    Autom

    atiquepour

    l’Analyse

    desD

    onnées.1.

    Méthodes

    etAlgorithm

    es,Dunod,Paris,1978.

  • Correspondence

    Analysis

    –F

    Murtagh

    47

    ��

    ��

    L.L

    ebart,A.M

    orineauand

    K.M

    .Warw

    ick,Multivariate

    Descriptive

    Statistical

    Analysis,W

    iley,New

    York,1984.

    F.Murtagh,“A

    surveyof

    recentadvancesin

    hierarchicalclusteringalgorithm

    s”,

    The

    Com

    puterJournal,26,354-359,1983.

    F.Murtagh,M

    ultidimensionalC

    lusteringA

    lgorithms,C

    OM

    PSTAT

    Lectures

    Volum

    e4,Physica-V

    erlag,Vienna,1985.

    F.Murtagh

    andA

    .Heck,M

    ultivariateD

    ataA

    nalysis,Kluw

    er,1987.

    H.R

    ouanetandB

    .Le

    Roux,A

    nalysedes

    Données

    Multidim

    ensionnelles,

    Dunod,Paris,1993.

    M.V

    olle,Analyse

    desD

    onnées,2ndE

    dition,Econom

    ica,Paris,1980.