Learn Correspondence analysis

Embed Size (px)

DESCRIPTION

Learn Correspondence analysis using SPSS

Citation preview

  • 7/13/2019 Learn Correspondence analysis

    1/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 1

    Correspondence Analysis

    Chapter 14

  • 7/13/2019 Learn Correspondence analysis

    2/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 2

    Correspondence analysis

    Multivariate statistical technique which

    looks into the associationof two or more

    categorical variables and display them

    jointly on a bivariate graph

    It can be used to apply multidimensional

    scaling to categorical variable.

  • 7/13/2019 Learn Correspondence analysis

    3/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 3

    Correspondence analysisand data reduction techniques

    Factor and principal component analyses are only appliedto metric (interval or ratio) quantitative variables

    Traditional multidimensional scaling deals with non-metricpreference and perceptual data when those are on an

    ordinal scale Correspondence analysis allows data reduction (and

    graphical representation of dissimilarities) on non-metricnominal (categorical) variables

    The issue with categorical (non-ordinal) variables is how tomeasure distances between two objects: Correspondenceanalysis exploitscontingency tables and associationmeasures

  • 7/13/2019 Learn Correspondence analysis

    4/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 4

    Example (Trust data)

    Do consumers with different jobs (q55) show preferences

    for some specific type of chicken (q6)?

    Cor res ponden ce Table

    17 50 10 17 94

    11 74 14 28 127

    6 19 4 8 37

    0 7 6 14 27

    1 18 7 3 29

    1 1 1 0 3

    0 4 2 3 9

    11 31 1 1 44

    47 204 45 74 370

    If employed, w hat is youroccupation?

    I am not employed

    Non manual employee

    Manual employee

    Executive

    Self employed

    professional

    Farmer / agricultural

    w orker

    Employer / Entrepreneur

    Other

    Active Margin

    'Value'chicken

    'Standard'chicken

    'Organic'chicken

    'Luxury'ch icken Active Margin

    a typical w eek, what type of f resh or frozen chicken do you buy f o

    your household's home consumption?

  • 7/13/2019 Learn Correspondence analysis

    5/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 5

    Independence

    If the two characters are independent then thenumber in the cells of the table should simplydepend on the row and column totals (lecture 9)

    Measure the distance between the expectedfrequency in each cell and the actual (observed)frequency

    Compute a statistic (the Chi-square statistic)

    which allows one to test whether the differencebetween the expected and actual value isstatistically significant

  • 7/13/2019 Learn Correspondence analysis

    6/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 6

    Reducing the number of dimensions

    The elements composing the Chi-square statisticare standardized metric values, one for each of thecells

    They become larger as the association between

    two specific characters increases These elements can be interpreted as a metric

    measure of distance

    The resulting matrix is similar to a covariance

    matrix A method similar to principal component analysis

    can be applied to this matrix to reduce the numberof dimensions

  • 7/13/2019 Learn Correspondence analysis

    7/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 7

    coordinates

    The principal component scores providestandardized values that can be used ascoordinates

    One may apply the same data reduction technique first by rows (synthesizing occupation as a function of

    types of chicken)

    then by column (synthesizing types of chicken as afunction of occupation)

    The first two components for each applicationgenerate a bivariate plot which shows both theoccupation and the type of chicken in the samespace

  • 7/13/2019 Learn Correspondence analysis

    8/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 8

    Output fromCorrespondence Analysis

    Executives prefer

    Luxury chicken

    Unemployed

    are closer toValue chicken

  • 7/13/2019 Learn Correspondence analysis

    9/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 9

    Applications

    It is possible to represent on the same graphconsumer preferences for different brands andcharacteristics of a specific product (e.g. carbrands together with colour, power, size, etc.)

    This allows one to explore brand choice in relationto characteristics opening the way to productmodifications and innovations to meet consumerpreferences

    Correspondence analysis is particularly useful when

    the variables have many categories The application to metric (continuous) data is not

    ruled out but data need to be categorized first

  • 7/13/2019 Learn Correspondence analysis

    10/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 10

    Summary

    Correspondence analysis is a compositional techniquewhichstarts from a set of product attributes to portrait the overallpreference for a brand

    This technique is very similar to PCA and can be employed fordata reductionpurposes or to plot perceptual maps

    Because of the way it is constructed correspondence analysiscan be applied to either the row or the columns of the datamatrix

    For example if rows represent brands and columns aredifferent attributes:

    1. By applying the method by rows one obtains the coordinates for thebrands

    2. The application by columns allows one to represent the attributes inthe same graph

  • 7/13/2019 Learn Correspondence analysis

    11/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 11

    Steps to run correspondence

    analysis

    Represent the data in a contingency table

    Translate the frequencies of the contingencytable into a matrix of metric (continuous)distances through a set of Chi-square associationmeasures on the row and columnprofiles

    Extract the dimensions (in a similar fashion toPCA)

    Evaluate the explanatory power of the selected

    number of dimensions Plot row and column objects in the same co-

    ordinate space

  • 7/13/2019 Learn Correspondence analysis

    12/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 12

    The frequency table

    y1 y2 yj yl

    x1 f11 f12 f1j f1l f10

    x2 f21 f22 f2j f2l f20

    xi fi1 fij fil fi0

    xk fk1 fj2 fkj fkl fkl

    f01 f02 f0j f0l 1

    Categoric al variable Y (l catego ries)

    Ca

    tegoric

    alvar

    iableX(k

    ca

    tegor

    ies

    )

    Row profile

    Row masses

    Column profile Column masses

  • 7/13/2019 Learn Correspondence analysis

    13/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 13

    Interpretation of coordinates

    The categories of thexvariable can be seenas different coordinates for the pointsidentified by the yvariable

    The categories of the yvariable can be seenas different coordinates for the pointsidentified by thexvariable

    Thus it is possible to represent thex and y

    categories as points in space, imposing (as inmultidimensional scaling) that they respectsome distance measure

  • 7/13/2019 Learn Correspondence analysis

    14/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 14

    Representations

    Take the row profile (the categories ofx) and plotthe categories in a bi-dimensional graph, using thecategories of y to define the distances

    This allows one to compare nominal categorieswithin the same variable: those categories ofx

    which show similar levels of association with agiven category of y can be considered as closerthan those with very different levels of associationwith the same category of y

    The same procedure is carried out transposing thetable which means that the categories of y can berepresented using the categories ofx to define thedistances

  • 7/13/2019 Learn Correspondence analysis

    15/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 15

    Computing the distances

    When the coordinates are defined simultaneously for the categories

    ofxand ythe Chi-square value can be computed for each cell as

    follows

    Obtain the expected table frequencies

    Where nijandfijare the absolute and relative frequencies, respectively, ni0and n0j(or

    fi0andf0j) are the marginal totals for row iand columnj (the row masses and column

    masses) respectively and n00is the sample size (hence the totalrelative frequencyf00equals one)

    The Chi-square value can now be computed for each cell (i,j)

    0 0 0 0*

    0 0

    00 00

    i j i j

    ij i j

    n n f f f f f

    n f

    * 2

    2

    *( )ij ij

    ij

    ij

    f ff

    These are the quad

    between category i

    of the x variable

  • 7/13/2019 Learn Correspondence analysis

    16/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 16

    The distance matrix The matrix 2measures all of the associations

    between the categories of the first variable and thoseof the second one.

    A generalization of the multivariate case (MCA ispossible by stacking the matrix Stacking: compose a large matrix by blocks, where each block is the

    contingency matrix for two variables (all possible associations aretaken into consideration)

    The stacked matrix is referred to as the Burt Table

    To obtain similarityvalues from the 2 matrix: compute the square root of the elemental Chi-square values

    use the the appropriate sign (the sign of the differencefijfij

    *

    ) large positive values correspond to strongly associated categories

    large negative values identify those categories where theassociation is strong but negative indicating dissimilarity

  • 7/13/2019 Learn Correspondence analysis

    17/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 17

    Estimation

    The resulting matrix Dcontains metric and continuous

    similarity data It is possible to apply PCA to translate such a matrix into

    coordinates for each of the categories first those ofxthenthose of y

    Before PCA can be applied some normalization is required

    so that the input matrix becomes similar to a correlationmatrix

    The use of the square root of the row masses (columns) fornormalizing the values in Drepresents the key differencefrom PCA

    The rest of the estimation process follows the results of thePCA

    As for PCA eigenvalues are computed, one for eachdimension, which can be used to evaluate the proportion ofdissimilarity maintained by that dimension

  • 7/13/2019 Learn Correspondence analysis

    18/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 18

    Inertia

    Inertiais a measure of association between two categorical

    variables based on the Chi-squared statistic. In correspondence analysisthe proportion of inertia

    explained by each of the dimensions can be regarded as ameasure ofgoodness-of-fitbecause the effectiveness ofcorrespondence analysis depends on the degree of

    association betweenx and y Total inertia

    is a measure of the overall association betweenx and y

    is equal to the sum of the eigenvalues

    corresponds to the Chi-square value divided by the number ofobservations

    A total inertia above 0.20 is expected for adequate representations

    Inertia values can be computedfor each of the dimensionsand represent the contribution of that dimension to theassociation (Chi-square) between the two variables

  • 7/13/2019 Learn Correspondence analysis

    19/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 19

    SPSS example

    EFS data set: economic positionof

    the householdreference person

    (a093) type of tenure(a121)

    TheirPearson Chi-square value is 274,which means

    significant associationat the 99.9%confidence level)

  • 7/13/2019 Learn Correspondence analysis

    20/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 20

    AnalysisDefine the range, i.e. the categories for each

    variable that enter the analysis

    Some categories

    can be indicated as

    supplementary:

    they appear in the

    graphical

    representation, but

    do not influence the

    actual estimation of

    the scores

  • 7/13/2019 Learn Correspondence analysis

    21/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 21

    Model options

    Choose the number ofdimensions to be

    retained

    Choice of

    distance measure

    Standardization (only for

    Euclidean distance)

    Normalizat ion

    Which variable

    should be

    privileged?

  • 7/13/2019 Learn Correspondence analysis

    22/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 22

    Number of dimensions

    The maximum number of dimensions for theanalysis is equal to the number of rows minus one, or

    the number of columns minus one (whichever thesmaller)

    In our example, the maximum number ofdimensions would be five which reduces to fourdue to missing values in one row category.

    As shown later in this section one may then choose

    to graphically represent only a sub-set of theextracted dimensions (usually two or three) tomake interpretation easier

  • 7/13/2019 Learn Correspondence analysis

    23/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 23

    Distance measure

    Chi-square distance (as discussed earlier)

    Euclidean distance

    uses the square root of the sum of squared differences

    between pairs of rows and pairs of columns

    this also requires one to choose a method for centering

    the data (see the SPSS manual for details)

    For this example standard correspondence analysis

    (with the Chi-square distance) does not require a

    standardization method.

  • 7/13/2019 Learn Correspondence analysis

    24/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 24

    Normalization method Defines how correspondence analysis is run: whether to give priority to

    comparisons between the categories forx (row) or those for y (columns)

    This choice influence the way distances are summarized by the firstdimensions

    Row principal normalization: the Euclidean distances in the finalbivariate plot ofx and y are as close as possible to the Chi-squaredistances between the rows, that is the categories ofx

    The opposite is valid for the column principal method

    Symmetrical normalization: the distances on the graph resemble as muchas possible distances for bothx and y by spreading the total inertiasymmetrically

    Principal normalization: inertia is first spread over the scores forx, then y

    Weighted normalization: defines a weighting value between minus one andplus one where minus one is the column principal zero is symmetrical andplus one is the row principal

    EFS example:the row principal method is more appropriate as it is morerelevant to see how differences in socio-economic conditions impact onthe tenure type than it is by looking at distances between tenure types.

  • 7/13/2019 Learn Correspondence analysis

    25/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 25

    Additional statistics

    Although CA is a

    nonparametric method,

    it is possible to compute

    standard deviations andcorrelations under the

    assumption of

    multinomial distribution

    of the cell frequencies,

    (when data are obtained

    as a random samplefrom a normally

    distributed population)

    Allows one to order the categories of x and y using scores

    obtained from CA

    E.g. the tenure types and the socio-economic conditions

    might follow some ordering but cannot be defined with

    sufficient precision to consider these variables as ordinal.

    One can use the scores in the first dimension (or the first

    two) to order the categories and produce a permutated

    correspondence table.

  • 7/13/2019 Learn Correspondence analysis

    26/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 26

    Plots

    Three graphs:

    Biplot (both x & y)

    x only (rows)

    y only (columns)

    One usually chooses to

    represent only the first

    two or three of theextracted dimensions

  • 7/13/2019 Learn Correspondence analysis

    27/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 27

    Output

    Sum mary

    .669 .447 .850 .850 .031 .094 -.032 -.022

    .209 .044 .083 .933 .055 .011 .081

    .173 .030 .057 .990 .055 -.042

    .072 .005 .010 1.000 .053

    .526 231.402 .000a 1.000 1.000

    Dimension

    1

    2

    3

    4

    Total

    Singular

    Value Inertia Chi Square Sig. Accounted for Cumulative

    Proportion of Inert ia

    Standard

    Deviation 2 3 4

    Correlation

    Confidence Singular Value

    24 degrees of freedoma.

    The SV is the

    square root of inertia

    (the eigenvalue)

    The Chi-square stat

    suggests strong and

    significant association

    The first dimensin explains 85%, the first two 93%of total inert ia. However, note that total inertia

    does not correspond to total variability, but to the

    variability of the extracted dimensions

    Usually a value of

    total inertia above

    0.2 is regarded as

    acceptable

    These precision measures

    are based on the

    multinomial distribution

    assumption

  • 7/13/2019 Learn Correspondence analysis

    28/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 28

    Row scores

    Overview Row Pointsb

    .080 .296 .025 .433 -.164 .024 .016 .001 .496 .407 .290 .002 .620 .089 1.000

    .539 .527 .049 -.039 .026 .152 .334 .030 .027 .071 .984 .008 .005 .002 1.000

    .077 -.239 -.409 -.352 -.143 .028 .010 .295 .318 .300 .156 .453 .336 .055 1.000

    .018 -.154 -1.223 .509 .241 .033 .001 .622 .157 .202 .013 .814 .141 .032 1.000

    .000 . . . . . .000 .000 .000 .000 . . . . .

    .286 -.999 .089 .015 .019 .288 .639 .052 .002 .020 .992 .008 .000 .000 1.000

    1.000 .526 1.000 1.000 1.000 1.000

    Economic p osition of

    Household Reference

    Person

    Self-employed

    Fulltime employee

    Pt employee

    Unemployed

    Work related govt train

    proga

    Ret unoc over min ni age

    Activ e Total

    Mass 1 2 3 4

    Score in Dimension

    Inertia 1 2 3 4

    Of Point to Inertia of Dimens ion

    1 2 3 4 Total

    Of Dimension to Inertia of Point

    Contribution

    Supplementary pointa.

    Row Principal nor malizationb.

    The mass column shows

    the relative weight of eachcategory on the sample

    Scores are computed for each

    category but the supplemental one,provided there are no missing data

    Scores are the coordin ates for the

    map

    Shows how total inertia has been

    distributed across rows (similar tocommunalities)

    These categories have a higher relevance because

    they are more important categories in the original

    correspondence table. These two categories(especially retirement) strongly contribute to

    explaining the first dimension

    The second dimension is

    characterized by unemployed and

    part-time employees

  • 7/13/2019 Learn Correspondence analysis

    29/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 29

    Column scores

    The same exercise is carried out on columns,however the row principal method does not

    normalize by column

    Overview Column Pointsb

    .098 -.699 -1.993 .051 1.106 .039 .048 .388 .000 .120 .548 .436 .000 .016 1.000

    .066 -.781 -1.263 2.821 -1.273 .039 .040 .105 .524 .107 .462 .118 .405 .014 1.000

    .050 .487 -2.023 -2.190 .891 .022 .012 .205 .240 .040 .245 .413 .333 .010 1.000

    .032 .531 -1.098 -2.270 -4.585 .014 .009 .038 .164 .669 .284 .119 .349 .248 1.000

    .457 .971 .371 .233 .133 .196 .431 .063 .025 .008 .982 .014 .004 .000 1.000

    .002 1.179 1.120 -1.287 5.002 .002 .003 .003 .004 .057 .725 .064 .058 .153 1.000

    .295 -1.244 .819 -.382 .018 .214 .457 .198 .043 .000 .954 .040 .006 .000 1.000

    .009 -.957 -1.039 -2.996 -3.705 .007 .000 .000 .000 .000 .512 .059 .338 .090 1.000

    1.000 .526 1.000 1.000 1.000 1.000

    Tenure - type

    Local Authority rented

    unfurnished

    Housing assoc iation

    Other rented unfurnished

    Rented fu rnished

    Ow ned with mortgageOwned by rental

    purchase

    Ow ned outright

    Rent f reea

    Ac tive Tota l

    Mass 1 2 3 4

    Score in Dimension

    Inertia 1 2 3 4

    Of Point to Inertia of Dimension

    1 2 3 4 Total

    Of Dimension to Inertia of Point

    Contribution

    Supplementary pointa.

    Row Princ ipal normalizationb.By column the first dimension is especially related to the

    owned by mortgage and owned outright categories

  • 7/13/2019 Learn Correspondence analysis

    30/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 30

    Bi-plot

    Employed individuals are

    closer to owned

    accommodations

    Retired individuals are

    also close to owned

    accommodations

    Part-time employees andunemployed individuals are closer

    to rented accommodations and

    other forms of accommodations

  • 7/13/2019 Learn Correspondence analysis

    31/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 31

    Multiple Correspondence

    Analysis(MCA)

    When all variables are multiple

    nominal, then optimal scaling applies

    MCA

  • 7/13/2019 Learn Correspondence analysis

    32/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 32

    Plot with 3 variables

    The analysis

    now also

    includes the

    government

    office region

  • 7/13/2019 Learn Correspondence analysis

    33/33

    Statistics for Marketing & Consumer ResearchCopyright 2008 Mario Mazzocchi 33

    SAS correspondence analysis

    SAS procedure:proc CORRESP

    simple correspondence analysis

    multiple correspondence analysis (option MCA)

    same types of normalization as SPSS

    option PROFILE (ROW, COLUMN or BOTH)