Learn Correspondence analysis

7/13/2019 Learn Correspondence analysis

1/33

Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 1

Correspondence Analysis

Chapter 14


2/33


Correspondence analysis

Multivariate statistical technique which

looks into the associationof two or more

categorical variables and display them

jointly on a bivariate graph

It can be used to apply multidimensional

scaling to categorical variable.


3/33


Correspondence analysisand data reduction techniques

Factor and principal component analyses are only appliedto metric (interval or ratio) quantitative variables

Traditional multidimensional scaling deals with non-metricpreference and perceptual data when those are on an

ordinal scale Correspondence analysis allows data reduction (and

graphical representation of dissimilarities) on non-metricnominal (categorical) variables

The issue with categorical (non-ordinal) variables is how tomeasure distances between two objects: Correspondenceanalysis exploitscontingency tables and associationmeasures


4/33


Example (Trust data)

Do consumers with different jobs (q55) show preferences

for some specific type of chicken (q6)?

Cor res ponden ce Table

17 50 10 17 94

11 74 14 28 127

6 19 4 8 37

0 7 6 14 27

1 18 7 3 29

1 1 1 0 3

0 4 2 3 9

11 31 1 1 44

47 204 45 74 370

If employed, w hat is youroccupation?

I am not employed

Non manual employee

Manual employee

Executive

Self employed

professional

Farmer / agricultural

w orker

Employer / Entrepreneur

Other

Active Margin

'Value'chicken

'Standard'chicken

'Organic'chicken

'Luxury'ch icken Active Margin

a typical w eek, what type of f resh or frozen chicken do you buy f o

your household's home consumption?


5/33


Independence

If the two characters are independent then thenumber in the cells of the table should simplydepend on the row and column totals (lecture 9)

Measure the distance between the expectedfrequency in each cell and the actual (observed)frequency

Compute a statistic (the Chi-square statistic)

which allows one to test whether the differencebetween the expected and actual value isstatistically significant


6/33


Reducing the number of dimensions

The elements composing the Chi-square statisticare standardized metric values, one for each of thecells

They become larger as the association between

two specific characters increases These elements can be interpreted as a metric

measure of distance

The resulting matrix is similar to a covariance

matrix A method similar to principal component analysis

can be applied to this matrix to reduce the numberof dimensions


7/33


coordinates

The principal component scores providestandardized values that can be used ascoordinates

One may apply the same data reduction technique first by rows (synthesizing occupation as a function of

types of chicken)

then by column (synthesizing types of chicken as afunction of occupation)

The first two components for each applicationgenerate a bivariate plot which shows both theoccupation and the type of chicken in the samespace


8/33


Output fromCorrespondence Analysis

Executives prefer

Luxury chicken

Unemployed

are closer toValue chicken


9/33


Applications

It is possible to represent on the same graphconsumer preferences for different brands andcharacteristics of a specific product (e.g. carbrands together with colour, power, size, etc.)

This allows one to explore brand choice in relationto characteristics opening the way to productmodifications and innovations to meet consumerpreferences

Correspondence analysis is particularly useful when

the variables have many categories The application to metric (continuous) data is not

ruled out but data need to be categorized first


10/33


Summary

Correspondence analysis is a compositional techniquewhichstarts from a set of product attributes to portrait the overallpreference for a brand

This technique is very similar to PCA and can be employed fordata reductionpurposes or to plot perceptual maps

Because of the way it is constructed correspondence analysiscan be applied to either the row or the columns of the datamatrix

For example if rows represent brands and columns aredifferent attributes:

1. By applying the method by rows one obtains the coordinates for thebrands

2. The application by columns allows one to represent the attributes inthe same graph


11/33


Steps to run correspondence

analysis

Represent the data in a contingency table

Translate the frequencies of the contingencytable into a matrix of metric (continuous)distances through a set of Chi-square associationmeasures on the row and columnprofiles

Extract the dimensions (in a similar fashion toPCA)

Evaluate the explanatory power of the selected

number of dimensions Plot row and column objects in the same co-

ordinate space


12/33


The frequency table

y1 y2 yj yl

x1 f11 f12 f1j f1l f10

x2 f21 f22 f2j f2l f20

xi fi1 fij fil fi0

xk fk1 fj2 fkj fkl fkl

f01 f02 f0j f0l 1

Categoric al variable Y (l catego ries)

Ca

tegoric

alvar

iableX(k

ca

tegor

ies

)

Row profile

Row masses

Column profile Column masses


13/33


Interpretation of coordinates

The categories of thexvariable can be seenas different coordinates for the pointsidentified by the yvariable

The categories of the yvariable can be seenas different coordinates for the pointsidentified by thexvariable

Thus it is possible to represent thex and y

categories as points in space, imposing (as inmultidimensional scaling) that they respectsome distance measure


14/33


Representations

Take the row profile (the categories ofx) and plotthe categories in a bi-dimensional graph, using thecategories of y to define the distances

This allows one to compare nominal categorieswithin the same variable: those categories ofx

which show similar levels of association with agiven category of y can be considered as closerthan those with very different levels of associationwith the same category of y

The same procedure is carried out transposing thetable which means that the categories of y can berepresented using the categories ofx to define thedistances


15/33


Computing the distances

When the coordinates are defined simultaneously for the categories

ofxand ythe Chi-square value can be computed for each cell as

follows

Obtain the expected table frequencies

Where nijandfijare the absolute and relative frequencies, respectively, ni0and n0j(or

fi0andf0j) are the marginal totals for row iand columnj (the row masses and column

masses) respectively and n00is the sample size (hence the totalrelative frequencyf00equals one)

The Chi-square value can now be computed for each cell (i,j)

0 0 0 0*

0 0

00 00

i j i j

ij i j

n n f f f f f

n f

* 2

2

*( )ij ij

ij

ij

f ff

These are the quad

between category i

of the x variable


16/33


The distance matrix The matrix 2measures all of the associations

between the categories of the first variable and thoseof the second one.

A generalization of the multivariate case (MCA ispossible by stacking the matrix Stacking: compose a large matrix by blocks, where each block is the

contingency matrix for two variables (all possible associations aretaken into consideration)

The stacked matrix is referred to as the Burt Table

To obtain similarityvalues from the 2 matrix: compute the square root of the elemental Chi-square values

use the the appropriate sign (the sign of the differencefijfij

*

) large positive values correspond to strongly associated categories

large negative values identify those categories where theassociation is strong but negative indicating dissimilarity


17/33


Estimation

The resulting matrix Dcontains metric and continuous

similarity data It is possible to apply PCA to translate such a matrix into

coordinates for each of the categories first those ofxthenthose of y

Before PCA can be applied some normalization is required

so that the input matrix becomes similar to a correlationmatrix

The use of the square root of the row masses (columns) fornormalizing the values in Drepresents the key differencefrom PCA

The rest of the estimation process follows the results of thePCA

As for PCA eigenvalues are computed, one for eachdimension, which can be used to evaluate the proportion ofdissimilarity maintained by that dimension


18/33


Inertia

Inertiais a measure of association between two categorical

variables based on the Chi-squared statistic. In correspondence analysisthe proportion of inertia

explained by each of the dimensions can be regarded as ameasure ofgoodness-of-fitbecause the effectiveness ofcorrespondence analysis depends on the degree of

association betweenx and y Total inertia

is a measure of the overall association betweenx and y

is equal to the sum of the eigenvalues

corresponds to the Chi-square value divided by the number ofobservations

A total inertia above 0.20 is expected for adequate representations

Inertia values can be computedfor each of the dimensionsand represent the contribution of that dimension to theassociation (Chi-square) between the two variables


19/33


SPSS example

EFS data set: economic positionof

the householdreference person

(a093) type of tenure(a121)

TheirPearson Chi-square value is 274,which means

significant associationat the 99.9%confidence level)


20/33


AnalysisDefine the range, i.e. the categories for each

variable that enter the analysis

Some categories

can be indicated as

supplementary:

they appear in the

graphical

representation, but

do not influence the

actual estimation of

the scores


21/33


Model options

Choose the number ofdimensions to be

retained

Choice of

distance measure

Standardization (only for

Euclidean distance)

Normalizat ion

Which variable

should be

privileged?


22/33


Number of dimensions

The maximum number of dimensions for theanalysis is equal to the number of rows minus one, or

the number of columns minus one (whichever thesmaller)

In our example, the maximum number ofdimensions would be five which reduces to fourdue to missing values in one row category.

As shown later in this section one may then choose

to graphically represent only a sub-set of theextracted dimensions (usually two or three) tomake interpretation easier


23/33


Distance measure

Chi-square distance (as discussed earlier)

Euclidean distance

uses the square root of the sum of squared differences

between pairs of rows and pairs of columns

this also requires one to choose a method for centering

the data (see the SPSS manual for details)

For this example standard correspondence analysis

(with the Chi-square distance) does not require a

standardization method.


24/33


Normalization method Defines how correspondence analysis is run: whether to give priority to

comparisons between the categories forx (row) or those for y (columns)

This choice influence the way distances are summarized by the firstdimensions

Row principal normalization: the Euclidean distances in the finalbivariate plot ofx and y are as close as possible to the Chi-squaredistances between the rows, that is the categories ofx

The opposite is valid for the column principal method

Symmetrical normalization: the distances on the graph resemble as muchas possible distances for bothx and y by spreading the total inertiasymmetrically

Principal normalization: inertia is first spread over the scores forx, then y

Weighted normalization: defines a weighting value between minus one andplus one where minus one is the column principal zero is symmetrical andplus one is the row principal

EFS example:the row principal method is more appropriate as it is morerelevant to see how differences in socio-economic conditions impact onthe tenure type than it is by looking at distances between tenure types.


25/33


Additional statistics

Although CA is a

nonparametric method,

it is possible to compute

standard deviations andcorrelations under the

assumption of

multinomial distribution

of the cell frequencies,

(when data are obtained

as a random samplefrom a normally

distributed population)

Allows one to order the categories of x and y using scores

obtained from CA

E.g. the tenure types and the socio-economic conditions

might follow some ordering but cannot be defined with

sufficient precision to consider these variables as ordinal.

One can use the scores in the first dimension (or the first

two) to order the categories and produce a permutated

correspondence table.


26/33


Plots

Three graphs:

Biplot (both x & y)

x only (rows)

y only (columns)

One usually chooses to

represent only the first

two or three of theextracted dimensions


27/33


Output

Sum mary

.669 .447 .850 .850 .031 .094 -.032 -.022

.209 .044 .083 .933 .055 .011 .081

.173 .030 .057 .990 .055 -.042

.072 .005 .010 1.000 .053

.526 231.402 .000a 1.000 1.000

Dimension

1

2

3

4

Total

Singular

Value Inertia Chi Square Sig. Accounted for Cumulative

Proportion of Inert ia

Standard

Deviation 2 3 4

Correlation

Confidence Singular Value

24 degrees of freedoma.

The SV is the

square root of inertia

(the eigenvalue)

The Chi-square stat

suggests strong and

significant association

The first dimensin explains 85%, the first two 93%of total inert ia. However, note that total inertia

does not correspond to total variability, but to the

variability of the extracted dimensions

Usually a value of

total inertia above

0.2 is regarded as

acceptable

These precision measures

are based on the

multinomial distribution

assumption


28/33


Row scores

Overview Row Pointsb

.080 .296 .025 .433 -.164 .024 .016 .001 .496 .407 .290 .002 .620 .089 1.000

.539 .527 .049 -.039 .026 .152 .334 .030 .027 .071 .984 .008 .005 .002 1.000

.077 -.239 -.409 -.352 -.143 .028 .010 .295 .318 .300 .156 .453 .336 .055 1.000

.018 -.154 -1.223 .509 .241 .033 .001 .622 .157 .202 .013 .814 .141 .032 1.000

.000 . . . . . .000 .000 .000 .000 . . . . .

.286 -.999 .089 .015 .019 .288 .639 .052 .002 .020 .992 .008 .000 .000 1.000

1.000 .526 1.000 1.000 1.000 1.000

Economic p osition of

Household Reference

Person

Self-employed

Fulltime employee

Pt employee

Unemployed

Work related govt train

proga

Ret unoc over min ni age

Activ e Total

Mass 1 2 3 4

Score in Dimension

Inertia 1 2 3 4

Of Point to Inertia of Dimens ion

1 2 3 4 Total

Of Dimension to Inertia of Point

Contribution

Supplementary pointa.

Row Principal nor malizationb.

The mass column shows

the relative weight of eachcategory on the sample

Scores are computed for each

category but the supplemental one,provided there are no missing data

Scores are the coordin ates for the

map

Shows how total inertia has been

distributed across rows (similar tocommunalities)

These categories have a higher relevance because

they are more important categories in the original

correspondence table. These two categories(especially retirement) strongly contribute to

explaining the first dimension

The second dimension is

characterized by unemployed and

part-time employees


29/33


Column scores

The same exercise is carried out on columns,however the row principal method does not

normalize by column

Overview Column Pointsb

.098 -.699 -1.993 .051 1.106 .039 .048 .388 .000 .120 .548 .436 .000 .016 1.000

.066 -.781 -1.263 2.821 -1.273 .039 .040 .105 .524 .107 .462 .118 .405 .014 1.000

.050 .487 -2.023 -2.190 .891 .022 .012 .205 .240 .040 .245 .413 .333 .010 1.000

.032 .531 -1.098 -2.270 -4.585 .014 .009 .038 .164 .669 .284 .119 .349 .248 1.000

.457 .971 .371 .233 .133 .196 .431 .063 .025 .008 .982 .014 .004 .000 1.000

.002 1.179 1.120 -1.287 5.002 .002 .003 .003 .004 .057 .725 .064 .058 .153 1.000

.295 -1.244 .819 -.382 .018 .214 .457 .198 .043 .000 .954 .040 .006 .000 1.000

.009 -.957 -1.039 -2.996 -3.705 .007 .000 .000 .000 .000 .512 .059 .338 .090 1.000

1.000 .526 1.000 1.000 1.000 1.000

Tenure - type

Local Authority rented

unfurnished

Housing assoc iation

Other rented unfurnished

Rented fu rnished

Ow ned with mortgageOwned by rental

purchase

Ow ned outright

Rent f reea

Ac tive Tota l

Mass 1 2 3 4

Score in Dimension

Inertia 1 2 3 4

Of Point to Inertia of Dimension

1 2 3 4 Total

Of Dimension to Inertia of Point

Contribution

Supplementary pointa.

Row Princ ipal normalizationb.By column the first dimension is especially related to the

owned by mortgage and owned outright categories


30/33


Bi-plot

Employed individuals are

closer to owned

accommodations

Retired individuals are

also close to owned

accommodations

Part-time employees andunemployed individuals are closer

to rented accommodations and

other forms of accommodations


31/33


Multiple Correspondence

Analysis(MCA)

When all variables are multiple

nominal, then optimal scaling applies

MCA


32/33


Plot with 3 variables

The analysis

now also

includes the

government

office region


33/33

Statistics for Marketing & Consumer ResearchCopyright 2008 Mario Mazzocchi 33

SAS correspondence analysis

SAS procedure:proc CORRESP

simple correspondence analysis

multiple correspondence analysis (option MCA)

same types of normalization as SPSS

option PROFILE (ROW, COLUMN or BOTH)

Documents

Learn Correspondence analysis