Upload
sugan-pragasam
View
12
Download
0
Embed Size (px)
DESCRIPTION
Learn Correspondence analysis using SPSS
Citation preview
7/13/2019 Learn Correspondence analysis
1/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 1
Correspondence Analysis
Chapter 14
7/13/2019 Learn Correspondence analysis
2/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 2
Correspondence analysis
Multivariate statistical technique which
looks into the associationof two or more
categorical variables and display them
jointly on a bivariate graph
It can be used to apply multidimensional
scaling to categorical variable.
7/13/2019 Learn Correspondence analysis
3/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 3
Correspondence analysisand data reduction techniques
Factor and principal component analyses are only appliedto metric (interval or ratio) quantitative variables
Traditional multidimensional scaling deals with non-metricpreference and perceptual data when those are on an
ordinal scale Correspondence analysis allows data reduction (and
graphical representation of dissimilarities) on non-metricnominal (categorical) variables
The issue with categorical (non-ordinal) variables is how tomeasure distances between two objects: Correspondenceanalysis exploitscontingency tables and associationmeasures
7/13/2019 Learn Correspondence analysis
4/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 4
Example (Trust data)
Do consumers with different jobs (q55) show preferences
for some specific type of chicken (q6)?
Cor res ponden ce Table
17 50 10 17 94
11 74 14 28 127
6 19 4 8 37
0 7 6 14 27
1 18 7 3 29
1 1 1 0 3
0 4 2 3 9
11 31 1 1 44
47 204 45 74 370
If employed, w hat is youroccupation?
I am not employed
Non manual employee
Manual employee
Executive
Self employed
professional
Farmer / agricultural
w orker
Employer / Entrepreneur
Other
Active Margin
'Value'chicken
'Standard'chicken
'Organic'chicken
'Luxury'ch icken Active Margin
a typical w eek, what type of f resh or frozen chicken do you buy f o
your household's home consumption?
7/13/2019 Learn Correspondence analysis
5/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 5
Independence
If the two characters are independent then thenumber in the cells of the table should simplydepend on the row and column totals (lecture 9)
Measure the distance between the expectedfrequency in each cell and the actual (observed)frequency
Compute a statistic (the Chi-square statistic)
which allows one to test whether the differencebetween the expected and actual value isstatistically significant
7/13/2019 Learn Correspondence analysis
6/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 6
Reducing the number of dimensions
The elements composing the Chi-square statisticare standardized metric values, one for each of thecells
They become larger as the association between
two specific characters increases These elements can be interpreted as a metric
measure of distance
The resulting matrix is similar to a covariance
matrix A method similar to principal component analysis
can be applied to this matrix to reduce the numberof dimensions
7/13/2019 Learn Correspondence analysis
7/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 7
coordinates
The principal component scores providestandardized values that can be used ascoordinates
One may apply the same data reduction technique first by rows (synthesizing occupation as a function of
types of chicken)
then by column (synthesizing types of chicken as afunction of occupation)
The first two components for each applicationgenerate a bivariate plot which shows both theoccupation and the type of chicken in the samespace
7/13/2019 Learn Correspondence analysis
8/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 8
Output fromCorrespondence Analysis
Executives prefer
Luxury chicken
Unemployed
are closer toValue chicken
7/13/2019 Learn Correspondence analysis
9/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 9
Applications
It is possible to represent on the same graphconsumer preferences for different brands andcharacteristics of a specific product (e.g. carbrands together with colour, power, size, etc.)
This allows one to explore brand choice in relationto characteristics opening the way to productmodifications and innovations to meet consumerpreferences
Correspondence analysis is particularly useful when
the variables have many categories The application to metric (continuous) data is not
ruled out but data need to be categorized first
7/13/2019 Learn Correspondence analysis
10/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 10
Summary
Correspondence analysis is a compositional techniquewhichstarts from a set of product attributes to portrait the overallpreference for a brand
This technique is very similar to PCA and can be employed fordata reductionpurposes or to plot perceptual maps
Because of the way it is constructed correspondence analysiscan be applied to either the row or the columns of the datamatrix
For example if rows represent brands and columns aredifferent attributes:
1. By applying the method by rows one obtains the coordinates for thebrands
2. The application by columns allows one to represent the attributes inthe same graph
7/13/2019 Learn Correspondence analysis
11/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 11
Steps to run correspondence
analysis
Represent the data in a contingency table
Translate the frequencies of the contingencytable into a matrix of metric (continuous)distances through a set of Chi-square associationmeasures on the row and columnprofiles
Extract the dimensions (in a similar fashion toPCA)
Evaluate the explanatory power of the selected
number of dimensions Plot row and column objects in the same co-
ordinate space
7/13/2019 Learn Correspondence analysis
12/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 12
The frequency table
y1 y2 yj yl
x1 f11 f12 f1j f1l f10
x2 f21 f22 f2j f2l f20
xi fi1 fij fil fi0
xk fk1 fj2 fkj fkl fkl
f01 f02 f0j f0l 1
Categoric al variable Y (l catego ries)
Ca
tegoric
alvar
iableX(k
ca
tegor
ies
)
Row profile
Row masses
Column profile Column masses
7/13/2019 Learn Correspondence analysis
13/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 13
Interpretation of coordinates
The categories of thexvariable can be seenas different coordinates for the pointsidentified by the yvariable
The categories of the yvariable can be seenas different coordinates for the pointsidentified by thexvariable
Thus it is possible to represent thex and y
categories as points in space, imposing (as inmultidimensional scaling) that they respectsome distance measure
7/13/2019 Learn Correspondence analysis
14/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 14
Representations
Take the row profile (the categories ofx) and plotthe categories in a bi-dimensional graph, using thecategories of y to define the distances
This allows one to compare nominal categorieswithin the same variable: those categories ofx
which show similar levels of association with agiven category of y can be considered as closerthan those with very different levels of associationwith the same category of y
The same procedure is carried out transposing thetable which means that the categories of y can berepresented using the categories ofx to define thedistances
7/13/2019 Learn Correspondence analysis
15/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 15
Computing the distances
When the coordinates are defined simultaneously for the categories
ofxand ythe Chi-square value can be computed for each cell as
follows
Obtain the expected table frequencies
Where nijandfijare the absolute and relative frequencies, respectively, ni0and n0j(or
fi0andf0j) are the marginal totals for row iand columnj (the row masses and column
masses) respectively and n00is the sample size (hence the totalrelative frequencyf00equals one)
The Chi-square value can now be computed for each cell (i,j)
0 0 0 0*
0 0
00 00
i j i j
ij i j
n n f f f f f
n f
* 2
2
*( )ij ij
ij
ij
f ff
These are the quad
between category i
of the x variable
7/13/2019 Learn Correspondence analysis
16/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 16
The distance matrix The matrix 2measures all of the associations
between the categories of the first variable and thoseof the second one.
A generalization of the multivariate case (MCA ispossible by stacking the matrix Stacking: compose a large matrix by blocks, where each block is the
contingency matrix for two variables (all possible associations aretaken into consideration)
The stacked matrix is referred to as the Burt Table
To obtain similarityvalues from the 2 matrix: compute the square root of the elemental Chi-square values
use the the appropriate sign (the sign of the differencefijfij
*
) large positive values correspond to strongly associated categories
large negative values identify those categories where theassociation is strong but negative indicating dissimilarity
7/13/2019 Learn Correspondence analysis
17/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 17
Estimation
The resulting matrix Dcontains metric and continuous
similarity data It is possible to apply PCA to translate such a matrix into
coordinates for each of the categories first those ofxthenthose of y
Before PCA can be applied some normalization is required
so that the input matrix becomes similar to a correlationmatrix
The use of the square root of the row masses (columns) fornormalizing the values in Drepresents the key differencefrom PCA
The rest of the estimation process follows the results of thePCA
As for PCA eigenvalues are computed, one for eachdimension, which can be used to evaluate the proportion ofdissimilarity maintained by that dimension
7/13/2019 Learn Correspondence analysis
18/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 18
Inertia
Inertiais a measure of association between two categorical
variables based on the Chi-squared statistic. In correspondence analysisthe proportion of inertia
explained by each of the dimensions can be regarded as ameasure ofgoodness-of-fitbecause the effectiveness ofcorrespondence analysis depends on the degree of
association betweenx and y Total inertia
is a measure of the overall association betweenx and y
is equal to the sum of the eigenvalues
corresponds to the Chi-square value divided by the number ofobservations
A total inertia above 0.20 is expected for adequate representations
Inertia values can be computedfor each of the dimensionsand represent the contribution of that dimension to theassociation (Chi-square) between the two variables
7/13/2019 Learn Correspondence analysis
19/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 19
SPSS example
EFS data set: economic positionof
the householdreference person
(a093) type of tenure(a121)
TheirPearson Chi-square value is 274,which means
significant associationat the 99.9%confidence level)
7/13/2019 Learn Correspondence analysis
20/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 20
AnalysisDefine the range, i.e. the categories for each
variable that enter the analysis
Some categories
can be indicated as
supplementary:
they appear in the
graphical
representation, but
do not influence the
actual estimation of
the scores
7/13/2019 Learn Correspondence analysis
21/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 21
Model options
Choose the number ofdimensions to be
retained
Choice of
distance measure
Standardization (only for
Euclidean distance)
Normalizat ion
Which variable
should be
privileged?
7/13/2019 Learn Correspondence analysis
22/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 22
Number of dimensions
The maximum number of dimensions for theanalysis is equal to the number of rows minus one, or
the number of columns minus one (whichever thesmaller)
In our example, the maximum number ofdimensions would be five which reduces to fourdue to missing values in one row category.
As shown later in this section one may then choose
to graphically represent only a sub-set of theextracted dimensions (usually two or three) tomake interpretation easier
7/13/2019 Learn Correspondence analysis
23/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 23
Distance measure
Chi-square distance (as discussed earlier)
Euclidean distance
uses the square root of the sum of squared differences
between pairs of rows and pairs of columns
this also requires one to choose a method for centering
the data (see the SPSS manual for details)
For this example standard correspondence analysis
(with the Chi-square distance) does not require a
standardization method.
7/13/2019 Learn Correspondence analysis
24/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 24
Normalization method Defines how correspondence analysis is run: whether to give priority to
comparisons between the categories forx (row) or those for y (columns)
This choice influence the way distances are summarized by the firstdimensions
Row principal normalization: the Euclidean distances in the finalbivariate plot ofx and y are as close as possible to the Chi-squaredistances between the rows, that is the categories ofx
The opposite is valid for the column principal method
Symmetrical normalization: the distances on the graph resemble as muchas possible distances for bothx and y by spreading the total inertiasymmetrically
Principal normalization: inertia is first spread over the scores forx, then y
Weighted normalization: defines a weighting value between minus one andplus one where minus one is the column principal zero is symmetrical andplus one is the row principal
EFS example:the row principal method is more appropriate as it is morerelevant to see how differences in socio-economic conditions impact onthe tenure type than it is by looking at distances between tenure types.
7/13/2019 Learn Correspondence analysis
25/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 25
Additional statistics
Although CA is a
nonparametric method,
it is possible to compute
standard deviations andcorrelations under the
assumption of
multinomial distribution
of the cell frequencies,
(when data are obtained
as a random samplefrom a normally
distributed population)
Allows one to order the categories of x and y using scores
obtained from CA
E.g. the tenure types and the socio-economic conditions
might follow some ordering but cannot be defined with
sufficient precision to consider these variables as ordinal.
One can use the scores in the first dimension (or the first
two) to order the categories and produce a permutated
correspondence table.
7/13/2019 Learn Correspondence analysis
26/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 26
Plots
Three graphs:
Biplot (both x & y)
x only (rows)
y only (columns)
One usually chooses to
represent only the first
two or three of theextracted dimensions
7/13/2019 Learn Correspondence analysis
27/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 27
Output
Sum mary
.669 .447 .850 .850 .031 .094 -.032 -.022
.209 .044 .083 .933 .055 .011 .081
.173 .030 .057 .990 .055 -.042
.072 .005 .010 1.000 .053
.526 231.402 .000a 1.000 1.000
Dimension
1
2
3
4
Total
Singular
Value Inertia Chi Square Sig. Accounted for Cumulative
Proportion of Inert ia
Standard
Deviation 2 3 4
Correlation
Confidence Singular Value
24 degrees of freedoma.
The SV is the
square root of inertia
(the eigenvalue)
The Chi-square stat
suggests strong and
significant association
The first dimensin explains 85%, the first two 93%of total inert ia. However, note that total inertia
does not correspond to total variability, but to the
variability of the extracted dimensions
Usually a value of
total inertia above
0.2 is regarded as
acceptable
These precision measures
are based on the
multinomial distribution
assumption
7/13/2019 Learn Correspondence analysis
28/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 28
Row scores
Overview Row Pointsb
.080 .296 .025 .433 -.164 .024 .016 .001 .496 .407 .290 .002 .620 .089 1.000
.539 .527 .049 -.039 .026 .152 .334 .030 .027 .071 .984 .008 .005 .002 1.000
.077 -.239 -.409 -.352 -.143 .028 .010 .295 .318 .300 .156 .453 .336 .055 1.000
.018 -.154 -1.223 .509 .241 .033 .001 .622 .157 .202 .013 .814 .141 .032 1.000
.000 . . . . . .000 .000 .000 .000 . . . . .
.286 -.999 .089 .015 .019 .288 .639 .052 .002 .020 .992 .008 .000 .000 1.000
1.000 .526 1.000 1.000 1.000 1.000
Economic p osition of
Household Reference
Person
Self-employed
Fulltime employee
Pt employee
Unemployed
Work related govt train
proga
Ret unoc over min ni age
Activ e Total
Mass 1 2 3 4
Score in Dimension
Inertia 1 2 3 4
Of Point to Inertia of Dimens ion
1 2 3 4 Total
Of Dimension to Inertia of Point
Contribution
Supplementary pointa.
Row Principal nor malizationb.
The mass column shows
the relative weight of eachcategory on the sample
Scores are computed for each
category but the supplemental one,provided there are no missing data
Scores are the coordin ates for the
map
Shows how total inertia has been
distributed across rows (similar tocommunalities)
These categories have a higher relevance because
they are more important categories in the original
correspondence table. These two categories(especially retirement) strongly contribute to
explaining the first dimension
The second dimension is
characterized by unemployed and
part-time employees
7/13/2019 Learn Correspondence analysis
29/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 29
Column scores
The same exercise is carried out on columns,however the row principal method does not
normalize by column
Overview Column Pointsb
.098 -.699 -1.993 .051 1.106 .039 .048 .388 .000 .120 .548 .436 .000 .016 1.000
.066 -.781 -1.263 2.821 -1.273 .039 .040 .105 .524 .107 .462 .118 .405 .014 1.000
.050 .487 -2.023 -2.190 .891 .022 .012 .205 .240 .040 .245 .413 .333 .010 1.000
.032 .531 -1.098 -2.270 -4.585 .014 .009 .038 .164 .669 .284 .119 .349 .248 1.000
.457 .971 .371 .233 .133 .196 .431 .063 .025 .008 .982 .014 .004 .000 1.000
.002 1.179 1.120 -1.287 5.002 .002 .003 .003 .004 .057 .725 .064 .058 .153 1.000
.295 -1.244 .819 -.382 .018 .214 .457 .198 .043 .000 .954 .040 .006 .000 1.000
.009 -.957 -1.039 -2.996 -3.705 .007 .000 .000 .000 .000 .512 .059 .338 .090 1.000
1.000 .526 1.000 1.000 1.000 1.000
Tenure - type
Local Authority rented
unfurnished
Housing assoc iation
Other rented unfurnished
Rented fu rnished
Ow ned with mortgageOwned by rental
purchase
Ow ned outright
Rent f reea
Ac tive Tota l
Mass 1 2 3 4
Score in Dimension
Inertia 1 2 3 4
Of Point to Inertia of Dimension
1 2 3 4 Total
Of Dimension to Inertia of Point
Contribution
Supplementary pointa.
Row Princ ipal normalizationb.By column the first dimension is especially related to the
owned by mortgage and owned outright categories
7/13/2019 Learn Correspondence analysis
30/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 30
Bi-plot
Employed individuals are
closer to owned
accommodations
Retired individuals are
also close to owned
accommodations
Part-time employees andunemployed individuals are closer
to rented accommodations and
other forms of accommodations
7/13/2019 Learn Correspondence analysis
31/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 31
Multiple Correspondence
Analysis(MCA)
When all variables are multiple
nominal, then optimal scaling applies
MCA
7/13/2019 Learn Correspondence analysis
32/33
Statistics for Marketing & Consumer ResearchCopyright 2008 - Mario Mazzocchi 32
Plot with 3 variables
The analysis
now also
includes the
government
office region
7/13/2019 Learn Correspondence analysis
33/33
Statistics for Marketing & Consumer ResearchCopyright 2008 Mario Mazzocchi 33
SAS correspondence analysis
SAS procedure:proc CORRESP
simple correspondence analysis
multiple correspondence analysis (option MCA)
same types of normalization as SPSS
option PROFILE (ROW, COLUMN or BOTH)