Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Correspondence Analysis Correspondence Analysis Correspondence Analysis Correspondence Analysis and Related Methodsand Related Methodsand Related Methodsand Related Methods
Michael GreenacreUniversitat Pompeu [email protected]
www.globalsong.net www.econ.upf.es/~michael
1961
1973
1984
1989
1991
19931999
2002
19941998
� B
� C� A
2007
First XLSTAT Users Conference
Paris, 20077-8 June 2007
Correspondence analysisand Related Methods – Part 1
1. What is correspondence analysis (CA)?
2. Why is CA so useful as a method of visualizing tabular data?
3. How is CA implemented in XLSTAT?
(by CA I mean “simple” CA analysis, as opposed to “multiple” CA, which is discussedin the next talk)
Jean-Paul Benzécri... creator of Correspondence AnalysisCorrespondence analysis: in which areas of research is it useful?CA visualizes complex data, primarily data on categorical
measurement scales, facilitating understanding and interpretation – a neglected aspect of statisticalenquiry (cf. usual modelling approach)
• linguistics, textual analysis : word frequencies• sociology : cross-tabulations and large sets of
categorical data from questionnaires; useful for qualitative research, visualization of case study data
• ecology : species abundance data at severallocations, often with explanatory variables
• market research : perceptual mapping of brands/products, ...
• archeology : large sparse data matrices• biology, geology, chemistry, psychology...
Simple Correspondence Analysis (CA)
� CA is a method of data visualization
O
O
O
O
O
• • • •
OO
O
O
O
•••
•
� It applies in the first instance to a cross-tabulation (contingency table)
� The results of CA are in the form of a map of points
� The points represent the rows and columns of the table; it is not the absolute values which are represented (as in principal componentanalysis, for example) but their relative values.
� The positions of the points in the map tell you something about similarities between the rows, similarities between the columns and the association between rows and columns
A simple example
� 312 respondents, all readers of a certain newspaper, cross-tabulated according to their education group and level of reading of the newspaper
E1
E2
E3
E4
E5
C1 C2 C3
� E1: some primary E2: primary completed E3: some secondary E4: secondary completed E5: some tertiary
� C1 : glance C2 : fairly thorough C3 : very thorough
1673
494012
392919
204618
275
E5
E4
E3
E2
E1
C3
C2
C1
-0.2
-0.1
0
0.1
0.2
0.3
0.4
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0.0704 (84.52 %)
0.0129 (15.48 %)
Three basic geometric conceptsç
•
•
•profile mass
•distance
profile – the coordinates (position) of the point
mass – the weight given to the point
distance – the measure of proximity between points
Four derived geometrical concepts
•
•
•
•
inertia – the weighted sum-of-squared distances to centroid
centroid – the weighted average position
projection – the closest point in the subspace
centroido
projection
•
subspace
subspace – space of reduced dimensionality within the space
•
•
•
•o
inertia
mi di
inertia = Σi midi2
Profile
� A profile is a set of relative frequencies, that is a set of frequencies expressed relative to their total (often in percentage form).
� Each row or each column of a table of frequencies defines a different profile.
� It is these profiles which CA visualises as points in a map.
E1
E2
E3
E4
E5
C1 C2 C3
1673
494012
392919
204618
275
original data
E1
E2
E3
E4
E5
C1 C2 C3
.62.27.12
.49.40.12
.45.33.22
.24.55.21
.14.50.36
row profiles
E1
E2
E3
E4
E5
C1 C2 C3
.13.05.05
.39.31.21
.31.22.33
.16.37.32
.02.05.09
column profiles
14
84
87
101
26
1
1
1
1
1
57 129 126 312 1 1 1
Row profiles viewed in 3-d
Plotting profiles in profile space(triangular coordinates)
E1 0.36 0.50 0.14
0.36
0.50
0.14
Weighted average (centroid)
average
The average is the point at which the two points are balanced.
weighted average
The situation is identical for multidimensional points...
Plotting profiles in profile space(barycentric – or weighted average – principle)
E1 0.36 0.50 0.14
0.360.50
0.14
•
Plotting profiles in profile space(barycentric – or weighted average – principle)
E2 0.21 0.55 0.24
0.210.55
0.24
•
Plotting profiles in profile space(barycentric – or weighted average – principle)
E5 0.12 0.27 0.62
0.120.27
0.62
•
Masses of the profiles
E1
E2
E3
E4
E5
C1 C2 C3
1673
494012
392919
204618
275
original data
14
84
87
101
26
57 129 126 312
.045
.269
.279
.324
.083
1
masses
•.183 .413 .404 1average
row profile
Readership data
57 (0.183)
3 (0.115)
12 (0.119)
19 (0.218)
18 (0.214)
5 (0.357)
C1
312 126
(0.404)129
(0.413)Total
0.0832616
(0.615)7
(0.269)Some tertiaryE5
0.32410149
(0.485)40
(0.396)Secondary completedE4
0.2798739
(0.448)29
(0.333)Some secondaryE3
0.2698420
(0.238)46
(0.548)Primary completedE2
0.045142
(0.143)7
(0.500)Some primaryE1
MassTotalC3C2Education Group
C1: glance C2: fairly thorough C3: very thorough
Calculating chi-square
χ2 = 12 similar terms ....
+ (3 - 4.76) 2 + (7 -10.74) 2 + (16 -10.50) 2
4.76 10.74 10.50
….87….….….…………..….
….84….….….…………..….
….14….….….…………..….
57 (0.183)
3(0.115)
4.76
….
C1
312126
(0.404)129
(0.413)Total
0.0832616
(0.615)
10.50
7 (0.269)
10.74
Observed Frequency
Some tertiaryExpected Frequency
E5
….101….….…………..….
MassTotalC3C2Education Group
For example, expected frequency
of (E5,C1):
0.183 x 26 = 4.76
= 26.0
Calculating chi-square
χ2 = 12 similar terms ....
+ 26 [ (3 / 26 - 4.76 / 26) 2+ (7 / 26 -10.74 / 26) 2 + (16 / 26 -10.50 / 26) 2
]4.76 / 26 10.74 / 26 10.50 / 26
χ2 / 312 = 12 similar terms ....
+ 0.083 [ (0.115 – 0.183) 2 + (0.269 – 0.413) 2 + (0.615 – 0.404) 2 ] 0.183 0.413 0.404
….87….….….…………..….
….84….….….…………..….
….14….….….…………..….
57 (0.183)
3(0.115)
4.76
….
C1
312126
(0.404)129
(0.413)Total
0.0832616
(0.615)
10.50
7 (0.269)
10.74
Observed Frequency
Some tertiaryExpected Frequency
E5
….101….….…………..….
MassTotalC3C2Education Group
Calculating inertia
Inertia = χ2 / 312 = similar terms for first four rows ...
+ 0.083 [ (0.115 – 0.183) 2 + (0.269 – 0.413) 2 + (0.615 – 0.404) 2 ]0.183 0.413 0.404
mass(of row E5)
squared chi-square distance(between the profile of E5 and
the average profile)
Inertia = ∑ mass × (chi-square distance)2
(0.115 – 0.183) 2 + (0.269 – 0.413) 2 + (0.615 – 0.404) 2 EUCLIDEAN
0.183 0.413 0.404 WEIGHTED
How can we see chi-square distances?
Inertia = χ2 / 312 = similar terms for first four rows ...
+ 0.083 [ (0.115 – 0.183) 2 + (0.269 – 0.413) 2 + (0.615 – 0.404) 2 ]0.183 0.413 0.404
mass(of row E5)
squared chi-square distance(between the profile of E5 and
the average profile)
(0.115 – 0.183) 2 + (0.269 – 0.413) 2 + (0.615 – 0.404) 2 EUCLIDEAN
0.183 0.413 0.404 WEIGHTED
( 0.115 – 0.183 )2+ ( 0.269 – 0.413 )
2+ ( 0.615 – 0.404 )
2
So the answer is to divide all profile elements by the √ of their averages
√0.183 √0.183 √0.413 √0.413 √0.404 √0.404
“Stretched” row profiles viewed in 3-d chi -squared space
“Pythagorian” –ordinary Euclideandistances
Chi-square distances
profiles
vertices
What CA does…
� … centres the row and column profiles with respect to their average profiles, so that the origin represents the average.
� … re-defines the dimensions of the space in an ordered way: first dimension “explains” the maximum amount of inertia possible in one dimension; second adds the maximum amount to first (hence first two explain the maximum amount in two dimensions), and so on… until all dimensions are “explained”.
� … decomposes the total inertia along the principal axes into principal inertias, usually expressed as % of the total.
� … so if we want a low-dimensional version, we just take the first (principal) dimensions
The row and column problem solutions are closely related, one can be obtained from the other; there are simple scaling
factors along each dimension relating the two problems.
Asymmetric Maps using XLSTAT
E5
E4
E3
E2
E1C3
C2
C1
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1 1.5
.07037 (84 ,5%)
.01289 (15,5%)
E1
E2
E3
E4
E5C1
C2C3
-1
-0.5
0
0.5
1
1.5
2
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
.07037 (84,5%)
.01289 (15,5%)
Symmetric Map using XLSTAT
some tertiary
secondary complete
secondaryincomplete
primarycomplete
primary incomplete
very thorough
fairly thorough
glance
-0.2
0
0.2
-0.6 -0.4 -0.2 0 0.2 0.4 0.6
.07037 (84,5%)
.01289 (15,5%)
Asymmetric and symmetric maps
Asymmetric maps represent the rows and columns jointly in principal & standard coordinates; asymmetric maps are alsobiplots.
Because the principal coordinates can be much “smaller” thanthe standard coordinates, especially when λk is small, thegenerally accepted way for the joint map is the symmetric map, where both rows and columns are in principal coordinates. Symmetric maps are – strictly speaking – not biplots, but theyare almost so (see Gabriel, Biometrika, 2002).
Data set “product” (McFie et al.)
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProdA 3 16 14 13 14 18 6 18B 1 15 6 8 10 13 14 9C 13 11 4 13 11 4 10 2D 9 11 4 9 11 9 11 3E 6 14 15 17 14 16 8 15F 3 16 14 15 12 14 7 16G 18 12 13 16 13 5 4 7H 2 14 7 6 10 4 14 8I 10 14 13 12 14 16 4 8ours 4 15 15 16 14 7 6 15
� Our company wishes to identify the perceptions of itself and its nine major competitors.
� Data are gathered from representatives from 18 companies that represent their potential client base: each has to say which companies they associate with which of 8 attributes.
� The aim is to gain an idea about the relationships between the competitors and the attributes, and where our company is situated in the overall scheme.
Reduction of dimensionality
••
•
••
•
•
•
••
•• •
Reduction of dimensionality
••
•
••
•
•
•
••
•• •
• data centred
means
Reduction of dimensionality
••
•
•
••
•
•
••
•• •
• data centred
• points weighted (row masses)
– in case of frequency data, points are weighted by their row masses, that is the relative frequencies ofeach row (i.e. proportional to sample sizes, n)
•
•
Reduction of dimensionality
••
•
•
••
•
•
••
•• •
• data centred
• points weighted (row masses)
•
• metric weighted (column weights)
dii'2 = Σj wi ( yij – yi'j )2
•
i
i'
e.g. wj = 1/σj2 the inverse of the variance in PCA
wj = 1/cj the inverse of the expected value in CA
Fat Freddy’s Cat – Dimensional Transmogrifier
with thanks to Jörg Blasius
Data set “product” (McFie et al.)
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProdA 3 16 14 13 14 18 6 18B 1 15 6 8 10 13 14 9C 13 11 4 13 11 4 10 2D 9 11 4 9 11 9 11 3E 6 14 15 17 14 16 8 15F 3 16 14 15 12 14 7 16G 18 12 13 16 13 5 4 7H 2 14 7 6 10 4 14 8I 10 14 13 12 14 16 4 8ours 4 15 15 16 14 7 6 15
� Our company wishes to identify the perceptions of itself and its 9 major competitors (A, B, …, I).
� Data are gathered from representatives from 18 companies that represent their potential client base: each has to say which companies they associate with which of 8 attributes.
� The aim is to gain an idea about the relationships between the competitors and the attributes, and where our company is situated in the overall scheme.
Products
Data set “product” (McFie et al.)
� First note that this is NOT a contingency table, so the chi-square test is not applicable (a permutation test could test for significance, but then we need to have original respondent-level data).
� This is an interesting example because it can be analyzed “as is” or it can be recoded to bring out certain features.
� Analyzing it with no recoding means that the “size” effect (sometimes called the “halo effect”) is removed, since we analyze profiles, i.e., the counts relative to their total counts. In other words, if a company gets relatively few associations, then it is the highest of these (lower) associations that are determinant. Hence, in the following extreme case, a pattern of [ 18 18 18 …] is identical to a pattern of [ 1 1 1 …] !
� The masses assigned to the companies will be proportional to the number of associations they get.
� If the “size” effect is needed to be visualized as well, the data table should be doubled.
Data set “product” (McFie et al.)Company PQ In PR En PL MI PS GP TotalA 3 16 14 13 14 18 6 18 102B 1 15 6 8 10 13 14 9 76C 13 11 4 13 11 4 10 2 68D 9 11 4 9 11 9 11 3 67E 6 14 15 17 14 16 8 15 105F 3 16 14 15 12 14 7 16 97G 18 12 13 16 13 5 4 7 88H 2 14 7 6 10 4 14 8 65I 10 14 13 12 14 16 4 8 91ours 4 15 15 16 14 7 6 15 92
Company PQ In PR En PL MI PS GP TotalA 2.9 15.7 13.7 12.7 13.7 17.6 5.9 17.6 102B 1.3 19.7 7.9 10.5 13.2 17.1 18.4 11.8 76C 19.1 16.2 5.9 19.1 16.2 5.9 14.7 2.9 68D 13.4 16.4 6.0 13.4 16.4 13.4 16.4 4.5 67E 5.7 13.3 14.3 16.2 13.3 15.2 7.6 14.3 105F 3.1 16.5 14.4 15.5 12.4 14.4 7.2 16.5 97G 20.5 13.6 14.8 18.2 14.8 5.7 4.5 8.0 88H 3.1 21.5 10.8 9.2 15.4 6.2 21.5 12.3 65I 11.0 15.4 14.3 13.2 15.4 17.6 4.4 8.8 91ours 4.3 16.3 16.3 17.4 15.2 7.6 6.5 16.3 92
Products
Products
Data set “product” (McFie et al.)
Com. PQ PQ- In In- PR PR- En En- PL PL- MI MI- PS PS- GP GP- TotalA 3 15 16 2 14 4 13 5 14 4 18 0 6 12 18 0 144B 1 17 15 3 6 12 8 10 10 8 13 5 14 4 9 9 144C 13 5 11 7 4 14 13 5 11 7 4 14 10 8 2 16 144D 9 9 11 7 4 14 9 9 11 7 9 9 11 7 3 15 144E 6 12 14 4 15 3 17 1 14 4 16 2 8 10 15 3 144F 3 15 16 2 14 4 15 3 12 6 14 4 7 11 16 2 144G 18 0 12 6 13 5 16 2 13 5 5 13 4 14 7 11 144H 2 16 14 4 7 11 6 12 10 8 4 14 14 4 8 10 144I 10 8 14 4 13 5 12 6 14 4 16 2 4 14 8 10 144ours 4 14 15 3 15 3 16 2 14 4 7 11 6 12 15 3 144
� Doubling involves coding the counts of the numbers (out of 18) that DON’T associate the company with the attribute in each case.
� There are now two columns per attribute – each attribute is represented by its positive and negative end of the 0-to-18 scale of counts.
Doubled table:
Prod.
ours I
H
G
F E
D CB
A
GlobProd
PriceSens
ModImage
PriceLevel
Environm
ProdRange
Innovatn
ProdQual
-2
-1
0
1
2
3
-2 -1 0 1 2 3
0.0765 (53.1%)
0.0478 (33.2 %) � Row points are projections of row profiles –have inertias along axes equal to principal inertias (hence principal coordinates).
� Column points are projections of extreme “corner”profiles, or vertices (cf.triangle…) –have inertia along axes equal to 1 (hence standard coordinates).
� Profile points generally close to average.
Row asymmetric map
� Row points and column points are both displayed in principal coordinates –both have inertias along axes equal to principal inertias.
� Both sets of points occupy similar regions of the map: aesthetically a better graphic.
Symmetric map
GlobProd
PriceSens
ModImage
PriceLevel
Environm
ProdRange
Innovatn
ProdQualoursI
H
G
FE
D
C
B
A
-0.4
-0.2
0
0.2
0.4
0.6
-0.4 -0.2 0 0.2 0.4 0.6 0.8
0.0765 (53.1%)
0.0478 (33.2%)
� Attributes have positive and negative pole –average association is at the origin of the map, e.g., In(novation ) has high average, P(roduct )Q(uality ) has low average.
� Fairly similar configuration to undoubledanalysis: there is no strong halo effect.
Doubled data: symmetric map
GP-
GP
PS-
PS
MI-
MI
PL-
PL
En-
En
PR-
PR In-
In
PQ-
PQ
ours
I
H
G
F
E
D
C
B
A
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
0.1173 (54.5%)
0.0682 (31.7%) Highproductquality
High price sensitive; low environment, product range and price level
Highproductrange, modern image, global products
Inertia contributions in CA
� Correspondence analysis (CA) is a method of data visualization which represents the true positions of profile points in a map which comes closest to all the points, closest in sense of weighted least-squares.
O
O
O
O
O
• • • •
OO
O
O
O
•••
•
� The inertia explained in the map applies to all the points: if we say 83% of the inertia is explained in the map, 71% on the first dimension and 12% on the second, this is a figure calculated for all row (or column) points together.
71%
12%
Inertia contributions in CA
� This type of “inertia-explained-by-axes” calculation can be made for individual points.
� These more detailed results are aids to interpretation in the form of numerical diagnostics, called contributions.
� Especially when there is not a high percentage of inertia explained by the map, these contributions will help us to identify points which are represented inaccurately.
� The inertias and their percentages tell us how much of the variance in the table is explained by the principal axes. The contributions do the same, but for each point individually, and help us to see:
(a) which points are being explained better than others; (b) which points are contributing to the solution more than others.
Geometry of inertia contributions
centroid c °°°°
• i-th point aiwith mass mi
k-th principal axis
•projection on
axis
di
fik
Total inertia of the cloud of points = µi mi di2 = µi mi µk fik
2 = µk λk
Inertia of i-th point = mi di2 = mi µk fik
2
Inertia contribution of i-th point to k-th axis = mi fik2
Inertia contributions
centroid c °°°°
• i-th point aiwith mass mi
k-th principal axis
•projection on
axis
di
fik
Total inertia of the cloud of points = µi mi di2 = µi mi µk fik
2 = µk λk
Inertia of i-th point = mi di2 = mi µk fik
2
Inertia contribution of i-th point to k-th axis = mi fik2
m1 f112 m1 f12
2 ... m1 f1p2
m2 f212 m2 f22
2 ... m2 f2p2
m3 f312 m3 f32
2 ... m3 f3p2
: : :: : :: : :: : :
mn fn12 mn fn2
2 ... mn fnp2
1
2
3
n
Axes1 2 ... p
m1 d12
m2 d22
m3 d32
:
:
:
mn dn2
λ1 λ2 ... λp
Inertia contributions
centroid c °°°°
• i-th point aiwith mass mi
k-th principal axis
•projection on
axis
di
fik
m1 f112 m1 f12
2 ... m1 f1p2
m2 f212 m2 f22
2 ... m2 f2p2
m3 f312 m3 f32
2 ... m3 f3p2
: : :: : :: : :: : :
mn fn12 mn fn2
2 ... mn fnp2
1
2
3
n
Axes1 2 ... p
m1 d12
m2 d22
m3 d32
:
:
:
mn dn2
λ1 λ2 ... λp
mi fik2 / λk : amount of inertia of axisk explained by pointi (absolute contribution, CTR)
mi fik2 / midi
2 : amount of inertia of pointi explained by axisk (relative contribution, COR)
mi fik2 / midi
2 = fik2 / di
2 , i.e. the square offik / di = cos(θik ), whereθik is the angle point-axis
θik
Contributions to axes and contributions to points
Contributions (rows):
Weight (relative) F1 F2A 0.100 0.200 0.010B 0.100 0.006 0.266C 0.100 0.249 0.031D 0.100 0.153 0.011E 0.100 0.113 0.010F 0.100 0.113 0.004G 0.100 0.037 0.414H 0.100 0.074 0.202I 0.100 0.009 0.044ours 0.100 0.048 0.010
Squared cosines (rows):
F1 F2A 0.922 0.027B 0.033 0.914C 0.901 0.065D 0.856 0.035E 0.827 0.045F 0.929 0.017G 0.129 0.839H 0.320 0.510I 0.087 0.259ours 0.389 0.046
Eigenvalues and percentages of inertia:
F1 F2Eigenvalue 0.117 0.068Rows depend on columns (%)54.482 31.656Cumulative % 54.482 86.139
Not so well-represented
CARME 2007
Correspondence Analysis &
Related Methods
Erasmus UniversityRotterdam
25-27 June 2007http://www.carme-n.org
After:� Correspondence Analysis in the
Social Sciences (Cologne,1991)� Visualizing Categorical Data
(Cologne, 1995)� Large Scale Data Analysis
(Cologne, 1999)� Correspondence Analysis and
Related Methods (CARME 2003) (Barcelona, 2003)
Just pubished byChapman & Hall / CRC Press