The fit of graphical displays to patterns of expectations

Computational Statistics & Data Analysis 36 (2001) 47–67www.elsevier.com/locate/csda

The %t of graphical displays topatterns of expectations

Moonseong Heoa, K. Ruben Gabrielb ;∗

aObesity Research Center, St. Luke’s=Roosevelt Hospital Center, Columbia University Collegeof Physicians and Surgeons, USA

bDepartment of Statistics, University of Rochester, Rochester, NY 14627, USA

Received 1 May 2000; received in revised form 1 June 2000

Abstract

Graphical displays of multivariate data often clearly exhibit features of the expectations even thoughthe data themselves are poorly %tted by the displays. Thus, it often occurs that ordinations and biplotsthat poorly %t the sample data still reveal salient characteristics such as clusters of similar individualsand patterns of correlation. This paper provides an explanation of this seemingly paradoxical pheno-menon and shows that when many variables are analyzed, the common measure of goodness of %t of alower rank approximation often seriously underestimates the closeness of the %t to underlying patterns.The paper also provides some guidelines on better estimates of the latter goodness of %t. c© 2001Elsevier Science B.V. All rights reserved.

Keywords: Graphical displays; Multivariate analysis; Rank; Biplots; Goodness of %t

1. The problem

Graphical displays of multivariate data often reveal interesting patterns of theindividuals and variables even though the data themselves are not well %tted by thedisplays. Thus, it often occurs that ordinations and biplots that poorly %t the datastill reveal salient characteristics such as clusters of similar individuals and structuresof correlations. This paper provides an explanation of this seemingly paradoxicalphenomenon and makes a contribution to the issue of estimating how well a displaymatches an underlying pattern

∗ Corresponding author.

0167-9473/01/$ - see front matter c© 2001 Elsevier Science B.V. All rights reserved.PII: S 0167-9473(00)00016-5

48 M. Heo, K.R. Gabriel / Computational Statistics & Data Analysis 36 (2001) 47–67

The issue is of practical importance since displays of data are often used to inferpatterns of the expectations of the observations (see, in particular, Caussinus 1986).Such inferences can only be as good as the displays’ %ts to the expectations or totheir functions, such as inter-individual distances. But for any given data set, thecloseness of that %t cannot be measured since the expectations are not known. Whatis usually done instead is to gauge the quality of the display by how well it %tsthe data itself. That, however, is misleading if the display’s %t to the data – whichcan be calculated – is diBerent from its %t to the expectations – which cannot becalculated but is the relevant criterion.

An analogy is with %tting a linear model by multiple regression: Goodness of %tof y onto the columns of X is usually gauged by the multiple correlation coeCcientR, which is the correlation of the data y with the %t Xb, where b are the estimatedcoeCcients; a more relevant measure of the closeness of %t to the population wouldbe the correlation �, say, of Xb with the expectations X�, where � are the truecoeCcients. Since, however, � cannot be measured from the data, R is used insteadeven though it is, as con%rmed by simulation studies, most often less than �.

Similarly, the closeness of representation of data by an approximate %t or displayis usually judged by how small the deviations are, because the closeness to the expec-tations cannot be calculated from the data. Thus, a two-dimensional approximationto a multivariate data set is usually judged by a coeCcient of goodness of %t: In thecase of simple least squares, the rank two approximation to X is X̃ =

∑2k=1 kukC′k ,

where X =∑

k kukC′k is the singular value decomposition, and the goodness of %tcoeCcient is√

1 − ||X̃ − X ||2||X ||2 =

√2

1 + 22∑

k 2k

;

where ||:|| is the Euclidean norm, and 1 ≥ 2 ≥ · · · ≥ 0 are the singular valuesof X (Eckart and Young, 1936; Householder and Young, 1938). Like the multiplecorrelation, this goodness of %t coeCcient apparently underestimates the %t to theexpectations, as is discussed in this paper.

2. An example

The apparent paradox is illustrated by an example in which the display appearsto %t an underlying pattern more closely than it %ts the data from which it wasconstructed. Strauss et al. (1979) analyzed data of 30 psychiatric variables mea-sured on 100 “archetypal patients”, 20 from each of %ve diagnostic groups. Thetwo-dimensional approximation of these data, centered on each variable’s mean, hada goodness of %t of only 0.67, i.e., it accounted for only 0:45(=(0:67)2) of the data’svariation; and yet the biplot of these data, Fig. 1, showed the markers for a sampleof 30 of the patients clustering clearly into groups corresponding to the diBerentdiagnoses, and the pattern of groups conformed to the well-known distinctions be-tween the diagnoses, as, for example, psychotic and neurotic depressives being moresimilar to each other than either group is to manics.

M. Heo, K.R. Gabriel / Computational Statistics & Data Analysis 36 (2001) 47–67 49

Fig. 1. Biplot of psychiatric patients with 30 variables – Goodness of Fit to sample rSample = 0:67.

The apparent paradox is that such a display could be meaningful even though it%tted the data poorly. Its explanation was conjectured to be that the data consistedof model-plus-independent error : (i) An underlying %ve-group model pattern on the30 variables, plus (ii) considerable uncorrelated error due to individual variation andimprecise measurement.

To test this explanation, we modeled a %ve-group pattern that emulated the onefound by Strauss et al., and added simulated random errors. We did so by de%ning a30-variate model vector for each diagnostic group, in such a manner that the groupdiBerences became similar to those in Fig. 1. We assigned that model vector toeach one of the 20 individuals in that diagnostic group. This resulted in 100 modelvectors to which we added random errors by independent simulation of one hund-red 30-variate uncorrelated normals. We made several runs of this model-plus-error,changing the variance 2 of the normal errors from one run to another and we thenbiplotted the results of each run – four of these are shown in Figs. 2a–d. We alsomeasured the goodness of %t of the biplot to the simulated data as well as to the


Fig. 2. Biplots of simulations of 100 psychiatric patients with 30 variables: (a) with = 0:25 –goodness of %t rSample =0:919 to sample and rModel =0:994 to model, (b) with =0:5 – goodness of %trSample =0:762 to sample and rModel =0:976 to model, (c) with =0:75 – goodness of %t rSample =0:644to sample and rModel = 0:944 to model, (d) with = 2:00 – goodness of %t rSample = 0:399 to sample

and rModel = 0:612 to model.

model vectors, expressing them by coeCcients rSample and rModel, respectively; theseare the similarity coeCcients de%ned in Section 3, below.

It was not surprising to %nd that for very small 2 the biplots were very close tothose of the model vectors and had high goodness of %t rSample as in Fig. 2a. Equallyunsurprisingly, large 2 resulted in biplots that were very diBerent from the modelpattern and had low goodness of %t, as in Fig. 2d.


What was remarkable was to %nd that for certain intermediate 2’s the biplots werevery similar to that of Strauss et al., both (i) in the clustering of the individuals andthe pattern of the clusters, as well as (ii) in the goodness of %t to the samplerSample = 0:644, Fig. 2c. This %nding supported our conjecture that the data consistedof a distinct con%guration plus uncorrelated random error.

These simulations further showed the goodness of %t of the biplot to the modelrModel was always higher than that to the sample rSample and indeed did not decreasemuch for larger 2’s. Thus, the biplot of Fig. 2c %tted the model to the extent ofrModel = 0:944 even though it %tted the sample only to the extent of rSample = 0:644.

3. Formulation of a model

For a more precise formulation of the paradoxical phenomenon and the conjecturedexplanation, some de%nitions and assumptions were made along the lines of those byFine (1992), the model itself having been discussed much earlier by Young (1940).

A multivariate matrix X of n rows and m columns (n¿m) was assumed to bethe sum

X = � + E ;

of � of aCne rank 2 with a given pattern and E consisting of nm i.i.d. normalsof mean zero and variance 2. The matrix patterns focused on were equally sizedclusters of rows whose coordinates were in an oBset plane (See, however, Section 8,below). The aCne planar requirement could be stated as rank-two for the matrix cen-tered on its column means. As Fine (1992) points out, this expectations-plus-errormodel is distinct from the better known model in which the columns of X arespeci%ed as variables having a joint (Gaussian) distribution, generally not indepen-dent, and there is a constant expectations for each variable. The latter model, whichunderlies much of the factor analytic literature, is not considered here.

The display was taken to be one of an aCne planar approximation X̃ of X and wascomputed by applying the Eckart–Householder–Young theorem to X after centering.

Closeness of X̃ to � was de%ned as the similarity of the con%guration of therows of X̃ and the con%guration of the rows of �, where each con%guration wasunderstood as consisting of the collection of the Euclidean distances between therows of its matrix. Since con%gurations are invariant over location, it was convenientto continue the discussion in terms of centered matrices X ; X̃ and �, except whensimulation were discussed in terms of non-centered matrices which were subsequentlycentered for display and measurement of similarities.

Closeness of con%gurations X̃ and � was measured by the following similaritycoeCcient due to Gower (1971) and Lingoes and SchMonemann (1974)

rModel(X) =trace{(X̃T

��TX̃)1=2}||X̃ || ||�|| ;


where, for any non-negative matrix N ; N 1=2 is the symmetric matrix that satis%esN 1=2N 1=2 =N . Closeness of X̃ to X was similarly measured by

rSample(X) =trace{(X̃T

XXTX̃)1=2}‖ X̃ ‖ ‖ X ‖ ;

which is known to equal the goodness of %t coeCcient√

(21 + 2

2)=∑

k 2k .

It will be noted that these are measures of similarity of relative distances, ratherthan of absolute distances.

The above approach considers the display of multivariate data as a representationof the con7guration of distances between the expectations of diBerent individuals(see the principal component analysis examples in Caussinus, 1993, Caussinus andFerrNe, 1992, and Caussinus and Ruiz-Garzen, 1993, 1995, all of which illustratethe display of con%gurations of individuals). An alternative approach considers adisplay as a representation of the multivariate expectations themselves, as in thecase of biplots in which the inner products, or coordinates of row marker pointswith respect to column marker axes, can be seen as estimates of expectations (seethe discussion of goodness of %t criteria in Caussinus, 1986, Caussinus, 1992a,b).The alternative approach would de%ne closeness simply as the norm of the diBerencebetween X̃ and � (Caussinus, 1986, 1992a) and use a goodness of %t coeCcientsuch as

rExpect(X) =√

1 − ||X̃ − �||2=||�||2 if ||X̃ − �||2 ≤ ||�||2:= 0 otherwise:

It can be shown that rExpect(X) ≤ rModel(X) for all data. Some comments on resultsfor this criterion will be oBered in Section 8, below. Remarkably, the corresponding

sample coeCcient√

1 − ||X̃ − X ||2=||X ||2 also equals√

(21 + 2

2)=∑

k 2k and hence

equals rSample(X) as well.The paradoxical phenomenon could then be stated as follows: It often occurs that

rModel(X) is large while rSample(X) is small. The conjecture was then reformulatedmore speci%cally as: This occurs when the number m of variables is large and isnot small relative to the average magnitude of the elements of �.

This phenomenon has been observed mainly on two-dimensional displays thatrevealed a structure of clusters. The present study is limited to representations of databy aCne rank-two approximations and to models consisting of patterns of clustersin a plane (see, however, Section 8). The former limitation is justi%ed by the factthat the most practically useful types of displays, including most biplots, are planar.Nor was the focus on cluster models considered serious since the study comparedentire con%gurations, not just aCliation with particular groups, and the similaritycoeCcients used are more heavily inPuenced by the inter-group distances than bydistances within groups. For example, for %ve clusters aligned on a semi-circle – seepattern t =3, below – the con%guration expresses the ordering and relative distancesbetween the clusters as much as the aCliation of individual points with particularclusters.


A more serious limitation is that all the models used here have the clusters andthe individuals located on a plane. This paper must therefore be seen as a %rst stepin a wider study which includes models with a higher-dimensional structure.

This paper studies the paradoxical phenomenon and its conjectured explanationboth by simulations and by asymptotic approximations. It also provides someestimates of the bias and the variance of the discrepancies between the availablerSample(X) and the relevant rModel(X).

The study is formulated in terms of closeness of row con%gurations, which iswhat the above similarity coeCcients measure. Its measurements are therefore thesame for all matrices � with the same con%guration, i.e., the same collection ofrelative Euclidean distances between the rows. Hence the study could, without lossof generality, focus on pattern matrices of the form

� = [ �(1) �(2) 0 · · · 0 ];

with �(1) and �(2) subject to

1Tn �(1) = 1T

n�(2) = �T(1) �(2) = 0:

To see this, note that Euclidean distances between rows are invariant under centeringand rotation and rePection of columns. For any orthogonal matrix R, therefore, thebetween row distances are the same for �R as for �. It follows that, by suitablechoice of R, any � of aCne rank 2 can be rePected and rotated into �R of the formabove without changing the con%guration.

The average magnitude of the elements of such an expectation matrix is

||�||=√

2n =

√12n

∑i

(�2i(1) + �2

i(2));

where �i(1) and �i(2) are the ith elements of �(1) and �(2), respectively. Also, the ratioof the error standard deviation to this expectation magnitude, referred to as relativeerror, is

� =

||�||=√2n:

4. The simulations

The simulations were carried out by generating 100 uncentered matrices X=�+Efor each of 378 combinations of n; m; t and , as listed in Table 1, and then centeringthem for approximation and comparison. The number of rows n was 30, 60 or 90or thereabouts, and the number of columns m varied from 3 to n.

All � consisted of c equal-sized clusters of rows with constant coordinates( �i(1) �i(2) ) = ( �c(1) �c(2) ) for all rows i in cluster c = 1; 2; 3 or 1; : : : ; 4 or 1; : : : ; 5.The clusters were arranged in one of three patterns. The %rst pattern (t = 1) was ofthree clusters at the vertices of an equilateral triangle; the second pattern (t =2) hadfour clusters at the vertices of a square; and the third pattern (t = 3) was of %veclusters at equal intervals along a semicircle. All three patterns were planar, so �


Table 1Parameters of the simulationsa

Three patterns of planar con7gurationsThree equidistant clusters Four clusters at the Five clusters at equal

corners of a square intervals on a semicircle(t = 1) (t = 2) (t = 3)�(1) �(2) �(1) �(2) �(1) �(2)

[−5:0110k

0110k

5:0110k

] [−2:89110k

5:77110k

−2:89110k

]

3:9518k

−3:9518k

3:9518k

−3:9518k

3:9518k

3:9518k

−3:9518k

−3:9518k

6:5916k

4:6616k

0:0016k

−4:6616k

−6:5916k

−3:1816k

1:4816k

3:4016k

1:4816k

−3:1816k

||�||=√2n = 4:08 ||�||=√2n = 3:95 ||�||=√2n = 4:08

Three sample sizes eachn = 30k, k = 1; 2; 3 n = 32k, k = 1; 2; 3 n = 30k, k = 1; 2; 3

Seven numbers of variables for each of the abovefor n = 30 or 32 m = 3, 4, 5, 10, 15, 20, 30for n = 60 or 64 m = 3, 5, 10, 20, 30, 40, 60for n = 90 or 96 m = 3, 5, 10, 30, 45, 60, 90

Six error standard deviations = 0:25, 0.50, 0.75, 1.00, 2.00, 3.00which correspond to relative errors� =

||�||=√2n≈ 0:06, 0.12, 0.18, 0.24, 0.49, 0.73

a1qk stands for a column vector of qk unities.

was of aCne rank 2. All patterns were centered and were set to have average rowmagnitude ||�||=√2n equal, or close to, 10=

√6.

The error standard deviation was varied from 0.25 to 3. Relative to the averagemagnitude ||�||=√2n of the expectations, this corresponds to the relative error �varying from about 0.06 to 0.73.

The simulations were carried out for each of 378 parameter combinations (Table 1).At each such combination {t; k; m; �}, 100 matrices X were generated as describedabove, and then, from each X , the approximation X̃ was computed and the pairof coeCcients rModel(X) and rSample(X) calculated. The collection of such pairs ofcoeCcients was then analyzed and so was the discrepancy

b(X) = rSample(X) − rModel(X):

Averages of the coeCcients and their discrepancy were taken over the 100 simula-tions at each combination {t; k; m; �}. These averages are the mean model %t

Sr{t; k;m; �}M =

{t;k;m;�}AveSim

rModel(X);

the mean sample %t

Sr{t; k;m; �}S =

{t;k;m;�}AveSim

rSample(X)


and the mean discrepancy

Sb{t; k;m; �}

={t;k;m;�}AveSim

b(X) = Sr{t; k;m; �}S − Sr{t; k;m; �}

M :

These may be seen as estimates of, respectively, the expected model %t

�{t; k;m; �}M = E{t; k;m; �} rModel(X);

the expected sample %t

�{t; k;m; �}S = E{t; k;m; �} rSample(X);

and the bias

�{t; k;m; �} = �{t; k;m; �}S − �{t; k;m; �}

M :

In all these expressions the superscript indicates the conditions under which theaveraging was done.

5. Analysis of the simulations: bias

The analysis %rst related the mean coeCcients only to the number of variables mand relative error �, averaging over the levels of k and t to yield mean sample %tSr{m;�}M =Avek; t

[Sr{t; k;m; �}M

]and mean model %t Sr{m;�}

S =Avek; t

[Sr{t; k;m; �}S

]since pattern and

sample size were thought to have less eBect (Evidence that this was indeed so ispresented below.)

Table 2 shows that both mean %t coeCcients decrease with increasing relative error�, as was to be expected. Moreover, for each �, both mean coeCcients decrease withincreasing number of variables m, but Sr{m;�}

S decreases very much faster than Sr{m;�}M ,

especially for the larger �’s.The fast decrease of sample %t Sr{m;�}

S with m was to be expected: A larger numberof variables is %tted more poorly by a rank 2 approximation than a smaller number.The slower decrease of model %t Sr{m;�}

M shows that the %t of the approximation tothe model is not much aBected by the number of variables, so long as the error isnot large.

As a result of these diBerent trends, mean sample %t Sr{m;�}S considerably under-

estimates mean model %t Sr{m;�}M for large m, especially when � is large. For a very

small number of variables m, there is a slight positive bias, especially if � is large,so rSample(X) slightly overestimates rModel(X). When m is about 4 or 5 the bias isclose to zero. For larger m, on the other hand, the bias is negative and at timesquite considerable. Thus, for m of 20 or 30, the bias may reach a magnitude of-0.2 or -0.3, and for m of 60 or more, it may go as far as -0.5. (The bias is alsoseen to depend on �, in that it is not as great for small �, which is as one wouldhave expected. But the simulations also show it to be less severe for very large �, apuzzling result, possibly due to the restriction m ≤ n in the domain of simulations).

These results reproduce the paradoxical phenomenon that motivated this study: Inother words, unless there are quite few variables, the mean sample %t SrS falls short


Table 2Average goodness of %t coeCcients Sr{m;�}

S and Sr{m;�}M by number of variables m and relative error �a

m

� Coef. 3 4 5 10 15 20 30 40 45 60 90

0.06 Sr{m;�}M 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998Sr{m;�}S 0.999 0.998 0.997 0.993 0.989 0.984 0.976 0.967 0.963 0.957 0.928

0.12 Sr{m;�}M 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.992 0.993 0.992Sr{m;�}S 0.997 0.993 0.990 0.973 0.959 0.943 0.914 0.887 0.873 0.841 0.783

0.18 Sr{m;�}M 0.984 0.984 0.984 0.984 0.984 0.984 0.983 0.983 0.983 0.983 0.983Sr{m;�}S 0.992 0.985 0.979 0.943 0.916 0.886 0.836 0.794 0.771 0.725 0.649

0.24 Sr{m;�}M 0.972 0.972 0.972 0.971 0.971 0.971 0.970 0.970 0.970 0.970 0.969Sr{m;�}S 0.987 0.975 0.962 0.907 0.867 0.826 0.758 0.706 0.679 0.629 0.548

0.49 Sr{m;�}M 0.899 0.898 0.897 0.894 0.887 0.885 0.879 0.875 0.881 0.870 0.862Sr{m;�}S 0.957 0.921 0.886 0.765 0.701 0.637 0.554 0.495 0.464 0.425 0.360

0.73 Sr{m;�}M 0.806 0.797 0.798 0.781 0.734 0.733 0.721 0.699 0.724 0.679 0.654Sr{m;�}S 0.928 0.877 0.824 0.674 0.617 0.548 0.472 0.420 0.386 0.360 0.306

aEach entry is an average of the coeCcients calculated for all the simulations at that combination of �and m. Thus, for m = 3 and � = 0:06 it is an average of 900 simulations, 100 at each combination ofk=1; 2; 3 and the three con%guration types; for m=90 and �=0:06, it is an average of 300 simulations,100 at k = 3 and each of the three con%guration types.

Fig. 3. Bias Sb{m;�}

, contours by number of variables m and relative error �.

of the mean model %t SrM, and often falls considerably short of it. It therefore appearsthat our conjecture and the model that incorporated it may explain the phenomenon.In other words, for large numbers of variables, sample errors swamp the expectationsand reduce the usual measure of goodness of %t, even though the display is close tothe expectations’ pattern.


Fig. 4. Bias Sb{m; t; �}

, contours by number of variables m and relative error �, drawn separately for eachpattern t, as distinguished by the types of the lines.

Fig. 5. Bias Sb{k;m;�}

, contours by number of variables m and relative error �, drawn separately for eachsample size group k, as distinguished by the types of the lines.

The eBects of patterns t and of sample sizes n(=km) can be seen from the separatedisplays of bias contours for diBerent t and k, Figs. 4 and 5, respectively. The closesimilarity of the contours drawn with diBerent types of line – representing diBerentt and k, respectively – show there was little diBerence between the bias contoursfor diBerent patterns and for diBerent sample sizes. It is not surprising that neither


of these factors has much eBect on rSample(X) which is a comparison within thesample X . That these factors also have little eBect on rModel(X), especially for thehigher values, was revealed by detailed checks of the coeCcients (not reproducedhere). What little diBerences there are, consist of a slightly stronger negative biasfor larger sample sizes, and for triangular and square cluster patterns as compared tosemicircles. But these diBerences are negligible compared to those related to � andespecially to m. This %nding justi%es the averaging over t and k that was presentedin Tables 1 and 2 and Fig. 3.

Some check of these simulation results could be obtained by large sample approx-imations, as follows. An idea of �{t; k;m; �}

M and �{t; k;m; �}S may be obtained by using the

following approximations based on equating sums of squares with their expectations:

||X ||2 ≈ ||�||2 + nm 2;

||X̃ ||2 ≈ ||�||2 + 2n 2;

as well as

tr{(�′X̃ X̃′�)1=2} ≈ ||�||2

and

tr{(X ′X̃ X̃′X)1=2} ≈ ||�||2 + 2n 2;

which are obtained by using the lower bound from the lemma of Sibson (1978).From the above it would follow, and is heuristically veri%ed, that

�{t; k;m; �M } ≈

√||�||2

||�||2 + 2n 2=

√1

1 + �2;

and

�{t; k;m; �}S ≈

√||�||2 + 2n 2

||�||2 + mn 2=

√1 + �2

1 + (m=2)�2;

so that the bias can be approximated by

�{t; k;m; �} ≈√

1 + �2

1 + (m=2)�2− 1√

1 + �2:

This is seen to go down as m increases. It is zero, i.e., lack of bias, at aboutm = 2�2 + 4, positive for smaller m and negative for larger m . For relative errorsof � = 0:06; 0:12; 0:18; 0:24; 0:49, and 0.72 the lack of bias would occur at m =4:01; 4:03; 4:07; 4:12; 4:48, and 5.08, respectively, that is, between m= 4 for small� and m = 5 for large �. This is exactly what was noted above from the simulationsplotted in Fig. 3.

More detailed comparisons of the simulations with these approximations are shownin Fig. 6, which gives contours of bias obtained both from simulations (solid linesreproducing Fig. 3) and from the preceding approximate formulas (dotted lines). Thetwo sets of contours generally agree very well, though simulations suggest slightlyless bias than the approximations do. A noticeable diBerence is, however, that the


Fig. 6. Bias Sb{m;�}

, contours by number of variables m and relative error �, as computed fromsimulations and from large sample approximations, as distinguished by the types of the lines

approximate formulas do not reproduce the above puzzling simulation result of anapparent reduction of bias for the highest relative errors. So perhaps that was arandom deviation after all.

6. Analysis of the simulations: variability

The %ndings discussed so far are about means over repeated samples: Biases aremeans of discrepancies. The practical relevance of these %ndings to any particularsample display depends on how representative these biases are of the discrepancies inindividual data sets. Data analysts who have calculated rSample(X) for their displays,need to decide what that says about rModel(X) of these particular sample displays,but that depends not only on the mean discrepancies but also on the how much theindividual samples’ discrepancies b(X) = rSample(X) − rModel(X) vary about the bias.

Since the bias is estimated as the mean discrepancy, the corresponding standarddeviations of discrepancies (Table 3) are estimated as

s{m;�} =√

Avek; t

[s2{t; k; m; �}];

where

s{t; k;m; �} =

√{t; k;m; �}AveSim

[rSample(X) − rModel(X) − b{t; k;m; �}]2:

The standard deviations of the discrepancies turn out to depend on relative error� as well as on the number of variables m. For �¡ 0:25, the standard deviationsare less than 0.01, and in most situations well below that. Hence use of the bias


Table 3Standard deviations of discrepancies S{m;�}, by number of variables m and relative error �a

m

� 3 4 5 10 15 20 30 40 45 60 90

0.06 0.0003 0.0006 0.0005 0.0007 0.0010 0.0010 0.0011 0.0012 0.0010 0.0012 0.00120.12 0.0013 0.0020 0.0017 0.0023 0.0035 0.0034 0.0035 0.0038 0.0032 0.0037 0.00360.18 0.0028 0.0042 0.0035 0.0046 0.0066 0.0067 0.0069 0.0069 0.0054 0.0064 0.00610.24 0.0049 0.0069 0.0063 0.0079 0.0120 0.0100 0.0103 0.0097 0.0082 0.0086 0.00730.49 0.0168 0.0225 0.0188 0.0207 0.0286 0.0242 0.0238 0.0247 0.0150 0.0199 0.01920.73 0.0315 0.0443 0.0337 0.0406 0.0761 0.0629 0.0590 0.0634 0.0444 0.0659 0.0715aSee Table 2.

to gauge discrepancy is pretty reliable, that is, within 0.02 or so. For larger relativeerrors, the standard deviations become too large, and bias is not a reliable indicatorof the discrepancy in any particular application. One needs at least a rough idea ofthe magnitude of the relative error to say how reliably one can gauge an individualdiscrepancy from the bias.

7. A closer approximation to model !t

The use of rSample(X) as an approximation of rModel(X) has been shown to bebiased. An improved approximation may be obtained by means of the large sampleapproximations described above. If � is eliminated in the approximations for �{t; k;m; �}

M

and �{t; k;m; �}S one obtains the following relation:

�{t; k;m; �}M ≈

√m=2 − [�{t; k;m; �}

S ]−2

m=2 − 1:

This suggests data analysts might convert their observed rSample(X)’s into approx-imations

r̃Sample(X) =

√m=2 − [rSample(X)]−2

m=2 − 1;

of rModel(X). This approximation formula is free of � and thus does not requireguesses of error and average expectation magnitude |�||=√2n. (Note thatr2Sample(X) ≥ 2=m by the Eckart–Householder–Young theorem so that the above for-

mula is always well de%ned).The bias of the approximation formula r̃Sample(X) is shown in Table 4 and its

root mean square, along with that of the original formula rSample(X), is presented inTable 5, where

z{m;�} =√

Avek; t

[z2{t; k; m; �}]

and

z̃{m;�} =√

Avek; t

[z̃2{t; k; m; �}];


Table 4Average discrepancy Sb

{m;�}, or bias, of approximation r̃sample(X), by number of variables m and relative

error �a

m

� 3 4 5 10 15 20 30 40 45 60 90

0.06 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.0000.12 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.0000.18 0.000 −0:001 0.000 0.000 −0:001 −0:001 −0:001 −0:001 −0:001 −0:001 −0:0010.24 −0:001 −0:001 −0:001 −0:001 −0:003 −0:002 −0:003 −0:003 −0:002 −0:004 −0:0040.49 −0:005 −0:007 −0:007 −0:012 −0:030 −0:029 −0:037 −0:041 −0:030 −0:048 −0:0580.73 −0:014 −0:037 −0:028 −0:053 −0:130 −0:126 −0:143 −0:169 −0:133 −0:196 −0:229aSee Table 2.

Table 5Root-mean-square (rms) discrepancy z{m;�} of sample %t rSample(X), and z̃{m;�} of approximationr̃Sample(X), by number of variables m and relative error �a

m

� rms 3 4 5 10 15 20 30 40 45 60 90

0.06 z{m;�} 0.001 0.001 0.001 0.005 0.009 0.014 0.022 0.031 0.035 0.047 0.070z̃{m;�} 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

0.12 z{m;�} 0.004 0.002 0.004 0.020 0.034 0.050 0.079 0.105 0.119 0.152 0.210z̃{m;�} 0.002 0.002 0.001 0.001 0.002 0.001 0.001 0.001 0.001 0.001 0.001

0.18 z{m;�} 0.009 0.004 0.008 0.041 0.068 0.098 0.148 0.190 0.212 0.258 0.333z̃{m;�} 0.004 0.004 0.003 0.003 0.003 0.003 0.003 0.002 0.002 0.003 0.002

0.24 z{m;�} 0.016 0.007 0.012 0.065 0.105 0.146 0.213 0.264 0.291 0.341 0.421z̃{m;�} 0.007 0.007 0.006 0.005 0.006 0.005 0.006 0.005 0.004 0.005 0.006

0.49 z{m;�} 0.061 0.032 0.022 0.131 0.188 0.250 0.326 0.380 0.418 0.446 0.503z̃{m;�} 0.026 0.028 0.021 0.022 0.039 0.037 0.047 0.048 0.033 0.054 0.064

0.73 z{m;�} 0.126 0.091 0.044 0.118 0.141 0.200 0.263 0.287 0.341 0.329 0.356z̃{m;�} 0.051 0.063 0.049 0.072 0.152 0.147 0.164 0.182 0.142 0.211 0.240

aEach entry is the root-mean-square of the discrepancies calculated for all the simulations at thatcombination of and m.

where

z{t; k;m; �} =


[rSample(X) − rModel(X)]2

and

z̃{t; k;m; �} =


[r̃Sample(X) − rModel(X)]2:

It is evident from Table 4 that approximation formula r̃Sample(X) has minimal biasin estimating rModel(X), except when �= =||�||=√2n¿ 1

4 and the number m of vari-ables is large. Furthermore, Table 5 shows that the approximation r̃Sample(X) reducesthe average error of estimating rModel(X) appreciably as compared to the original


rSample(X). For relative error � = ||�||=√2n

¡ 14 the mean error of using r̃Sample(X) to

approximate rModel(X) is generally well below one half of 1%, which is more thansatisfactory for practical use. For relative error � of about 1

2 it is between 2% and6%, and thus is generally quite acceptable. But for � of around 3

4 the average errormay rise above 20% when there are many variables, and thus makes the estimationof rModel(X) very dubious indeed.

An approximation is useful only if it can be calculated entirely from the data,like r̃Sample(X), and does not involve the unknown parameters. That does not mean,however, that the properties of such an approximation, like its bias, may not dependon those parameters, and that having some idea of the magnitude of � would not beimportant for judging whether this approximation is acceptable: For reasonably small� it is, for larger � it is not. The point is that the improved prediction of the model%t by means of r̃Sample(X) can be used when one knows merely that � is small.

Further attempts to improve the approximation by regressing rModel(X) ontorSample(X) or r̃Sample(X) have not yielded worthwhile results, nor have attempts toadjust for sample size. These are therefore not reported.

8. Additional results and conclusions

The problem in data analysis this paper has addressed is that when 2-D displaysof multivariate data are used to infer patterns in the expectations of the observations,the quality of the inference is commonly judged by the goodness of %t of the displayto the data. This measure is used because it is not possible to compute the goodnessof %t to the expectations, but it is often misleading since the %t to the data can bemuch worse than the %t to the expectations, especially if the number of variables islarge. This paper gives an idea of the resulting discrepancies, both in terms of theiraverage, i.e., the bias, and the variation of individual data samples about that bias.

8.1. E=ects of relative error and numbers of variables and observations

(i) The bias in using data %t to estimate expectations %t depends most stronglyon the relative error, that is, on the magnitude of the error of the observationsrelative to that of the expectations.

(ii) The bias generally increases sharply with the number of variables. That is be-cause the data %t goes down but, unless the relative error is very large, the %tof a display to the con%guration of expectations is not much aBected by theaddition of further variables which are not correlated with the expectations. Inother words, little is gained by attempting to sharpen an analysis through theexclusion of apparently redundant variables. It therefore is not bad practiceto include many variables in an exploratory study, even though some of themare poorly correlated with the pattern (provided, of course, they do not have asecondary pattern of their own).

(iii) Sample size appears to have little eBect on goodness of %t to expectations, butthere is a slightly stronger negative bias for larger sample sizes.


8.2. Fit of expectations versus 7t of con7gurations

In considering the display of the con%guration of individuals, i.e., of inter-individualdistances, the simulations have shown considerable bias, that is, average discrepancy,of the data %t as a measure of the %t of con%gurations of expectations. On the otherhand, the variation in discrepancies from one sample to the other was not found tobe quite so large.

The paper presents an improved approximation of the %t of con%gurations ofexpectations by means of a transformation of the data %t coeCcient. That reducesthe bias very much and results in a much smaller average discrepancy. Indeed, thedata analyst can use this approximation with considerable con%dence of correctlyestimating the %t of con%gurations of expectations unless the data contain randomerror of a magnitude of more than one-half the average magnitude of the expectations,and the number of variables is quite large. In practice, in any one application, thequality of this approximation formula will only be known if one has some idea ofthe relative size of random errors and expectations. Hopefully, data analysts willbe suCciently familiar with their kinds of data to be able to judge that relativemagnitude.

The alternative measure of goodness of %t rExpect(X) may be thought appro-priate if the display is considered as a representation of the expectations themselves,rather than of their con%gurations, i.e., their distance between the rows. However,for centered X

rModel(X) = supscalar"

maxorthonormalR

rExpect("XR) ≥ rExpect(X)

(see Heo, 1966, p.39). Table 6 shows that for small � and m, rExpect(X) is barelysmaller that rModel(X) but for larger � and m, rExpect(X) becomes much smaller thanrModel(X); for large � the coeCcient rExpect(X) indeed goes down to zero. A remark-able %nding is (iv) that only in cases of excellent %t can the expectations themselvesbe approximated well without rotating, rePecting and dilating the %t. In most practicalcases the %t without rotation, rePection and dilation is very poor.

8.3. E=ect of clustering

The simulations described in Tables 2–5 were carried out in three types of clustercon7gurations of the expectations �, but the results were noted to be almost the samefor all three con7gurations. Subsequent simulations were carried out on randomcon7gurations of the same magnitude, i.e., on �’s which were randomly sampledfrom a bivariate normal and adjusted to have zero sums 1T�=0 and, for n=30k, sumsof squares and products �T�=500kI2, so that ||�||2 ≈ 1000k. (This compares to thecluster con%gurations’ magnitudes of roughly ||�||2=2n= 1=2n

∑ni=1(�

2i(1) + �2

i(2)) = 16per average entry in the non-zero columns of the expectation matrix, that is, roughly||�||2 = 1000k for n = 30k).

Table 7 shows averages of 1000 simulated coeCcients rModel(X); rSample(X);r̃Sample(X), and rExpect(X) from three-cluster (t = 1), four-cluster (t = 2), %ve-cluster


Table 6Average goodness of %t coeCcients for 30-by-m data matrices simulated under models of clusters orrandomness by error variability (�) and number of variables ma. All models have expectated sums ofsquares and products �T� = 500I2.

m = 10 m = 30

(�) t= 1 2 3 4 1 2 3 4

0.25 (0.06) Sr{m; t; �}M 0.9982 0.9980 0.9982 0.9982 0.9982 0.9981 0.9982 0.9982

Sr{m; t; �}S 0.9931 0.9927 0.9931 0.9931 0.9766 0.9749 0.9765 0.9765

S̃r{m; t; �}S 0.9983 0.9981 0.9983 0.9983 0.9983 0.9981 0.9983 0.9983

Sr{m; t; �}Expect 0.9976 0.9974 0.9976 0.9976 0.9964 0.9963 0.9964 0.9964

1.00 (0.24) Sr{m; t; �}M 0.9721 0.9702 0.9720 0.9722 0.9709 0.9693 0.9706 0.9706

Sr{m; t; �}S 0.9101 0.9052 0.9109 0.9098 0.7667 0.7570 0.7683 0.7665

S̃r{m; t; �}S 0.9737 0.9720 0.9740 0.9736 0.9746 0.9730 0.9749 0.9745

Sr{m; t; �}Expect 0.9601 0.9586 0.9595 0.9599 0.9357 0.9332 0.9354 0.9354

2.00 (0.49) Sr{m; t; �}M 0.8944 0.8887 0.8970 0.8934 0.8755 0.8705 0.8782 0.8767

Sr{m; t; �}S 0.7724 0.7652 0.7755 0.7746 0.5720 0.5630 0.5754 0.5723

S̃r{m; t; �}S 0.9109 0.9065 0.9128 0.9123 0.9232 0.9194 0.9246 0.9234

Sr{m; t; �}Expect 0.8056 0.7929 0.8082 0.8073 0.6117 0.5856 0.6101 0.6110

3.00 (0.73) Sr{m; t; �}M 0.7692 0.7600 0.7727 0.7635 0.6707 0.6685 0.6843 0.6677

Sr{m; t; �}S 0.6871 0.6817 0.6891 0.6860 0.5031 0.4970 0.5054 0.5019

S̃r{m; t; �}S 0.8471 0.8421 0.8489 0.8462 0.8878 0.8839 0.8892 0.8871

Sr{m; t; �}Expect 0.2676 0.2480 0.2871 0.2542 0.0 0.0 0.0 0.0

aFor each �; m and t, 1000 simulated 30-by-m data matrices X =�+E were generated by adding � toE which consisted of random normal (0; )’s — for t = 1; 2; 3 the � was computed from the formulasof Table 1 with each column sums of squares adjusted to equal 500; for t = 4, the columns of � weregenerated from independent normals, then centered and orthogonalized to have sums of squares andproducts �T� = 500I2.

(t = 3) and random (t = 4) con%gurations of expectations, for several choices of �and m = 10; 30 and n = 30. It is evident that the average coeCcients are virtuallythe same for all four con%gurations. The %ndings have been the same for other ; mand n. The conclusions from the earlier simulations on cluster con%gurations, (Tables2–5 and Fig. 3–6) can therefore be taken to apply equally to the expectations of anycon%guration.

8.4. Open questions

The present study has been limited to rank 2 aCne approximations of expecta-tions that have a planar con%guration. Its importance for statistical practice wouldbe enhanced if the study were extended (a) to approximations of expectationswith non-planar con%gurations, such as single-dimensional ones, or 3-D patterns


Table 7Average coeCcients of goodness of %t from simulations of 30 observations on m=3; 5; 10; 20; 30; 60; 90variables with =0:25; 1; 2; 3 and randomly generated expectations having eigenvalues 500; 500; 0; : : : ; 0;1000 simulations for each m and

m

(�) Coef. 3 5 10 20 30 60 90

0.25 (0.06) Sr{m;�}M 0.9982 0.9982 0.9982 0.9982 0.9982 0.9982 0.9982

Sr{m;�}S 0.9991 0.9974 0.9931 0.9847 0.9765 0.9533 0.9318

S̃r{m;�}S 0.9983 0.9983 0.9983 0.9983 0.9983 0.9983 0.9983

Sr{m;�}Expect 0.9980 0.9979 0.9977 0.9970 0.9964 0.9944 0.9926

1.00 (0.24) Sr{m;�}M 0.9724 0.9724 0.9722 0.9717 0.9706 0.9693 0.9677

Sr{m;�}S 0.9872 0.9631 0.9098 0.8281 0.7665 0.6490 0.5799

S̃r{m;�}S 0.9735 0.9736 0.9736 0.9742 0.9745 0.9760 0.9773

Sr{m;�}Expect 0.9680 0.9675 0.9601 0.9476 0.9354 0.8976 0.8575

2.00 (0.49) Sr{m;�}M 0.8976 0.8986 0.8934 0.8843 0.8767 0.8534 0.8121

Sr{m;�}S 0.9575 0.8895 0.7746 0.6436 0.5723 0.4718 0.4233

S̃r{m;�}S 0.9038 0.9070 0.9123 0.9175 0.9234 0.9376 0.9463

Sr{m;�}Expect 0.8652 0.8488 0.8072 0.7164 0.6125 0.1148 0.0

3.00 (0.73) Sr{m;�}M 0.8086 0.8002 0.7635 0.7284 0.6677 0.5801 0.5057

Sr{m;�}S 0.9297 0.8318 0.6860 0.5646 0.5019 0.4258 0.3925

S̃r{m;�}S 0.8246 0.8362 0.8462 0.8722 0.8871 0.9186 0.9354

Sr{m;�}Expect 0.6272 0.5523 0.2790 0.0078 0.0 0.0 0.0

(Besse et al., 1988; FerrNe, 1995) as well as (b) to approximations whose rank wasnot %xed at 2, and, indeed, might be data-driven by some criterion such as the shapeof the scree (Krzanowski, 1988, p. 67), or by resampling (Besse and de Falguerolles,1993).

Finally, the present study has focused on the representation of individuals ratherthan of random samples from a distribution. It has therefore used the %xed modelin which the data X are modelled as expectations of rank 2 plus uncorrelated error(the “functional model” according to Fine, 1992). This is the model under whichgraphical displays can serve to explore the pattern of the expectations, and the mea-sures studied here are relevant to such exploration. This study has not dealt withdata from a random model in which X is multivariate with a covariance matrix ofreduced rank (the “structural model” according to Fine, 1992). Under that model,the possible purposes of graphical display are diBerent since the individuals are ofno inherent interest, being randomly drawn. Its goodness of %t should be measuredfor the purposes for which these displays could be used (e.g., shape of multivariatedistribution, structure of correlation), and the coeCcients studied here might not be


appropriate. The conclusions of the present study do not apply to that random modelbut are speci%c to the %xed model under which the representation of individuals ismeaningful.

Acknowledgements

Part of this work is the result of the %rst author’s thesis work (Heo, 1996) underthe guidance of the second author.

References

Besse, Ph., Caussinus, H., Fine, J., 1988. Principal components analysis and optimization of graphicaldisplays. Statistics 19, 301–312.

Besse, Ph., de Falguerolles, A., 1993. Application of resampling methods to the choice of dimensionin principal component analysis. In: HMardle, W., Simar, L. (Eds.), Computer Intensive Methods inStatistics. Physica Verlag, Heidelberg, pp. 166–176.

Caussinus, H., 1986. Models and uses of principal component analysis. In: DeLeeuw, J., Heiser,W., Meulman, J., Critchley, F. (Eds.), Multidimensional Data Analysis. DSWO Press, Leiden,pp. 149–170.

Caussinus, H., 1992a. ModYeles fonctionels: Quelques dNevelopments et applications. In: Droebske, J.J.,Fichet, B., Tassi, P. (Eds.), ModYeles pour l’analyse des DonnNees Multidimensionnelles. Economica,Paris, pp. 61–81.

Caussinus, H., 1992b. Projections revelatrices. In: Droebske, J.J., Fichet, B., Tassi, P. (Eds.), ModYelespour l’analyse des DonnNees Multidimensionnelles. Economica, Paris, pp. 241–265.

Caussinus, H., 1993. ModYeles probabilistes et analyse des donnNees multidimensionnelles. Journal de laSociNetNe de Statistique de Paris, 134, 15–32.

Caussinus, H., FerrNe, L., 1992. Comparing the parameters of a model for several units by means ofprincipal components analysis. Computational Statistics and Data Analysis, 13, 269–280.

Caussinus, H., Ruiz-Gazen, A., 1993. Projection pursuit and generalized principal component analysis,in New Directions in Statistical Data Analysis and Robustness. Basel, Birkhausen Verlag,pp. 35–46.

Caussinus, H., Ruiz-Gazen, A., 1995. Metrics for %nding typical structures by means of principalcomponent analysis, in Data Analysis and its Applications. Harcourt Brace, pp. 177–92.

Eckart, C., Young, G., 1936. The approximation of one matrix by another of lower rank. Psychometrika1, 211–218.

FerrNe, L., 1995. Improvement of some multidimensional estimates by reduction of dimensionality. J.Multivariate Anal. 54, 147–162.

Fine, J., 1992. ModYeles fonctionels et structurels. In: Droebske, J.J., Fichet, B., Tassi, P. (Eds.), ModYelespour l’analyse des DonnNees Multidimensionnelles. Economica, Paris, pp. 21–60.

Gower, J.C., 1971. Statistical methods of comparing diBerent multivariate analyses of the same data.In: Hodson, F.R., Kendall, D.G., Tautu, P. (Eds.), Mathematics in the Archeological and HistoricalSciences. The University Press, Edinburgh, pp. 138–149.

Heo, M., 1996. On the %t of sample graphical displays to patterns in populations. Unpublished Ph.D.Dissertation University of Rochester, Rochester, NY.

Householder, A.S., Young, G., 1938. Matrix approximation and latent roots. Amer. Math. Monthly 45,165–171.

Krzanowski, W.J., 1988. Principles of Multivariate Analysis. Clarendon Press, Oxford.Lingoes, J.C., SchMonemann, P.H., 1974. Alternative measures for %t for the SchMonemann-Carroll matrix

%tting algorithm. Psychometrika 39, 423–427.


Strauss, J.S., Gabriel, K.R., Kokes, R.F., Ritzler, B.A., VanOrd, A., Tarana, E., 1979. Do psychiatricpatients %t their diagnosis? Patterns of symptomatology as described with the biplot. J. of NervousMental Dis. 167, 105–113.

Sibson, R., 1978. Studies in the robustness of multidimensional scaling: Procrustes statistics. J. Roy.Statist. Soc. B 41, 217–229.

Young, G., 1940. Maximum likelihood estimation and factor analysis. Psychometrika 6, 49–53.

Documents

The fit of graphical displays to patterns of expectations