13
This article was downloaded by: [Colorado College] On: 16 October 2014, At: 17:17 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of the American Statistical Association Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uasa20 Self-Consistency and Principal Component Analysis Thaddeus Tarpey a a Department of Mathematics and Statistics , Wright State University , Dayton , OH , 45435 , USA Published online: 17 Feb 2012. To cite this article: Thaddeus Tarpey (1999) Self-Consistency and Principal Component Analysis, Journal of the American Statistical Association, 94:446, 456-467 To link to this article: http://dx.doi.org/10.1080/01621459.1999.10474140 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Self-Consistency and Principal Component Analysis

Embed Size (px)

Citation preview

Page 1: Self-Consistency and Principal Component Analysis

This article was downloaded by: [Colorado College]On: 16 October 2014, At: 17:17Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Journal of the American Statistical AssociationPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uasa20

Self-Consistency and Principal Component AnalysisThaddeus Tarpey aa Department of Mathematics and Statistics , Wright State University , Dayton , OH , 45435 ,USAPublished online: 17 Feb 2012.

To cite this article: Thaddeus Tarpey (1999) Self-Consistency and Principal Component Analysis, Journal of the AmericanStatistical Association, 94:446, 456-467

To link to this article: http://dx.doi.org/10.1080/01621459.1999.10474140

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: Self-Consistency and Principal Component Analysis

Self-Consistency and Principal Component AnalysisThaddeus TARPEY

I examine the self-consistency of a principal component axis; that is, when a distribution is centered about a principal componentaxis. A principal component axis of a random vector X is self-consistent if each point on the axis corresponds to the mean of Xgiven that X projects orthogonally onto that point. A large class of symmetric multivariate distributions are examined in terms ofself-consistency of principal component subspaces. Elliptical distributions are characterized by the preservation of self-consistencyof principal component axes after arbitrary linear transformations. A "lack-of-fit" test is proposed that tests for self-consistencyof a principal axis. The test is applied to two real datasets.

KEY WORDS: Bootstrap; Elliptical distribution; k-means clustering; Principal point; Principal curve; Self-consistent point;Spherical distribution; Symmetric distribution.

© 1999 American Statistical AssociationJournal of the American Statistical Association

June 1999, Vol. 94, No. 446, Theory and Methods

about its mean, because the mean is the center of gravityfor a distribution. When is a distribution is centered abouthigher-dimensional principal component subspaces?

For instance, consider the uniform distributions on a cir­cle and on a square centered at the origin (Fig. 1). Forboth distributions, the covariance matrix is of the form0-

212 and thus any line through the origin corresponds toa principal component axis. The circle is centered aboutany line through the origin, but there are only four linesabout which the uniform distribution on the square is cen­tered: the horizontal, vertical, and diagonal lines throughthe origin (see Fig. 1). Every line passing through the ori­gin is "self-consistent" for the uniform distribution on acircle, but there are only four "self-consistent" projectionsonto lines for the uniform distribution on a square.

Generally speaking, a random vector Y is self-consistent(Tarpey and Flury 1996) for X if

Self-consistency provides a unified framework for manystatistical techniques that provide a simpler structure forrepresenting distributions such as principal components,principal points (Flury 1990, 1993; Tarpey, Li, and Flury1995), and principal curves (Hastie and Stuetzle 1989).

Let X denote a mean °random vector. We shall calla linear subspace V self-consistent for X if almost everypoint x E V corresponds to the mean of X given that Xprojects orthogonally onto the point x. If P is an orthog­onal projection matrix onto a line through the origin, thenPX is self-consistent for X if for almost every point xon the line, E[XIPX = xJ = x. The connection betweena self-consistent line (or hyperplane) and principal compo­nent subspaces is made by the following theorem.

Theorem 1. (Tarpey and Flury 1996). Let X denote a p­variate random vector and assume without loss of generalitythat E[X] = O. Suppose that P is a projection matrix asso­ciated with an orthogonal projection from RP into a linearsubspace V of dimension q < p. If PX is self-consistentfor X, then V is spanned by some q eigenvectors of thecovariance matrix of X.

1. INTRODUCTION

The first principal component axis provides a simplestraight-line approximation to a multivariate distribution.A natural question to ask for many applications is if thestraight-line approximation provided by the first principalcomponent axis is adequate. For instance, the first principalcomponent axis may be a poor approximation to a distri­bution exhibiting nonlinear structure. A single straight linemay be an insufficient summary for a population consistingof nonhomogeneous subgroups.

A natural criterion for a principal component axis to pro­vide a good fit to a distribution is if the distribution is cen­tered about the axis. Hastie and Stuetzle (1989) introducedthe term "self-consistent" to define the property that eachpoint on a smooth curve is the average of all points in thedistribution that project orthogonally onto this point. Thusa curve is self-consistent for a distribution if the distribu­tion is centered about the curve. In this article I addressthe problem of determining whether a principal componentaxis is self-consistent.

Traditionally, a principal component of a p-variate ran­dom vector X with mean 1.£ is any linear combinationU = a' (X - 1.£), where 0. is a normalized eigenvector ofthe covariance matrix of X. Pearson (1901) introduced prin­cipal components by finding the q-dimensional plane thatminimizes the sum of squared distances between each ob­servation in a dataset and the plane. The solution to theproblem is given by the plane that is spanned by q eigenvec­tors of the sample covariance matrix and translated so thatit passes through the sample mean. Thus the first principalcomponent axis passes through the mean, and its directionis determined by the eigenvector of the sample covariancematrix associated with the largest eigenvalue. Because theaverage squared distance between a random vector X andan arbitrary point is uniquely minimized by the mean of X,the zero-dimensional principal component subspace is sim­ply the mean of the distribution. A distribution is centered

Thaddeus Tarpey is Assistant Professor, Department of Mathematicsand Statistics, Wright State University, Dayton, OH 45435 (E-mail:[email protected]). The author thanks two anonymous referees,the associate editor and editor for constructive comments, and also TrevorHastie for permission to use the gold assay data as well as Rob Tibshiranifor providing the gold assay data.

£[XIY] = Y a.s. (1)

456

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 3: Self-Consistency and Principal Component Analysis

Tarpey: Self-Consistency and Principal Component Analysis 457

5., '----:-1--~--'------"-----'----'------'----

in-house assay

.... : .0° .: •• 0 .0 •

: . ;. :.: .

Figure 2. Gold Assay Data Plot of the Log Assays for the OutsideLab Versus the in-House Lab.

Gold Assoy Doto

2. SELF-CONSISTENCY FORSYMMETRIC DISTRIBUTIONS

In this section a large class of symmetric multivariatedistributions is defined for which the projections onto prin­cipal component subspaces are self-consistent. Recall that amultivariate random vector X is symmetric about its mean/L if

dX -/L = -(X -/L),

plying the EM algorithm to estimate the principal curvefor the gold assay data. Interestingly, the two methods givedifferent results with different interpretations.

The first principal component axis has a simple inter­pretation as the extrema of the expected minimum squareddistance function in the class of straightline approximationsto a distribution. However, Duchamp and Stuetzle (1996)showed that nonlinear principal curves, although criticalpoints of the expected minimum squared distance function,correspond to saddlepoints. If a principal component axis isself-consistent, then complications introduced by nonlineargeneralizations can be avoided.

In Section 2 a large class of symmetric multivariate distri­butions is defined for which principal component subspacesare self-consistent. The class of elliptical distributions ischaracterized by self-consistency of principal axes after ar­bitrary linear transformations in Section 3. In Section 4 Iconsider the relationship between self-consistent points andself-consistent principal component subspaces, which is thebasis of the lack-of-fit test introduced in Section 5.

Throughout I assume that the distributions under consid­eration are continuous with finite second moments. Further­more, when I state that a principal component or a principalcomponent axis (subspace) is self-consistent for a randomvector X, I mean that the associated projection of X ontothe principal component axis (subspace) is self-consistentaccording to the definition in (1).

, , ,

'"----------- --.-------~-----, , ,, . ,, , ,, , ,, , ,, , ,, , ,, , ,, , ,, , ,, , ,, , ,

Example 1: Gold Assay Data. Hastie and Stuetzle(1989) examined data on the gold content of computer chipwaste. Each sample was split in two and assayed by an out­side laboratory and also by the company that collected andsold the gold. The company wanted to know which labora­tory produced lower gold content assays on average for agiven sample. Figure 2 shows a scatterplot of 249 pairs oflog-transformed gold assays.

Hastie and Stuetzle considered a generalization of theerrors-in-variables model: Xli = f(Ti) + eli and X2i =rc + e2i, where Ti is the expected gold content for sample iusing the in-house lab, f (Ti) is the expected gold content forthe outside lab, and the eij are measurement errors. If thefunction f (T) is linear, then the curve is estimated by thefirst principal component. If the first principal componentis not self-consistent, then the function f(T) is nonlinearand must be estimated. As in regression problems, it is im­portant to know whether a straight-line model provides agood fit to the data. In Section 5 I propose a lack-of-fit testto assess self-consistency of the first principal componentaxis. I apply this lack-of-fit test to the gold assay data inSection 6.

If a principal component axis is not self-consistent, thenone may want to estimate a nonlinear self-consistent curve.Principal components are easy to compute and provide asimple approximation to the original distribution. Estimat­ing nonlinear principal curves, on the other hand, is a muchmore difficult problem. Hastie and Stuetzle (1989) adaptedtheir principal curve algorithm to estimate a principal curvefor the gold assay data using a locally weighted runninglines smoother. Tibshirani (1992) also examined the goldassay data using a maximum likelihood approach and ap-

Figure 1. For the Uniform Distribution on the Circle and the Square,Any Line Through the Center Is a Principal Component Axis. The circleis centered about any line through the center, and thus every principalcomponent axis is self-consistent. The square, however, has only fourprincipal component axes about which it is centered: the vertical andhorizontal lines, and the diagonal lines, denoted by the dashed lines.

Therefore, a self-consistent line must be a principal com­ponent axis. However, it is not always the case that a prin­cipal component axis is self-consistent. Returning to Figure1, any line through the origin for the uniform distributionon a circle is a self-consistent principal component. Onlythe four dashed lines shown for the square correspond toself-consistent projections.

In practical applications, a principal component axis mayfail to be self-consistent for a variety of reasons, such ascurvilinear relationships between variables or the existenceof nonhomogeneous subgroups in the population. Our firstexample provides an illustration.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 4: Self-Consistency and Principal Component Analysis

458 Journal of the American Statistical Association, June 1999

3. CHARACTERIZATION OFELLIPTICAL DISTRIBUTIONS

Recall that the uniform distributions on a circle and asquare can be distinguished by the self-consistency of pro­jections onto lines through the origin. The first result inthis section shows that spherical distributions are charac­terized by self-consistency of any projection onto a linethrough the origin. The result follows from the fact (Fang,Kotz, and Ng 1990, p. 97) that X is spherically symmet­ric if and only if for any perpendicular vectors a#-O and(3 #- 0,

Corollary 1. Suppose that X has a strongly symmet­ric distribution. Then the projection of X into any prin­cipal component subspace is self-consistent provided thatthe eigenvalues of the covariance matrix are distinct. If theeigenvalues are not all distinct, in which case the principalcomponents are not uniquely defined, then there exists anorthogonal matrix B so that the components of Z = B'Xare self-consistent principal components.

The matrix B of Corollary 1 is the orthogonal matrixfrom the definition of a strongly symmetric distribution.The need to distinguish the case of equal eigenvalues ofthe covariance matrix in Corollary 1 can be simply illus­trated by the example of the uniform distribution on thesquare discussed in Section 1. If Xl and X 2 are indepen­dent and uniform on [-1, IJ, then (X,, X 2 ) ' has the uniformdistribution on a square and Xl and X 2 are self-consistentprincipal components of the bivariate distribution. However,the two eigenvalues of the covariance matrix of (Xl, X2)'are equal, and any line through the mean is also a principalcomponent axis. As noted in Section ·1, though, only fourof the lines through the mean are self-consistent.

Note that if the principal components of a mean 0 p­

variate random vector X are independent, then the projec­tion of X onto a principal component axis will be self­consistent for X regardless of whether the marginal uni­variate distributions are symmetric. To see this, let P be aprojection matrix onto a principal component subspace ofdimension q < p and let Q = Ip - P. If the principal com­ponents are independent, then PX and QX are indepen­dent and £[XIPXJ = £[PX + QXIPXj = PX + Q£[XJ =PX + 0 = PX, which shows that PX is self-consistentfor X.

where "~,, means "has the same distribution as." Symmet­ric multivariate distributions do not necessarily have self­consistent principal components, as the next example illus­trates.

Example 2. Suppose that X = (Xl, X2)' is uniformlydistributed on [-a, OJ x [-1, OJ U [0, aJ x [0,1]' a 2: 1. ThenX is bivariate symmetric. For a = 1, the projection ofX onto the 45-degree diagonal line corresponds to a self­consistent projection onto the first principal componentaxis. However, for a > 1, the principal components arenot self-consistent.

When considering self-consistency of principal compo­nent subspaces, a natural subset of symmetric multivariatedistributions to consider is the strongly symmetric distri­butions. A p-variate random vector Z = (Zl,"" Zp)' hasorthant symmetry if

(ZI,""Zp)' ~ (±ZI,""±Zp)' (Efron 1969).

I say that a p-variate random vector X with £[XJ = IL isstrongly symmetric if there exists an orthogonal p x p matrixB such that

dX = IL+BZ,

where Z has orthant symmetry and IL E ~p (Tarpey 1995).The distribution in Example 2 is symmetric but not stronglysymmetric when a > 1. The class of strongly symmet­ric distributions contains elliptical distributions and distri­butions whose principal components are independent andsymmetric.

The following lemma allows me to restrict attentionto orthant symmetric distributions when proving self­consistency results for strongly symmetric distributions.

Lemma 1. Suppose that X = IL + BZ, where B is anorthogonal matrix, Z is a p-variate random vector, and IL E

~p. Y is self-consistent for X if and only if W = B' (Y ­IL) is self-consistent for Z.

Proof. See the Appendix..From Lemma 1, it may be assumed, without loss of gen­

erality, that strongly symmetric distributions are orthantsymmetric when proving self-consistency results.

Next I state an elementary theorem that will be useful.£[a'XI(3'XJ = o. (2)

Theorem 2. Partition a p-variate orthant symmetric ran­dom vector X as X = (X~, X~)' where X 2 is q-variate,q < p. Let Y = f(X 2 ) , where f is some measurable func­tion of X 2 • Then £[XI!YJ = 0 a.s.

Proof. See the Appendix.If X = (Xl,"" X p )' is orthant symmetric, then it is easy

to see that £[X] = 0 and the components X j of X are prin­cipal components of X. Thus if f is taken to be the identityfunction in Theorem 2, then it follows immediately that anyq-dimensional principal component subspace determined byq of the components X j of X will be self-consistent. Thisgives the following corollary.

Theorem 3. Suppose that X is a p-variate, mean 0 ran­dom vector with covariance matrix a 2I. Then X is spher­ically symmetric if and only if for every unit vector a E~P, PX is self-consistent for X where P = 00'.

Proof. See the Appendix.Sometimes a transformation of the variables will pre­

cede a principal component analysis. For example, if vari­ables are measured on different scales, then one can rescalethe variables to unit variance before performing a principalcomponent analysis. A natural question to ask is: Do prin­cipal components remain self-consistent after linear trans-

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 5: Self-Consistency and Principal Component Analysis

Tarpey: Self-Consistency and Principal Component Analysis 459

!(XI,X2)

= 2- [(1 + (V3XI + X2)2/96)(1 + (V3X2 - xd2/96)]-2,37r

Figure 3. The Contours of Equal Density in the Coordinate Sys­tem of the Principal Components (a) for the Distribution of Example 3.1and (b) After Transforming the Distribution to Get a Correlation Matrix.Self-consistency of the principal axes is not preserved after the transfor­mation.

formations of the variables? The answer to this question isno in general, as can be seen by the following example.

Example 3. Consider the strongly symmetric bivariaterandom vector (XI,X2 ) ' whose density is given by

5. SELF-CONSISTENCY FOR SAMPLE DATA

Until now I have discussed self-consistency for probabil­ity distributions. I now tum to the problem of determiningself-consistency of principal axes based on sample data. Ifthe goal of a principal component analysis is to summa­rize a multivariate data set by a straight line using the firstprincipal component axis, then it will often be of inter­est to the investigator if the straight line provides a goodfit, as in the gold assay example of Section 1. Similarly,in regression analysis, one needs to determine whether a

Proof. See the Appendix.Thus the projection of a random vector onto a self­

consistent principal component axis can be viewed as thelimiting distribution of k collinear self-consistent points ask tends to infinity. As noted in the proof of Theorem 4, Y k

converges almost surely to PX as k tends to infinity. Theasymptotic behavior of principal points as k tends to infin­ity has been studied for univariate distributions (see, e.g.,Cambanis and Gerr 1983 or Potzelberger and Felsenstein1994 and, for bivariate distributions, Su 1997). It is inter­esting to note that a distribution cannot have an arbitrarilylarge number of self-consistent points that lie on a straightline that is not a principal component axis, because as kgoes to infinity, a self-consistent line results. From Theo­rem 1, if a line is self-consistent, then it must be a principalcomponent axis.

D j = {x E ~p : Ilx - Yjll < [x - Yill,i # j}

and let Y = Yj if X E Dj. The points {Yl, ... ,yd arecalled k self-consistent points of X if Y is self-consistentfor X. The points {YI,... ,yd are called k principal points(Flury 1990) if they give the minimum mean square errorEIIX - YI1 2 compared to any other random vector sup­ported by a set of k points. Principal points are self­consistent points (Flury 1993).

Self-consistency of a principal component axis or sub­space can be assessed by the self-consistency of a set ofk principal points of the marginal distribution in the prin­cipal component subspace. For Theorem 4, define P as aprojection matrix onto a principal component subspace ofa mean 0 random vector X. Let Y k denote the projectionof X onto k principal points for the marginal distributionof X in this principal component subspace.

Theorem 4. Let X denote a p-variate, mean 0 randomvector with a bounded and continuous density function.Then the projection PX onto a principal component sub­space is self-consistent for X if and only if Y k is self­consistent for each k = 2,3, ....

proximation will be utilized in the next section to approxi­mate the hypothesis that a principal component axis is self­consistent.

Principal points are defined as the set of k points thatoptimally represent a distribution in terms of mean squarederror. Given a p-variate random vector X, let {YI,... ,yddenote a set of k distinct points in ~p. Define the domainof attraction for point Yj as

I

IIIII

• IIII

(b)

III

(a)

4. PRINCIPAL COMPONENTS ANDPRINCIPAL POINTS

In this section I show that a self-consistent principal com­ponent axis is approximated by a set of k "self-consistent"points that lie on the principal component axis. The ap-

(XI,X2)' E ~2.

This distribution results by taking two independent t ran­dom variables on 3 degrees of freedom, multiplying the firstvariable by two, and rotating the distribution by 30 degrees.

The contours of equal density for this distribution in thecoordinate system of the principal components is shown inFigure 3(a). The contours of equal density are symmetricabout the principal axes, and hence the principal compo­nent axes are self-consistent. Rescaling the variables Xland X 2 to unit variance to get a correlation matrix yieldsa distribution whose contours of equal density are shownin Figure 3(b) plotted in the coordinate system of the prin­cipal components of the transformed variables. As shownin Figure 3(b), the principal components are no longer self-

. consistent.For elliptical distributions, principal components will re­

main self-consistent after linear transformations. In fact, asthe following corollary shows, the class of elliptical dis­tributions is characterized by self-consistency of principalcomponents under arbitrary linear transformations.

Corollary 2. Let X denote a p-variate random vectorwhose covariance matrix has full rank. Then X has an el­liptical distribution if and only if every principal componentsubspace of B'X is self-consistent where B is an arbitraryp x p matrix of full rank.

Proof. See the Appendix.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 6: Self-Consistency and Principal Component Analysis

460

straight-line model is sufficient or a more complex modelis needed. Lack of fit of the first principal component axismay indicate that a nonlinear generalization such as prin­cipal curves (Hastie and Stuetzle 1989) should be utilized(see Example 5). If the data were collected from a nonhomo­geneous population (such as different species of birds; seeExample 6), then lack of fit of a principal component axismay indicate that a single principal component axis doesnot suffice for all of the subgroups. In classic and logisticregression analysis, lack-of-fit tests are available to assessthe fit of the model (see, e.g., Hosmer and Lemeshow 1989and Montgomery and Peck 1992). In this section I proposea lack-of-fit test to assess the self-consistency of a principalcomponent axis.

The hypothesis of self-consistency of the projection ofX onto the first principal component axis can be stated as

H: PX is self-consistent for X,

where P = 010~ and 01 is the eigenvector of the covari­ance matrix associated with the largest eigenvalue.

Of course, exact self-consistency will never hold for asample, because for almost every point XQ on a sample prin­cipal component axis there will be no points in the samplethat project orthogonally onto XQ. From Theorem 4, H canbe restated as

Journal of the American Statistical Association, June 1999

observation as z, = (Zil' Z~2)" where Zi2 is a p-1 vector ofthe second through pth principal components. To test Hi,perform the following steps.

Lack-of-Fit Test

1. Estimate principal points. Run the k-means algo­rithm to estimate k principal points for the univariate dataZn, Z2'l, ... , Znl, of the first principal component. Denotethe k cluster means from the algorithm by ~l' , ~.

2. Partition the data into k slices Dj,j = 1 , k, where

b, = {Zi: II€j - Zilll < 11€1 - Zilll, l i' j}.3. Compute means over each slice. Compute the means

fi,j of the Zi2 over each of the k slices:

fi,j = L Zi2/nj ,

ziEDj

where nj is the number of observations in Dj'

4. Compute the test statistic. If the first principal compo­nent is self-consistent, then by Theorem 4, E[fi,j] = O,j =1, .. , ,k. Thus the test statistic is based on the squared stan­dard distance of each fi,j from 0:

where Sj is the sample covariance matrix computed fromthe Zi2 in the jth slice ti;

5. Perform centering. To ensure that the resampling ofstep 6 is done in a way that reflects H k (Hall and Wilson1991), the data are centered over each of the k slices:

Now transform the "centered" data back into the originalcoordinate system: x: = Az: + x.

6. Obtain bootstrap sampling. Sample n observationswith replacement from xi, i = 1, ... , n, obtaining a boot­strap sample. Compute the test statistic, T*2, for the resam­pled data by repeating steps 1-4 for the bootstrap sample.Repeat a large number N times.

7. Determine the p-value. The p-value for this test is theproportion of bootstrap samples that yield a test statisticT*2 larger than T 2 .

After the data have been centered in each of the slicesb, in step 5, any difference between the cluster means fi,;and 0 in each bootstrap sample will be due approximatelyto natural variation and not to systematic departures fromself-consistency of the first principal component axis. Theforegoing procedure takes into account the variability ofestimating the first principal component axis as well.

The lack-of-fit test just described is similar in spirit tononparametric smoothing lack-of-fit tests in the regressionliterature. Hart (1997, p. 145) described a lack-of-fit testbased on a linear smooth function 9 of the residuals fromregression. If the null hypothesis in the regression frame­work is true, then the residuals should behave as a random

H: Vk, Y k is self-consistent for X,

where Y k is supported by k principal points of o~X alongthe first principal component axis. Thus I propose an ap­proximate test of self-consistency of a principal componentaxis using self-consistent points and principal points. Toshow that a principal component axis is not self-consistent,one need only show that Y k is not self-consistent for anygiven value of k. The lack-of-fit testing procedure is de­scribed for the hypothesis

Hi: Yk is self-consistent for X.

The testing procedure requires estimating the eigenvectorsof the covariance matrix and principal points. The eigen­vectors of the covariance matrix are estimated in the usualway using eigenvectors from the sample covariance matrix.The k-means algorithm (Hartigan 1975) is used to estimatethe principal points for the marginal principal componentdistributions. This is justified by the fact that cluster meansfrom the k-means algorithm are strongly consistent estima­tors of principal points (Pollard 1981).

A bootstrap (Efron 1982) procedure will be used to testHi: Let xj , .. " x., denote a sample from a p-variate dis­tribution. The description of the procedure is simplified ifone works with the data transformed into the coordinatesystem of the sample principal components and centere~ atO. That is, let z, = A'(Xi - x), i = 1, ... , n, where A isthe p x p orthogonal matrix whose columns correspond tothe eigenvectors of the sample covariance matrix. ChooseA so that the first column is the eigenvector associated withthe largest eigenvalue. Thus the first component of z, cor­responds to the first principal component. Partition each

where * AZi2 = Zi2 - J.Lj

if z, E Dj , i = 1, ... , n.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 7: Self-Consistency and Principal Component Analysis

Tarpey: Self-Consistency and Principal Component Analysis 461

principal component 1

principal component 1

(a)

(b)

1I1II1

."

....... r"·. .. ~ ·1····

1 I 1I 1 II I I1 1 1I I II I I1 I."I .-1 •••••• ·1I .' i· ", - T·.r" -.: .:' 1 ":' {'"I· ....:. I .. " I :'J( .

• rr« > • '",,': A·Q '. --:.'.""'":.1.- .. , ·1" • ·.1 ,.,'.

" -I" .:··.··r .' I· ...• I .... I····:· ·...1·

~I '1· t , -:. 1

1 I'"' 1I I I1 1 1I I I

1I1IIII1I

', I . : ...., . ::··A I" . I~'

j-<;.'. : •••• :r-::' . 'A': . ~:'.'-J ".'-/ • '-! .. \".;/

I .,. I -, • '., ."1 .: 1Y. ..,' .' I'"

~I'- .' . :1 .... : .....\I" :" .• '1' :. -:. I. ,I .ea- .1 I1 1 11 1 1I I I

'.m..-. "

'.

'.

c::Q)

c::oa.Eou

ca.~~ t-------.......,~~-t-~~J--;t-'-1'~_+___t_h_-_t

a.

N

Figure 4. (a) A Scatterplot of the Simulated Data of Example 5.1in the Coordinate System of the Sample Principal Components and (b)a Bootstrap Sample Obtained From the Data After Centering in EachSlice. In (a), the solid horizontal line is the first principal component, andthe open circles are k -means estimates of four principal points along thefirst principal component axis. The dashed vertical lines represent theboundaries of the slices, and the triangles are the corresponding meansof each slice. The discrepancy between the circles and triangles in (a)is due to the lack of self-consistency on the first principal componentaxis.

c~oa.Eou

of the sample covariance matrix are nearly equal), and thebootstrap procedure is unable to detect the nonlinearity inthe data.

A simulation study was conducted to evaluate the per­formance of the lack-of-fit test when the null hypothesis istrue using the multivariate normal, multivariate t, and uni-

N

ca.'g t--------++---."--t----;--t-t-r--t--fo~:+___+_:fL"-_t

a.

Example 4. I simulate a sample of 200 bivariate ob­servations for which a strong nonlinear trend exists be­tween the components. Let Xl '" uniform[O, 1] and X2 =log(Xl) + e, where e '" uniform[-.3, .3]. This is a modelconsidered by Li (1997) as an example of nonlinear con­founding between regressors. Figure 4(a) shows a scatter­plot of the data in the coordinate system of the princi­pal components. Four principal points are estimated by thek-means algorithm for the first principal component andare represented by the four circles on the horizontal axis.The four points partition the plane into four slices (whoseboundaries are given by the vertical dashed lines), and themean of each slice is denoted by the triangles. The factthat the triangles and the circles differ considerably high­lights the fact that the first principal component axis is notself-consistent. The T 2 statistic for this data is T2 = 129.19.Next, the data in each slice are centered, and bootstrap sam­ples are taken. Figure 4(b) shows one such bootstrap samplealong with k = 4 estimated principal points estimated fromthe bootstrap sample, denoted again by the circles. Like­wise, the triangles in Figure 4(b) correspond to the meansover each slice. For the bootstrap sample, the circles andtriangles nearly overlap, as is to be expected, and the teststatistic for this bootstrap sample is T*2 = 3.95. The pvalue based on 1,000 bootstrap samples is 0, providing verystrong evidence against self-consistency of the first princi­pal component axis, as is readily evident from the plot.

Further simulations were run for the previous example,but with the dimension of the data increased to p = 4 byadding two independent mean 0 normal variates each withan equal standard deviation (J. Running simulations for stan­dard deviations (J = .2, .3, .4,.5 yielded p values of .004,.013, .044, and .948 using k = 4 and 1,000 bootstrap sam­ples. For (J = .1, a scatterplot (not shown here) of the firsttwo principal components shows a strong nonlinear pattern.For .2 :::; (J :::; .4, a nonlinear pattern is not noticeable in ascatterplot of the first two principal components, but thelack-of-fit test nonetheless detects the nonlinear structure.For (J = .5, the data become almost spherical in the firsttwo principal components (the two largest eigenvalues

scatter about 0, and the estimated function 9 should be rel­atively flat. In the current setting, principal components 2through p are the multivariate analog of the residuals, andinstead of a smooth function g, a piecewise constant func­tion determined by the cluster means in step 3 is estimated.The test statistic of step 4 is a multivariate analog of theunivariate lack-of-fit test statistic based on a linear smoothfor the residuals from regression.

Note that the lack-of-fit test will likely fail if the largesteigenvalue of the sample covariance matrix is statisticallyindistinguishable from the second largest eigenvalue, inwhich case the estimated first principal component axis mayvary widely between bootstrap samples. (For a formal testof "sphericity," that is, equality of eigenvalues of the sam­ple covariance matrix, see Flury 1988, p. 19; Mardia, Kent,and Bibby 1979, p. 235.)

The following example illustrates the bootstrap lack-of­fit test.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 8: Self-Consistency and Principal Component Analysis

462

form distributions on a rectangle. The simulations indicatethat the lack-of-fit test is conservative and in each case thedistribution of p values was skewed to the left. In each sce­nario, the largest value of k leads to smaller p values onaverage, except for the uniform distribution on a rectan­gle. The test became more conservative as the dimensionof the distribution increased and was most conservative forheavier-tailed distributions.

A natural question that arises when applying the lack-of­fit test is: What value of k should one use? That is, howmany principal points should be estimated along the firstprincipal component axis? Large values of k improve theapproximation of H k to H. But if k is too large, then someof the slices b, will contain too few points to provide astable estimate of the cluster means. My experience withsimulations and real data indicates that strong statisticalevidence of a lack of self-consistency of the first principalcomponent axis can often be determined with k as small as3 or 4. One very important point is that the lack-of-fit testdescribed here will usually fail to reject Hi; for k = 2 evenif the data exhibit a very strong nonlinear pattern. Largervalues of k are restricted by sample size and dimensionality.In the examples discussed here, I required each slice tocontain at least 2p points, where p is the dimension of thedata.

In the simulation study, the test statistic and its associ­ated p value was computed using k from 3 to 5 or 6 foreach dataset. Generating multiple p values typically leadsto multiplicity concerns if a decision is to be made. How­ever, the results of the simulation study reveal that basinga decision according to the smallest of the p values stillyields a conservative test in dimension p > 2. As a descrip­tive tool, computing the test statistic for different values ofk can be useful. For example, if the departure from self­consistency is subtle, then small values of k may not detecta problem with self-consistency, whereas larger values of k

will indicate a lack of self-consistency (see Example 6).As a practical guideline for making decisions, it appears

relatively safe to make a decision based on the smallest pvalue when the dimension is greater than two, due to theconservative nature of the test. For two-dimensional data(or if multiplicity remains a concern for higher-dimensionaldata), I recommend basing decisions on the p value fromthe largest value of k used for computing the test statis­tic, because larger values of k better approximate the self­consistency hypothesis.

Determining the largest value of k for computing thelack-of-fit test statistic can be aided by the use of the uni­variate normal distribution since projections of high dimen­sional data tend to be approximately normal (Diaconis andFreedman 1984). To illustrate, consider using k = 6 slices.The proportions for each of the six slices for a univariatenormal distribution as determined by k = 6 principal pointsare 7.4%,18.1 %,24.5%,24.5%, 18.1%, and 7.4% (see, e.g.,Cox 1957, table 1). As a rough guideline, the slice with thesmallest cardinality should contain approximately 7.4% ofthe sample size n. Thus to use k = 6 for the lack-of-fit test,

Journal of the American Statistical Association, June 1999

one can check whether 7.4% of the sample size exceeds 2p,where p is the dimension of the data.

One final comment regards the use of the k-means algo­rithm for the lack-of-fit test. By construction, the k-meansalgorithm converges to a set of k self-consistent points forthe empirical distribution, and there may exist several setsof k self-consistent points. But if the underlying distributionhas a log-concave density (as does the univariate normal dis­tribution), then there exists a unique set of k self-consistentpoints that must be the k principal points of the distribution(Kieffer 1982; Li and Flury 1995; Truskin 1982).

6. EXAMPLES

I now apply the lack-of-fit test of Section 5 to realdatasets. The examples illustrate the importance of assess­ing self-consistency of a principal component axis. In thefirst example I apply the lack-of-fit test to assess whetherthe straight-line model provided by the first principal com­ponent axis is appropriate or the nonlinear generalizationgiven by principal curves should be used. In each applica­tion of the lack-of-fit test, 1,000 bootstrap samples wereobtained.

Example 5: (Gold Assay Example 1 (Continuedl}. Us­ing the bootstrap lack-of-fit test of Section 5 on the goldassay data encountered in Example 1 yields the results givenin Table 1.

The evidence is fairly strong against self-consistency ofthe first principal component axis. The departure from self­consistency can be seen in Figure 5. Figure 5(a) shows ascatterplot of the gold assay data. The solid line is the firstprincipal component axis, the circles represent k = 6 princi­pal point estimates along the first principal component axis,and the triangles are the corresponding means over each ofthe six slices. Figure 5(a) tends to support the conclusionreached by Hastie and Stuetzle that the outside assay tendsto be higher than the in-house assay in the middle range,but the difference is reversed at lower levels; see Hastie andStuetzle's Fig. 12(b). In Figure 5(b), 25 bootstrap samplesfrom the raw data (without centering in each slice) wereobtained. For each bootstrap sample, the conditional meansover each of the six slices were joined by line segments andplotted. The connected line segments of Figure 5(b) can bethought of as a providing a rough first approximation tothe principal curve. Because Figure 5(b) is computed fromthe raw data instead of the slice-centered data, one can seethe systematic departure from the first principal componentaxis (the thick straight line) in the joined bootstrap line seg­ments, which indicates why the lack-of-fit test rejects thehypothesis of self-consistency for the first principal com­ponent axis.

Table 1. Gold Assay Data

k T2 P value

3 16.43 .0004 18.5 .0035 19.34 .0176 28.20 .003

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 9: Self-Consistency and Principal Component Analysis

Tarpey: Self-Consistency and Principal Component Analysis 463

in-house assay

Example 6: Dimensions ofBirds. Figure 6 shows a scat­terplot of the wing area (ern") and wing spread (em) for232 different species of birds (Greenwalt 1962). Clearly,the first principal axis will not be self-consistent for thisdataset (T 2 = 117.6 and p = 0 for k = 4).

Taking the square root of wing area so that both variablesare in units of centimeters, we get T 2 = 10.49 and a p valueof .439 using k = 4. Thus the square root transformationof wing area gives a first principal component axis consis­tent with the hypothesis of self-consistency. (In fact, the p

values for the test using k = 5, ... , 10 all exceeded .5.) In­cidently, the Box-Cox transformation, which correspondsto raising the wing area variable to the .4 power approxi­mately, yielded a p value of .206 when k = 4. Figure 7(a)shows a scatterplot of the square root-transformed bird dataalong with the first principal component axis. As in Figure5, the circles correspond to k = 6 estimated principal pointsalong the first principal component axis, and the trianglesare the corresponding means over each of the six slices.The circles and triangles overlap considerably (the p valuefor k = 6 is .669). The first principal component accountsfor about 99.5% of the total variability for the transformeddata. Thus, even if the first principal component were notself-consistent, it would be hard to distinguish the circlesand triangles. Therefore, in Figure 7(b) the plotting is donein the coordinate system of the principal components wherethe scale is chosen so that one can see the variability of thesecond principal component. For Figure 7(b), 25 bootstrapsamples were obtained from the raw data [as in Fig. 5(b)],

oN..,

ables are measured on different scales. In regression, if anonlinear pattern exists, a common strategy is to find appro­priate transformations of the regressorts) and/or responseso that a simple straight-line model will suffice. In princi­pal component analysis, if a simple transformation existsthat yields a self-consistent principal component axis, thenestimating a principal curve may be avoided. The next ex­ample looks at the effects of transforming the variables onself-consistency.

6

6

5

5

4

42 3

2 3

in-house assay

a

o

IL---'O"'---.......--........--......--........--~__......I-1

1 ..._-1--'O"'---.......--........--......--........--~-- ......

(a)

o

o

'"o ..,.,.,o.,-0'iii=> No

'"~ ..,.,o.,~..:; No

(b)

Figure 5. Gold Assay Data; (a) A Scatterplot of the Outside LabAssay and the In-House Assay Along With the First Principal ComponentAxis (Solid Line) and (b) Results of 25 Bootstrap Samples From the RawData. In (a), the circles represent estimated principal points along thefirst principal component axis and the triangles are the correspondingmeans over each slice. In (b), the thick straight line is the first principalcomponent axis. For each bootstrap sample, the means over each of thesix slices are connected by line segments and joined together. Part (b)shows a systematic departure from self-consistency of the first principalcomponent axis because the line segments lie mostly above the firstprincipal axis in the middle and below the principal axis on the left.

ocoN

oE ;;;-s

o" 0o N

:"~ ~ .'..O'l • -..co· •••~ N •••_

~ • t5l.

:l'

...

Figure 6. Bird Data: Scatterplot of Wing Length (em) Versus WingArea (crn2). The first principal component axis is not self-consistent dueto the strong nonlinear pattern.

It is well known that principal components are not scaleinvariant. A common problem in principal component anal­ysis is how to choose an appropriate transformation if vari-

0.0 0.2 0.4 0.6 0.8

Wing Areo (cm-2)

1.0 1.2

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 10: Self-Consistency and Principal Component Analysis

464 Journal of the American Statistical Association, June 1999

0>I

::

7. DISCUSSION

I have discussed the situation of a distribution centeredabout a principal component axis. The mean is the zero­dimensional center of gravity for a distribution. If a distri­bution is centered about a higher-dimensional linear sub­space, then this subspace must be a principal component

cipal component axis. Figure 7(b) also shows the dramaticincrease in variability of the second principal component asone goes from left to right along the first principal compo­nent axis.

In a study of allometric growth, it is natural to take alogarithm transformation of each variable. A model of al­lometric growth assumes that the wing area and wing lengthincrease at a constant relative rate as you go from smallerto larger birds. One can interpret the coefficients of the firsteigenvector of the covariance matrix of the log-transformeddata as constants of allometric growth (Jolicoeur 1963).Theallometric model rests on the assumption that the first prin­cipal component then is self-consistent. Taking the loga­rithm of wing area and wing spread gives T 2 = 33.545 andan associated p value of 0 using k = 4. However, a morereasonable transformation is to take the logarithm of spreadand one-half the logarithm of area, because area is in unitsof cm2 (see Hills 1982), which then gives T 2 = 11.51 anda p value of .197, indicating a better fit. It is interesting tonote that if the log-transformed data were from an ellipticaldistribution, then according to Corollary 2, the linear trans­formation from log-log to .5 x log -log should not affect theself-consistency of the first principal component axis.

Running the lack-of-fit test for k = 6 for the trans­formation .5 x log(wing area) and log(wing length) yieldsT 2 = 50.5 with a p value of .004. Thus using only k = 4points is insufficient for determining that the first prin­cipal component apparently is not self-consistent. Figure8(a) shows a scatterplot of the transformed bird data alongwith the first principal component axis. In Figure 8(b), 25bootstrap samples were again generated from the raw log­transformed data [as in Fig. 7(b)], and the means over eachof the slices are plotted and connected by line segmentsin the coordinate system of the principal components. Thebootstrap line segments show a very strong systematic de­parture from the horizontal line, indicating a lack of self­consistency of the first principal component. The implica­tion is that a single allometric model does not seem to holdfor all species of birds. Perhaps this is due to the fact thatwings accommodate many different modes of flight amongbirds: horizontal flight, zigzagging, climbing, diving, bound­ing, soaring, and hovering (Lighthill 1975). In particular,note that in Figure 8(a) the points corresponding to speciesof hummingbirds have been plotted with circles as opposedto points. Each of these circles lie on or above the firstprincipal component axis. The flight of hummingbirds ischaracterized by hovering, which is considerably differentfrom other types of winged flight. Thus the conclusions ofthe lack-of-fit test indicate that perhaps constants for allo­metric growth will not be the same for different types offlight.

325

120100

225

40 60 80

Square-root(Wing area)

(a)

25

I..,I

'"., L_-75----

.......----........-----'------l

"0 00OJ 0

a. '"(fJ

"'0c: '"~ -

0

~

0.,

0..0

0 20

and the means of each slice were connected by line seg­ments. It is difficult to distinguish any systematic departurefrom self-consistency, except that for the fifth point fromthe left, most of the line segments lie below the first prin-

~r------..--------r-------r-----..,

Figure 7. Bird Data: Square Root Transformation. (a) A scatterplotof wing length versus square root of wing area. The solid line is thefirst principal component. The circles represent k = 6 estimated prin­cipal points along the first principal component axis and the trianglesare the corresponding means. (b) Results of taking 25 bootstrap sam­ples from the square root transformed data. For each bootstrap sample,the means over each of the six slices are computed and connected byline segments and plotted in the coordinate system of the principal com­ponents. The departure from self-consistency of the first principal com­ponent axis (horizontal line) appears slight, because the line segmentsappear to cluster about the first principal component axis.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 11: Self-Consistency and Principal Component Analysis

Tarpey: Self-Consistency and Principal Component Analysis 465

oO.5*log(wing area)

(a)

o..------.,....------r-----.....-------,

N

Ul

'xo

c

"c:& C!1- ._E 0o"oCo'uc:'CCo

distributions is characterized by the fact that after arbitrarylinear transformations, such as transformations to a corre­lation matrix, the principal component axes remain self­consistent. Analogously, ellipsoids remain ellipsoids afterarbitrary linear transformations.

The bootstrap lack-of-fit test described in Section 5provides a relatively simple procedure for assessing self­consistency of principal axes from sample data and appearsto work well in the examples that I have investigated. Oneconcern with using the k-means algorithm for the lack-of­fit test is that' the algorithm may produce different clustersfor the first principal component, depending on which initialstarting values are used. Clusters may be formed containingvery few observations, in which case the sample covariancematrix for these clusters will be singular or nearly singular.One may consider using different versions of the k-meansalgorithm (Lloyd 1982) or using parametric or semipara­metric estimators of principal points instead instead of thenonparametric k-means algorithm (Tarpey 1997).

The methodology and examples discussed here deal onlywith self-consistency of the first principal axis. The pro­cedure can be easily adapted to assess self-consistency ofhigher-dimensional principal component subspaces, or prin­cipal component axes besides the first principal componentaxis. In fact, Theorem 4, on which the inference procedureis based, was stated for arbitrary dimensions. Determin­ing the principal points for the lack-of-fit test for higher­dimensional subspaces is complicated by the fact that typ­ically several different sets of k self-consistent points ex­ist for principal component subspaces of dimension two ormore (Tarpey 1998). However, the k-means algorithm con­verges by construction to a set of k self-consistent points ofthe empirical distribution, and the methodology describedhere works for principal points and self-consistent points.

APPENDIX: PROOFS

Proof of Lemma 1

Suppose that Y is self-consistent for X; then

£[ZIW = w] = £[B'(X -IL)IB'(Y -IL) = w]

B'£[XIY = Bw + IL] - B'IL

B' (Bw + IL) - B' IL a.s. by self-consistency-1 0

principal component axis 1

ol-..----....-----........----........----~r -2

(b) = w.Figure 8. Bird Data: Log Transformed. (a) A scatterplot of log(wing

length) versus .5 log(wing area). The solid line is the first principal com­ponent axis. The observations plotted with circles in the lower left corre­spond to species of hummingbirds. (b) Results of 25 bootstrap samplesplotted in the coordinate system of the principal components. The meansover each of the six slices are joined by line segments. The lack of self­consistency of the first principal component axis is highlighted by thesystematic variation of the line segments about the horizontal line.

subspace. As the examples in Sections 5 and 6 illustrate,however, the one-dimensional principal component axes arenot always self-consistent.

The class of elliptical distributions lend themselves nat­urally to principal component analysis. The contours ofequal density are ellipsoids whose axes correspond to self­consistent principal component axes. The class of elliptical

Conversely, if W is self-consistent for Z, then £[XIY = y]IL+ B£[ZIW = B'(y -IL)] = IL+ B(B'(y - IL)) = y.

Proof of Theorem 2

Because (X~,X~)' ~ (-X~,X~)', by orthant symmetry,

£[X1IY] ~ -£[X1IY].

For any set A measurable with respect to Y such that P(Y E

A) > 0,

£[X1IA (Y)]/P(Y E A)

£[X1IY E A] (by definition)

£[-X1IY E A] (by orthant symmetry)

£[-X1IA(Y)]jP(Y E A) (by definition).

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 12: Self-Consistency and Principal Component Analysis

466 Journal of the American Statistical Association, June 1999

(IA (.) denotes the indicator function for the set A.) Therefore,t"[XlIA(Y)] = t"[-XlIA(Y)], which means that t"[XlIY E A] =o. Because this holds for any non-zero probability set A measur­able with respect to Y, it follows that t"[XlIY] = 0 a.s.

Proof of Theorem 3

Suppose that X has a spherical distribution. Then, for any pairof orthogonal unit vectors a and (3in ~RP, from (2), t"[(3/Xla/X] =o. Choose a p x (p - 1) matrix A 2 = [al : ... : ap-l] such thatA = [a: AI] is orthogonal. Let P = ao' and Q = A2A~. Then

Y = Ale;] = t"[XIXl E Dj]. By self-consistency of the principalpoints for the marginal distribution of X,, t"[XlIXl E Dj] = ej.It needs to be shown that t"[XiIXl E Dj] = O,i = q + 1, ... ,p.This is equivalent to showing that

t"[XJDj (Xl)] = J.. ·1. I: X;f(Xl, Xi) dx, dXl = 0,J

where f is the marginal joint density of (Xl, Xd, i > q. BecausePX = (X~, 0/)/ is self-consistent for X, t"[Xi IXl = x] = 0 foralmost every x. Thus

Proof of Theorem 4

k

v, = L AlejIDj (X),j=l

J... i~ t"[Xi!X l = x]h(x)dx = 0,J

= 0,

where II is the marginal density of Xi. The following can bewritten:

and thus Y k is self-consistent for X.On the other hand, suppose that Y k is self-consistent for X

for every k. If el, ... , ek represent k principal points of A~X,

then the random vector L:~=lejID; (X) converges almost surely

to A~X (see, e.g., Parna and Lember 1997). Thus Y k convergesalmost surely to PX. It needs to be shown that t"[XiIXI} = 0,for i = q + 1, ... .p. Let (0, F, P) denote the probability spaceon which X is defined. Because Y k converges almost surely toPX, let X, (w) = x~ be a point for which the marginal densityof Xl is positive and A~Ydw) -+ x~. For each k, let Djk,j =1, ... ,k denote the domains of attraction of the k principal pointsof Xl. The Djk form a partition of the support of Xl, and thusthere exists a sequence Djkk such that x~ E Djkk for all k. Now,because Yk is self-consistent for X, it follows that t"[XiIYk] = 0for all k, which means that t"[XiIXl E Djk] = 0, for j = 1, ... , k.Applying this last equality to the Djkk gives

J...J. [00 X;f(Xl, Xi) dx, dxl/7rjkk = 0,D

j k k00

where 7rjkk = P(XI E Djkk). Applying the mean value theorem,

vol(Djkk)[: X;f(Xlk, Xi) dX;/7rjkk = 0

for some Xlk E Djkk. Because x~ E Djkk for all k, Xlk convergesto x~ as k goes to infinity. Furthermore, vol(Djkk)/7rjkk convergesto 1/[: (x~) as k tends to infinity. Therefore,

J... j. [00 X;f(Xl, Xi) dx; dXl/7rjkk -+Dik k 00

[: X;f(~ilx~)dx.,

t"[PX + QXIPX]

PX + t"[QXIPX]

PX + Alt"[A~Xla/X]

PX + Al (t"[a~Xla/X],... ,t"[a~_lXla/X])/

PX + (0, ... ,0)/ by (2)

PX

t"[XIPX]

where Dj E RP is the domain of attraction of Ale;, j = 1, ... ,k.By Lemma 1 it may be assumed, without loss of generality, that

cov(X) = diag(O"r, ... ,O";),O"l ~ O"j,j =I- 1. That is, al, ... ,aq

can be taken to be the first q basis vectors.First, consider the case where PX is self-consistent for X. Let

X = (Xl, ... ,Xp )' , then Xi := A~X = (Xi , ,Xq )' . Let Dj cRq be the domain of attraction of ej,j = 1, , k. Then t"[XI

Therefore, PX is self-consistent for X.Conversely, suppose that every principal component of X

is self-consistent. That is, for every unit vector a E RP, RP,t"[XIPX] = PX, where P = ao', Choose any unit vector (3that is perpendicular to a. Multiplying both sides of the equationt"[XIPX] = PX by (3/ gives t"[(3/Xla'X] = 0 establishing equa­tion (3.1) which is equivalent to X being spherically symmetric.

Proof of Corollary 2

Without loss of generality, assume t"[X] = o. If X has an ellip­tical distribution, then B/X also has an elliptical distribution (seetheorem 2.16 of Fang et al. 1989, p. 43). Therefore any principalcomponent subspace of B/X is self-consistent by Corollary 1.

Conversely, suppose that for any p x p matrix B of full rank,the principal component subspaces of B/X are self-consistent. Letq, denote the covariance matrix of X, and let q, = ADA' be thespectral decomposition of q, wh~re A is a p x p orthogonal matrixand D is a diagonal matrix. Set B = AD- l/2. Then the covariancematrix of Z = B/X is the identity matrix. By the hypothesis, anyprincipal component of Z is self-consistent. Thus for any unitvector a E RP, setting P = ao' gives that PZ is self-consistentfor Z. This implies that Z has a spherical distribution by Theorem3, which in turn implies that X has an elliptical distribution.

Let al, ... , a q denote q < p eigenvectors of the covariance. matrix of X. Define Al = [al : ... : a q ] and let P = AlA~,

the projection matrix onto the principal component subspace ofdimension q < p. Let el, ... ,ek denote a set of k principal pointsof A~X. The projection of X onto the k principal points for themarginal distribution of X in this principal component subspaceis given by

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4

Page 13: Self-Consistency and Principal Component Analysis

Tarpey: Self-Consistency and Principal Component Analysis

but the left-hand side is identically O. Thus [[XiIXl] = 0, whichgives that PX is self-consistent for X, and the theorem is proved.

[Received Mareh 1997. Revised August 1998.J

REFERENCESCambanis, S., and Gerr, N. (1983), "A Simple Class of Asymptotically

Optimal Quantizers," IEEE Transactions on Information Theory, IT-29,664-676.

Cox, D. R. (1957), "Note on Grouping," Journal ofthe American StatisticalAssociation, 52, 543-547.

Diaconis, P., and Freedman, D. (1984), "Asymptotics of Graphical Projec­tion Pursuit," The Annals ofStatistics, 12, 793-815.

Duchamp, T., and Stuetzle, W. (1996), "Extremal Properties of PrincipalCurves in the Plane," The Annals of Statistics, 24, 1511-1520.

Efron, B. (1969), "Student's t-Test Under Symmetry Conditions," Journalof the American Statistical Association, 64, 1278-1302.

--- (1982), The Jackknife, The Bootstrap, and Other Resampling Plans,Philadelphia, PA: Society for Industrial and Applied Mathematics.

Fang, K., Kotz, S., and Ng, K. (1990), Symmetric Multivariate and RelatedDistributions, New York: Chapman and Hall.

Flury, B. (1988), Common Principal Components and Related MultivariateModels, New York: Wiley.-- (1990), "Principal Points," Biometrika, 77, 33--41.--- (1993), "Estimation of Principal Points," Applied Statistics, 42,

139-151.Greenwalt, C. (1962), "Dimensional Relationships for Flying Animals,"

Smithsonian Miscellaneous Collections, 144(2), 1--46.Hall, P., and Wilson, S. R. (1991), ''Two Guidelines for Bootstrap Hypoth­

esis Testing," Biometrics, 47, 757-762.Hart, J. D. (1997), Nonparametric Smoothing and Lack-of-Fit Tests, New

York: Springer.Hartigan, J. A. (1975), Clustering Algorithms, New York: Wiley.Hastie, T., and Stuetzle, W. (1989), "Principal Curves," Journal of the

American Statistical Association, 84, 502-516.Hills, M. (1982), "Allometry," in Encyclopedia ofStatistical Sciences, eds.

S. Kotz and N. L. Johnson, New York: Wiley, pp. 48-54.Hosmer, D., and Lemeshow, S. (1989), Applied Logistic Regression, New

York: Wiley.Jolicoeur, P. (1963), "The Multivariate Generalization of the Allometry

Equation," Biometrics, 19,497--499.Kieffer, J. (1982), "Exponential Rate of Convergence for Lloyd's Method

467

I," IEEE Transactions in Information Theory, Special Issue on Quanti­zation, 1T-28, 205-210.

Li, K-C. (1997), "Nonlinear Confounding in High-Dimensional Regres­sion," The Annals of Statistics, 25, 577-612.

Li, L., and Flury, B. (1995), "Uniqueness of Principal Points for UnivariateDistributions," Statistics & Probability Letters, 25, 323-327.

Lighthill, J. (1975), Mathematical Biofluiddynamics, Philadelphia, PA: So­ciety for Industrial and Applied Mathematics.

Lloyd, S. P. (1982), "Least Squares Quantization in PCM," IEEE Trans­actions on Information Theory, Special Issue on Quantization, IT-28,129-137.

Mardia, K., Kent, J., and Bibby, J. (1979), Multivariate Analysis, NewYork: Academic Press.

Montgomery, D., and Peck, E. (1992), Introduction to Linear RegressionAnalysis, New York: Wiley.

Parna, K., and Lember, 1. (1997), "On Some Properties of k-Variance,"unpublished manuscript.

Pearson, K. (1901), "On Lines and Planes of Closest Fit to Systems ofPoints in Space," Philosophical Magazine, 2, 559-572.

Potzelberger, K., and Felsenstein, K. (1994), "An Asymptotic Result onPrincipal Points for Univariate Distributions," Optimization, 28, 397­406.

Pollard, D. (1981), "Strong Consistency of K-Means Clustering," The An­nals of Statistics, 9, 135-140.

Su, y. (1997), "On Asymptotics of Quantizers in Two Dimensions,"Jour­nal of Multivariate Analysis, 61, 67-85.

Tarpey, T. (1995), "Principal Points and Self-Consistent Points of Sym­metric Multivariate Distributions," Journal ofMultivariate Analysis, 52,39-51.

--- (1997), "Estimating Principal Points of Univariate Distributions,"Journal ofApplied Statistics, 24,499-512.

--- (1998), "Self-Consistent Patterns for Symmetric Multivariate Dis­tributions," Journal of Classification, 15, 57-79.

Tarpey, T., and Flury, B. (1996), "Self-Consistency: A Fundamental Con­cept in Statistics," Statistical Science, 11, 229-243.

Tarpey, T., Li, L., and Flury, B. (1995), "Principal Points and Self­Consistent Points of Elliptical Distributions," The Annals of Statistics,23, 103-112.

Truskin, A. (1982), "Sufficient Conditions for Uniqueness of a LocallyOptimal Quantizer,' IEEE Transactions in Information Theory, SpecialIssue on Quantization, IT-28, 187-198.

Tibshirani, R. (1992), "Principal Curves Revisited," Statistics and Com­puting, 2, 183-190.

Dow

nloa

ded

by [

Col

orad

o C

olle

ge]

at 1

7:17

16

Oct

ober

201

4