Upload
thaddeus
View
223
Download
0
Embed Size (px)
Citation preview
This article was downloaded by: [Colorado College]On: 16 October 2014, At: 17:17Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK
Journal of the American Statistical AssociationPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uasa20
Self-Consistency and Principal Component AnalysisThaddeus Tarpey aa Department of Mathematics and Statistics , Wright State University , Dayton , OH , 45435 ,USAPublished online: 17 Feb 2012.
To cite this article: Thaddeus Tarpey (1999) Self-Consistency and Principal Component Analysis, Journal of the AmericanStatistical Association, 94:446, 456-467
To link to this article: http://dx.doi.org/10.1080/01621459.1999.10474140
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
Self-Consistency and Principal Component AnalysisThaddeus TARPEY
I examine the self-consistency of a principal component axis; that is, when a distribution is centered about a principal componentaxis. A principal component axis of a random vector X is self-consistent if each point on the axis corresponds to the mean of Xgiven that X projects orthogonally onto that point. A large class of symmetric multivariate distributions are examined in terms ofself-consistency of principal component subspaces. Elliptical distributions are characterized by the preservation of self-consistencyof principal component axes after arbitrary linear transformations. A "lack-of-fit" test is proposed that tests for self-consistencyof a principal axis. The test is applied to two real datasets.
KEY WORDS: Bootstrap; Elliptical distribution; k-means clustering; Principal point; Principal curve; Self-consistent point;Spherical distribution; Symmetric distribution.
© 1999 American Statistical AssociationJournal of the American Statistical Association
June 1999, Vol. 94, No. 446, Theory and Methods
about its mean, because the mean is the center of gravityfor a distribution. When is a distribution is centered abouthigher-dimensional principal component subspaces?
For instance, consider the uniform distributions on a circle and on a square centered at the origin (Fig. 1). Forboth distributions, the covariance matrix is of the form0-
212 and thus any line through the origin corresponds toa principal component axis. The circle is centered aboutany line through the origin, but there are only four linesabout which the uniform distribution on the square is centered: the horizontal, vertical, and diagonal lines throughthe origin (see Fig. 1). Every line passing through the origin is "self-consistent" for the uniform distribution on acircle, but there are only four "self-consistent" projectionsonto lines for the uniform distribution on a square.
Generally speaking, a random vector Y is self-consistent(Tarpey and Flury 1996) for X if
Self-consistency provides a unified framework for manystatistical techniques that provide a simpler structure forrepresenting distributions such as principal components,principal points (Flury 1990, 1993; Tarpey, Li, and Flury1995), and principal curves (Hastie and Stuetzle 1989).
Let X denote a mean °random vector. We shall calla linear subspace V self-consistent for X if almost everypoint x E V corresponds to the mean of X given that Xprojects orthogonally onto the point x. If P is an orthogonal projection matrix onto a line through the origin, thenPX is self-consistent for X if for almost every point xon the line, E[XIPX = xJ = x. The connection betweena self-consistent line (or hyperplane) and principal component subspaces is made by the following theorem.
Theorem 1. (Tarpey and Flury 1996). Let X denote a pvariate random vector and assume without loss of generalitythat E[X] = O. Suppose that P is a projection matrix associated with an orthogonal projection from RP into a linearsubspace V of dimension q < p. If PX is self-consistentfor X, then V is spanned by some q eigenvectors of thecovariance matrix of X.
1. INTRODUCTION
The first principal component axis provides a simplestraight-line approximation to a multivariate distribution.A natural question to ask for many applications is if thestraight-line approximation provided by the first principalcomponent axis is adequate. For instance, the first principalcomponent axis may be a poor approximation to a distribution exhibiting nonlinear structure. A single straight linemay be an insufficient summary for a population consistingof nonhomogeneous subgroups.
A natural criterion for a principal component axis to provide a good fit to a distribution is if the distribution is centered about the axis. Hastie and Stuetzle (1989) introducedthe term "self-consistent" to define the property that eachpoint on a smooth curve is the average of all points in thedistribution that project orthogonally onto this point. Thusa curve is self-consistent for a distribution if the distribution is centered about the curve. In this article I addressthe problem of determining whether a principal componentaxis is self-consistent.
Traditionally, a principal component of a p-variate random vector X with mean 1.£ is any linear combinationU = a' (X - 1.£), where 0. is a normalized eigenvector ofthe covariance matrix of X. Pearson (1901) introduced principal components by finding the q-dimensional plane thatminimizes the sum of squared distances between each observation in a dataset and the plane. The solution to theproblem is given by the plane that is spanned by q eigenvectors of the sample covariance matrix and translated so thatit passes through the sample mean. Thus the first principalcomponent axis passes through the mean, and its directionis determined by the eigenvector of the sample covariancematrix associated with the largest eigenvalue. Because theaverage squared distance between a random vector X andan arbitrary point is uniquely minimized by the mean of X,the zero-dimensional principal component subspace is simply the mean of the distribution. A distribution is centered
Thaddeus Tarpey is Assistant Professor, Department of Mathematicsand Statistics, Wright State University, Dayton, OH 45435 (E-mail:[email protected]). The author thanks two anonymous referees,the associate editor and editor for constructive comments, and also TrevorHastie for permission to use the gold assay data as well as Rob Tibshiranifor providing the gold assay data.
£[XIY] = Y a.s. (1)
456
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
Tarpey: Self-Consistency and Principal Component Analysis 457
5., '----:-1--~--'------"-----'----'------'----
in-house assay
.... : .0° .: •• 0 .0 •
: . ;. :.: .
Figure 2. Gold Assay Data Plot of the Log Assays for the OutsideLab Versus the in-House Lab.
Gold Assoy Doto
2. SELF-CONSISTENCY FORSYMMETRIC DISTRIBUTIONS
In this section a large class of symmetric multivariatedistributions is defined for which the projections onto principal component subspaces are self-consistent. Recall that amultivariate random vector X is symmetric about its mean/L if
dX -/L = -(X -/L),
plying the EM algorithm to estimate the principal curvefor the gold assay data. Interestingly, the two methods givedifferent results with different interpretations.
The first principal component axis has a simple interpretation as the extrema of the expected minimum squareddistance function in the class of straightline approximationsto a distribution. However, Duchamp and Stuetzle (1996)showed that nonlinear principal curves, although criticalpoints of the expected minimum squared distance function,correspond to saddlepoints. If a principal component axis isself-consistent, then complications introduced by nonlineargeneralizations can be avoided.
In Section 2 a large class of symmetric multivariate distributions is defined for which principal component subspacesare self-consistent. The class of elliptical distributions ischaracterized by self-consistency of principal axes after arbitrary linear transformations in Section 3. In Section 4 Iconsider the relationship between self-consistent points andself-consistent principal component subspaces, which is thebasis of the lack-of-fit test introduced in Section 5.
Throughout I assume that the distributions under consideration are continuous with finite second moments. Furthermore, when I state that a principal component or a principalcomponent axis (subspace) is self-consistent for a randomvector X, I mean that the associated projection of X ontothe principal component axis (subspace) is self-consistentaccording to the definition in (1).
, , ,
'"----------- --.-------~-----, , ,, . ,, , ,, , ,, , ,, , ,, , ,, , ,, , ,, , ,, , ,, , ,
Example 1: Gold Assay Data. Hastie and Stuetzle(1989) examined data on the gold content of computer chipwaste. Each sample was split in two and assayed by an outside laboratory and also by the company that collected andsold the gold. The company wanted to know which laboratory produced lower gold content assays on average for agiven sample. Figure 2 shows a scatterplot of 249 pairs oflog-transformed gold assays.
Hastie and Stuetzle considered a generalization of theerrors-in-variables model: Xli = f(Ti) + eli and X2i =rc + e2i, where Ti is the expected gold content for sample iusing the in-house lab, f (Ti) is the expected gold content forthe outside lab, and the eij are measurement errors. If thefunction f (T) is linear, then the curve is estimated by thefirst principal component. If the first principal componentis not self-consistent, then the function f(T) is nonlinearand must be estimated. As in regression problems, it is important to know whether a straight-line model provides agood fit to the data. In Section 5 I propose a lack-of-fit testto assess self-consistency of the first principal componentaxis. I apply this lack-of-fit test to the gold assay data inSection 6.
If a principal component axis is not self-consistent, thenone may want to estimate a nonlinear self-consistent curve.Principal components are easy to compute and provide asimple approximation to the original distribution. Estimating nonlinear principal curves, on the other hand, is a muchmore difficult problem. Hastie and Stuetzle (1989) adaptedtheir principal curve algorithm to estimate a principal curvefor the gold assay data using a locally weighted runninglines smoother. Tibshirani (1992) also examined the goldassay data using a maximum likelihood approach and ap-
Figure 1. For the Uniform Distribution on the Circle and the Square,Any Line Through the Center Is a Principal Component Axis. The circleis centered about any line through the center, and thus every principalcomponent axis is self-consistent. The square, however, has only fourprincipal component axes about which it is centered: the vertical andhorizontal lines, and the diagonal lines, denoted by the dashed lines.
Therefore, a self-consistent line must be a principal component axis. However, it is not always the case that a principal component axis is self-consistent. Returning to Figure1, any line through the origin for the uniform distributionon a circle is a self-consistent principal component. Onlythe four dashed lines shown for the square correspond toself-consistent projections.
In practical applications, a principal component axis mayfail to be self-consistent for a variety of reasons, such ascurvilinear relationships between variables or the existenceof nonhomogeneous subgroups in the population. Our firstexample provides an illustration.
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
458 Journal of the American Statistical Association, June 1999
3. CHARACTERIZATION OFELLIPTICAL DISTRIBUTIONS
Recall that the uniform distributions on a circle and asquare can be distinguished by the self-consistency of projections onto lines through the origin. The first result inthis section shows that spherical distributions are characterized by self-consistency of any projection onto a linethrough the origin. The result follows from the fact (Fang,Kotz, and Ng 1990, p. 97) that X is spherically symmetric if and only if for any perpendicular vectors a#-O and(3 #- 0,
Corollary 1. Suppose that X has a strongly symmetric distribution. Then the projection of X into any principal component subspace is self-consistent provided thatthe eigenvalues of the covariance matrix are distinct. If theeigenvalues are not all distinct, in which case the principalcomponents are not uniquely defined, then there exists anorthogonal matrix B so that the components of Z = B'Xare self-consistent principal components.
The matrix B of Corollary 1 is the orthogonal matrixfrom the definition of a strongly symmetric distribution.The need to distinguish the case of equal eigenvalues ofthe covariance matrix in Corollary 1 can be simply illustrated by the example of the uniform distribution on thesquare discussed in Section 1. If Xl and X 2 are independent and uniform on [-1, IJ, then (X,, X 2 ) ' has the uniformdistribution on a square and Xl and X 2 are self-consistentprincipal components of the bivariate distribution. However,the two eigenvalues of the covariance matrix of (Xl, X2)'are equal, and any line through the mean is also a principalcomponent axis. As noted in Section ·1, though, only fourof the lines through the mean are self-consistent.
Note that if the principal components of a mean 0 p
variate random vector X are independent, then the projection of X onto a principal component axis will be selfconsistent for X regardless of whether the marginal univariate distributions are symmetric. To see this, let P be aprojection matrix onto a principal component subspace ofdimension q < p and let Q = Ip - P. If the principal components are independent, then PX and QX are independent and £[XIPXJ = £[PX + QXIPXj = PX + Q£[XJ =PX + 0 = PX, which shows that PX is self-consistentfor X.
where "~,, means "has the same distribution as." Symmetric multivariate distributions do not necessarily have selfconsistent principal components, as the next example illustrates.
Example 2. Suppose that X = (Xl, X2)' is uniformlydistributed on [-a, OJ x [-1, OJ U [0, aJ x [0,1]' a 2: 1. ThenX is bivariate symmetric. For a = 1, the projection ofX onto the 45-degree diagonal line corresponds to a selfconsistent projection onto the first principal componentaxis. However, for a > 1, the principal components arenot self-consistent.
When considering self-consistency of principal component subspaces, a natural subset of symmetric multivariatedistributions to consider is the strongly symmetric distributions. A p-variate random vector Z = (Zl,"" Zp)' hasorthant symmetry if
(ZI,""Zp)' ~ (±ZI,""±Zp)' (Efron 1969).
I say that a p-variate random vector X with £[XJ = IL isstrongly symmetric if there exists an orthogonal p x p matrixB such that
dX = IL+BZ,
where Z has orthant symmetry and IL E ~p (Tarpey 1995).The distribution in Example 2 is symmetric but not stronglysymmetric when a > 1. The class of strongly symmetric distributions contains elliptical distributions and distributions whose principal components are independent andsymmetric.
The following lemma allows me to restrict attentionto orthant symmetric distributions when proving selfconsistency results for strongly symmetric distributions.
Lemma 1. Suppose that X = IL + BZ, where B is anorthogonal matrix, Z is a p-variate random vector, and IL E
~p. Y is self-consistent for X if and only if W = B' (Y IL) is self-consistent for Z.
Proof. See the Appendix..From Lemma 1, it may be assumed, without loss of gen
erality, that strongly symmetric distributions are orthantsymmetric when proving self-consistency results.
Next I state an elementary theorem that will be useful.£[a'XI(3'XJ = o. (2)
Theorem 2. Partition a p-variate orthant symmetric random vector X as X = (X~, X~)' where X 2 is q-variate,q < p. Let Y = f(X 2 ) , where f is some measurable function of X 2 • Then £[XI!YJ = 0 a.s.
Proof. See the Appendix.If X = (Xl,"" X p )' is orthant symmetric, then it is easy
to see that £[X] = 0 and the components X j of X are principal components of X. Thus if f is taken to be the identityfunction in Theorem 2, then it follows immediately that anyq-dimensional principal component subspace determined byq of the components X j of X will be self-consistent. Thisgives the following corollary.
Theorem 3. Suppose that X is a p-variate, mean 0 random vector with covariance matrix a 2I. Then X is spherically symmetric if and only if for every unit vector a E~P, PX is self-consistent for X where P = 00'.
Proof. See the Appendix.Sometimes a transformation of the variables will pre
cede a principal component analysis. For example, if variables are measured on different scales, then one can rescalethe variables to unit variance before performing a principalcomponent analysis. A natural question to ask is: Do principal components remain self-consistent after linear trans-
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
Tarpey: Self-Consistency and Principal Component Analysis 459
!(XI,X2)
= 2- [(1 + (V3XI + X2)2/96)(1 + (V3X2 - xd2/96)]-2,37r
Figure 3. The Contours of Equal Density in the Coordinate System of the Principal Components (a) for the Distribution of Example 3.1and (b) After Transforming the Distribution to Get a Correlation Matrix.Self-consistency of the principal axes is not preserved after the transformation.
formations of the variables? The answer to this question isno in general, as can be seen by the following example.
Example 3. Consider the strongly symmetric bivariaterandom vector (XI,X2 ) ' whose density is given by
5. SELF-CONSISTENCY FOR SAMPLE DATA
Until now I have discussed self-consistency for probability distributions. I now tum to the problem of determiningself-consistency of principal axes based on sample data. Ifthe goal of a principal component analysis is to summarize a multivariate data set by a straight line using the firstprincipal component axis, then it will often be of interest to the investigator if the straight line provides a goodfit, as in the gold assay example of Section 1. Similarly,in regression analysis, one needs to determine whether a
Proof. See the Appendix.Thus the projection of a random vector onto a self
consistent principal component axis can be viewed as thelimiting distribution of k collinear self-consistent points ask tends to infinity. As noted in the proof of Theorem 4, Y k
converges almost surely to PX as k tends to infinity. Theasymptotic behavior of principal points as k tends to infinity has been studied for univariate distributions (see, e.g.,Cambanis and Gerr 1983 or Potzelberger and Felsenstein1994 and, for bivariate distributions, Su 1997). It is interesting to note that a distribution cannot have an arbitrarilylarge number of self-consistent points that lie on a straightline that is not a principal component axis, because as kgoes to infinity, a self-consistent line results. From Theorem 1, if a line is self-consistent, then it must be a principalcomponent axis.
D j = {x E ~p : Ilx - Yjll < [x - Yill,i # j}
and let Y = Yj if X E Dj. The points {Yl, ... ,yd arecalled k self-consistent points of X if Y is self-consistentfor X. The points {YI,... ,yd are called k principal points(Flury 1990) if they give the minimum mean square errorEIIX - YI1 2 compared to any other random vector supported by a set of k points. Principal points are selfconsistent points (Flury 1993).
Self-consistency of a principal component axis or subspace can be assessed by the self-consistency of a set ofk principal points of the marginal distribution in the principal component subspace. For Theorem 4, define P as aprojection matrix onto a principal component subspace ofa mean 0 random vector X. Let Y k denote the projectionof X onto k principal points for the marginal distributionof X in this principal component subspace.
Theorem 4. Let X denote a p-variate, mean 0 randomvector with a bounded and continuous density function.Then the projection PX onto a principal component subspace is self-consistent for X if and only if Y k is selfconsistent for each k = 2,3, ....
proximation will be utilized in the next section to approximate the hypothesis that a principal component axis is selfconsistent.
Principal points are defined as the set of k points thatoptimally represent a distribution in terms of mean squarederror. Given a p-variate random vector X, let {YI,... ,yddenote a set of k distinct points in ~p. Define the domainof attraction for point Yj as
I
IIIII
• IIII
(b)
III
(a)
4. PRINCIPAL COMPONENTS ANDPRINCIPAL POINTS
In this section I show that a self-consistent principal component axis is approximated by a set of k "self-consistent"points that lie on the principal component axis. The ap-
(XI,X2)' E ~2.
This distribution results by taking two independent t random variables on 3 degrees of freedom, multiplying the firstvariable by two, and rotating the distribution by 30 degrees.
The contours of equal density for this distribution in thecoordinate system of the principal components is shown inFigure 3(a). The contours of equal density are symmetricabout the principal axes, and hence the principal component axes are self-consistent. Rescaling the variables Xland X 2 to unit variance to get a correlation matrix yieldsa distribution whose contours of equal density are shownin Figure 3(b) plotted in the coordinate system of the principal components of the transformed variables. As shownin Figure 3(b), the principal components are no longer self-
. consistent.For elliptical distributions, principal components will re
main self-consistent after linear transformations. In fact, asthe following corollary shows, the class of elliptical distributions is characterized by self-consistency of principalcomponents under arbitrary linear transformations.
Corollary 2. Let X denote a p-variate random vectorwhose covariance matrix has full rank. Then X has an elliptical distribution if and only if every principal componentsubspace of B'X is self-consistent where B is an arbitraryp x p matrix of full rank.
Proof. See the Appendix.
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
460
straight-line model is sufficient or a more complex modelis needed. Lack of fit of the first principal component axismay indicate that a nonlinear generalization such as principal curves (Hastie and Stuetzle 1989) should be utilized(see Example 5). If the data were collected from a nonhomogeneous population (such as different species of birds; seeExample 6), then lack of fit of a principal component axismay indicate that a single principal component axis doesnot suffice for all of the subgroups. In classic and logisticregression analysis, lack-of-fit tests are available to assessthe fit of the model (see, e.g., Hosmer and Lemeshow 1989and Montgomery and Peck 1992). In this section I proposea lack-of-fit test to assess the self-consistency of a principalcomponent axis.
The hypothesis of self-consistency of the projection ofX onto the first principal component axis can be stated as
H: PX is self-consistent for X,
where P = 010~ and 01 is the eigenvector of the covariance matrix associated with the largest eigenvalue.
Of course, exact self-consistency will never hold for asample, because for almost every point XQ on a sample principal component axis there will be no points in the samplethat project orthogonally onto XQ. From Theorem 4, H canbe restated as
Journal of the American Statistical Association, June 1999
observation as z, = (Zil' Z~2)" where Zi2 is a p-1 vector ofthe second through pth principal components. To test Hi,perform the following steps.
Lack-of-Fit Test
1. Estimate principal points. Run the k-means algorithm to estimate k principal points for the univariate dataZn, Z2'l, ... , Znl, of the first principal component. Denotethe k cluster means from the algorithm by ~l' , ~.
2. Partition the data into k slices Dj,j = 1 , k, where
b, = {Zi: II€j - Zilll < 11€1 - Zilll, l i' j}.3. Compute means over each slice. Compute the means
fi,j of the Zi2 over each of the k slices:
fi,j = L Zi2/nj ,
ziEDj
where nj is the number of observations in Dj'
4. Compute the test statistic. If the first principal component is self-consistent, then by Theorem 4, E[fi,j] = O,j =1, .. , ,k. Thus the test statistic is based on the squared standard distance of each fi,j from 0:
where Sj is the sample covariance matrix computed fromthe Zi2 in the jth slice ti;
5. Perform centering. To ensure that the resampling ofstep 6 is done in a way that reflects H k (Hall and Wilson1991), the data are centered over each of the k slices:
Now transform the "centered" data back into the originalcoordinate system: x: = Az: + x.
6. Obtain bootstrap sampling. Sample n observationswith replacement from xi, i = 1, ... , n, obtaining a bootstrap sample. Compute the test statistic, T*2, for the resampled data by repeating steps 1-4 for the bootstrap sample.Repeat a large number N times.
7. Determine the p-value. The p-value for this test is theproportion of bootstrap samples that yield a test statisticT*2 larger than T 2 .
After the data have been centered in each of the slicesb, in step 5, any difference between the cluster means fi,;and 0 in each bootstrap sample will be due approximatelyto natural variation and not to systematic departures fromself-consistency of the first principal component axis. Theforegoing procedure takes into account the variability ofestimating the first principal component axis as well.
The lack-of-fit test just described is similar in spirit tononparametric smoothing lack-of-fit tests in the regressionliterature. Hart (1997, p. 145) described a lack-of-fit testbased on a linear smooth function 9 of the residuals fromregression. If the null hypothesis in the regression framework is true, then the residuals should behave as a random
H: Vk, Y k is self-consistent for X,
where Y k is supported by k principal points of o~X alongthe first principal component axis. Thus I propose an approximate test of self-consistency of a principal componentaxis using self-consistent points and principal points. Toshow that a principal component axis is not self-consistent,one need only show that Y k is not self-consistent for anygiven value of k. The lack-of-fit testing procedure is described for the hypothesis
Hi: Yk is self-consistent for X.
The testing procedure requires estimating the eigenvectorsof the covariance matrix and principal points. The eigenvectors of the covariance matrix are estimated in the usualway using eigenvectors from the sample covariance matrix.The k-means algorithm (Hartigan 1975) is used to estimatethe principal points for the marginal principal componentdistributions. This is justified by the fact that cluster meansfrom the k-means algorithm are strongly consistent estimators of principal points (Pollard 1981).
A bootstrap (Efron 1982) procedure will be used to testHi: Let xj , .. " x., denote a sample from a p-variate distribution. The description of the procedure is simplified ifone works with the data transformed into the coordinatesystem of the sample principal components and centere~ atO. That is, let z, = A'(Xi - x), i = 1, ... , n, where A isthe p x p orthogonal matrix whose columns correspond tothe eigenvectors of the sample covariance matrix. ChooseA so that the first column is the eigenvector associated withthe largest eigenvalue. Thus the first component of z, corresponds to the first principal component. Partition each
where * AZi2 = Zi2 - J.Lj
if z, E Dj , i = 1, ... , n.
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
Tarpey: Self-Consistency and Principal Component Analysis 461
principal component 1
principal component 1
(a)
(b)
1I1II1
."
....... r"·. .. ~ ·1····
1 I 1I 1 II I I1 1 1I I II I I1 I."I .-1 •••••• ·1I .' i· ", - T·.r" -.: .:' 1 ":' {'"I· ....:. I .. " I :'J( .
• rr« > • '",,': A·Q '. --:.'.""'":.1.- .. , ·1" • ·.1 ,.,'.
" -I" .:··.··r .' I· ...• I .... I····:· ·...1·
~I '1· t , -:. 1
1 I'"' 1I I I1 1 1I I I
1I1IIII1I
', I . : ...., . ::··A I" . I~'
j-<;.'. : •••• :r-::' . 'A': . ~:'.'-J ".'-/ • '-! .. \".;/
I .,. I -, • '., ."1 .: 1Y. ..,' .' I'"
~I'- .' . :1 .... : .....\I" :" .• '1' :. -:. I. ,I .ea- .1 I1 1 11 1 1I I I
'.m..-. "
'.
'.
c::Q)
c::oa.Eou
ca.~~ t-------.......,~~-t-~~J--;t-'-1'~_+___t_h_-_t
a.
N
Figure 4. (a) A Scatterplot of the Simulated Data of Example 5.1in the Coordinate System of the Sample Principal Components and (b)a Bootstrap Sample Obtained From the Data After Centering in EachSlice. In (a), the solid horizontal line is the first principal component, andthe open circles are k -means estimates of four principal points along thefirst principal component axis. The dashed vertical lines represent theboundaries of the slices, and the triangles are the corresponding meansof each slice. The discrepancy between the circles and triangles in (a)is due to the lack of self-consistency on the first principal componentaxis.
c~oa.Eou
of the sample covariance matrix are nearly equal), and thebootstrap procedure is unable to detect the nonlinearity inthe data.
A simulation study was conducted to evaluate the performance of the lack-of-fit test when the null hypothesis istrue using the multivariate normal, multivariate t, and uni-
N
ca.'g t--------++---."--t----;--t-t-r--t--fo~:+___+_:fL"-_t
a.
Example 4. I simulate a sample of 200 bivariate observations for which a strong nonlinear trend exists between the components. Let Xl '" uniform[O, 1] and X2 =log(Xl) + e, where e '" uniform[-.3, .3]. This is a modelconsidered by Li (1997) as an example of nonlinear confounding between regressors. Figure 4(a) shows a scatterplot of the data in the coordinate system of the principal components. Four principal points are estimated by thek-means algorithm for the first principal component andare represented by the four circles on the horizontal axis.The four points partition the plane into four slices (whoseboundaries are given by the vertical dashed lines), and themean of each slice is denoted by the triangles. The factthat the triangles and the circles differ considerably highlights the fact that the first principal component axis is notself-consistent. The T 2 statistic for this data is T2 = 129.19.Next, the data in each slice are centered, and bootstrap samples are taken. Figure 4(b) shows one such bootstrap samplealong with k = 4 estimated principal points estimated fromthe bootstrap sample, denoted again by the circles. Likewise, the triangles in Figure 4(b) correspond to the meansover each slice. For the bootstrap sample, the circles andtriangles nearly overlap, as is to be expected, and the teststatistic for this bootstrap sample is T*2 = 3.95. The pvalue based on 1,000 bootstrap samples is 0, providing verystrong evidence against self-consistency of the first principal component axis, as is readily evident from the plot.
Further simulations were run for the previous example,but with the dimension of the data increased to p = 4 byadding two independent mean 0 normal variates each withan equal standard deviation (J. Running simulations for standard deviations (J = .2, .3, .4,.5 yielded p values of .004,.013, .044, and .948 using k = 4 and 1,000 bootstrap samples. For (J = .1, a scatterplot (not shown here) of the firsttwo principal components shows a strong nonlinear pattern.For .2 :::; (J :::; .4, a nonlinear pattern is not noticeable in ascatterplot of the first two principal components, but thelack-of-fit test nonetheless detects the nonlinear structure.For (J = .5, the data become almost spherical in the firsttwo principal components (the two largest eigenvalues
scatter about 0, and the estimated function 9 should be relatively flat. In the current setting, principal components 2through p are the multivariate analog of the residuals, andinstead of a smooth function g, a piecewise constant function determined by the cluster means in step 3 is estimated.The test statistic of step 4 is a multivariate analog of theunivariate lack-of-fit test statistic based on a linear smoothfor the residuals from regression.
Note that the lack-of-fit test will likely fail if the largesteigenvalue of the sample covariance matrix is statisticallyindistinguishable from the second largest eigenvalue, inwhich case the estimated first principal component axis mayvary widely between bootstrap samples. (For a formal testof "sphericity," that is, equality of eigenvalues of the sample covariance matrix, see Flury 1988, p. 19; Mardia, Kent,and Bibby 1979, p. 235.)
The following example illustrates the bootstrap lack-offit test.
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
462
form distributions on a rectangle. The simulations indicatethat the lack-of-fit test is conservative and in each case thedistribution of p values was skewed to the left. In each scenario, the largest value of k leads to smaller p values onaverage, except for the uniform distribution on a rectangle. The test became more conservative as the dimensionof the distribution increased and was most conservative forheavier-tailed distributions.
A natural question that arises when applying the lack-offit test is: What value of k should one use? That is, howmany principal points should be estimated along the firstprincipal component axis? Large values of k improve theapproximation of H k to H. But if k is too large, then someof the slices b, will contain too few points to provide astable estimate of the cluster means. My experience withsimulations and real data indicates that strong statisticalevidence of a lack of self-consistency of the first principalcomponent axis can often be determined with k as small as3 or 4. One very important point is that the lack-of-fit testdescribed here will usually fail to reject Hi; for k = 2 evenif the data exhibit a very strong nonlinear pattern. Largervalues of k are restricted by sample size and dimensionality.In the examples discussed here, I required each slice tocontain at least 2p points, where p is the dimension of thedata.
In the simulation study, the test statistic and its associated p value was computed using k from 3 to 5 or 6 foreach dataset. Generating multiple p values typically leadsto multiplicity concerns if a decision is to be made. However, the results of the simulation study reveal that basinga decision according to the smallest of the p values stillyields a conservative test in dimension p > 2. As a descriptive tool, computing the test statistic for different values ofk can be useful. For example, if the departure from selfconsistency is subtle, then small values of k may not detecta problem with self-consistency, whereas larger values of k
will indicate a lack of self-consistency (see Example 6).As a practical guideline for making decisions, it appears
relatively safe to make a decision based on the smallest pvalue when the dimension is greater than two, due to theconservative nature of the test. For two-dimensional data(or if multiplicity remains a concern for higher-dimensionaldata), I recommend basing decisions on the p value fromthe largest value of k used for computing the test statistic, because larger values of k better approximate the selfconsistency hypothesis.
Determining the largest value of k for computing thelack-of-fit test statistic can be aided by the use of the univariate normal distribution since projections of high dimensional data tend to be approximately normal (Diaconis andFreedman 1984). To illustrate, consider using k = 6 slices.The proportions for each of the six slices for a univariatenormal distribution as determined by k = 6 principal pointsare 7.4%,18.1 %,24.5%,24.5%, 18.1%, and 7.4% (see, e.g.,Cox 1957, table 1). As a rough guideline, the slice with thesmallest cardinality should contain approximately 7.4% ofthe sample size n. Thus to use k = 6 for the lack-of-fit test,
Journal of the American Statistical Association, June 1999
one can check whether 7.4% of the sample size exceeds 2p,where p is the dimension of the data.
One final comment regards the use of the k-means algorithm for the lack-of-fit test. By construction, the k-meansalgorithm converges to a set of k self-consistent points forthe empirical distribution, and there may exist several setsof k self-consistent points. But if the underlying distributionhas a log-concave density (as does the univariate normal distribution), then there exists a unique set of k self-consistentpoints that must be the k principal points of the distribution(Kieffer 1982; Li and Flury 1995; Truskin 1982).
6. EXAMPLES
I now apply the lack-of-fit test of Section 5 to realdatasets. The examples illustrate the importance of assessing self-consistency of a principal component axis. In thefirst example I apply the lack-of-fit test to assess whetherthe straight-line model provided by the first principal component axis is appropriate or the nonlinear generalizationgiven by principal curves should be used. In each application of the lack-of-fit test, 1,000 bootstrap samples wereobtained.
Example 5: (Gold Assay Example 1 (Continuedl}. Using the bootstrap lack-of-fit test of Section 5 on the goldassay data encountered in Example 1 yields the results givenin Table 1.
The evidence is fairly strong against self-consistency ofthe first principal component axis. The departure from selfconsistency can be seen in Figure 5. Figure 5(a) shows ascatterplot of the gold assay data. The solid line is the firstprincipal component axis, the circles represent k = 6 principal point estimates along the first principal component axis,and the triangles are the corresponding means over each ofthe six slices. Figure 5(a) tends to support the conclusionreached by Hastie and Stuetzle that the outside assay tendsto be higher than the in-house assay in the middle range,but the difference is reversed at lower levels; see Hastie andStuetzle's Fig. 12(b). In Figure 5(b), 25 bootstrap samplesfrom the raw data (without centering in each slice) wereobtained. For each bootstrap sample, the conditional meansover each of the six slices were joined by line segments andplotted. The connected line segments of Figure 5(b) can bethought of as a providing a rough first approximation tothe principal curve. Because Figure 5(b) is computed fromthe raw data instead of the slice-centered data, one can seethe systematic departure from the first principal componentaxis (the thick straight line) in the joined bootstrap line segments, which indicates why the lack-of-fit test rejects thehypothesis of self-consistency for the first principal component axis.
Table 1. Gold Assay Data
k T2 P value
3 16.43 .0004 18.5 .0035 19.34 .0176 28.20 .003
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
Tarpey: Self-Consistency and Principal Component Analysis 463
in-house assay
Example 6: Dimensions ofBirds. Figure 6 shows a scatterplot of the wing area (ern") and wing spread (em) for232 different species of birds (Greenwalt 1962). Clearly,the first principal axis will not be self-consistent for thisdataset (T 2 = 117.6 and p = 0 for k = 4).
Taking the square root of wing area so that both variablesare in units of centimeters, we get T 2 = 10.49 and a p valueof .439 using k = 4. Thus the square root transformationof wing area gives a first principal component axis consistent with the hypothesis of self-consistency. (In fact, the p
values for the test using k = 5, ... , 10 all exceeded .5.) Incidently, the Box-Cox transformation, which correspondsto raising the wing area variable to the .4 power approximately, yielded a p value of .206 when k = 4. Figure 7(a)shows a scatterplot of the square root-transformed bird dataalong with the first principal component axis. As in Figure5, the circles correspond to k = 6 estimated principal pointsalong the first principal component axis, and the trianglesare the corresponding means over each of the six slices.The circles and triangles overlap considerably (the p valuefor k = 6 is .669). The first principal component accountsfor about 99.5% of the total variability for the transformeddata. Thus, even if the first principal component were notself-consistent, it would be hard to distinguish the circlesand triangles. Therefore, in Figure 7(b) the plotting is donein the coordinate system of the principal components wherethe scale is chosen so that one can see the variability of thesecond principal component. For Figure 7(b), 25 bootstrapsamples were obtained from the raw data [as in Fig. 5(b)],
oN..,
ables are measured on different scales. In regression, if anonlinear pattern exists, a common strategy is to find appropriate transformations of the regressorts) and/or responseso that a simple straight-line model will suffice. In principal component analysis, if a simple transformation existsthat yields a self-consistent principal component axis, thenestimating a principal curve may be avoided. The next example looks at the effects of transforming the variables onself-consistency.
6
6
5
5
4
42 3
2 3
in-house assay
a
o
IL---'O"'---.......--........--......--........--~__......I-1
1 ..._-1--'O"'---.......--........--......--........--~-- ......
(a)
o
o
'"o ..,.,.,o.,-0'iii=> No
'"~ ..,.,o.,~..:; No
(b)
Figure 5. Gold Assay Data; (a) A Scatterplot of the Outside LabAssay and the In-House Assay Along With the First Principal ComponentAxis (Solid Line) and (b) Results of 25 Bootstrap Samples From the RawData. In (a), the circles represent estimated principal points along thefirst principal component axis and the triangles are the correspondingmeans over each slice. In (b), the thick straight line is the first principalcomponent axis. For each bootstrap sample, the means over each of thesix slices are connected by line segments and joined together. Part (b)shows a systematic departure from self-consistency of the first principalcomponent axis because the line segments lie mostly above the firstprincipal axis in the middle and below the principal axis on the left.
ocoN
oE ;;;-s
o" 0o N
:"~ ~ .'..O'l • -..co· •••~ N •••_
~ • t5l.
:l'
...
Figure 6. Bird Data: Scatterplot of Wing Length (em) Versus WingArea (crn2). The first principal component axis is not self-consistent dueto the strong nonlinear pattern.
It is well known that principal components are not scaleinvariant. A common problem in principal component analysis is how to choose an appropriate transformation if vari-
0.0 0.2 0.4 0.6 0.8
Wing Areo (cm-2)
1.0 1.2
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
464 Journal of the American Statistical Association, June 1999
0>I
::
7. DISCUSSION
I have discussed the situation of a distribution centeredabout a principal component axis. The mean is the zerodimensional center of gravity for a distribution. If a distribution is centered about a higher-dimensional linear subspace, then this subspace must be a principal component
cipal component axis. Figure 7(b) also shows the dramaticincrease in variability of the second principal component asone goes from left to right along the first principal component axis.
In a study of allometric growth, it is natural to take alogarithm transformation of each variable. A model of allometric growth assumes that the wing area and wing lengthincrease at a constant relative rate as you go from smallerto larger birds. One can interpret the coefficients of the firsteigenvector of the covariance matrix of the log-transformeddata as constants of allometric growth (Jolicoeur 1963).Theallometric model rests on the assumption that the first principal component then is self-consistent. Taking the logarithm of wing area and wing spread gives T 2 = 33.545 andan associated p value of 0 using k = 4. However, a morereasonable transformation is to take the logarithm of spreadand one-half the logarithm of area, because area is in unitsof cm2 (see Hills 1982), which then gives T 2 = 11.51 anda p value of .197, indicating a better fit. It is interesting tonote that if the log-transformed data were from an ellipticaldistribution, then according to Corollary 2, the linear transformation from log-log to .5 x log -log should not affect theself-consistency of the first principal component axis.
Running the lack-of-fit test for k = 6 for the transformation .5 x log(wing area) and log(wing length) yieldsT 2 = 50.5 with a p value of .004. Thus using only k = 4points is insufficient for determining that the first principal component apparently is not self-consistent. Figure8(a) shows a scatterplot of the transformed bird data alongwith the first principal component axis. In Figure 8(b), 25bootstrap samples were again generated from the raw logtransformed data [as in Fig. 7(b)], and the means over eachof the slices are plotted and connected by line segmentsin the coordinate system of the principal components. Thebootstrap line segments show a very strong systematic departure from the horizontal line, indicating a lack of selfconsistency of the first principal component. The implication is that a single allometric model does not seem to holdfor all species of birds. Perhaps this is due to the fact thatwings accommodate many different modes of flight amongbirds: horizontal flight, zigzagging, climbing, diving, bounding, soaring, and hovering (Lighthill 1975). In particular,note that in Figure 8(a) the points corresponding to speciesof hummingbirds have been plotted with circles as opposedto points. Each of these circles lie on or above the firstprincipal component axis. The flight of hummingbirds ischaracterized by hovering, which is considerably differentfrom other types of winged flight. Thus the conclusions ofthe lack-of-fit test indicate that perhaps constants for allometric growth will not be the same for different types offlight.
325
120100
225
40 60 80
Square-root(Wing area)
(a)
25
I..,I
'"., L_-75----
.......----........-----'------l
"0 00OJ 0
a. '"(fJ
"'0c: '"~ -
0
~
0.,
0..0
0 20
and the means of each slice were connected by line segments. It is difficult to distinguish any systematic departurefrom self-consistency, except that for the fifth point fromthe left, most of the line segments lie below the first prin-
~r------..--------r-------r-----..,
Figure 7. Bird Data: Square Root Transformation. (a) A scatterplotof wing length versus square root of wing area. The solid line is thefirst principal component. The circles represent k = 6 estimated principal points along the first principal component axis and the trianglesare the corresponding means. (b) Results of taking 25 bootstrap samples from the square root transformed data. For each bootstrap sample,the means over each of the six slices are computed and connected byline segments and plotted in the coordinate system of the principal components. The departure from self-consistency of the first principal component axis (horizontal line) appears slight, because the line segmentsappear to cluster about the first principal component axis.
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
Tarpey: Self-Consistency and Principal Component Analysis 465
oO.5*log(wing area)
(a)
o..------.,....------r-----.....-------,
N
Ul
'xo
c
"c:& C!1- ._E 0o"oCo'uc:'CCo
distributions is characterized by the fact that after arbitrarylinear transformations, such as transformations to a correlation matrix, the principal component axes remain selfconsistent. Analogously, ellipsoids remain ellipsoids afterarbitrary linear transformations.
The bootstrap lack-of-fit test described in Section 5provides a relatively simple procedure for assessing selfconsistency of principal axes from sample data and appearsto work well in the examples that I have investigated. Oneconcern with using the k-means algorithm for the lack-offit test is that' the algorithm may produce different clustersfor the first principal component, depending on which initialstarting values are used. Clusters may be formed containingvery few observations, in which case the sample covariancematrix for these clusters will be singular or nearly singular.One may consider using different versions of the k-meansalgorithm (Lloyd 1982) or using parametric or semiparametric estimators of principal points instead instead of thenonparametric k-means algorithm (Tarpey 1997).
The methodology and examples discussed here deal onlywith self-consistency of the first principal axis. The procedure can be easily adapted to assess self-consistency ofhigher-dimensional principal component subspaces, or principal component axes besides the first principal componentaxis. In fact, Theorem 4, on which the inference procedureis based, was stated for arbitrary dimensions. Determining the principal points for the lack-of-fit test for higherdimensional subspaces is complicated by the fact that typically several different sets of k self-consistent points exist for principal component subspaces of dimension two ormore (Tarpey 1998). However, the k-means algorithm converges by construction to a set of k self-consistent points ofthe empirical distribution, and the methodology describedhere works for principal points and self-consistent points.
APPENDIX: PROOFS
Proof of Lemma 1
Suppose that Y is self-consistent for X; then
£[ZIW = w] = £[B'(X -IL)IB'(Y -IL) = w]
B'£[XIY = Bw + IL] - B'IL
B' (Bw + IL) - B' IL a.s. by self-consistency-1 0
principal component axis 1
ol-..----....-----........----........----~r -2
(b) = w.Figure 8. Bird Data: Log Transformed. (a) A scatterplot of log(wing
length) versus .5 log(wing area). The solid line is the first principal component axis. The observations plotted with circles in the lower left correspond to species of hummingbirds. (b) Results of 25 bootstrap samplesplotted in the coordinate system of the principal components. The meansover each of the six slices are joined by line segments. The lack of selfconsistency of the first principal component axis is highlighted by thesystematic variation of the line segments about the horizontal line.
subspace. As the examples in Sections 5 and 6 illustrate,however, the one-dimensional principal component axes arenot always self-consistent.
The class of elliptical distributions lend themselves naturally to principal component analysis. The contours ofequal density are ellipsoids whose axes correspond to selfconsistent principal component axes. The class of elliptical
Conversely, if W is self-consistent for Z, then £[XIY = y]IL+ B£[ZIW = B'(y -IL)] = IL+ B(B'(y - IL)) = y.
Proof of Theorem 2
Because (X~,X~)' ~ (-X~,X~)', by orthant symmetry,
£[X1IY] ~ -£[X1IY].
For any set A measurable with respect to Y such that P(Y E
A) > 0,
£[X1IA (Y)]/P(Y E A)
£[X1IY E A] (by definition)
£[-X1IY E A] (by orthant symmetry)
£[-X1IA(Y)]jP(Y E A) (by definition).
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
466 Journal of the American Statistical Association, June 1999
(IA (.) denotes the indicator function for the set A.) Therefore,t"[XlIA(Y)] = t"[-XlIA(Y)], which means that t"[XlIY E A] =o. Because this holds for any non-zero probability set A measurable with respect to Y, it follows that t"[XlIY] = 0 a.s.
Proof of Theorem 3
Suppose that X has a spherical distribution. Then, for any pairof orthogonal unit vectors a and (3in ~RP, from (2), t"[(3/Xla/X] =o. Choose a p x (p - 1) matrix A 2 = [al : ... : ap-l] such thatA = [a: AI] is orthogonal. Let P = ao' and Q = A2A~. Then
Y = Ale;] = t"[XIXl E Dj]. By self-consistency of the principalpoints for the marginal distribution of X,, t"[XlIXl E Dj] = ej.It needs to be shown that t"[XiIXl E Dj] = O,i = q + 1, ... ,p.This is equivalent to showing that
t"[XJDj (Xl)] = J.. ·1. I: X;f(Xl, Xi) dx, dXl = 0,J
where f is the marginal joint density of (Xl, Xd, i > q. BecausePX = (X~, 0/)/ is self-consistent for X, t"[Xi IXl = x] = 0 foralmost every x. Thus
Proof of Theorem 4
k
v, = L AlejIDj (X),j=l
J... i~ t"[Xi!X l = x]h(x)dx = 0,J
= 0,
where II is the marginal density of Xi. The following can bewritten:
and thus Y k is self-consistent for X.On the other hand, suppose that Y k is self-consistent for X
for every k. If el, ... , ek represent k principal points of A~X,
then the random vector L:~=lejID; (X) converges almost surely
to A~X (see, e.g., Parna and Lember 1997). Thus Y k convergesalmost surely to PX. It needs to be shown that t"[XiIXI} = 0,for i = q + 1, ... .p. Let (0, F, P) denote the probability spaceon which X is defined. Because Y k converges almost surely toPX, let X, (w) = x~ be a point for which the marginal densityof Xl is positive and A~Ydw) -+ x~. For each k, let Djk,j =1, ... ,k denote the domains of attraction of the k principal pointsof Xl. The Djk form a partition of the support of Xl, and thusthere exists a sequence Djkk such that x~ E Djkk for all k. Now,because Yk is self-consistent for X, it follows that t"[XiIYk] = 0for all k, which means that t"[XiIXl E Djk] = 0, for j = 1, ... , k.Applying this last equality to the Djkk gives
J...J. [00 X;f(Xl, Xi) dx, dxl/7rjkk = 0,D
j k k00
where 7rjkk = P(XI E Djkk). Applying the mean value theorem,
vol(Djkk)[: X;f(Xlk, Xi) dX;/7rjkk = 0
for some Xlk E Djkk. Because x~ E Djkk for all k, Xlk convergesto x~ as k goes to infinity. Furthermore, vol(Djkk)/7rjkk convergesto 1/[: (x~) as k tends to infinity. Therefore,
J... j. [00 X;f(Xl, Xi) dx; dXl/7rjkk -+Dik k 00
[: X;f(~ilx~)dx.,
t"[PX + QXIPX]
PX + t"[QXIPX]
PX + Alt"[A~Xla/X]
PX + Al (t"[a~Xla/X],... ,t"[a~_lXla/X])/
PX + (0, ... ,0)/ by (2)
PX
t"[XIPX]
where Dj E RP is the domain of attraction of Ale;, j = 1, ... ,k.By Lemma 1 it may be assumed, without loss of generality, that
cov(X) = diag(O"r, ... ,O";),O"l ~ O"j,j =I- 1. That is, al, ... ,aq
can be taken to be the first q basis vectors.First, consider the case where PX is self-consistent for X. Let
X = (Xl, ... ,Xp )' , then Xi := A~X = (Xi , ,Xq )' . Let Dj cRq be the domain of attraction of ej,j = 1, , k. Then t"[XI
Therefore, PX is self-consistent for X.Conversely, suppose that every principal component of X
is self-consistent. That is, for every unit vector a E RP, RP,t"[XIPX] = PX, where P = ao', Choose any unit vector (3that is perpendicular to a. Multiplying both sides of the equationt"[XIPX] = PX by (3/ gives t"[(3/Xla'X] = 0 establishing equation (3.1) which is equivalent to X being spherically symmetric.
Proof of Corollary 2
Without loss of generality, assume t"[X] = o. If X has an elliptical distribution, then B/X also has an elliptical distribution (seetheorem 2.16 of Fang et al. 1989, p. 43). Therefore any principalcomponent subspace of B/X is self-consistent by Corollary 1.
Conversely, suppose that for any p x p matrix B of full rank,the principal component subspaces of B/X are self-consistent. Letq, denote the covariance matrix of X, and let q, = ADA' be thespectral decomposition of q, wh~re A is a p x p orthogonal matrixand D is a diagonal matrix. Set B = AD- l/2. Then the covariancematrix of Z = B/X is the identity matrix. By the hypothesis, anyprincipal component of Z is self-consistent. Thus for any unitvector a E RP, setting P = ao' gives that PZ is self-consistentfor Z. This implies that Z has a spherical distribution by Theorem3, which in turn implies that X has an elliptical distribution.
Let al, ... , a q denote q < p eigenvectors of the covariance. matrix of X. Define Al = [al : ... : a q ] and let P = AlA~,
the projection matrix onto the principal component subspace ofdimension q < p. Let el, ... ,ek denote a set of k principal pointsof A~X. The projection of X onto the k principal points for themarginal distribution of X in this principal component subspaceis given by
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4
Tarpey: Self-Consistency and Principal Component Analysis
but the left-hand side is identically O. Thus [[XiIXl] = 0, whichgives that PX is self-consistent for X, and the theorem is proved.
[Received Mareh 1997. Revised August 1998.J
REFERENCESCambanis, S., and Gerr, N. (1983), "A Simple Class of Asymptotically
Optimal Quantizers," IEEE Transactions on Information Theory, IT-29,664-676.
Cox, D. R. (1957), "Note on Grouping," Journal ofthe American StatisticalAssociation, 52, 543-547.
Diaconis, P., and Freedman, D. (1984), "Asymptotics of Graphical Projection Pursuit," The Annals ofStatistics, 12, 793-815.
Duchamp, T., and Stuetzle, W. (1996), "Extremal Properties of PrincipalCurves in the Plane," The Annals of Statistics, 24, 1511-1520.
Efron, B. (1969), "Student's t-Test Under Symmetry Conditions," Journalof the American Statistical Association, 64, 1278-1302.
--- (1982), The Jackknife, The Bootstrap, and Other Resampling Plans,Philadelphia, PA: Society for Industrial and Applied Mathematics.
Fang, K., Kotz, S., and Ng, K. (1990), Symmetric Multivariate and RelatedDistributions, New York: Chapman and Hall.
Flury, B. (1988), Common Principal Components and Related MultivariateModels, New York: Wiley.-- (1990), "Principal Points," Biometrika, 77, 33--41.--- (1993), "Estimation of Principal Points," Applied Statistics, 42,
139-151.Greenwalt, C. (1962), "Dimensional Relationships for Flying Animals,"
Smithsonian Miscellaneous Collections, 144(2), 1--46.Hall, P., and Wilson, S. R. (1991), ''Two Guidelines for Bootstrap Hypoth
esis Testing," Biometrics, 47, 757-762.Hart, J. D. (1997), Nonparametric Smoothing and Lack-of-Fit Tests, New
York: Springer.Hartigan, J. A. (1975), Clustering Algorithms, New York: Wiley.Hastie, T., and Stuetzle, W. (1989), "Principal Curves," Journal of the
American Statistical Association, 84, 502-516.Hills, M. (1982), "Allometry," in Encyclopedia ofStatistical Sciences, eds.
S. Kotz and N. L. Johnson, New York: Wiley, pp. 48-54.Hosmer, D., and Lemeshow, S. (1989), Applied Logistic Regression, New
York: Wiley.Jolicoeur, P. (1963), "The Multivariate Generalization of the Allometry
Equation," Biometrics, 19,497--499.Kieffer, J. (1982), "Exponential Rate of Convergence for Lloyd's Method
467
I," IEEE Transactions in Information Theory, Special Issue on Quantization, 1T-28, 205-210.
Li, K-C. (1997), "Nonlinear Confounding in High-Dimensional Regression," The Annals of Statistics, 25, 577-612.
Li, L., and Flury, B. (1995), "Uniqueness of Principal Points for UnivariateDistributions," Statistics & Probability Letters, 25, 323-327.
Lighthill, J. (1975), Mathematical Biofluiddynamics, Philadelphia, PA: Society for Industrial and Applied Mathematics.
Lloyd, S. P. (1982), "Least Squares Quantization in PCM," IEEE Transactions on Information Theory, Special Issue on Quantization, IT-28,129-137.
Mardia, K., Kent, J., and Bibby, J. (1979), Multivariate Analysis, NewYork: Academic Press.
Montgomery, D., and Peck, E. (1992), Introduction to Linear RegressionAnalysis, New York: Wiley.
Parna, K., and Lember, 1. (1997), "On Some Properties of k-Variance,"unpublished manuscript.
Pearson, K. (1901), "On Lines and Planes of Closest Fit to Systems ofPoints in Space," Philosophical Magazine, 2, 559-572.
Potzelberger, K., and Felsenstein, K. (1994), "An Asymptotic Result onPrincipal Points for Univariate Distributions," Optimization, 28, 397406.
Pollard, D. (1981), "Strong Consistency of K-Means Clustering," The Annals of Statistics, 9, 135-140.
Su, y. (1997), "On Asymptotics of Quantizers in Two Dimensions,"Journal of Multivariate Analysis, 61, 67-85.
Tarpey, T. (1995), "Principal Points and Self-Consistent Points of Symmetric Multivariate Distributions," Journal ofMultivariate Analysis, 52,39-51.
--- (1997), "Estimating Principal Points of Univariate Distributions,"Journal ofApplied Statistics, 24,499-512.
--- (1998), "Self-Consistent Patterns for Symmetric Multivariate Distributions," Journal of Classification, 15, 57-79.
Tarpey, T., and Flury, B. (1996), "Self-Consistency: A Fundamental Concept in Statistics," Statistical Science, 11, 229-243.
Tarpey, T., Li, L., and Flury, B. (1995), "Principal Points and SelfConsistent Points of Elliptical Distributions," The Annals of Statistics,23, 103-112.
Truskin, A. (1982), "Sufficient Conditions for Uniqueness of a LocallyOptimal Quantizer,' IEEE Transactions in Information Theory, SpecialIssue on Quantization, IT-28, 187-198.
Tibshirani, R. (1992), "Principal Curves Revisited," Statistics and Computing, 2, 183-190.
Dow
nloa
ded
by [
Col
orad
o C
olle
ge]
at 1
7:17
16
Oct
ober
201
4