10.1.1.142.8994 PRINCIPAL CURVES

7/28/2019 10.1.1.142.8994 PRINCIPAL CURVES

1/52


2/52

CURVES

AT&Murray lIm

New Jersey, 07974and

L V . , . . . . . , ~ , ' " StuetzleIJe.va:rtl11.e1:lt of StatisticsUniversity of Washington

Seattle, 98195

AUTHORS' FOOTNOTETrevor Hastie is a Member of the Technical Staff at AT&: T Bell Laboratories, Murray Hill, New Jersey07974. Werner Stuetzle is Associate Professor, Department of Statistics, University of Washington,Seattle, 98195. This workwas developed for the most part at Stanford University with partial supportfrom the Department of Energy under contracts DE-AC03-76SF and DE-AT03-81-ER10843, and theOffice of Naval Research contract ONR N00014-81-K-0340, and by the U.S. Army Research

Anareas Buja, Tom Duchamp, lain Johnstone, and Larry Shepp for theirtheoretical support; Robert 'I'ibshirani, Brad Efron and Jerry Friedman for many helpful discussionsand suggestions; Horst Friedsamand WillOren for supplying the SLC example, and their help with theanalysis; both referees for their constructive criticism of earlier drafts.


3/52

AbstractPrincipal curves are smooth one dimensional curves that pass through th e middle of a p dimensional

data set, providing a non-linear summary of the data. They are an d their shape issuggested by t he da ta. T he algorithm for constructing principal curves starts with some prior summarysuch as th e usual principal component line. Th e curve in each successive iteration is a smooth or localaverage o f t he p dimensional points, where th e definition of local is based on th e dis tance in arc lengtho f t he projections of th e points onto th e curve found in th e previous iteration.

In this paper we define principal curves, give an algorithm for their construction, present sometheoretical results, an d compare ou r procedure to other generalizations of principal components.

Tw o applications illustrate th e use of principal curves.In th e first application, two different assays for gold content in a number of samples of computerchip waste appear to show some systematic differences which are blurred by measurement error. Th e

classical approach using linear errors in variables regression ca n detect systematic linear differences, butis no t able to account for nonlinearit ies. 'When we replace th e first linear principle component by aprincipal curve, a local "bump" is revealed, an d we use bootstrapping to verify it s presence.

As a second example we describe how th e principal curve procedure was used to align th e magnetsof th e Stanford Linear Collider. Th e collider uses around 950 magnets in a roughly circular arrangementtobend-electronand positron beams a nd b ri ng them to collision. Mterconstructionitwas found thatsome of th e magnets ha d ended up significantly ou t of place. As a result th e beams ha d to be bent to os ha r pl y a n d could no t be focused. Th e engineers realized that th e magnets did no t have to be movedto their originally planned locations bu t rather to a sufficiently smooth arc through th e "middle" of th eexisting posit ions. Tills arc was found using th e principal curve procedure.

K e y w o r d s : Principal Components, Smoother, Non-parametric, Non-linear, L . i ~ ~ " " ' " in variables, Symmetric, Self-Consistency.


4/52

1. Introduction

symmernc srtuanon. Instead of summanz-aper we COl1S111er similar generauzanens for

In many situations we do not have a preferred variable that we wish to label response, but wouldstill like to summarize th e joint behavior of x and y. The dashed line in figure La shows what happensif we used x as the response. So simply assigning the role of response to one of the variables could leadto a poor summary. An obvious alternative is to summarize the data by a straight line that t reats thetwo variables symmetrica.ilY' The first principal component line in.figure lb does just this - it is foundby minirxrizin.g.theClrthogonaldeviations.

Linear regression has been generalized to include nonlinear functions of z. This has been achievedusing predefined parametric functions, and with the reduced cost and increased speed of computingnonparametric scatterplot smoothers have gained popular ity. These include kernel smoothers (Wat-son 1964), nearest neighbor (Cleveland 1979), smoothers (Silverman 1985 and

general sl:attel'Pl()t smoothens produce a curve minimize the verticaldeviations as depicted in figure Ic , subject to some form of smoothness constraint. The non-parametricversions referred to above allow th e data to dictate th e form of the non-linear aepenoency

It is often sensible to treat one of th e variables as a response variable, and the other as an explanatory variable. The aim of th e analysis is then to seek a rule for predicting the response using th evalue of theexpla.natory variable. Standard linear r e ~ r e s s i o n p r o d u c e s a l inear ptediction rule.expectation OfYiis mOdeled as a linear function of x and is usually estimated by least squares. Thisprocedure is equivalent to finding th e line that minimizes th e sum of vertical squared deviations, asdepicted in figure Ia ,

Consider a data se t consisting of 17. observations on two variables, x and y. can represent the 17.points in a scatterplot, as in figure La, It is natural to t ry and summarize the pattern exhibited by thepoints in the scatterplot. The form of summary we choose depends on th e goal of our analysis. A trivialsummary is the mean vector which simply locates the center of the cloud but conveys no informationabout the joint behavior of the two variables.

a str;3.igllt we use a SILlOC>tn curve; curve we treat two


5/52

linear principal components, focus on the orthogonal or shortest distance to points. vVe formallydefine principal curves to be those smooth curves that are self consistent for a distribution or data set.This means that we pick any point on the curve, collect all the data tha t "project" onto this pointand average them, then this average coincides with the point on the curve.

The algorithm for finding principal curves is equally intuitive. Starting with any smooth curve(usually the larges t principal component), it checks if this curve is self consistent by projecting andaveraging. If it is not, the procedure is repeated, using the new curve obtained by averaging as astarting guess. This is iterated until (hopefully) convergence.

The largest principal component line plays roles other than that of a data summary: In errors in variables regression it is assumed that.tll.ere is randomness in the predictors as well-as

the response. This can occur in practice when the predictors are measurements of some underlyingvariables, and there is error in the measurements. I t also occurs in observational studies whereneither variable is fixed by design. The errors in variables regression technique models the expectation of y as a linear function of the systematic component of z , In the case of a single predictor,the model is estimated by the principal component line. This is also the total least squares methodof Golub and van Loan (1979). More details are given in an example in section 8.

Qftenwe want to replace a n w n ~ e r o f highlY correlated vari.a,bles by a single varia.ble, such as anormalized linear combination of the original set. The first principal component is the normalizedlinear combination with the largest variance. I

In factor analysis we model the systematic component of the data by l inear functions of a smallset of unobservable variables called factors. Often the models are estimated using linear principal

one use the largest principalcomponent. Many variations of this model have appeared in the literature.

In all the above situations the model can be written as::Ci = (1)

principal component.assumeandom component.

squares estimateUO+aAi is

theis

::Ci = + ei.4


6/52

This might then be a factor analysis or structural model, and for two variables and some restrictionsan errors in variables regression model. In the same spi rit as above, where we used the first linearprincipal component to estimate (1), the techniques described in this article can be used to estimatethe systematic component in (2).

This paper focuses on the definition of principal curves, and on an algorithm for finding them. Wealso present some theoretical results, although lots of open questions remain.2. The principal curves of a probability distributionWe first give a brief introduct ion to one dimensional curves, and then define the principal curvesof smooth probability distributiQns in p space. Subsequent sections give algQrithms for finding thecurves, both for distributions and finite realizations. This is analogous to motivating a seatterplotsmoother, such as a moving average or kernel smoother, as an estimator for the conditional expectationof the underlying distribution. We also discuss briefly an alternative approach via regularization usingsmoothing splines.2.1. One dimensional curvesA one dimensional curve in p dimensional space is a vector f(>..) of p functions of a single variableA. These functions are called the coordinate functions, and A provides an ordering along the curve.I f the coordinate functions are .smooth, then f is by definition a smooth curve. We can apply anymonotone transformation to >.., and by modifying the coordinate functions appropriately the curveremains unchanged. The parametrization, however, is different. There is a natural parametrization forcurves in terms of the arc-length. The arc-length of a curve f from >"0 to >"1 is given by1\ 11= IIf'(z)1I dz:'\0I f IIf'(z)1I == 1 then 1= >"1 - >"0. This is a rather desirable situation, since if all the coordinate variablesare in the same units of measurement, then>" is also in those units.

The vector 1'(>") is tangent to the curve at A and is sometimes called the velocity vector at A. Acurve == 1 is a speed curve. 'YVe can any smoothcurve our of smoothness relatesmore naturally to curve, smootnnesstranslates directly into amoovn visual appearance of point set {f(A), >.. E A} (absence of sharp

5


7/52

I f 11 is a unit vector, then f(>..) = 110 + >"11 is a unit speed straight line. This parametrization is notunique - i " ' ( >..) = u + a11 + >"11 is another unit speed parametrization for th e same line. In the followingwe will always assume that {U,11} =O.

The vector fll(>..) is called the acceleration of the curve at >.., and for a unit speed curve, it is easyto check that it is orthogonal to the tangent vector. In this case fill IIf" II is called th e principal normalto the curve at >... The two vectors f'(>,,) and f"(>..) span a plane. There is a unique unit speed circlein the plane that goes through f(>..) and has the same velocity and acceleration at f(>..) as th e curveitself. The radius rf(A) of this circle is called the radius of curvature of the curve f at A; it is easy tosee that rfC>") = 1/1If"(A)II. The center Cf(A) of the circle is called the center of curvature of f at >..Thorpe (1979) gives a clear introduction to these and related ideas in differential geometry.

II Insert figure 3 around here II2.2. Definition of principal curvesDenote by X a random vector in lRP with densi ty h and finite second moments. Without loss ofgenerality assume E(X) =O. Let f denote a smooth (COO) unit speed curve in lRP parametrized overA ~ JRl, a dosed (possibly infinite) interval, which does not intersect itself (AI # A2 => f(>"l) # f(A2,and has finite length inside any finite ball in lRl'.

We define th e projection index Af : lRl' -+ JRl by> " f ( ~ ) =SUp{A: - f(>")11 = i n f l l ~ - f(,u)II}). I i (3)

The projection index Af(3;) o f . ~ is the value.of >.. for which f(A) is closest to I f there are several suchvalues, we pick the la rgest one. We show in the appendix that Af(e ) is well defined and measurable.DefinitionThe curve f is called self-consistent or a principal curve of h if E(X IAf (X) = A) = f(>..) for a.e. A.

a aeignbcracoce

intuitive motivation Deluna ou r dennitron of a prmcrpai curve. For anyclosest point onf is called a

>.. we couect all have f(>..) asobservations, and if holds for

Figure 3 illustratesparticular parameter

curve. If f(A) isprincipal curve. In th eon curve. our algorithms to come; we

~ v " . r ~ < r i n ( $ ' to estimate conditionai expectations.6


8/52

The definition of principal curves immediately rise to a number of interesting questions: forwhat kinds of distributions do principal curves exist, how many principal curves are there for agiven distribution, and what are their properties? We are unable to answer those questions in general.vVe can, however, show that th e definition is not vacuous - there are densities which do have principalcurves.

It is easy to check that for ellipsoidal distributions the principal components are principal curves.For a spherically symmetric distribution, any line through the mean vector is a principal curve. For anytwo dimensional spherically symmetric distribution, a circle with center at the origin and radius E IIXIIis a principal curve. (Strictly speaking, a circle does not fit our definition, because it does intersectitself. See,however, ou r note at the beginning of the appendix.)

We show in th e appendix that for compact A it is always possible to construct densities with carrierin a thin tube around f, which have f as a principal curve.

What about data generated from th e model X = f(A) +c, with f smooth and E(c) = O. Is f aprincipal curve for this distribution? The answer seems to be no in general. We will show in section 7 inthe more restrictive setting of data scattered around the arc of a circle that th e mean of the conditionaldistribution of :v, given A(:V) = Ao, lies outside th e circle of curvature at Ao; this implies that f cannotbe a principal curve. So in this situation th e principal curve will be biased for th e functional model.We have some evidence that thi s b ias is small , and decreases to zero as the variance of th e errorsgets small relative to th e radius of curvature. We discuss this bias as well as estimation bias (whichfortunately appears to operate in th e opposite direction) in section 7.3. Connections between principal curves and principal components

In this section we establish some facts that make principal curves appear as a reasonable generalization of linear principal components.Proposit ion 1: If a straight line leA) =Uo+ AVo is self consistent, then it is a principal component.Proof: The has to pass t t l l ~ o u , ~ h the v., .&u


9/52

th e covariance of X: Evo = E(XXt)vo=E"E(XXtvo /Aj(X) = A)=E"E(XXtvo Ixtvo =A)=E"E(AX Ixtvo =A)= E" A2VOo

Principal components need not be self consistent in th e sense of the definition; however, they areself consistent with respect to linear regression.Proposition 2: Suppose leA) is a straight line, and we linearly regress th e p components Xj of X onth e projection Ae(X) resulting in linear functions fj(A). Then 1= l iff Vo is an eigenvector of E and10 = o.

The proof of this requires only elementary linear algebra and is omitted.3.1. A distance property of principal curves

An important property of principal components is that they are critical points of the distance fromth e observations.

Let d(3::,/) denote the usual euclidean distance from a point 3:: to its "projection" on I : d(3::,/) =113:: - I(Aj(3::II, and define

Consider a straight line leA) = 1+A11. The distance D2 ( h , / ) in this case may be regarded as a funct ionof 1. and 11: D2(h,l) == b 2(h , 1 , v ) . I t is well known that gradu,vD2(h,1,v) = 0 iff 1 = 0 and v is aneigenvector of E, i.e., the line l is a principal component line.

vVe now restate this fact in a variational setting and extend it to principal curves. Let 9 denote aclass of curves parametrized over A. For 9 E 9 define It = I + tg . This creates a perturbed version ofIDefinition

curve I is a entreat point of the distance function variations the class 9 if f= 9E

8


10/52

Proposition 3: Let (h denote the class of straight lines g(A) =a+Ab. A straight line lO(A) =aO+Abois a critical point of the distance function for variations in (h iff vo is an eigenvector of Cov(X) andao = O.

The proof involves straightforward linear algebra and is omitted.A result analogous to proposition 3 holds for principal curves.

Proposition 4: Let (iB denote the class of smooth (CCO ) curves parametrized overA, with IIgli .$ 1and IIg'lI .$ 1. Then 1 is a principal curve of h iff 1 is a critical point of the distance function forperturbations in (iB.

A proof of proposi tion 4 is given in the appendix, The condition that Ilg11 is bounded guaranteesthat it lies in a thin tube around 1 and that the tubes shrink uniformly, as t -;. O. The boundedness ofIIg'llensures that for t small enough, I: is well behaved and in particular is bounded away from 0 fort < 1. Both conditions together guarantee that, for small enough t, Aft is well defined.4. An algorithm for f inding principal curvesIn analogy to l inear principal component analysis, we are particularly interested in finding smoothcurves corresponding to local minima of the distance function. Our strategy is to start with a smoothcurve such as the largest linearprincipal component, and check if it is a principal curve. This involvesprojecting the data onto the curve, and then evaluating their expectation conditional on where theyproject . Either this conditional expectation coincides with th e curve, or we get a new curve as a byproduct. We then check if the new curve is self consistent, and so on. If the self consistency conditionis met, we have found a principal curve. It is easy to show that both the operations of projection andconditional expectation reduce the expected distance from the points to the curve.4.1. The pr incipal curve algor ithmThe above discussion motivates the following iterative algorithm.initialization: Set 1(0)(>.)= z+a>. where a is the the first linear principal component

of h. Set A(O)(::e) = >'j


11/52

There are potential problems with th e above algorithm. While principal curves are by definitiondifferentiable, there is no guarantee that curves produced by the conditional expectation step ofth e algorithm have property. Discontinuities can certainly occur at th e endpoints of a curve .problem is illustrated in figure 4, where the expected values of the observations projecting onto f( Amin)and f(Am=:) are disjoint from the new curve.

I f this situation occurs, we have have to join f(Amin) and f{Ama:r;) to th e rest of the curve in adifferentiable fashion. Inl1ghtor the above, we cannot prove that th e algorithm converges. All we haveis some evidence in its favor:

By definition principal curves are fixed points of th e algorithm. Assuming that each iteration is well defined and produces a differentiable curve, we can show that

the expected distance D2(h,f(j) converges. I f th e conditional expectation operation in the principal curve algorithm is replaced by fitting aleast squares straight line, then the procedure converges to the largest principal component.

5. Principal curves for data setsSo far, we have considered principal curves of a multivariate probability distribution. In reality, how-ever, we always work with finite multivariate data sets. Suppose then that X is an n X p matrix ofn obsersatiens on p variables. 'Ve regard th e data se t as a sample from an underlying probabilitydistribution.

A curve f(A) is represented by n tuples (Ai,fi), joined up in increasing order o f A to form apolygon. Clearly th e geometric shape of the polygon depends only on th e order, no t on th e actualvalues of th e Ai. 'Ve will assume that the tuples are sorted increasing order of A, and weuse th e arc-length parameterization, for Al = 0 and Ai is the along the fromf l to fi . is

distributlen case, algorithm alternates between a project ion an expectationabsence of

to bemforrnation we use

pr()jei:ti()ns of n cbservations ontoas a starting curve;

addmg uprssance is estimated.ome taresnolds


12/52

(4)

smooth a one cimen-smeetnea is p dimeasionat; so we

averagsag, Some commonty

any scatterptot ; : )Hl ,VVUI ' : a to

Cleveland (1979). TheseVdJ,"li::l.Ult:l to

11

In our case,current rmptementancn

In the more common regression context, seatterplot smoothers are

locally weigated runnmgresponse against

tion

Local averaging is no t a newto estimate

ar e A


13/52

experience with all those mentioned above, although all the examples were fitted using locally weightedrunning lines. We give a brief description; for details see Cleveland (1979).5.2.2 Locally weighted running lines smootherConsider the estimation of E (x IA), Le. a single coordinate function, based on a sample of pairs(At,XI)," ' , (An, xn), and assume th e Ai are ordered. To estimate E(x IA), the smoother fits a straightline to the wn observations {Xj} closest in Aj to Ai. The estimate is taken to be the fitted value of theline at Ai. The fraction w of points in th e neighborhood is called the span. In fitting th e line, weightedleast squares regression is used, The weights are der ived from a symmetric kernel centered at Ai thatdies smoothly to zero within th e neighborhood. Specifically, i f hi is th e distance to the wn'th nearestneighbor, then th e points s, in th e neighborhood get weights Wi j = (1 -I '\i.,xi r)3.5.3. A demonstration o f t he algorithmTo illustrate th e principal curve procedure, we generated a se t of 100 data points from a circle in twodimensions with independent Gaussian errors in both coordinates:

(X) = (5sin(A)) + (e l )Xz 5 cos(A) ez (5)where A is uniformly distributed on [0,211") andel and ez are independent N(o, 1).

II Insert figures 5a-d about here IIFigures 5a-d show the data, the circle (dashed line) and the estimated curve (solid line) for selected

steps of th e iteration. The starting curve-is th e fitstprincipal component, in figure 5a. Any line throughthe origin is a principal curve for the population model (5), but in general this will no t be th e case fordata. Here the algorithm converges to an estimate for another population principal curve, the circle.

solution curve.

principal curve procedure with a particularlyprojection of the points onto this

example is admittedly artificial, but n"D


14/52

found points onto the new curve. It can be seen that th e o f t he projeeteopoints along new curve can be different to the ordering along the previous curve. This enablesthe successive curves to bend to shapes that could no t parametrized as a function of th eprincipal component.5.4. Span selection for the sca tte rp lot smootherThe crucial parameter o f any local averaging smoother is th e size of the neighborhood over whichaveraging takes place. We discuss here th e choice of th e span w for th e locally weighted running linesmoother.A fixed span ..""U.'''''O the peoeedure has converged to a consistent (with respect to the smoother) curve for thespan last used. do no t want f it ted curve to be too wiggly to density of data. As

averaee dlstaIICe decreases and curve follows data more rk,;;plvhuman eye is skllled at lllaJ'U.,LILl;5 t17adeotts between smootnness

this jUd.gm.ent automatrcany. fideUty to we

A similar situation arises in non-parametric regression, we a response y a covariate

response Yi inroceeds as toU.OWS.

smootnness juagmene autoltna.tlc:alIlY is to ensure

13


15/52

using a smooth estimated from the with the it h observation le t Y(i) be this predictedand define cross-validated residual sum of CVRSS = - Y ( i ) Y ~ ' CVRSS/n is

an approximately unbiased estimate of expected squared prediction error. H the span tooth e curve will miss features in the data, and the bias component of the prediction error will dominate.H th e span is too small, the curve begins to fit the noise in the data, and the variance component of theprediction error will increase. We pick the span that corresponds to the minimum CVRSS.

In the principal curve algorithm, we can use the same procedure for estimating th e spans for eachcoordinate function separately, as a final smoothing step. Since most smoothers have this feature builtin as an option, cross-validation in this manner is trivial to implement.

Figure 7(a) shows the final curve after one more smoothing step, using cross-validation to select thespan-nothing much has changed.

Figure 7b,on the other.hand, shows what happens if we continue iterating with the cross-validatedsmoothers, The spans get successively smaller, until the curve almost interpolates the data. In somesituations, such as the SLC example in section 8, this may be exactly what we want. It is unlikely,however, that In this eventcross.validationwouIdbe used.to pick the span' A possible explanation forthis behavior is that the/errors in the co--ordiIl.


16/52

aa

(7)

se t

circle, however

COJlsilier aA

of smooenmg splines

on a number of other artitir:iaJ examples.

Q.1111.cUlt to distinguish their performance in practice.

only involves th e first part of (7) and is our usual pr()je.)res,ealE!d to lie

Our current implementation of the algorithm

given

The procedure worked

We now apply our alternating algorithm to this criterion

5.6. Further illustrations and discussion of the algorithm

prmcrpat curve, as are

Given.>' i, up. into p expressions of the form (6), one for each coordinate function.These are optimized by smoothing the p coordinates against >'i using a cubic spline smoother withparameter p.The usual penalized least squares arguments show that if a minimum exists, it must be a cubic

spline in each coordinate. We make no claims about its existence, or about global convergence propertiesof this algorithm.

An a d v a n t ~ g e o f t h e . sPMIle s m o o t h i n ~ ) a l g 9 r i t h m Is.that i t can be c0ntputed in O(n) operations,andth"Us is.a.strOIlg competitor fo r th e kernel typesmoothers which take O(nZ) unless approximationsare used. Although it is difficult to guess the smoothing parameter p, alternative methods, such asusing the approximate degrees of freedom (see Cleveland, 1979) are available for assessing the amountof smoothing and thus selecting the parameter.

and visit every point. I t is easy to make this argument rigorous.

so thatDzef,>') = t l l ~ i - f(>'i)lIZ+p r lI fll (>.)lizo;

i=1 lis over f with Ii E S2[O, Notice we have confined th e functions to the uni t interval,and thus do no t use t he unit speed Intuitively, for a fixed smoothing parameter p,functions defined over an arbitrarily large interval can satisfy th e second-derivative smoothness criterion


17/52

'When th e principal curve procedure is started the circle, it does not move much, except atthe endpoints, as depicted in figure 9a. This is a consequence of the smoothers endpoint behavior inthat it is not constrained to be periodic. Figure 9b showswhat happens when we use a periodic versionof the smoother. However, starting from the linear principal component (where theoretically it shoulds tay) , the algorithm iterates to a curve that, apart from th e endpoints, appears to be attempting tomodel the circle (see figure 9c; this behavior occurred repeatedly over a number of simulations of this

jexample. The,ends of the curve are "stuck" and further iterations do not free them.The example illustrates the fact that the algorithm will tend to find curves that are minima of

the distance function. This is not surprising; after all, the principal curve algorithm is a generalizationof the power method for which 'e.."thibits behavior. The powermethod will, tend to converge to an eigenvector for the largest eigenvalue, unless special precautions aretaken.

Interestingly, the algorithm using the periodic smoother and start ing from the linear principalcomponent finds the identical circle to that in figure 9b.6. Bias considerations: model and estimation bias

Model bias occurs when the data are of the form z = f(>..)+e, and we wish to recover f(>..). In general,if f(>..)hascu.ryature,itwilLnot be a. principalcurveforthedistriblltion As a consequencethe principal curve procedure can only find a biased versionof f(>"), even if it at the generatingcurve. This bias goes to zero with the ratio of the noise variance to the radius of curvature.

Estimation bias occurs because we use scatterplot smoothers to estimate conditional expectations.The bias is introduced by averaging over neighborhoods, which usua.lly has a f la ttening effect. Wedemonstrate this bias with a simple example.

6.1. A s imple model for invest igat ing bias

torc

z are generated,

more mass is

(J centered atclosest to

curve f is an arc of a with centerfrom a bivariate mean cnosen uniformly on

srtuanoa. Il[lttutively: it seems


18/52

a point, the s e ~ : n u ~ n t shrinks down to to th e curve at that but there is always moremass outside circle than implies that the conditional expectation lies outside the circle.

We can prove (Hastie, 1984) that

where

and

Finally r* -:,. p as a / p -:,. O.

*sin(8/2)re = r 8/2 ' (8)

Equation (8) nicely separates the two components of bias. Even if we had infinitely many observations,and thus would no t need local averaging to estimate conditional expectation, th e circle with radius pwould not be a stationary point of the algorithm; th e principal curve is a circle with radius r" > p,The factor Sini2) is due to local averaging. There is clearly an optimal span at which the two biascomponents cancel exactly. In practice this is not much help since we require knowledge of the radius of

c u r v a t u r e a n d t b . ~ . e r r o r variance is needed to determine it . Typically these .quan.titieswin be c h a ~ g i n gas we move along the curve. Hastie (1984) gives a demonstration that these bias patterns persist in asituation where the curvature changes along the curve.7. ExamplesThis section contains three examples that illustrate the use of the procedure.7.1. The Stanford Linear Coll ider (SLC) projectThis application of principal curves was implemented by a group of l ' ! : e ( ) d e ~ t l c engineers at the StanfordLinear Center (SLAC) in th e software by

Jer()me l?'riedm'l.n of SLAC.c o l t i d E ~ S two Intense cetnsion are recorded

in a coilision chamber studied by nar'tir l lp pJtlYS1:ClSl;S, is to discover new su[)-atOIJWCto accelerate a pOl,itI'on

an etectron cotnder arcs to to

17


19/52

no smoothing at

further it is

if Oi measures

this application. The fittedoriginal coordinates, can be represented by a a vertex

segments is vital imnnrt:::J!.nrp sincenext s e ~ : m t ~ n t withotlt hittj:ng

magnet specify a threshold

was a nal;ur;u way of cnoosmgto

h a : r d E ~ r it is td.latmch

curve, once transformedat

This technique effectively removed the major component of th e bias, and is an illustration of howspecial situations lend themselves to adaptations of the basic procedure. Of course knowledge of th eideal curve is not in other applications.

1) the arc length from the beginning of the arc till the point of projection onto th e ideal curve (x),2) th e distance from the magnet to this projection in th e horizontal plane (y), and3) th e distance in t he vertical plane (z).

Each of the two collider arcs contain roughly 475 magnets (23 segments of 20 plus some extras),which guide the positron and electron beam. Ideally these magnets lie on a smooth curve with acircumference o f abou t 3km as depicted in the schematic The collider has a third dimension, andactually resembles a floppy tennis racket. This is because th e tunnel containing th e magnets goesunderground (whereas the accelerator is above ground).

Measurement errors were inevitable in the procedure used to place the magnets . This resulted inthe magnets lying close to th e planned curve, but with errors in th e range of :I:: 1.0mm. A consequenceof these errors was that the beam could not be adequately focussed.

The engineers realized that it was not necessary to move the magnets to th e ideal curve, bu trather to a curve through th e existing magnet positions and smooth enough to allow focused bendingof th e beam. Thi s st ra tegy would hopefully reduce th e amount of magnet movement necessary. Theprincipalcurve procedure was used to find this curve. The remainder of this section describes somespecial features of this simple but important application.

Initial attempts at fitting curves used the data i n the measured 3 dimensional geodetic coordinates,but it. was found that th e magnet displacements 'Vere small rela,tiveto bias inqucedby smoothing.The theoretical arc was then "removed", and subsequent curve fitting was based on the residuals. Thiswas achieved by replacing the 3 coordinates of each magnet by three new coordinates:


20/52

all results in no movement (no work), bu t with many magnets violating the threshold. As theamount of smoothing (span) is increased, the angles tend to decrease, and the residuals and thus theamounts of magnet movement increases. The strategy used was to increase th e span until no magnetsviolated the angle constraint. Figure l lb gives the fitted vertical and horizontal components of thechosen curve, for a section of the north arc consisting of 149 magnets. This relatively rough curve wasthen translated back to the original coordinates, and the appropriate adjustments for each magnet weredetermined. The systematic trend in these coordinate functions represent systematic departures of themagnets from th e theoretical curve. It turned out that only 66% of the magnets needed to be moved,since the remaining 34% of the residuals were below 60 J.Lm in length and considered negligible.

There natural constralIlts on the syste.m. Some ofthe magnets were ftxedby design, andthus could not be moved. The beam enters. the arc parallel to the accelerator, so the initial magnetsdono beniling. Similarly there are junction points at which no bending is allowed. These constraintsare accommodated by attaching weights to the points representing the magnets, and using a weightedversion of th e smoother in the algorithm. By giving the fixed magnets sufficiently large weights, theconstraints are met. Figure l lb has the parallel constraints built in at the endpoints.

Final ly , since some of the magnets were way off target, we used a resistant version of the fittingprocedure. Points are weighted according to their distance from the fitted curve, and deviations beyonda fixed threshold are given weight O.7.2. Gold as say pa ir sA California based company collects computer chip waste in order to sell it for its content of gold andother precious metals. Before bidding for a particular cargo, the company takes a sample to estimatethe gold content of the the lot. The Qne sub..s;:l.mple is assayed by anoutside laboratory, the other by their own inhouse laboratory. The company wishes to eventually useonly one of the assays. I t is in their interest to know which laboratory produces on average lower goldcontent assays for a given sample.

ozere many more

assays. Each point represents an oh ' ;PY'V: l. tiC1ITli = 1 corresponds to

12a consistsata:l : j i =

even scatter


21/52

(9)

(10)

to simulated,lltion.a! sampies, we use the bootstrapresidual vectors

as

absence of\Ve COIn'Plllte

The dashed curve in figure 12a is th e usual scatterplet smooth of Xl against X2 and is clearlymisleading as a scatterplot summary. The principal curve lies above th e 45 l ine in the interva!l.4 to 4which an untransformed ofassay tends to be lower than that of the outside lab. The difference is reversed at lower levels, but thisis of less practical importance since at these levels th e lo t is valuable.

essentrauy onto a s t r a I l 2 ~ h t

Model (10) essential ly looks for deviations from th e 45 line, and is estimated by the first principalcomponent.

Model (9) is a special case of th e principal curve model, where one of the coordinate functionsis the identity. This identifies the systematic component of variable X2 with th e arc-length parameter.Similarlywe estimate . (9) using a natural variant. of th e principal curve algorithm. I l l t h e s m o o t h i n ~step we smooth only Xl against th e current value of T , and then update T by projecting the data ontoth e curve defined by (I(T),T).

A natural question arising at this point is whether bend in the curve is real, or whether thelinear model (10) is I f we access to more data same wesinlply calculate curves and see

where Ti is the expected gold content for sample i using th e inhouse lab assay, f( Ti) is th e expectedassay result for the outside lab relative to the inhouse lab, and e ji is measurement error, assumed i.i.d.with Var(eli) = V a r ( e ~ 2 i ) 'V i,

This is a generalization of th e linear errors in variables model or the s tructural model (i fwe regardth e Ti themselves as random variables), or th e filnctional model (i f th e Ti 100z per ton)).Our model for the above data is

( : : : ) = e ~ ) ) + (:::)


22/52

(9). We therefore scale them up by a factor of V2. We then sampled with replacement from this pool,and reconstructed a bootstrap replicate by adding a sampled residual vector to each of the fitted valuesof the original fit. For each of these bootstrapped data the full curve fitting procedure was appliedand the fitted curves were saved. This method of bootstrapping is a imed at exposing both bias andvariance.

Figure 12b shows the errors in variables principal curves obtained for 25 bootstrap samples. Thespread of these curves give an idea of the variance of the fitted curve. The difference between theiraverage and the original fit estimates the bias, which in this case is negligible.

Figure 12c shows the result of a different bootstrap experiment. Our null hypothesis is. that therelationl'lh.ip is linear, we way as above, but replacing the principal curveby th e linear errors in variables line. The observed.curve (thick solid curve) lies outside the band ofcurves fitted to 25 bootstrapped data sets, providing additional evidence that the bend is indeed real.7.3. One dimensional color dataThe data in this example were used by Shepard and Carroll (1966) (who cite the original source as

IiBoynton and Gordon (1965 to illustrate a version of their parametric data representation techniquescalled proximity analysis. We give more details of this technique in the discussion.

The data set consists of 23 observations on four variables. It was obtained by exposing 100 observers tolight of each of 23 different wavelengths. The four variables are the percentages of observers assigningthe color descriptions blue,projection of the data and the principal the plane spanned by the first two linear principalcomponents. The observations lie very dose to curve, which provides an summary of thedata set . (Note that this cannot be deduced from figure 13a alone.) Figure 13b shows the coordinatefunctions of the curve. Arc of curve turns out to be a monotone of

s m a . u . . ~ r thani n t ; e r l ~ s t j i n l Z : feature of the curve is that the d.ls:ta.Jlce between twotwo extremes of the seectmm

are judged more extreme and the mrdare.

21


23/52

8. Extension to higher dimensions - principal surfacesvVe have had some success in extending the definitions and algorithms for curves to cater for twodimensional (globally parametrized) surfaces.

A continuous 2-dimensional globally parametrized surface in IRP is a function I =.Ii -;. IRP for.Ii S;;; m.2, where I is a vector of continuous functions:

I(A) =

II Insert figure 14 about here

h(>.t,)..2)12(>\1, )..2)

II

(11)

Let X be defined as before, and let I denote a smooth 2-dimensional surface in JRP, parametrizedover A S;;; m.2 Here th e projection index Af(m) is defined to be the parameter value corresponding tothe point on th e surface closest to m.

The principal surfaces of hare those members of (;2 which are self consistent:

E(X IAf(X) =A) = I(A) for a.e. AFigure 14 illustrates the situation. We do no t yet have a rigorous justification for these definitions,although we have had success in implementing an algorithm.

similar to th e curve algorithm; two dimensional surface smooth-ers are used instead of one dimensional scatterplot smoothers, vVe refer the reader to Hastie (1984) formore details of principal surfaces, the algorithm to compute them and examples.9 . Discuss ion

problem we will restrict ourselves to one dimensicnal mamtotds,Ours is no t att,eml:>t at fiIld.ing a metnca

approaches tofittiJrlg ncnlinear maIuioi,ds to m u l l t i V < ! L r i a j ~ e

th e case l:1"l>J:l.l:&,n paper.to ours was s u ~ ~ e s t e d

22


24/52

prmcrpat curves as conditional sxpectatronsa summary.

ceordmate mactioas is easy. attractivetir'"",,,, of pnncipar curves to data sets.

agrees

From th e operational of view it is advantageous there is no need to specify a parametricform for the cOJrdlin;a.te tunettens, Because the curve is represented as a the opti-mal A's

We will not go Into the definition and motivation of th e smoothness measure; it is quite subtle,and we refer th e interested reader to th e original source. We just wish to point ou t that instead ofoptimizing smoothness, one could optimize a combination of smoothness and fidelity to th e data asdescribed in section 5.5, which would lead to modeling the coordinate functions as spline funCtions aIldshould allow the method to better deal with noise in th e data.

Shepard and Carroll (1966) proceed from th e assumption that th e P dimensional observation vectorslie exactly on a smooth one dimensional manifold. In this case it is possible to find parameter valuesAI . . . An such that for each one of the P coordinates, Xii varies smoothly with Ai. The basis of theirmethod is a measure for the degree of smoothness of th e dependence of Xii on Ai. This measure ofsmoothness,smnmedoverthep.coordinates, optimized with respect to th e A's - oI1eftndsthose values of Al . . . Anwhich make the dependence of th e coordinates on th e A's as smooth as possible.

In view of this previous work, what do we think is the contribution of the present paper?

The model of Etezadi-Amoli and McDonald (1984) is the same as Carroll's, bu t they use differentgoodness-of-fit measures. Their goal is to minimize th e off-diagonal elements of the error covariancem


25/52

IIz- =side is

is

24

=

liz- I(A)II = inf liz- 1(Il)llPEAB = {I l I

Since B is non-empcy and compact ,>" ,aMao.Proof: Define r =attained. 0

Existence of the projection index is a consequence of the following two lemmas.

tion of linear principal components. This dose connection is further emphasized by the fact thatlinear principal curves ar e principal components and that th e algorithm converge to th e largestprincipal component, if conditional expectations ar e replaced by least squares straight lines.

Existence of the projection index.

Proof:>Q is closed, because li z - I(A)II is a continuous function of A. It remains to show that Q is bounded.Lemma 5.1: For every z E lRP and for any r > 0 the se t Q = {A I li z - I(A)1I :::; 1'}is compact.

Assumptions

Appendix - Proofs of Propositions.

Suppose it were not. Then there would exist an unbounded monotone sequence Al' A2,'" with li z - f(Ai)1l ~1'. Let B denote the ball around z with radius 21'. Consider the segment of the curve between I(Ai) and I(AH1)'The segment either leaves and reenters B, or it stays entirely inside. This means that it contributes at leastmin(2r, IAi+! - Ail) to the length of the curve inside B. As there are mfimteiysequence I would have infinite length in B, which is a contradiction. 0Lemma 5.2: For every z E lR!' there exists A E A for which

Denote by X a random vector in lR!' with density h and finite second moments. Let I denote a smooth (COO)unit speed curve in lR!' parametrized over a closed, possibly infinite interval A ~ m1 . We assume that I does notintersect itself (Al# A2 I(Al) :J: I(A2)) and has finite length inside any finite ball.No.te: Under these conditioIls, the se t {/(A), AE A} forms a smooth, connected l-dimenslonal manifold diffeomorphic to the interval A. "'ny smooth, connected I-dimensional manifold is diffeomorphic either to an intervalor to a circle (Milnor, 1965). The results and proofs below could' be slightly modified to cover the latter case(closed curves).

Throughout the appendix we make the following assumptions:


26/52

Proof: P , , , ~ - fU\)" = d ( ~ , f)}has a ''''''''0''''''' element. 0

5.2) and COll'lpaet ( l ! ~ m n l a and therefore

It is no t hard to show that l l ( ~ ) is measurable; a is available upon request.Stationarity of the distance function.We willfust establish some simple facts which are of interest in themselves.Lemma 6.1: If f(lo) is a closest point to and lo E A0 , the interior of the parameter interval, then :z: is inthe normal hyperplane to f at f(lo):

- f(lo), f'(.>.o = 0defined (l o

point ~ is called an ambiguity point for a curve fif it has more than one closest point onthe curve: cardp ' ' ' ~ - f(l)1l = d ( ~ , f ) } > 1.

Let A denote the set of ambiguity points. Our next goal will be to show that A is measurable and hasmeasure O.

Define M A, the orthogonal hyperplane to f at l , byMA = {:z: '(:z: - f(l), f'(l = O}point 1vfA' I t will be useful to

define mapping that maps A x IRf-1 into UA MA Choose p - 1 smooth vector fields nl(l), . . . ,np_l(l) suchthat for every l the vectors f'(A), nl().),, np_l(l) are orthogonal. It is well known that such vector fields doexist. Define X : A x IRr1 ....... IRP by

X(l , v ) = f(l) +and set M = x(A, :mr-1) , the set of all points in IRP lying in some hyperplane for some point on the curve. Themapping X is smooth, because f and nl, . . . ,np_l are assumed to be smooth.

\Ve shall now present a few observations which simplify showing A has measure O.Lemma 6.2: peA () = O.

intervalforms a

is measurable and hasset of

posrstble i f A is aof this measure 0

Accordin,g to lemma 6.1, this isroof:ayperplane, which has measure O.measure O. 0Lemma 6.3: Let E be a measure set. I t is sufficient to show for E

=

25


27/52

Howe'ver. wen

26oouneanes of th e interval.

lai(.:) ={ ':E

tedmi


28/52

In th e following, let gB denote the class of smooth curves parametrized over A, with /lg(.\)11 ::5 1 and/lg'(.\)/I::5 1. For 9 E gB, define f ~ ( ' A ) = f(.\) +tg(.\). It is easy to see that has finite length inside any finiteball and for t < 1, .\Ii is well defined. Moreover, we haveLemma 4.1: I f Z is not an ambiguity point for t , then

Proof: We have to show that for every e> 0 there exists 6 > 0 such that for all t < 6, 1.\1.(z) - Af(z)1 < e. SetC =An(.\f(z) - e,.\j(z)+ e)e and de = inf>"Ec liz - f(.\)II. The infimum is attained and de > li z - f(.\j(z))lI,because Z is no t an ambiguity point.

(/lz - ft(.\)/I-Ilz - ft(.\f(zID ?: de - 6 -lIz - f(.\j(z))/I - 6=6>0

oWe are now ready to prove:

Proposition 4: The curve f is a principal curve of h iffdD2(h, It)l -0 v s codt - v 9 E B t=O

Proof: Vie use the dominated convergence theorem to show that we can interchange the orders of integrationand differentiation in the expression

We need to find a pair of integrable random variables which almost surely boundZt = IIX - ft(.\I.(X1I 2 - /IX - f(.\f(X1I 2t

for all SUIUC1,ent,ly smaIl t > O.Now

,t;xJpaI1ldiIlg th e first norm we

=27

- 2t


29/52

a nd t hu s

Using th e Cauchy-Schwarz inel'!uality and th e assumption that lIgU :::5 1Zt :::5 2!!X -- f(Aj(Xn!1 + 1

:::5 211X-- f{Ao)1I + 1'Vt < 1 a n d a rb it ra ry AO- As IIXII was assumed to be integrable, so is IIX -- f(Ao)lI, an d therefore Zt-

Similarly we have

(13)

Expanding th e first norm we get

Zt 2: --2 (X -- f(Aft(X)-g(Aj,{X2: --211X -- f{Aj,{X1I2: --21lX -- f{..\o)II ,

(14)

which is once again integrable. By th e dominated convergence theorem, th e interchange is justified. However,from (13) an d (14), an d because f an d g a re c on tin uo us functions, we see that th e limit limt _ OZt existswhenever Aft (X) is continuous in t at t =O. We have proved this continuity for a.e. :l l in lemma 4.1. Moreover,this limit is given by

by (13) an d (14).Denoting th e distribution of Aj{X) by h>., we ge t

If f{A) is a pr incipal curve of h, then by definition E(X IAJ{X) = l ) = f(A) for a.e. A, a nd t hu s~ D 2 ( h , ft)l. = 0 vaE gB-

#= 0Conversely, suppose that =A). =0for all g E gB. l.,;orlSlder each coordinate separately, an d re-express as

unpties that o28


30/52

and

of the

389, Institute for

Color-N'mling TechIliq1ue" ,Journal

Robust lJu,c;auy v i 1 e i g ; h t t : ~ d R e ~ ' e s s i o n and Smoothmg Scatterplots" ~ l I H ' ' N l . , ' ' 1~ P ( ) l Y I l o n l i a l Factor Al1laljrSU.", Proceedings of the 77th Annual C01nvent:ion, APA, 103-104.

Boynton, R.M. and Gordon, J. ljeZold-jjrtlke Hue Shift Measuredof the of America, 55 ,

Becker, R., Chambers, J. and Wilks, A. (1988), The New S Language, Wadsworth, New York.

Carroll, D.J.

American Statisti.cal.t!sso,::at:on, 74,


31/52

Efron, B. (1982), " Jacknife, the Bootstrap and other Resampling Plans", SIA}d-CBlY/S, 38 .Etezadi-Amoli, J. andMcDonald, R.P. (1983), "A Second Generation Nonlinear Factor Analysis", PsychometriJ:a,

48, #3, 315-342.Golub, G.H. and van Loan, C. (1979), " Total Least Squares", in Smoothing Techniques for Curve Estimation,

Proceedings, Heidelberg, Springer Verlag.Hart, J. and WehrIey, T. (1986), "Kernel Regression Estimation Using Repeated Mea surement Data", Journal

of the American Statistical Association, 81 , 1080-1088.Hastie, T.J. (1984), " Principal Curves and Surfaces", Lab. for Comp. Stat. tech. rep. # 11 and Ph.d. thesis,

Statistics Departmen,t, Stanford University.Milnor, J.W. (1965), Topologyfrom the Differentiable Vie'Wpoint,University Press of Virginia, Charlottesville.Shepard,. R.N. and Carl'oll, D.J. (1966), " Parametric Representations of Non- Linear Data Structures", in

Multivariate Analysis (Krishnaiah, P.R., ed), Academic Press, New York.Silverman, B.'VV. (1985), "Some Aspects of Spline Smoothing Approaches to Non-Parametric Regression Curve

Fitting", Journal of the Royal Statistical Society B, 47, 1-52.Stone, M. (1974), "Cross-validatory Choice and Assessment ofStatistical Predictions (with Discussion)", Journal

of the Royal Statistical Society B, 36 , 111-147.Thorpe, J .A. (1979), Elementary Topics in Differential Geometry, Springer-Verlag, New York. Undergraduate

Text in Mathematics.Wabba, G. and Wold, S. (1975), " A Completely Automatic French Curve: Fitting Spline Functions by Cross

validation" ,Communications in Statistics, 4, 1-7.Watson, G.S. (1964) "Smooth Regression Analysis", Sankya Series, A, 26, 359-372.


32/52

(a)circle

errors.

a smcotner ensuresmean.entered at

algorithm applied to biv.lXiate spinerlcaJ. gau,ssict.IlaJg;OrithlJ:l is started at a

31is seieceec unnornuy

or a butis


33/52

than the generating curve.11.(a) A rough schematic of the Stanford linear accelerator and the linear collider ring.

(b) fitted coordinate for the positions for a section of SLC. The datarepresent residuals from the theoretical curve. Some (35%) of the deviations from the fitted curvewere small enough that these magnets were not moved.

FIGURE 12. (a) Plot of the log assays for the inhouse and outside labs. The solid curve is the principalcurve, the dashed curve the scatterplot smooth. (b) A band of 25 bootstrap curves. Each curve isthe principal curve of a bootstrap sample. A bootstrap sample is obtained by randomly assigning

to the principal curve curve). The bandof curves appears to besolid curve, spread of the curves gives an indication of

varia:nce. (c) Another band of 25 bootstrap curves. Each curve is the principal curve of a bootstrapsample, based on the linear errors in variables regression line (solid line). This simulation is aimedat testing the null hypothesis of no kink. There is evidence that the kink is real, since the principalcurve (solid curve) lies outside this band in the region of the kink.

FIGURE 13. (a) The 4 dimensional color data and the principal curve projected onto the first principalcomponent plane. (b) The estimated co-ordinate functions plotted against the arc length of theprincipal curve. The arc length is a monotone funct ion of wavelength.

FIGURE 14. Each point on a principal surface is the average of the points that project there.

32


34/52

If,


35/52

Ie


36/52


37/52

o

o

o

o 0

o

oo

oo

o

o 0

o 0

o

_____ 1. 3


38/52

e. '.

-.

. .

,


39/52

.- .. .' .- .1 .. .. ".. .. ".. , - - , , t "[ start: PC line - 12.91 iteration 1: - 10.43

.. ".. ..

iteration 4: - 2.58 final - 1.55


40/52

:'


41/52


42/52

8


43/52

.. .,.

I

,

.. .,. t i l

:.:::6 .. ... ...-.. 'II

... . ~ : .


44/52

10

......

......

...

...

"-...

..."

,,IfffffI

II

II

II

III

II,,,,,,,

'"'"'".....


45/52

collision chamberI

linearaccelerator

II o ,


46/52

" l-

o

o

..' .

100

100

... ' .

200arc lengltl (m)

' ..

200arc lengltl (m)

fib

. '

. ..,;---,--=...._........


47/52

,...

o

=

o 1 2 3 4inhouse assay

(a)


48/52

o

aI2

inhouse assay(c)12 b

I4 6


49/52

o".-"""',","".-

o 2inhouse assay(b)

1:& c

4 6


50/52

First principal aXIS


51/52


52/52

f(X)

14-

Documents

10.1.1.142.8994 PRINCIPAL CURVES