Download pdf - University ofWashingtonUniversity ofWashington Abstract Additive Principal Components : A Method for Estimating ... Kay, Anne, Stefan, Heather and Robert for all the dinners, fun and

University of Washington

Abstract

Additive Principal Components : A Method for Estimating

Additive Equations with Small Variance From Multivariate Data

by Deborah J. Donnell

Chairperson of the Supervisory Committee: Professor Werner Stuetzle

Department of Statistics

Additive equatiollJ or additive principal componenta are a generaliJation of linear principal

components, with .ums of arbitrary tranaformation. replacing linear combinatiollJ of variables.

The presence of additive principal componenta with .mall variances indicat.. the concentration of

the obaervatiollJ around a pouibly nonlinear manifold, implying strong dependenci.. between the

variabl... Additive principal componenta thus have diagnostic applications - additive dependen

ci.. among predictor variabl.. of an additive regre..ion model cause problema that are similar to

those caused by collinearity among predictors in a linear modeL

Additive principal components are the solution of an eigenproblem in an appropriate function

space. An iterative algorithm is given for the sequential computation of the smallest additive

principal components, and convergence to the correct minimising solution is shown. The estima

tion technique is evaluated on data generated from certain symmetric diatributiollJ for which the

solution can be determined explicitly.

A tr&llJparent method for interpretation of the additive dependenci.. using dynamic graphics is

sugg..ted. The elfectivene.. of this technique is illustrated on several data seta. An application of

the additive principal component ... a diagnostic for in.tability of predictor transforms in additive

regression models is demollJtrated.

Table of Contents

List of Figures v

List of Tables .

Chapter 1: Introduction

. • . . • . . . • . . . . . . • . . • . . . • . . . . . . . . . . . Vll

1

Chapter 2: Definition and Theory of the Additive Principal Component

2.1 Introduction .

2.2 Definition of the Smallest Additive Principal Component.

2.3 Finding the Additive Principal Component

2.4 Further Additive Principal Components . .

2.5 A Null Distribution for Additive Principal Components

2.6 A Linear Characterization .

5

5

6

7

17

20

21

2.7 Alternating Conditional Expectation Regression and Additive Principal

Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . " 22

Chapter 3: Additive Principal Component Solutions for some Multivariate Distri-

butions 26

3.1 Introduction. 26

3.2 Distributions with Bivariate Symmetry. 27

3.3 The Additive Principal Components of Distributions with Bivariate Sym-

metry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Polynomial Biorthogonality . . . . . . . . . . . . . . . . . . . 32

3.5 Additive Principa.l Components of the Gaussian Distribution 33

3.6 Additive Principa.l Components of the Gegenba.uer Distribution 35

3.7 Zero Varia.nce Additive Principal Components for Clustered and Categori-

ca.l Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40

Chapter 4: Estimation of Additive Principal Components 42

4.1 Introduction............. 42

4.2 Algorithm Implementation Deta.ils 43

4.3 Algorithm Improvement: A Linear Principal Component Step 47

Chapter 5: Simulations of Additive Principal Component Estimation. 49

5.1 Introduction..... 49

5.2 Evaluation mea.sures 50

5.3 Simulations using the Gaussian Distribution. 53

5.4 Simulations using the Uniform Distribution on an Ellipsoid 63

5.5 Simulations using Manifolds defined by Specified Constraints 69

5.6 APC Estimation for Uncorrelated Variables . . . . . . . . . . 81

5.7 APC Estimation for Distributions with Exact Additive Dependencies. 83

5.8 Conclusions................................. 85

Chapter 6: Applied Additive Principal Component Analysis. 91

6.1 Introduction................... 91

6.2 Interpretation Techniques for Data Analysis . 92

6.3 Guidelines for Detecting Real Structure . 101

6.4 The Infant Morta.lity Data. . . 104

6.5 The Boston Housing Data . . 117

6.6 A Diagnostic for Additive Regression Transform Stability . 126

Chapter 7: Literature Review . . . . . . . . . . . . . . . . . . . . .

III

. 133

7.1 Linear Principal Component Analysis ..

1.2 Nonlinear Generalizations of Principal Component Analysis

1.3 Additive Models . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 8: Conclusion .

Bibliography . . . . . .

.133

. 137

. 145

.141

.149

Appendix A: Statistical Programming on the Symbolics 36xx Lisp Machine . 153

iv

List of Figures

5.1 GAU-Sl: Correlation Plots .....

5.2 GAU-Sl: APC-function Estimation

5.3 GAU-Sl: Variance of APC-function Estimation

5.4 GAU-S2: Correlation Plots .

5.5 GAU-S2: APC-function Estimation for Smallest APC

5.6 UNI-S: Correlation Plots .....

5.7 UNI-S: APC-function Estimation

5.8 UNI-S: Variance of APC-function Estimation

5.9 SCM-S1: APC-function Estimation for Component 1

5.10 SCM-S1 : APC-function Estimation for Component 2



5.13 Independent Gaussian: Estimates of the Three Smallest APCs

5.14 Uniform on Ball: Estimates of the Three Smallest APCs

5.15 Discrete APC : Estimates of the Two Smallest APCs .

6.1 Interpretation Example: The APC-function plots

6.2 Interpretation Example: The APC-function plots

6.3 Interpretation Example : Added Variable Plots . .

6.4 Interpretation Example: The APe-function plots

6.5 TIM: The Additive Regression Models .

55

58

59

61

62

66

67

68

75

76

79

80

82

84

86

97

98

99

. 100

.106

6.6 '0IM-5var: APC-functions of the smallest APC .

6.7 TIM-5var: APC-functions of the second APC.

6.8 TLM-5var: APC-functions for the third APC..

6.9 TIM-4var: APC-functions for the smallest APC

6.10 TIM-4var : APC-functions for the second APC

6.11 TIM-4var: APC-functions for the third APC .

6.12 TIM-4var: The residuals of the smallest APC .

6.13 BH-small : The smallest APC-function plots

6.14 BH-small: The second APC-function plots.

6.15 BH-small: The third APC-function plots.

6.16 BH-small : The smallest APC outliers .

6.17 BH-small : The ACE regression models

6.18 APC Diagnostic for TIM Regression: Smallest APC

6.19 APC Diagnostic for TIM Regression: Second smallest APC .

VI

.108

.109

.110

.112

.113

.114

.116

· 120

· 121

· 122

.123

.127

· 131

· 132

List of Tables

5.1 GAU-S1: Correlations between True and Estimated APCs

5.2 GAU-S1: Loading metric .

5.3 GAU-S1: Eigenvalue and Variable Loadings

5.4 GAU-S1: Canonical metric .

5.5 GAU-S2: Eigenvalue and Variable Loadings

5.6 GAU-S2: Canonical Metric .

5.7 UNI-S: Eigenvalue and Variable Loadings

5.8 UNI-S: Correlation between True and Estimated APCs

5.9 UNI-S: Loading metric .

5.10 UNI-S : Canonical Metric

5.11 SCM-S1 : Eigenvalue and Variable Loadings.

5.12 SCM-S1 : Correlation between True and Estimated APCs

5.13 SCM-S1 : Loading metric .

5.14 SCM-S1 : Canonical Metric

5.15 SCM-S2 : Correlation between True and Estimated APCs

5.16 SCM-S2 : Loading metric .

5.17 SCM-S2 : Eigenvalue and Variable Loadings.

5.18 SCM-S2 : Canonical Metric .

6.1 TIM: The Additive Regression Models

6.2 TIM-5var: Eigenvalues and Variable Loadings

54

54

57

60

61

62

64

65

65

66

73

73

74

74

77

77

78

78

. 105

.107

6.3 TIM-4var: Eigenvalues and Variable Loadings

6.4 BH-small: Eigenvalues and Variable Loadings

6.5 BH-full: Eigenvalues and Variable Loadings .

. 112

.119

. 124

Acknowledgements

I gratefully acknowledge the guidance, patience and assistance of Werner Stuetzle and

Andreas Buja, whose generosity in sharing their time and ideas provided an invaluable

resource for this dissertation.

I would like to express my appreciation to the entire staff and faculty of the Statistics

department, who provided funding throughout my program; with special thanks to Peter

Guttorp and Jon Wellner, from whom help was never sought in vain.

No thanks would be complete without mention of the warm and constant support of my

friends in Seattle: to my housemates Catherine, Anne, Scott, Kay, Anne, Stefan, Heather

and Robert for all the dinners, fun and companionship; to my fellow students, especially

Robert, Jeff, Nuala, Katrina, Keith, Gary and Russell for the never ending invitations to

coffee.

Finally, to Andrew, who was always prepared to endure the worst and celebrate the

best with me, my warmest and deepest thanks.

The research for this dissertation was completed while the author was supported by

DOE Grant number DE-FG06--85ER25006.

Gloria Deo - To God be the Glory

Chapter 1

Introduction

Computers have made it possible to collect and analyze ever larger data sets. This de

velopment has created a need for new statistical methods. Small sample size necessarily

limits the complexity of models that can be fitted and of structure that can be reliably

detected; thus one is restricted to classical parametric methods like linear regression and

linear principal component analysis. On the other hand, large sample size allows the

detection of complicated structure, and the fitting of complex models. This means that

nonparametric methods for description and inference making fewer assumptions about the

underlying situation are called for.

These considerations are the reasons why in recent years there has been a surge of

interest in methods for nonparametric multiple regression, in particular the models of

additive regression and Alternating Least Squares (ALS) [YdLT76] or Alternating Con

ditional Expectation (ACE) regression [BF85]. The former models the response as an

additive function of the predictors :

p

y ~ I>l>i(Xi ),i=1

whereas ACE regression finds transformations 4>1, ... ,4>. of the predictors as well as a

2

transformation 9 of the response :

p

9(Y) - L 4>i(Xi)'i=1

This dissertation contributes to the development of methodology suitable for detecting

complex structure in data.

Our intent is to estimate additive equations from multivariate data which satisfy as

nearly as possible the constraint:

p

L 4>i(Xi ) = O.i=1

Such an additive constraint describes high-dimensions.! structure in the data. Recall the

linear structure implied by a linear constraint, l(x) = a' x = O. If the data nearly satisfy

this constraint, they lie close to a linear manifold of co-dimension p - 1. Analogously,

an additive constraint E 4>i(Xi) = 0 defines an additive manifold of co-dimension 1, and

data nearly satisfying this constraint lie near this additive manifold.

We present a method for estimating the transformations 4>1(X1),4>x(Xx), ... ,4>p(Xp)

describing an additive manifold close to the data. The additive equation is defined by

generalizing the definition of the linear principal component, resulting in the Additive

Principal Component.

Detecting high-dimensional structure in data is intrinsically a difficult task, even with

sophisticated graphical tools. The estimation of constraints will be an appropriate analysis

tool when the search for structure in the data is undirected, that is, no variables are

designated a priori as intrinsically more important, or dependent rather than independent.

Although the additive form of the equation places some restrictions on the surfaces that

can be modelled, they nevertheless are considerably more general than linear manifolds. If

they can be reliably estimated, and properly displayed and interpreted, additive principal

components have the potential to be an important' tool for better understanding the

multivariate nature of data.

The importance of recognizing nonlinear dependencies among the predictor variables

when fitting additive regression models is analogous to the importance of detecting colli-

nearity patterns when fitting linear models ]Sil69]. In a IL'lear model cO.!lulea,rlt;y between

3

carriers results in inflated variance of the estimated regression coefficients. It is then

not possible to infer the separate influence of the collinear explanatory variables on the

response variable. In the additive case, similar difficulties arise. Suppose we fit an additive

model Y .,. E~:1 4>;(Xj ) to the data. We often want to make both qualitative and

quantitative statements about the contributions of each Xi in the model, based on the

estimated <Pi. Consider the analogy to the extreme case of exact collinearity, where there

exist functions of the variables such that E 9;(X;) = O. In this situation, the alternative

fit :p

Y - 2:)<Pi + 9i)(Xi ),i:1

is indistinguishable from the initial one. If the data come close to satisfying this constraint,

some or all of the estimated <Pi will not be stable. We are clearly in no position to interpret

the component functions of the fitted model when this is the case. A method which enables

us to examine how close the data come to satisfying an additive constraint would thus be

a diagnostic check for global stability of the transforms in additive or ACE regression.

Additive principal components provide a method for detecting high dimensional struc

ture in multivariate data, and thus discovering the implied additive dependencies between

the variables. As a natural extension to linear principal component methodology, there

are many potential applications for this data analysis technique.

This dissertation considers first the theoretical properties of additive principal compo

nents, followed by a study of the estimation problem.

We begin in the next chapter with a formal definition of the Additive Principal Com

ponent (APC). An algorithm for finding the APCs leads naturally to consideration of their

theoretical properties. In the third chapter, continuing the theoretical development, we

find explicit APC solutions for a class of elliptically symmetric distributions.

The algorithm developed in Chapter 2 provides a method of estimation for the APCs.

Finite sample iasues arising in the implementation of the algorithm are discussed in Chap

ter 4. The fifth chapter draws on the known APC solutions derived in the third chapter,

to study the finite sample algorithm via simulation. The sixth chapter addresses the use

of APes the data analysis context, of int"q)retation

4

the use of dynamic graphics, and then demonstrating their use on two real data sets. A

dynamic graphical diagnostic for additive and ACE regression is also explained.

The seventh chapter reviews linear principal component techniques, and then discusses

some other nonlinear generalizations of principal components that have been developed.

The dissertation concludes with a brief summary of the research presented.

Chapter 2

Definition and Theory of the

Additive Principal Component

2.1 Introduction

We begin by defining the smallest additive principal component. Then a simple intuitive

idea suggeste an algorithm for finding the APe, which we analyze for the linear case.

The insight gained from the linear solution is applied to the additive case, resulting in

the characterization of the APC as an eigenfunction. Due to this characterization, the

properties of the APC and the algorithm are more fully understood, and we deduce a

modification of the algorithm for which convergence, at least in the population case,

can be shown. Throughout this chapter, we concern ourselves only with the population

properties of the algorithm.

Before proceeding further, we remark that estimation of additive equations is not in

variant under rescaling of the variables. This is analogous to the scaling issue in linear

principal component analysis. Throughout this dissertation we assume the random vari

ables have been standardized, so Xl, X2, .. ,Xp have E (Xi) = 0 and var (Xi) = L

6

2.2 Definition of the Smallest Additive Principal

Component

Our objective is to determine whether random variables Xl, X2 " • •• ,Xpcome close to sat

isfying an additive constraint I>I>.(X.) = 0, for some set of transformations </>1> </>2, ••• ,</>p.

First consider the classical version of this problem where the functions </>. are restricted

to be linear, that is, E </>.(X.) = E a.X•. The aim then is to find the linear combination

of the variables that is closest to zero.

One possible criterion is to find the vector a minimizing the variance of the sum,

var (E a.X.) = var (X· a). To avoid the trivial solution, a = 0, the constraint E a; =

E var a.X. = 1 is imposed. The minimum occurs for a an eigenvector for the smallest

eigenvalue of cov(X) = ~. The random variable E a.X. is called a smallest principal

component of X. The corresponding linear function 11 (x) = a· x defines a linear manifold

of co-dimension 1 in p-space through Il(x) = O. It can be shown that this manifold

minimizes the expected squared distance from the observations to any linear manifold of

co-dimension 1. Hence defining a using this geometric criterion, of minimizing distance to

a linear manifold, results in the same solution as finding a minimizing the variance.

For the linear case, the above three characterizations of the vector a are equivalent:

• E a.X. has minimal variance among all linear combinations of the variables with

Ea; = 1.

• a· x = 0 defines the manifold of co-dimension 1 lying closest to the data.

• a is an eigenvector for the smallest eigenvalue of ~.

We return to the problem posed for the more general additive case : find non trivial

functions </>. making the sum of the transformed variables, E </>.(X.) , "closest" to zero.

We need to decide on a criterion that makes the notion of closeness exact and uniquely

defines the set of transformations ~ = (</>1> 4>2, ... ,</>p).

A natural approach is to extend the definition of the smallest principal component

of X. Using the minimum variance characterization, we could define ~ as the vector of

transformations of the variables minirrJz:ing val" ¢;i (Xi) to var ~ 1"

7

Alternatively, we could use the geometric characterization and determine the additive

manifold described by 2:::CP,(X,) =0 which minimizes the expected squared distance from

the observations to any additive manifold of co-dimension 1.

Any solution, q1(X) =Ei cP,(X,) , to the minimum variance criterion will define through

~(x) = 0 an additive manifold which lies close to the data. However, unlike the linear

case, this additive manifold and the additive manifold closest to the data in the geometric

sense win not be the same.

We choose to use the minimum variance approach, which is both computationally and

theoretically more tractable.

Definition

The smallest additive principal component of X = (Xl," . ,Xp ) i8 the ran·

dom variable q1(X) = Ef=l cP,(X,) minimizing var Ef=l cP,(X,) subject to

Ef=l varcP,(Xi) = 1 .

Note that the constraint E varcPi = 1 is indeed the natural analogue to the linear defini

tion. If cPi (Xi) =ai Xi, E var <Pi (Xi) = E var aiXi = E a~var Xi = E at = l.

At this point we present the notation and terminology conventions we will use through

out this dissertation.

• ~(X) = Li cPi(Xi) denotes the Additive Principal Component, abbreviated APe.

• cP,(Xi ) is referred to as the APC-function for the j"th variable.

• q>(X) = (cPt, ... ,cPp) denotes the vector of transformations defining the APC.

2.3 Finding the Additive Principal Component

2.3.1 A Naive Algorithm

Our intent is to find functions q> = (4)h<PZ,''''¢p) minimizing var E¢i(Xi ) subject to

E var ¢,(X,) = 1. Rewriting the variance of the sum,

8

suggests a straightforward componentwise minimization scheme, in the spirit of ACE

[BF85]. Let us ignore the constraint Lvar¢i(Xi) =1 for the moment. If we assume

tP2,"" ", ¢p to be known, then the minimizing transformation of Xl is given by :

(Here, E Xl == E (. IXl).) This is then done for each variable in turn, yielding a new set

of transformations. The constraint is reinstated by rescaling using L var ¢i of the new

functions. This suggests the following algorithm.

Naive algorithm

Ch .. . 1 r t" "J,[O] "J,!O] "J,[O]oose lnltla tranSlorma Ions 'f'l ,'f'2 ,"" 'f'p

Repeat for N = 1,2, ...

Do for i - l, ... ,p

) Inner loop

Outer Loop

Until

( [N] [N] [N] ( )tPl , tP2 , ... ,¢p ) +- CtPl' CtP2, •.. ,c¢p

var L ¢~Nl converges.

Notice that the iteration scheme employed here in the inner loop is different to that used

in the ACE algorithm of Breiman and Friedman [BF85J. Breiman and Friedman replace

each tPi by its new transformation as the inner loop proceeds, whereas we obtain the new

p-tuple U8ing only the previous p~tuple throughout the entire inner loop. This provides

us with a natural way of restandardizing in the outer loop and will a.llow a tra.nsparent

a.nalysis of the convergence of the algorithm.

is :

9

thelargest eigenvalue of :E is 3.of ilhlstration, that

- - E;;tl a;o)ld)cov(Xi , Xl)

- aio)ld) - E~=1 a}o)ld)cov(Xi , Xd.

Using this equation, the inner loop iteration over all variables written in vector notation

2.3.2 Analysis of the Algorithm for Linear Transformations

E (- E#1 aiX;Xdal = EX:

1

Assuming that this step is part of the inner loop of an algorithm in which we compute

aiNw

) from (aiold

), a~.nd), •.• , a~old», and also making use of the assumptions EX, = 0 a.nd

var Xi =1, we can write:

The problem in the linear case is to find the vector a minimizing the variance of the

corresponding linear combination of X, that is, to minimize var E h(Xi) =var E a,Xi

subject to Evar fliXi = Eat = 1. The solution is an eigenvector for the smallest

eigenvalue of E, hence we expect to establish the convergence of the naive algorithm to

this vector.

Consider the first step of an inner loop. Following the previous 8ection, we initia.lly

ignore the side condition and assume a2,' .. ,ap to be fixed. The value of a1 minimizing

var« - E;;t1 ajX;) -a1 X !l is the coefficient of the simple linear regression of - E#1 aiXi

onX1 :

Each iteration applies the matrix 1- E, and then restandardizes, so the a.lgorithm is

simply the power method for computing eigenvectors. It is easily shown to converge to

the eigenvector for the largest absolute eigenvalue of I -:E. This may not be the vector

we are seeking, which is rather the sma.llest eigenvector for E. Suppose, for the sa.ke

a(NW) = a(old) - E (X1X)a(o)ld)

_ (1 - E)a(old).

We impose the constraint E a; = 1 by rescaling by the norm factor JE(a~mw»2 11(1 - E)a(old)lI after the inner loop is completed. Hence, after the k th iteration we have:

[k] _ (1 - E)a[k-l] _ (1 - E}k-latOja - 11(1 - E}a[k-1JII - 1[(1 - E)k-lalc/n'

10

eigenvalue of I - E with the largest absolute value is 11 - 31 ~ 11 - 0.51, so the algorithm

will converge to the eigenvector for the largest eigenvalue of E instead of the smallest.

However, since the eigenvalues of E lie between 0 and p, the algorithm can be modified

to ensure convergence to the desired eigenvector.

Observe that if a matrix A nxp has an eigenvalue A with corresponding eigenvector v,

then the matrix pI - A has an eigenvalue p - A with the same eigenvector, v. So if A

has the increasing sequence of eigenvalues 0 :5: At :5: ..• :5: Ap :5: p, then pI - A has the

positive decreasing sequence of eigenvalues p ~ p - At ~ ... ~ p - Ap ~ O. It follows

that an eigenvector for the largest (absolute) eigenvalue of pI - E is an eigenvector for

the smallest eigenvalue of E.

We can ensure convergence to the smallest eigenvector of E, by applying the matrix

pI - E at each step, instead of the naive I - E. Then the new ai in the inner loop is :p

(....,) (old)" (old) (X X)Q. +- pal - L...." aj COV ,., i·

;=1

To summarize, the modified algorithm, which applies pI - E instead of I - E, will

converge to a correct solution of our problem in the linear case.

2.3.3 The Hilbert Space of the Additive Principal Component

In the linear setting, the problem of minimizing the variance is easily solved analytically

by reexpressing the criterion as a minimization in RP. That is,

var I: aiXi - a'E (X 'X)a

- a'Ea

= (a,Ea).

(2.1)

The minimization of var I: aiXi subject to I: a; = 1 is equivalent to minimizing (a, r:a)

subject to (a, a) - the minimizing vector is then found by appealing to the Cauchy

Schwartz inequality.

The minimization problem in the additive case can be re-expressed in an analogous

manner, however we first need to establish an appropriate formal framework. The additive

principal component is defined by a vector of functions \tl = (,pI, ... ,,pp), so a natural

amUOj" to l!l space of Lz lUrict::ons.

11

For; = 1, ... ,p, define the function spaces:

H(X,} = {t/>I : E t/>,(X,} = 0, E t/>;eX;} < oo}.

Each of these is a Hilbert space with inner product (t/>I,t/>:) = E (t/>,(X,}t/>;(X;)} and

corresponding norm 1It/>,II· = E t/>;(X,} = vart/>;(X,}.Define the cartesian product space HP = H(Xd X H(X.} x •.. X H(Xp }. The natural

inner product on HP is :

(~, ~I). - E, (t/>, , t/>D

= E, E (t/>It/>D,

with corresponding norm:

II~II: = LE (t/>;) = Lvart/>I'I

In Breiman and Friedman [BF85], it is established that HP is a Hilbert space for which

the natural embeddings of H(Xt), H(X.}, ... , H(Xp } are all closed linear subspaces. Also,

the norm topology of HP coincides with the product topology inherited from the factors

The smallest APC belongs to the cartesian sum space :

H+ = H(Xd $ ... $ H(Xp }

- {I(X} = E. !;(X,) : Evar Ii = 1'/, E H(X,)}.

Now the formal notions for the APC are established, we proceed to the eigenfunction

characterization.

2.3.4 The Eigenfunction Characterization

The eigen properties of the APC follow when we reformulate the definition of the APC as

a minimization problem in HP. We begin by characterizing the estimator resulting from

the naive algorithm of section 2.3.1.

Recall that in the inner loop we obtain a new estimate of each t/>, by :

t/>l""") ...... E X;( - Ej;>!; t/>}Oldl)

= E X; (if>lo'd)

(2.2)

12

The conditional expectation operator E Xi, denoted Pi hereafter, is a projection mapping

H+ onto the subspace H(Xi ).

Since conditional expectation is a linear operator, we can rewrite,

p

<p~MW) = ~~old)(Xi) - I: Pi(~)old)(Xj)); i =1, ... , p.i=1

This can be written in an "operator matrix" notation, illustrating the similarity to the

linearca.se:

I PI PI 4>lold

)

tt(new) (X) (1 P2 I P2 )4>~old)

= -

Pp Pp I 4>~old)

= (1 - P)lJ}(old)

I PI PI

P2 I P2where P -

Pp Pp I

1 - the identity mapping in H.

There is a slight abuse of notation in this representation concerning the domain of the

operator Pi' Pi is defined on the domain H+. In the above form, Pi maps from the

subspace domain H(X,) to H(X,). Strictly speaking, the dependence of this restricted

operator should be indicated in the matrix representation of P , however, if l{ is simply

considered as shorthand for E Xi(_), no confusion will result.

After k iterations of the outer loop,

fkJ _ (1 - P)lJ}[k-l] _ (1 _ P)k-l~[Ol

lJ} - 11(1 - P)lJ}[k-IJIl. - fi(1 - P)k-lep[O!Il•.

The naive algorithm is seen to be the power method applied to the operator I - P, so it

will converge to the eigenfunction for the largest absolute eigenvalue of 1-P, if it exists.

The similarity between the above representation (2.2) and the linear analysis of section

auggests vector APe-functions is an 61gen:tuEld:,on

13

HP, above. P maps the additive function formed from summing the elements of <I? onto

each of its conditional expectations, that is,

[P <I?]i = Pi I>MXi )·i

Ai; a first step towards establishing this characterization, we have the following simple

identity.

Lemma 2.1

(<I?,P <I?). = var L.pii

Proof:(<I?, P <I?). = L,i(.pi, P; L,i .pi)

= L,i(.pi,L,i.pi)

- (L,i .pi, L,i .pi)

= var L,i .pi.

The second equality follows from the self-adjoint property of projection operators:

(.pi, Pd) = (Pi.pi,!) = (.pi,!). I

This identity provides the crux of the argument. Noting that L,Var.pi = fl<I?iI:, the

definition of the smallest additive principal component has an equivalent characterization

as the solution to an extremum problem in HP.

Theorem 2.1 A function vector <I? E HP minimizes (<I?,P<I?). subject to the constraint

1f<I?1f; = 1 iff the set of transformations {,h,.p2, ... ,.pp} minimizes var L,.pi(Xi) under

L,var .pi(Xi) = 1.

Proof: An immediate consequence of Lemma 2.1. I

Notice the analogy between these equivalent characterizations and the two character

izations of the linear solution, equation (2.1).

It is a well known fact from the theory of self-adjoint operators, iJor70, Th 6.7 p.125],

4> E HP minimizing (<l\, P4». subject to = 1j or eqmval"ntly minimizing the

Rayleigh quotient:(~,P~).

II~II:

is an eigenfunction for the smallest eigenvalue of P (where it exists). Thus, once we have

shown P is self adjoint, the following eigen characterization of the vector of APe-functions

is established.

Theorem 2.2 The smallest eigenfunction of the operator P, if it erists, is a vector of

APC-functions for the smallest additive principal component ofX.

An immediate corollary to Theorems 2.1 and 2.2 is :

Corollary 2.1 Suppose ~ = (<Pl,th, ... ,<pp) is a smallest eigenfunction of P belonging

to the eigenvalue Amin' with II~II = 1. Then:

1. The smallest APC of X is ~ = I:i <Pi,

E. The variance of the smallest APC is Amin'

Proof: The first is immediate; for the second,

I

We now turn our attention to establishing the properties of the operator P.

Lemma 2.2 P is a bounded, self-adjoint, non-negative operator in HP.

Proof: P is bounded :

IIP~II: ~ I:.IIPi I:i <Pi II2

::; I:i II I:i <Pi 112

= p II I:i <Pi 112

::; P (I:i IltPillJ2·

The maximum of I:i 1!<Pill under the constraint I: II <Pi 11 2 = 1 is attained at II4>ill = 1'-L

Hence,

:0; 1'(I:i IIM)2::; 1'2.

15

The inequality is sharp, with equality occurring when Xi = Xi V i,j.

P is self-adjoint :

(~,Pw). ~f L,i(<Pi,PiL,itPi)

= L,i (<Pi ,L,i tPi)

= (L,i <Pi,L,i tPi)'

From the symmetry of this expression it follows that (~,P w). = (P ~, w) •.

P is non-negative: (Lemma 2.1) (~,P ~). = var L, <Pi ~ 0. •

Finally, we address the existence of the smallest eigenspace.

Finding linear principal components is a simple finite dimensional problem, with 1:

having at most p distinct eigenvalues. Finding additive principal components, where we

are solving for a set of L2 functions, is generally not a finite dimensional problem - for

continuous variables the spectrum of the operator P will not necessarily be finite or even

discrete. We do know, however, since P is bounded, that the spectrum of P is contained

in the closed interval [0, pl. The theory of bounded self-adjoint operators reveals that

there are potential problems: P may have a non-trivial continuous spectrum or may have

spectral values that are not eigenvalues. We can rule out these undesirable possibilities

by adopting suitable compactness assumptions, following Breiman and Friedman [BF85].

Assumption: The restricted operators Pi /. : H(X.) .... H(Xi } are compact

lor k t- i, i = 1, ... , p.

A sufficient condition for compactness to hold is given in Breiman and Friedman

[BF85J, and the implications of assuming compactness are more fully discussed by Buja

[Buj85].

The assumption implies that the image of the unit ball is relatively compact. Even

under this assumption, P itself is not compact: suppose Xl is independent of X 2 , .•• ,Xp ,

then the bounded set {~: ~ = (<Pl(X!l,O, ... ,0)', li~ll. :::; I} is preserved under P but not

relatively compact in HP. However we can show:

Lemma 2.3 The operator PI: HP ..... HP is compact.

16

Proof: Let B denote the unit ball in HP, B. the unit baH in H(X.).

P - I = E. Q. where Q. : HP ~ HP is defined by

Q.( l))) =(Pl4>,•... , Pi - 1tP" 0, p'+1tP., ...• PptP.Y

It is enough to show that every Qi is compact by Jorgens [Jor70, Th 5.10 p.98]. Since

Be B1 X .,. x Bp , compactness of Q. is established if Q.(Bt x '" x Bp ) is shown to be

relatively compact.

By assumption P.(Ei ) is relatively compact in H(X,) Vi ¥- i, hence;

is relatively compact in HP. •Assumption of compactness implies the spectrum of P is essentially discrete, since the

continuous spectrum of a compact operator consists of at most one point. The spectrum of

a compact operator in an infinite dimensional Hilbert space has the following properties :

• There exists a sequence {h}f of distinct nonzero eigenvalues for which:

• The eigenspaces for distinct eigenvalues are orthogonal and the sum of all the

eigenspaces is dense in the whole space.

• The nonzero eigenvalues have finite multiplicity.

The spectrum of P - I is thus a discrete, bounded set with 0 as the only possible

accumulation point. Since the eigenvalues {1.J of P - I are related to the eigenvalues

{Ak} ofP through Ak = lk + I, the eigenvalues and eigenspa.ces ofP inherit all the above

properties, however the accumulation point of the eigenvalues is 1.

In summary, under the assumption of compa.ctness, the smallest eigenvalue of P exists,

and any eigenfunction corresponding to this eigenvalue is a smallest principal component

ofXo

17

Outer Loop

) Inner loop

for i = 1, .. . ,pDo

Standardize

var E <p1N ] converges.

Repeat for N = 1,2, ...

Until

The final algorithm is :

Ch . 't' I r t' ~Iol ~!O] ~[O]oose IDl la tranSlorma Ions 'f'l ,'f'2 ,.,., 'f'p

2.3.5 The Final Algorithm

2.4 Further Additive Principal Components

Now that the correspondence between P and the smallest APe is established, it is dear

the naive algorithm for the additive case has the same flaw as it had in the linear case.

The eigenvalues and eigenfunctions of P and 1- P are in one-to-one correspondence,

exactly as for the linear case of section 2.3.2. The naive algorithm converges to the

eigenfunction of I - P for the eigenvalue with the largest absolute value. AB in the linear

case, applying the modified operator pI - P will guarantee convergence to the correct

solution.

Algorithm

Up to this point, we have considered only a single constraint, however, there may be

other additive dependencies of importance that can be captured with a second constraint.

In linear principal component analysis, searching for a.dditionallinear dependencies would

18

correspond to examining the principal components of other eigenvalues. The characteriza

tion of the smallest additive principal component as a smallest eigenfunction of P suggests

exploring eigenfunctions associated with other eigenvalues of P.

We first review the properties of the second smallest linear principal component. In

section 2.2 we pointed out that the smallest principal component, a1> minimizes var(X'a),

or equivalently defines through lr(x) = a1 . x = 0 a linear manifold L(1) minimizing the

expected equared distance from the observations to any linear manifold of co-dimension

1. The second smallest principal component, a2, is defined as the unit vector minimizing

var (X· a) subject to cov(I:. aliXi,I:i a2iXi) = O. An equivalent definition replaces the

covariance constraint by the requirement a1 1. a2.

The vector a2 defines a linear function 12 and a corresponding manifold L(2) of co

dimension 1 through 12(x) = O. Together the functions lr and 12 define a linear manifold

L(12) of co-dimension 2, which is the intersection of L(1) and L(2). This is the manifold

that has the smallest expected squared distance from the observations among all manifolds

of co-dimension 2.

The second smallest additive principal component is defined by extending the variance

criterion of the linear definition.

Definition

The second smallest additifJe principal component oj X is the random fJariable

¢;(2) (X) = I: rp;2) (Xi) minimizing:

var I:i rpj(Xi) = W, P 4>').

subject to (4)',4>(1)). = I:lcOV(rp;,rpf!)) =0

and 114"/i: = I:varrpj(Xi) =1.

The additional constraint above defining or the second smallest APC is a natural condition

of orthogonality between APCs with respect to the inner product of the Hilbert space HP.

The second smallest additive principal component is an eigenfunction corresponding to

the second smallest eigenvalue of P. It is easy to generalize this idea and define a sequence

of additive principal components, each one orthogonal to all the preceedlng oneK The

19

additive principal component corresponds to an eigenfunction of P belonging to the k'h

smallest eigenvalue (where eigenvalues are repeated according to their multiplicity). The

k'h APC is denoted by adding an upper subscript to the usual notation, i.e., ~(')(X).

As the operator P - I is compact, we can express its decomposition explicitly. Divide

the eigenvalues of P - I into an upper and a lower sequence according to whether they are

positive or negative. Denote the negative values by the increasing sequence {I. - 1 : I. :5

l,k = 1,2 ...}, positive values by the decreasing sequence {Uk -1: u. ~ l,k = 1,2 ...}.

Both sequences, if they are infinite, converge to zero. Let U. denote the operator that

projects onto the eigenspace of the eigenvalue u., likewise for L.. Then P - I can be

written:

P - I = E(I. -1)L. + E(UI- I)U,.k I

Thus,

P = 1+ E(I. - I)L. + E(UI - I)U,.k I

The eigenvalues of P smaller than 1 are 1 + (I. - 1) = I., hence the sequence of APCs

spans the union of the range spaces of {L. : k = 1,2 ...}. The sequence of operators

{L1, L z, ...} are an orthogonal decomposition of the contracting part of P. That is, if

L = I:f:l(l. - I)L., then for ~ E w, IIL~II. :5 II~II. :5 liP ~II.·

Linear principal components are uncorrelated, and the vector of variable loadings are

orthogonal eigenvectors of I:, hence the dispersion matrix of the principal components,

y = Ax, is diagonal :

vary = var Ax = A'I:A = diag(>'1> ... ,>..) where A'A = I.

For additive principal components the same result holds true: the additive principal com

ponents simultaneously diagonalize the quadratic forms (~, P ~). and 1I~li:' and additive

principal components belonging to two different eigenvalues are uncorrelated. Note that

we then have:

mr[m,ar pre,petty of linear prUnCI]pal cornp')l1<m~s.

20

The geometric structure induced by further additive principal components is analogous

to the linear case. IT the operator P has two vanishing eigenvalues the observations lie in

an additive manifold of co-dimension 2, described by the two constraints:

'" (1)( ) _ '" (2)(X.) _ . h ",(",(1) (2))_L... tPi Xi - 0 and L... tPi • - 0, WIt L... 'l'i ,tPi - O.

If there are two eigenvalues "close" to zero, then the observations lie close to the additive

manifold defined by the corresponding pair of implicit equations. The manifold described

by these two constraints is not the manifold lying closest to the data in the Euclidean

metric, unless the manifold is linear.

An obvious method for finding the k th smallest component is to choose initial functions

for the algorithm that are orthogonal to all the previous principal components. That is,

choose \1>[oJ such that (\1>[01,\1>(1)) = ... = (\1>[OI,\1>(k-l)) = o.The algorithm will then converge to the k th largest eigenfunction of pI - P, or equiva

lently the k th smallest eigenfunction of P. Hence it finds the k th smallest additive principal

component associated with the eigenvalue Ak = var Li tPlk ) .

2.5 A Null Distribution for Additive Principal

Components

The APCs are standardized 80 that L var tPi = 1, hence only if var L tPi < 1 does the

APC reveal a dependency between the variables. This is equivalent to restricting our

attention to eigenfunctions corresponding to eigenvalues of P smaller than one. It is

natural to ask when P has no eigenvalues less than one. The following theorem provides

a characterization of this null situation.

Theorem 2.3 The following are equivalent:

1. All the spectral values ofP are greater than or equal to 1,

E. P - I is non.negative definite,

s. P = I, so the spectrum olP is the singleton {I},

4. The variables

21

5. The spacu H(Xd,H(X:), ... ,H(Xp ) are orthogonal.

Proof: (1 => 2) If all the spectral values of P are at least 1, then all the spectral values

of P - I are non-negative, or equivalently P - I is non-negative.

(2 => 4) If P - I is non-negative, then for ~ij == (O,th, 0, ... ,0, ¢>i' 0" .. ,0) E HI'.

(~ii,(p _ I)~ii). = (~ii,p~ii). -lI~iill: ~°=> ° $ var (¢>i +¢i) - 1 by Lemma. 2.1 and II~II: = 1

- var ¢>i +var¢i + 2COV(¢i, ¢>i) -1

= 2 COY (¢.) ¢>j) since lI~.jll: == var ¢>i + var ¢>j = 1.

Replacing ¢i by -tPi in the above we arrive at the conclusion that cov(¢>i, ¢>,.)= ° V¢>i, ,piUnder the assumption of compactness of Pi and Pit it follows that Xi and Xi are inde

pendent V" # j.

(4=> 5) Clear, since cov(¢>i'¢>i) == (¢i,tPi)'

(5 => 3) H(Xi)..L H(Xi ) => Pil i := OVi # j. Hence P =I.

(3 => 1) Trivially. •

Note that orthogonality of spaces H(Xd, H(X2),"" H(Xp ) is equivalent to pairwise

independence only, and full independence of Xl> _.. )XI' does not follow: from H(Xt} ..L

H(X:) and H(Xt} 1. H(Xs) we only have H(XI) 1. H(X:)$H(Xs) whereas independence

of Xl from (X:, X s) is equivalent to H(Xr) ..L H(X:, X s), the latter denoting the space

of centered L2 functions which depend only on X2 and X g •

We conclude with the following simple corollary:

Corollary 2.2 The APes 0/ x -- )/ (0, I) all have the eigenvalue 1. Any normed Bum of

transformed variables i8 an APO 0/ X.

2.6 A Linear Characterization

For the smallest additive principal component, there is a linear characterization of the

minimizing solution. Namely, the smallest additive principal component of X is exactly

the smallest linear principal component of the transformed and restandardized varia.bles :

22

The smallest eigenvector has the vector of variable loadings a = (!I<h(Xd!l, ... , !I4>p(Xp)lil.Recall that the linear principal component of Y achieves minimal variance among all

linear combinations of Yl> Y:, ... ,Yp • This minimum cannot be less than the minimum

over all additive functions, since it is itself an additive function, nor can it be greater than

the minimum since the smallest APC is a linear combination of YI , Y" .. . ,Yp • Hence the

two are identical.

For additive principal components for other eigenvalues, this duality no longer exists,

as the transformations of the variables are different for each additive principal compo

nent. Moreover, the k'h additive principal component does not correspond to the smallest

linear principal component of its restandardized transformed variables, as it is subject to

orthogonality with respect to the previous APCs. This is true even in the case where the

smallest additive principal component is linear.

2.7 Alternating Conditional Expectation Regression and

Additive Principal Component Analysis

The stationary equations of the ACE regression model and the APC solutions have a

striking similarity, which suggests investigating more closely the differences between the

two solutions. This leads to an interpretation of the APC solutions as a possible alternative

to ACE regression, in the case where a response variable is not designated a priori.

For the purposes of comparison between the two stationary equations we will choose

a variable, say Xl, as the response variable for the ACE regression, and the remaining

variables, X" ... ,Xp , as predictors. Let H~ll denote the space of additive functions

of these predictor variables. The optimal ACE transformations 4>i, 4>i, ... ,4>; satisfy for

maxinIal ). E [0, 1J, the two stationary equations:

).·.pi = pXl E.;<1 .pi,

)" E.;<1 4>i = p H i':'II4>i·

For the smallest APC of X , the APC-functions satisfy for mininIal ). E [0,11 the p

23

equations:

(1 - -')¢1 = pX1 2:i#1 ¢i,(1 - -\)tPt = pX, 2:i :l-i ¢;, i =2, ... •p.

Comparing the two solutions, it is dear that the APC and ACE solutions for two variables

are identical up to scalar multiples for p =2, with the correspondence ¢l = -¢i. For larger

p, the ACE equations are unsymmetric, since Xl is singled out as a response variable, and

then predictor transformations are found to give the best additive approximation to a

transformation of Xl. The APC equations, by contrast, treat the variables symmetrically.

The APC solution satisfies p restricted "'regression" equations, that is, each ¢. minimizes:

II¢. - (- 2:i#.¢i)1l2

subject to IItPill 2 = 1 - 2:i#i II4'iIl2.

Loosely speaking, the APe-functions are the solution which is simultaneously best over

all p possible ACE regressions of one variable on all the other variables.

The linear analogy to using APe as a an alternative to additive regression, is using

linear principal components regression as an alternative to least squares regression. Prin

cipal component regression is advocated when both response and predictor variables are

observed with (known) error. Then the principal component plane is an optimal fit to the

data in the sense that it minimizes the residual sum of squares over the joint distribution of

response and predictors. However, a regression interpretation of the APC solution cannot

be similarly justified, since we do not minimize over the joint distribution of the original

variables. In APC analysis, the variance of the transformed variables is minimized.

Finally, we show that the ACE regression model and the APC solution for more than

two variables, coincide only when an exact additive singularity exists.

Theorem 2.4 For p ~ 3, consider random variables X ll - .. ,Xp • Suppose the ACE reo

gression of X. on {Xi: j'l: i} is :

¢i(xt) '""' L ¢i(Xi ),i#

24

The two sets of transformations correspond for some constant c according to the rule:

<l>i(Xi) - -c<l>i(Xi )

<1>; (Xj) - c<l>j(Xj) for j # i,

if and only if 3 <1>1, ...• <l>p with <l>i # 0 such that II Ei <l>i (Xi) II = o.

(2.3)

Proof: (:j» Suppose the ACE and APC solutions coincide. Without 1088 of generality.

take Xl to be the response variable of the ACE regression.

For fixed <1>1. <l>z •... •P' ACE has a linear characterization in the standardized. tra.nlr

formed variables :

If 1 and a are the first canonical correlation vectors for Yl and (Yz•...• Yp ) respectively.

with canonical correlation p*. then:

(lI<1>ill.II;II.·· ·.II;1Il = (l.p*a) = (l.a*). say. (2.4)

This linear characterization follows from Theorem 5.1 in Breiman and Friedman [BF85].

and the minimization criterion defining the ACE regression.

In section 2.6. we gave a linear characterization of the smallest APC : the smallest

linear principal component direction of Yl •...•Yp is (1Ilil •... ,lI<1>pill = 1. say.

If R is the correlation matrix of YI, Yz•...• Yp • then 1 is an eigenvector of R. hence

Rl= AI. Since the solutions coincide, from (2.3). cl = (IIill.il<l>;II.,.,.il<l>;U) = (-l.a*),

It follows that :

(1 rb1(-1)=A(-1).\r12 Rzz ) a* a*

The lower partition implies,

r12 = a*Rn- Aa*, (2.5)

We will now show that equations (2.4) and (2,5) can hold simultaneously iff A = 0,

which im.,lies Ii ::::: 0, as '''U,Ulf''U,

25

1

From the properties of the canonical correlation solution, we know a = R.z--./a, where

a is the first singular vector of ruli;}. That is, ruR;l = p'a. Substitution yields the

relation p'a = R221rU' or equivalently,

Comparison of (2.5) and (2.6) yield a contradiction, unless A= O.

(<=) The converse direction of the theorem is trivial. •

(2.6)

Chapter 3

Additive Principal Component

Solutions for some Multivariate

Distributions

3.1 Introduction

For distributions with strong symmetry, it is possible to explicitly calculate the addi

tive principle components. From both an applied and theoretical viewpoint this exact

knowledge is very valuable.

First, it enables us to study the performance of the estimation procedure, since we can

assess the accuracy of our estimates by comparison with the known theoretical solution.

Second, the particular distributions for which the eigen solutions are tractable encompass

a limited class of null situations - the independent Gaussian and the Uniform on the p

ball, for example. The APCs of these null distributions provide a standard of comparison

for assessing the significance of detected structure in real data.

The first three sections of this chapter establish conditions under which exact APC

solutions are easily characterized. Then APCs are enumerated for a number of specific

distributions. Finally, we discuss some non-trivial distributions which have APCs with

27

zero eigenvalues. These typically involve dependencies in the data that are not represented

by smooth transformations of the variables.

3.2 Distributions with Bivariate Symmetry

Calculation of the APCs is simplified when all bivariate marginals of the distribution are

symmetric. Symmetry, in the bivariate setting, specifically refers the assumption that the

law of (X, Y) is the aame as that of (Y, X). From this it follows that the ranges of X and

Yare the &arne, and that X and Y have the aame marginal distributions.

Suppose X and Yare distributed according to QX,y(dtlodtZ), with marginals Qx(dt)

and Qy (dt) respectively. Let :

H(X) = Lz(Qx) = {</>(X) : E </>(X) = 0, var </>(X) < oc}

H(Y) = Lz(Qy) = {8(Y) : E 8(Y) = 0, var 8(Y) < oc}.

The conditional expectation operators:

pX : H(Y) -+ H(X)

where pX(8(Y» = E(8(Y) IX)

and Py: H(X) -+ H(Y)

and Py (8(X» = E(8(X) IY),

are mappings between the two spaces. When the joint distribution of (X, Y) is symmetric

however, X and Y have the aame marginal distribution Qx(dt) = Qy(dt) = Q(dt). We

can then consider pX and Py as mappings of Lz(Q) onto itself, and in this sense, the

conditional expectation operators are identical, pX = Py = P. P can be defined as an

operator on H(X), aay, according to P(g(X» = pXg(Y). P thus defined is symmetric

and nonnegative definite, and all of its eigenfunctions are clearly also eigenfunctions of

the identical operator defined as a mapping of H(Y) onto itself.

When P is compact and self-adjoint, spectral theory grants the existence of a sequence

of eigenvalues which converge to zero, and of associated eigen spaces which are mutually

orthogonal, finite dimensional (for nonzero eigenvalue ), and complete in the sense that

the closure of the span of the eigenspaces is the whole space.

28

P is self-adjoint since:

(4)(X) , P8(X)) = (4)(X), p X 8(Y))

= (4)(X),8(Y))

= (PY 4>(X) ,8(Y))

= (px 4>(Y),8(X))

= (N(X),8(X)),

where symmetry plays its part in the penultimate equality. Nonnegative definiteness is a

property of the inner product.

By definition, the eigenfunctions {4>.(X)}t and {4>.(y)}. are both sequences of orthog

onal functions, but in addition, they are mutually orthogonal:

(4).(X),4>;(Y)) = (4).(X), p X 4>;(Y))

= (4).(X), N;(Y))

= (4).(X), A;4>;(X))

= o i¥j.

A full discussion of these properties of the symmetric bivariate distribution, leading nat

urally to a singular value decomposition of the distribution function is given in Buja

[Buj85].

3.3 The Additive Principal Components of Distributions

with Bivariate Symmetry

The previous section established that symmetry of a bivariate distribution implies that

the two variables have a common sequence of eigenfunctions. To calculate the APCs of X,

we need to strengthen this condition: all bivariate distributions have to share the same

eigenfunction sequence. This implies symmetry of all pairwise bivariate distributions, but

it is considerably stronger.

Under this condition, we will show that an APC is defined by scalar multiples of a

single.eigenfunction of the common sequence.

Denote the common family of eigenfunctions by f. ={II, /2, ... : E /. =0, var /. = 1}.

The operator P of the previous section} corresponding to the variables and IS

29

denoted Pii. Let the eigenvalue of Pij belonging to the kth eigenfunction lie be denoted

(Ie)1'ij . Thus we have:

(3.1)

The eigenvalue 1';;) is the correlation between lle(Xi) and ll:(Xi) :

cor (l1e(X.),ll:(Xj » - (ll:(Xi),lle(Xi »)

= 1'i~)(lIe(Xi»ll:(Xi»)

There is a. potential for confusion between the eigenfunctions and eigenvalues of each Pi;'

and the eigenfunctions and eigenvalues of P. Since our interest is primarily in the APCs

of X, the terms "eigenfunction" and "eigenvalue" will be reserved for the eigen analysis

of P ; the 1'i~~) will be called correlations and the ll:, APC-basis functions.

The vector of APC~basis functions AIe(X) = (lle(XI ), ... ,11e(Xp»has correlation ma-

trix :

1

1'(1:) 1'(1:) 1pI p2

Since every bivariate distribution is symmetric, 1'i~) =1'}:), thus T(Ic) is symmetric.

When every variable has the same sequence of APC-basis functions, the APCs have a

particularly simple structure, as can be seen from the following algebraic argument. We

denote elementwise multiplication by *, that is, a * A(X) = (all(Xl ), ... ,apl(Xp», and

omit the superscript k. Also, in the following it is convenient to use Pi to denote E Xi,

instead of the more explicit Pi" allowing the domain space of the mapping to be inferred.

[P (a * A(X »]t ~ Pi E i ajl(Xi )

= E i aiPil(Xj )(3.2)

- (Ei aj1'ij)I(X,;) by (3.1)

= [(Ta) * A(X )Ii

It fonows that a * A(X ) is an eigenfunction of P if Ta = Aa. This is satisfied exactly

when a is an eigenvector of the symmetric matrix T, and the corresponding eigenvalue of

a* A(X) will be

30

Denote the sequence of eigenvalues of T(k) by (Aik) $ A~k) $ '" $ A~k)} with corre-

., {(k) (k) (k)}spondIng eIgenvectors VI • V z .... ,vp •

Theorem 3.1 Suppose X ku a p-variate distribution witk all bivariate distributions skar.

ing the same set 0/ APC· basis functions. Then tke (unordered) eigenvalues/or tke operator

Pare:{ Ail) •A~I) A~l) •

A\Z) AZ) A~Z) •...• A;k) •...}.

and tke eigenvectors belonging to tkese values are

~. = { ...(1) (1) (1).. .....1 Z p •

q,iZ). q,~Z) q,~Z) •...• q,;k) ....}.

wkere q,;k) = vik) • Ak(X).

The APC with variance A;k) is :

Proof: It is sufficient to establish

1. q,\k) is an eigenvector belonging to the eigenvalue A;k).

2. J:.' is a complete orthonormal basis for H(X) = H(Xr) X H(Xz) x ... x H(Xp ).

For the first.

P q,;k) = P (v;k) • Ak(X)}

- (T(k)v;k). Ak(X)

= A;k)v;k). Ak(X)

= A(k)q,(k).• •

by (3.2)

since v;k) is an eigenvector of T(k)

To show the q,;k) are orthonormal, first consider the case k =P k'. The APe·basis functions

are mutually orthogonal. 80 :

(q,\k). q,;:'). = '2:", (v;:]lk( X",), v;:;J Ik'(X~)

= '2:", v;:1v;:;1 (lk(X",), I.,(Xm»

= 0.

31

elliptical distributions with comJ.'110n APe-basis functions is not

I

if i =i'

else, since v~k}) v~:} are eigenvectors of T(k}.

The construction of the APCs is an application of Corollary 2.1.

product, H(X) =H(Xt} x H(Xz)··· x H(X,,).

This is in the span of ,C*) as required.

Completeness is inherited from the completeness of each of the subspaces in the direct

h(X) - E~l 80k * l.,(X}

- E~l Ef=l a:~A:)Vi(l:) * h~(X)- E~l Ef=l a:~l:)~~A:).

e* is a complete basis for H(X) since'c, the set of APC-basis functions, is a complete

orthonormal basis for H(X.) for every i. Hence h E H(X.) can be written h(X.) =l:~l aki1k(Xi} and thus heX) E H+ can be written heX) =Ef=l8ok*A(k}(X). Now the

, {(k)}p np b 't ",P (k) (k) Hp-vector 8ok, sroce tI. ,=1 span" ,can e Wll ten aj: = L.-.=l a. v, . ence,

Each APC is defined by a single member of 'c, provided an eigenvalues are distinct.

If an eigenvalue of T(k) has multiplicity greater than I, then although the APC-basis

function is uniquely determined, the sealings v~k) of the APC-basis functions are not well

determined. If an eigenvalue of T(k) coincides with an eigenvalue of TW), then neither the

sealings nor the APC-basis functions are well-determined. In this case, there is a mixing

of APC-basis functions: any set of transformations of the form ll> = a~~k) +~~~A:l)for a E [0, 1J, defines an APC.

It may seem at first sight that the symmetry condition alone will restrict consideration

to only a very small class of distributions. However, with the convention of standardizing

the distribution so the variables have equal variance, symmetry of univariate marginate

will often be accompanied by symmetry of the bivariate distributions. Hence, the da.ss of

If k = k',

32

3.4 Polynomial Biorthogonality

This section serves only to introduce an auxiliary property of bivariate distributions, which

simplifies the calculation of the APe·basis functions for the distributions of the ensuing

sections.

Definition

A bivariate distribution has the property of polynomial biorthogonality if all eigenfunctions

with respect to projection onto marginals are polunomials.

The sets of eigenfunctions for such distributions necessarily correspond to two sets of

orthogonal polynomials with respect to the marginal distributions, since the eigen families

are complete. The following two propositions give conditions to establish that polynomials

are preserved under projection onto marginals, and a simple method of finding the related

eigenvalues.

Proposition 3.1 A bivariate distribution has the polunomial biorthogonalitu property iff

the conditional moment ElY'" IXl is a polunomial of degree no greater than m in X for

all m, and the same for E [X'" IY] as a polunomial in Y. If the distribution is 81/mmetric,

only one of the two conditions need hold.

Proposition 3.2 If a bivariate distribution has the polynomial biorthogonalitu propertu,

the eigenvalue A:" is the product of the leading coefficients in the polunomials given bU the

conditional moments E [Y'" IXl and E [X'" IYJ. If the distribution is symmetric, A;" is

the square of the leading coefficient of either polynomial.

Polynomial biorthogonality was first studied by Lancaster [Lan58J; proofs of the two

propositions above can be found in Buja [Buj85J.

An easy sequence of steps lead to the eigenfunctions and eigenvalues. First check that

the distribution is polynomially biorthogonal by finding the conditional expectations and

computing conditional moments. Then from the marginals, find the family of orthogo

nal polynomials. Finally, read off the eigenvalues from the moments of the conditional

expectation.

33

305 Additive Principal Components of the Gaussian

Distribution

Assume X - .lIp(o, R), with R a correlation matrix. The condition of Theorem 3.1

requires common APC-basis functions (eigenfunctions) for all bivariate pairs, so we first

focus on the bivariate distributions.

As all bivariate distributions are Gaussian, symmetry is trivial. The conditional dis

tribution of Xi given X; is also Gaussian, and the conditional moments are:

E (X~ IXi) = ~ (m) ~p":-2;(I_ p2yX!,,-2;., L.. 2' 2L(l)' " I,.

,=0 J '2'

Bivariate Gaussian distributions are thus polynomially biorthogonal, and the system of

orthogonal polynomials generated by the Gaussian marginal are the hermite polynomials.

Applying Proposition 3.2, the correlation between the k'h degree hermite polynomial pair

(P.(Xi),P'(X;)), is p~;, where as usual P' is centered and standardized.

Returning now to the APC construction, since each bivariate distribution is Gaussian,

with a standard Gaussian marginal, every pair has as APC-basis functions the family of

hermite polynomials, independent of their correlation Pi;. Theorem 3.1 is applicable, with

the hermite polynomials providing the common set of APC-basis functions.

The APC-functions are multiples of a hermite polynomial. The correlation matrix

associated with the k'h degree hermite polynomial is the k'h Schur product of R :

T(k) ~f var (rrk)

= var (Pk(XIl, Pk(X2), . .. ,Pk(Xp ))

= Rko,

where [RkOJi; = kPi;'

Thus for each k, the P eigenvalues, (Aik) $ ).~k) $ ... $ ).~k)) of R" are all eigenvalues of

P.

The APCs are the sequence {¢;k) : i = 1,2, ... ,P, k = 1,2, ...}, where:

and

¢;k) = (v;k))trr.(X)

:s an elgen"ecwr for the

34

Since each Rio' is a correlation matrix, we can obtain partial orderings of the eigenvalues

of P for the Gaussian distribution.

Proposition 3.3 For tAe eigenlJalue8, plio), i = 1,2, ... ,P}j;;1 0/ the 8equence 0/ Schur

product8 Rio' 0/ a corrtlation matm, R :

1. A~I) ::; A~2) ::; ... ::; A~k) ::; ... ::; 1

(1) (2) ,(10)E. Ap :?: Ap :?: .•. :?: "p :?: ... :?: 1

9. All) ::; A~ih/ i, j

I ,(1) > ,(i)y' ."t. Ap _ I'\i ')J.

This is a consequence of the following result of majorization theory (Bapat and Sunder

[BS83J). Let A' B denote the Schur product of A and B, [A. Blij = aijbij , A(A) the

ordered sequence of eigenvalues of A and -< the majorization relation, U-< v if 2::=1 tli ::;

2::=I lJi y j.

Lemma 3.1 Let A and B be p X P matrice8, with A 8tl/-adjoint, and B a corrtlation

matrix. Then,

A(A. B) -< A(A).

An elegant proof of the above is found in the given reference.

Proof of Proposition 3.3 :

Repeated application of Lemma 3.1 to R yields the relationship :A(Rh) -< A(R(k-I),),

establishing (1) and (E). Statements (9) and U) follow as obvious consequences, since by•• ,(10) ,(10) ,(10)

defimtlOn "1 ::; "2 ::; ... ::; "p . •

Proposition 3.3 has the following consequences for the Gaussian distribution:

1. The smallest additive principle component is the smallest linear principal component

ofX.

2. The largest linear principal component achieves largest variance among all additive

functions.

35

3. The second smallest additive principal component is either the smallest component of

the quadratically transformed variables, Jl2) l or the second smallest linear principal

-(1)component, tP2 •

4. In general, the Joth smallest APC is among the eigenfunctions belonging to the "upper

triangular" subset of eigenvalues p.~i:) ~ l,i + k ~ j + I}.

The first two points above simply verify the well known result that minimal (maximal)

correlation over all marginal transformations for the multivariate Gaussian is achieved by

the smallest (largest) linear principal component of X. (See Lancaster [Lan58J)

3.6 Additive Principal Components of the

Gegenbauer Distribution

In this section we compute the APCs of variables distributed according to a dass of

symmetric multivariate distributions, which include the Uniform distribution on the unit

p-ball, Bp • This is then used to find the APes for the transformed distribution on an

ellipsoid.

The distribution we consider is a multivariate generalization of the symmetric beta

distribution, centered at zero and rescaled onto Bp • The p-variate Gegenbauer distribution

with parameter a on Bp has density :

lP(a) ~ Q(Xl,X2,.··,xp ;a)r~i+a) (1 2 2 2)a-l- - Xl - X2 - ••• - Xp.. rea)

The distribution can be derived from a transformation of the Dirichlet distribution -

if Yl,"" Yp are distributed Dirichlet(a,!, ... , !), then vYl,"')VYp have the above

density.

In particular, note several special cases of this density:

1. For a ~ I we have the Uniform in the unit p-balL

2. While the Uniform on the (p - 1)-sphere does not have a density, it can be obtained

as limit as 11 --. O.

36

3. The independent p-variate Gaussian is included in this family by considering a suit

ably rescaled version as a ..... 00. Explicitly, for viz.1p(a),

Comparing the three cases enumerated above, the Uniform distribution can be thought

of as intermediate between two extremes, the degenerate distribution on the surface of the

sphere on the one hand, and the limiting Gaussian distribution in the "center" of the

sphere on the other.

It is a simple matter to establish the following relationship between marginals of 1p(a) :

Proof:

t}fral (1 - ,,~-1 2)a+t-1- +4 L...it=l Xl

= rrX+a) (1 _ ~2 _ ~2 _ _ ~2 )a+t-1..trrra) ~1 ~2 . .. ~p-l

where "p-l x~ < 1 •£"1 1-

It follows by induction that the bivariate distribution of (Xi, Xj) is 12(a + f - 1) and

the univaria.te marginal distribution of each Xi is 11 (a+ 9). The conditional distribution

of Xi given Xj is Vi - Xl 1(a+ j -1). 1 The moments of 11(<» are easily derived: since

INote that the 11(<» density is simply the usual symmetric beta density tI(<>,<» on10, 1J linearly rescaled over the interval [-1, I). It is also the density of the square root ofa !)e

37

symmetric about zero the odd moments vanish, and the even moments of order 2m are:

B(m+t,a)B( t, a)

The polynomial biorthogonality property holds, since

E (x}m IXi) - (1- Xj)mE (')'l(a + i- 1))= (1 _ x~)mB(m+t,(1+i-l)

J B(!.t1+f-1)·

The system of orthogonal polynomials generated by 1'1(a + (Pi1» are the ultraspherica.l

or Gegenbauer polynomials of order a + f - 1, 9k(-; a,p) (hence the distribution name).

As usual, the 9k('; a, p) are centered and standardized.

The coefficients of the leading polynomial terms in the conditional moment are inde

pendent of both i and j, implying that for every variable pair, the correlations between

(9k(Xi),9k(Xj)) are 0 for k odd, and for k = 2m,

.x~J~m)(a,p)= A(2m)(a,p) = B(m+ t ,a+!-1)(_1)2m• B(~,4+f-l)

We now apply these properties of the bivariate distribution to solve for the APCs.

All the bivariate distributions are identical, and the sequence of Gegenbauer poly*

nomials provide the common set of APC·basis functions. The correlation matrix for

1

A(2m)

).(2m)

1

where dependence on a and p has been suppressed.

This matrix has only two distinct eigenvalues: 1 + !>.. (2m)l, which exceeds 1, and

1 - (p -1)-11.x(2m)L which corresponds to the (p - I)-dim eigenspace spanned by vectors

of contrasts, {c : l:Ci = O}. The results for the Gegenbauer distribution on the unit ball

are summa.rized below.

38

Proposition 3.4 Suppose Xl, ... , X p - ')"p(a) on B p.

1. The APCs of X with "aManceless than 1 are contrasts of e"en degree Oegenbauer

polynomials,

~)2"')(X) = L c,g2m(X,; a,p), with L c, = o.,e. The eigen"alue of the APCs formed from polynomials of degree 2m is

1 - (p - 1)-IIA(2m)(a)1 with multiplicity p - 1.

The (p - I)-dimensional eigenspace of each APC eigenvalue is spanned by the (p - 1)

dimensional space of contrasts, so the APCs are not unique. The APCs eigenvalues are

increasing as a function in 2m for fixed a, and increasing in a for fixed m. Hence the first

p - 1 smallest additive principal components are any p - 1 linearly independent contrasts

of 2nd degree Gegenbauer polynomials corresponding to the eigenvalue l-IA(2)(a)l. The

second smallest eigenvalue 1 -IA()(a)1 defines the space spanned by contrasts of the 4th

degree Gegenbauer polynomials.

For a numerical example, we calculate the three smallest eigenvalues for ')"p(I), the

Uniform distribution on the unit p-ball.

degree 2 4 6

p=3 7 15 11S1i 16 ill

p=4 U S( 62IT 35 6!

Even for small m and p the eigenvalues are very close to 1. This result reflects the weak

dependencies of the Uniform distribution on the unit ball.

We now derive the APCs of the more interesting class of Gegenbauer distributions on

a p-dimensional ellipse. This distribution will later be used for a simulation example, so

the solutions are fully enumerated.

The random variable Y has a Gegenbauer(a;p) distribution on a p-dimensional ellipse if

Y = R - ~X, where X is Gegenbauer(a; p) on the unit ball, and R is any correlation matrix.

This is equivalent to choosing p arbitrary directions in the unit sphere and generating

Y1 , ... , Yp as the projections of X onto those directions. Each ofthe marginal distributions

is identical to the marginals of the spherical case, as a trivial consequence of the sphericity,

•

hc)'we·ver the tnvana!:e pairs are now ro"related.

39

We shall briefly digress here to discuss the bivariate APC·basis functions or eigenfunc

tions of the correlated case.

Every pair of variates (Yi, Yi) can be written 88 a linear combination:

(3.3)

where the pair (Xl, X2) has a bivariate Gegenbauer distribution, ")'p(a+ r-1) on the unit

disk and p is the correlation between Yi and Yi' If (Xl, X2) are polynomially biorthog

onal, then the prop08ition below establishes the eigenfunctions and eigenvalues for the

elliptically transformed variables (Yi, l'j).

Proposition 3.5 If (Yi, Y2) 4fe generated from the polynomia1l1l biorthogonal variables

(XI, X2) according to the transformation (9.9), then the eigenfunctions of (YI , Y2) are

exactly those of (Xl, X 2) irrespective of the correlation, p. The eigenvalues values are:

where >"2m(O) are the eigenvalues of the circular case, p = O.

A proof of Proposition 3.5 can be found in Buja. [Buj85].

The eigenvalues are now functions of the correlation p, and are no longer monotonic in

m. For large p the linear polynomial has larger correlation than the quadratic, for small

p, the quadratic polynomial dominates the linear. Hence, if there is a strong correlation

between any of the variables, the smallest APC-basis function will be linear, however if

all the correlations are weak, the smallest APC-basis function will be quadratic.

We can now calculate the APCs for the Gegenbauer on an ellipsoid. Proposition 3.5

J1

>.(m) (PIp)

)Jm) (P2p)

implies the common set of APe-basis functions consists of the Gegenbauer polynomials

of order a + p/2 - L Theorem 3.1 can be applied, with the matrix of correlations for the

transformations G~m) == (gm(Xl), ... ,9m(Xp )),

>..(m} (P12)

1

For example, for the simulation of Chapter 5.4, we will take a = I, hence we are con

sidering the Uniform on the ellipsoid generated hy the symmetric transformation matrix:

(

1.0 0.55 0.33)

R = 1.0 0.3 .

1.0

The correlation matrix of the linear polynomials is R itself, which has two eigenvalues

less than I, of 0.4488 and 0.7526. The correlation matrices of the quadratic and cubic

polynomials are :

(

1.0

T(2) =

0.128

1.0-0.114) (1.0-0.138 and TIS) =

1.0

-0.121

1.0

-0.185 )

-0.178 .

1.0

The smallest eigenvalues of these symmetric matrices are 0.8596 and 0.6750 respectively.

Thus the smallest APC-basis functions are linear, the second smallest are cuhic, and the

third are again linear.

3.1 Zero Variance Additive Principal Components for

Clustered and Categorical Data

Consider the situation where two variables Xl and X2 of a p-variate set divide into two

natural clusters in diagonally opposite quadrants, as described by Buja and Kass [BK85].

More exactly, suppose there exist cut points a, h such that P(XI < a, X 2 < h) and P(XI ?

a, X 2 ? h) are both non-zero, and sum to 1. Then there is an exact singularity in the data,

that is, an APC with an eigenvalue of O. By defining ¢>l to map the two sets {Xl < a} and

{Xl? a} onto different constants kl and k2, and ¢>2 mapping the corresponding sets {X2 <

h} and {X2 ? h} onto -kt> -k2, then with ¢>s =... = ¢>p =0, var(tPl(Xt}+¢>2(X2)) = o.

With a pair of categorical variables a similar phenomenon can occur. Suppose the

categories ;1 and il of variables Xi and Xi always occur together. Then P(Xi E Ci"

Xi E

cil) and P(Xi '" Ci p Xi '" cil) are both nonzero and sum to L As there are no ordering

restrictions on categorical variables, we can assign scores kl , kz to Xl, according to whether

41

an observation is in cit or not. Similarly the scores -k1 , -kz can be assigned 0 Xz. The

resulting transformation of the variables is a zero variance APC, exactly as above.

In the same spirit, APCs with exact singularities can exist between continuous and

categorical variables. Suppose there are a group of categories of the categorical vari·

able, Xl, whose values on the continuous variable, Xz, are distinct from the remaining

categories. Then there exist cut points a1, az, with P(X1 E clI" a1 < Xz ::; az) and

P(X1 i clI,,41 ~ Xz or Xz > az) are nonzero and sum to 1. Again, defining functions

mapping the disjoint sets onto different constants results in a zero variance APC.

Any of these two variable dependencies can exist in higher dimensional generalizations.

APCs that are formed from step functions are referred to as discrete APCs, since the

transformed variables are discrete valued.

Chapter 4

Estimation of Additive Principal

Components

4.1 Introduction

The algorithm of Chapter 2 for finding the APC of X can be implemented as an estima

tion procedure in the finite sample setting, simply by using a data smooth to estimate

the conditional expectations. The resulting algorithm was implemented on a Symbolics

Lisp 3610, a computing environment well suited to developmental programming. Some

comments on the use of this machine are found in Appendix A.

We have discussed properties of the APC for the population case in the preceeding

two chapters. Now, using simulation, we will look at the behavior of the algorithm as an

estimation technique.

We will not discuss asymptotic convergence and consistency properties of the finite

sample algorithm. These are dependent on properties of the data smooth used in the

implementation. Results for data smooths are fragmentary, and strong results are only

available for a restricted class of data smooths. A selection of relevant results is found in

Breiman and Friedman [BFBS, Appendix].

In practice, convergence can be a delicate matter. Through using the APC algorithm

on a wide range of data sets, we have examined some factors affecting speed and accuracy

43

of the convergence to an optimal solution. Several refinements and improvements of the

basic algorithm based on our experience have been developed.

4.2 Algorithm Implementation Details

4.2.1 Data Smooth

The smoother used in our implementation is the variable span "supersmooth" developed

by Friedman and Stuetzle [FS81]. A full description of the procedure in found in the

given reference. A attractive facet of the supersmooth is that it encompasses variable

span, fixed span running linear, linear, monotonic, cyclic and categorical estimators of

conditional expectation.

4.2.2 Convergence Criterion

Convergence of the algorithm is assessed using convergence of the eigenvalue:

var ~:>Pi(Xi)Evar4>i(Xi) .

Iteration of the outer loop continues until the eigenvalue estimate ceases to decrease.

Typically, we used a criterion for convergence of a change less than 0.005 in the eigenvalue

estimate over the last three iterations. For most applications, this seemed adequate,

although if the eigenvalue is very small (less than 0.01) further iteration may be required.

We used a straightforward estimator of the eigenvalue, simply calculating the squared

standard deviation of the vector estimates of E. 4>. and each (!Ji.

4.2.3 Initial Estimates

Occasionally, we find that the algorithm estimates an APe with smaller variance than

a previous APC. Suppose, for instance, that the eigenvalue of the third estima.ted APC

is smaller than the second. Then the algorithm has not located a global minimum for

the second APC. If we want to locate a correct solution for the second APC, a natural

procedure to consider is re-estimating the second and third APC, using the estimated

APe as an initial estimate for the second APe. Sometimes this simply results in a.

44

reversal of two original estimates, indicating the algorithm initially converged to a local

minimum at the third APe.

This ability to become stuck in local minima has the unfortunate consequence that the

algorithm is sensitive to initial values. In addition, since the algorithm utilizes a power

iteration technique, good starting guesses will speed the convergence considerably.

Both of these factors have made it worthwhile to find a method for calculating good ini

tial values. The basis of our starting estimates is Gnanadesikan's proposal for introducing

nonlinearity into ordinary principal components analysis [Gna77l.

First the variable matrix is augmented by adding second and third degree orthonormal

polynomials in each variable,

X a.ug = (X1>P2(X2),P3(X3), .•• , Xp , P2(Xp ),P3(X,»

where Pi: is an kth degree polynomial,

Pi: 1. PI, E Pi: = 0, var Pi: = 1.

Then the smallest linear principal component of this augmented matrix is formed:

(4.1)

Finally, estimators for each variable are constructed from the summed contribution of each

variable :

Proposition 4.1 The estimator (4.2), is the smallest APC of X when APC-functions

are restricted to the class of third degree polynomials.

Proof: The smallest principal component of the augmented matrix (4.1), minimizes for

aE RP ,, 3 P 3

var (L L ai:iPk(X,» subject to L L aii = 1.~1~1 ~lbl

The APe solution minimizes over ~ :

p p

var (L ¢.(X.)) subject to L var (¢i(Xi )) =1.i= 1 io:::: 1

(4.3)

The tPi are restricted to be centeredt third degree polynomials, hence can be written

tPi = l:~=1 bkiPk, since P1, 1'2, P8 span the space of cubic polynomials. The minimization

criterion defining the APe then reduces to optimization of:

P 3

var (L L bciPk(Xi)),i=I.1:=1

over b = (bIb b2b b3t. ... t ~p, b3p) subject to :

P 3 P 3 P 3

L (var L bciPk(Xi )) = L L bZivar (Pi(X,)) = L L b~ .. = 1.i=1 c=1 i=1 c=1 i=1 c=1

This is exactly the criterion (4.3), hence the two solutions are identical. I

The sample estimate for the smallest APC is simply the sample smallest linear principal

component of the augmented matrix of data vectors. Initial estimates for the k th APC

use the kth sample linear principal component of the augmented matrix, orthogonalized

with respect to the final estimate found for each of the previous k - 1 APe.

Borrowing from the principles of multiple correspondence analysis, a simple extension

of the above procedure yields starting values for categorical variables. If necessary, the

categorical variables are collapsed into a small number, say 4, similar classes. Three

independent vectors are formed using indicator vectors for the first three classes. These

are centered and normed, then added to the augmented matrix (4.1). A linear principal

component analysis on this matrix yields starting estimates in the same fashion as before,

which assign scores to each of the similar categories.

4.2.4 Spectrum Shift

The outer loop of the APC algorithm of Chapter 2.3.5, applies the operator pI - P, where

p is the number of variables in the analysis. The spectrum shift P was chosen in order to

ensure convergence to the smallest eigen solution, however, applying the operator 0:1 - P

will still converge to the correct solution if :

(4.4)

Since the eigenvalues are bounded sharply by 0 and P, p is the smallest shift that will

always work, However, dearly if a = >''''''$, the inequality is satisfied. It is easily shown

46

that the rate of convergence of the power method is controlled by the ratio:

let - A(2) I where A(l) and A(2) are the smallest and

let - A(I)1 second smallest eigenvalues of P .(4.5)

For most cases, the value p is a very conservative estimate of the largest eigenvalue

this upper bound is only achieved in the extreme case of all variables being identical.

Hence the ratio (4.5) can be close to one, particularly when the number of variables is

large. Convergence to an optimal solution is then slow, hence it is difficult to assess when

the stationary solution has been reached and the solutions may be dependent on initial

values. The behavior of the algorithm is greatly improved if a good estimate of the largest

eigenvalue is available.

In our implementation, the initial estimates described in the preceeding subsection

provide a good approximation to the largest eigenvalue. Explicitly, we use the largest

eigenvalue of the augmented matrix (4.1). This will tend to underestimate the true

maximal eigenvalue, however for the condition (4.4) to be satisfied, it is sufficient that

).""" > (A""" + A.....)!2. This will almost always hold, so in practice, convergence to the

smallest eigen solution will still occur.

4.2.5 Maintaining Orthogonality

The algorithm proposed for the calculation of higher principal components in the popu

lation case, simply orthogonalizes the initial estimates with respect to all previous com

ponents, thus in theory restricting the solution to be in the orthogonal space. In the

implementation of the algorithm for estimating the .~h additive principal component, it

will not be sufficient to orthogonali2e only the initial estimates. In the finite sample ver

sion, conditional expectations are estimated by data smooths. Typically the data smooth

is not a projection operator - it may not even satisfy linearity or symmetry - in which

case exact orthogonality is not preserved in the inner loop. Even if the smooth is a projec

tion, rounding error will still reintroduce components in the orthogonal space. To ensure

convergence to an orthogonal solution, the new estimate will need to be re-orthogonalised

in every pass tlll":m,,h the outer ioop.

4.3

47

Algorithm Improvement

Component Step

A Linear Principal

The basic algorithm entails a straightforward repetition of the iteration cycle over all

variables, using exactly the function estimates of the previous iteration cycle. We describe

an improvement which optimizes linearly over the fixed function estimates after each cycle

of iterations.

In this section, the estimation of the loadings of the variable functions is considered

separately from the transformations themselves, so for the sake of clarity we reparameterize

the APC, separating out the scale factor. Let ai = II~i II, then define:

1'Pi = -~i.

ai

Under the constraints I:i a; = 1 and var 'Pi = l,the representation ~(X) = I:i ai'P(X;)

is equivalent to the usual representation ~(X) = I:~i(Xi) with I:ivar~i(Xi) = 1. We

will use this alternative parameterization in the remainder of this section.

At the conclusion of each full iteration cycle, we have the p new variable function

estimates (al 'PI,· .. ,ap'Pp). The idea is simply to compute new scalings, (ai, ... ,a;) which

minimize I:var ai'Pi over a, with a'a = 1 for fixed ('P1> ... ,lOp). This is achieved by

computing the smallest principal component direction, a· for the p variables ('PI, ... ,'Pp).

Then we update the APC-function estimates, replacing ai'Pi with ai 'Pi. Since the smallest

principal component minimizes the variance of the sum, var I: ai'Pi ~ var I: ai'Pi. Hence

addition of the linear principal component step must improve the rate of convergence.

Implementing a linear principal component step for higher APC employs the same

basic idea, however the orthogonalization between components must be preserved. For

the current estimates of the kth APC, (alk

)'Plk), . .. ,a~k)'P~k», we want to find new scalings

( (k). (k).) h' h .. ." (k) . h I 1 H h k tha l , .... , ap w Ie m.mlmIze L" var ailOt over at WIt Q: a = . owever, t e

APC must also obey k - 1 orthogonality constraints,

o = ".cov(a.,~(k) aU),)i»L,t- n·"t' • rt

= ". a'cov (,~(k) aU),~U».L....-t- t Yt) t Yt

48

This boils down to k - 1 linear constraints on a, which can be written in matrix form:

Ca = 0,

where [C lu = ~(j)cov (101k) , lOP). Hence, finding the optimal scalings for fixed 10. while

preserving the orthogonality conditions is a simple constrained linear optimization prob

lem. The solution is easily computed by a change of basis, using the first k - 1 columns of

C as the first k - 1 basis vectors, then computing the smallest linear principal component

in the remaining p - k + 1 dimensions. Translation back to the usual basis yields the

minimizing solution.

As in the smallest APe, by definition, var I:a;k)' lO;k) ~ var I: (k).'P)k), hence the

rate of convergence is improved.

Chapter 5

Simulations of Additive Principal

Component Estimation

5.1 Introduction

In this chapter we present a simulation study of APC estimation. The purpose of the

study is assessment of the performance characteristics of the finite sample algorithm using

distributions with known solutions.

We know the exact APC solutions for the Gaussian distribution and the Uniform

on an ellipsoid, so these are natural distributions to consider. The dependencies of the

elliptically symmetric distributions of Chapter 3, can be fully characterized using only

linear and quadratic functions. This implies that these distributions represent a class of

"null distributions" from the point of view of nonlinear additive methodology. In the

Gaussian case, for instance, the only significant structure is linear, and the higher order

polynomial APC are an artifact of the elliptical distribution, hence are redundant once

the linear structure is known. This notion of redundancy will be discussed more fully in

section 6.3.1. From a simulation study we might expect to get some practical guidance as

to how to detect such uninformative nonlinearities.

We have discovered a technique for constructing data sets with high-dimensional ad-

50

ditive structure. With this technique, it is possible to test a more interesting range of

situations, involving mixtures of different marginal distributions and functions of any

form. In these simulations, we can test the effectiveness of the algorithm in recovering

"real" structure.

Lastly, we use simulation to study APC estimation in the two extreme cases of additive

dependency- the cases of mutual independence and exact singularity.

Data sets with known APC solutions provide a valuable testing ground for the algo

rithm, allowing quantitative assessment of bias and stability, and qualitative assessment

of the implementation. However, this simulation study is not intended to be an exhaustive

study of factors affecting estimation or algorithm performance - a task which would be

daunting to attempt. Rather, the aim is come to a fuller understanding of the inherent

properties of the estimation procedure.

In addition, through the accumulation of experimental evidence we can begin to de

velop an intuition for the behavior of the algorithm, and hence for interpretation of the

estimates. From this accumulated experience we can infer heuristic guidance for using

APC estimation in an applied setting.

5.2 Evaluation measures

Before presenting results of the simulations, we first describe the quantities we use to

assess the accuracy of the estimation.

Each simulation yields a sequence of estimated APCs : the APC-function for variable

t, APC k from the ;j'h sample is denoted ~;:) where i = 1... p ,j = 1... N , k = 1... K.

Sample statistics calculated from the simulations are compared with the true values of the

quantities described below.

1. The eigenvalue, var I: 4>i(Xi), a number lying between 0 and 1.

For the eigenvalue, bias, standard deviance and RMS deviance of the simulation

estimates are reported.

51

2. The p 8tandard deviau'ons of the transformed variables,

a p.vector con8trained to lie on the (p - l).sphere.

The standard deviations of the variables are analogous to variable loadings of linear

principal component analysis, There, ¢(Xi ) :;:; aiXi, so o-(¢i(Xi)) :;:; lail, hence this

p-vector shall be referred to as the variable loading of the estimated component. The

bias, standard deviance and RMS deviance are reported for each variable loading.

However, since the estimates are constrained to lie on the unit (p-l)-sphere, a more

meaningful metric is the angular separation between the estimated and true loading

vectors, the loading metric, dl(&,a) = cos-1 J(&,a)J. An estimate of the mean angle

is given by d(a" a) where 5. = 01 2:j aj[l)-l 2:i aj. Two estimates of the variability

of the loadings are given, analogous to standard error and RMS error estimators :

the average angle between estimated and true loadings, N-l 2: j dl(OJ, a), and the

average angle between estimated and mean loadings, N-l 2:idz(&j,a.).

3. The APe, ¢(X) = 2:¢i(Xi ) a function of X .

If the estimation of the APe is accurate, the joint distribution of the true and esti

mated APC is concentrated along the diagonal. Hence an indicator of the accuracy

of estimation for a sample is the correlation between the estimated and true APC.

This correlation, averaged over all sample estimates, together with the standard de

viation are reported. A plot corresponding to this average correlation, which gives a

visual impression of the estimation accuracy, is ~(k)(X 0) versus ¢(k)(X 0), for some

fixed sample X 0 of X , where :

~(k)(X 0) =L ~~~)(xf).i

The average APC-function estimate, ~~~)(Xf) is simply an average over the APe

function estimates of all samples of the simulation:

52

4. The APe-function for each lJariable, .(X.).

The accuracy of the APC-function estimates is best assessed by comparing true and

estimated APC-functions graphically. The "average" accuracy of the ith function

estimate can be assessed by comparing ¢l.l:l(xp) versus xp with the true </>ll:l(Xp).

An impression of the "variance" of the estimation is gained from superimposing the

N plots (if>!;l - !l:l)(Xp) versus xp. To facilitate comparison, all the kth APC

function plots are plotted with the same Y-axis scale.

If the eigenvalue of the component has multiplicity greater than one, none of the

quantities 2-4 are uniquely defined. In the finite sample setting, if true eigenvalues of two

components are almost identical, the sample APCs will be unstable since the underlying

eigen space is ill-determined. This is a consequence of the discontinuity of eigenfunctions

as a function of their eigenvalue estimates. More explicitly, suppose two eigenvalues are

close, and we use as our finite sample estimate of the smallest eigenvalue the smallest

estimated eigenvalue. Due to finite sample variability, the eigenfunction corresponding to

this estimate could be either of the smallest or second smallest true eigenfunction. Between

different samples, then, the eigenfunction estimates for the smallest eigenvalue fluctuate

between the orthogonal sets of functions of the smallest and second smallest eigenvalues.

The eigenvalue estimates themselves will be close to their true values, however.

To assess accuracy of estimation in this scenario, measures of comparison that are

not affected by this behavior must be used. Although we cannot compare the individual

components, the eigenspace they span is uniquely determined. If, for instance, the 2nd

and 3rd smallest eigenvalues are the same, then while the 2nd and 3rd smallest APCs

are not unique, we wish at least span(Eif>121,Eif>!81) to be close to span(E!21,E!8).

This suggests a canonical correlation analysis of these two spaces will give a measure of

the closeness of the true and estimated spaces in terms of canonical angles, pIll and p(2).

A natural metric between the spaces, which we will refer to as the canonical metric, is

C(2) = v'0I.2 + fP, where 01. = coS-lp(I), f3 = cos-lp(2). This metric can be extended in

an obvious way to provide an overall measure of closeness for k APCs: C(I:) = VE~ 01.;

where OI.i = C08- 1pW, p(j) the j'h canonical angle between the true and empirical spaces.

53

C(k) is bounded above by cos-1(0)Vk, which is increasing in k. Hence the scaled metric,

Jk c<» i( )' which is bounded above by 1 for all k, facilitates comparison between spaceskCQ'- 0

of different dimension.

While this metric is especially useful for ill-determined eigenspaces, it is also a useful

overall measure of APC estimation up to k dimensions, hence the mean and standard

deviance of this global measure of accuracy are also reported for all simulations.

5.3 Simulations using the Gaussian Distribution

For the normal distribution, the APC-functions are scaled hermite polynomials. Recall

that the appropriate variable loadings for an APC constructed from ktl> degree polynomials

and its associated eigenvalue, are determined by the eigenvectors and eigenvalues of the

correlation matrix R(k.). The ktl> smallest APC must belong to the subset of APCs formed

from hermite polynomials of degree k or less.

For the Gaussian distribution, two different scenarios were studied by simulation. In

the first scenario, GAU-Sl, the variables are nearly collinear, that is, the smallest eigen

value of their correlation matrix is close to zero, and all eigenvalues are distinct. Estimates

of the APCs, the variable loadings and the APC-functions should be well determined.

For the second simulation study, GAU-S2, the second and third smallest eigenvalues of

operator P are almost identical. The smallest APC estimates should be well determined,

however the second and third APCs are not unique, although they uniquely determine the

space of the second smallest eigenvalue.

5.3.1 GAU-S1: Gaussian Scenario 1

Random samples are generated from a normal distribution with zero mean and correlation

matrix:1.0 .6 .4 -.7

1.0 .5 -.3R=

1.0 -.8

1.0

54

Table 5.1: GAU-S1 : Correlations between ¢(k) (X 0) and ¢(k)(X 0)

Component First Second Third Fourth

Correlation 0.988 0.876 0.681 0.748

Std dev (N=20) 0.008 0.088 0.190 0.188

Table 5.2: GAU-S1 : Loading metric


dd&, atrueH") 0.038 0.185 1.111 0.895

ave (dz(<>;,&)) 0.167 0.774 1.385 1.332

ave (d,(<>i, atrue)) 0.172 0.778 1.665 1.531

The four smallest APCs are estimated on each of 20 data sets with 200 observations. The

eigenvalues of the smallest APC are given below.

Eigenvalues of R

Eigenvalues of R{2.)

Eigenvalues of R(B.)

0.0197,0.5455,0.7732 ...

0.2033, 0.7594 .

0.3817, 0.8556 .

Thus, the smallest APC is the smallest linear principal component, the second smallest is

the smallest quadratic, the third the smallest cubic, but the fourth smallest is the second

smallest linear principal component.

Table 5.1 gives the average correlations between the true and estimated APCs and

Figure 5.1, the corresponding plots.

Table 5.2 gives the angle between the mean vector of loadings and the true loadings

( the loading metric ), and two measures of dispersion for this angle: the average angle

between estimated and true loadings, and the average angle between estimated and mean

loadings.

In Table 5.3 the true and estimated eigenvalues and individual loadings are compared.

,.' 1.ll~

G.2

.....u:t

~.". U .,-~'#. • <

G.G -.<t."'" ... \".

'( ~I • '~~ e.G .)~

-G.2.c'" ',/i-'

"'U

., ."

-G•• ,"-1.&

-0.• ·0.2 G,. G.~ -1.t -G.5 G.O M I.G

J d ~ dComponent 3 CompOllellt.

1.16 u

Got G.n ,.~.~ . "

0-4C1'.:.. ..;. ,

G.G ,~:.~.. :~ ..~.'

t-;~: .~. v

I. ~. ..

G.O

,<:;~~l '0,$<1 "

-G.46 -ui"'1,i -t.n G•• G... ,.. -1.1 -U5 t.G 1.15 1.7

j d ~ "For the k th APC :¢o:) (X 0) versus ¢:(k)(X 0)

Figure 5.1: GAU-51 : Correlation Plots

CompoHllt 1

55

CompOllent .1

56

All the eigenvalue estimates have small negative bias, both bias and variance estimates

increase with the number of componente estimated. Similarly, the bias and variance of

the variable loadings is larger for the higher components.

Finally, Table 5.4 contains the sca.!ed and unscaled canonical metries between the APC

spaces spanned by {~(k)(XOm and {~(k)(XOm.

It is apparent from the APC correlations, the angles of the variable loadings and the

canonical metric, that while the first and second APCs are estimated extremely accurately,

the third and fourth are less exactly determined. This is clearly seen in the increase in

scatter of Figure 5.1.

The scaled canonical metric decrease. with the inclusion of the fourth APC. This

decrease implies the 4-dimensiona.! space is more accurately determined than the 3-dimen

siona.! space, suggesting the estimates of the third and fourth APCs are "mixing", that is,

there is a slight lack of resolution within the space of the third and fourth components.

This phenomenon is discussed fully in the concluding section 5.8.

Figure 5.2 shows the true and average estimated APC-functions, Figure 5.3 variance

of these functions over the 20 replications. The function plots verify the accuracy in esti·

mation of the two smallest components. The variance plote for the third and fourth com·

ponents show the presence of a linear trend in some replicates of the third APC·functions,

and likewise a cubic trend in some fourth APC·functions. This behavior provides further

support for the explanation of mixing, or lack of resolution of these components, suggested

previously.

5.3.2 GAU-S2: Gaussian Scenario 2

Random samples are generated from a normal distribution with zero mean and correlation

matrix:1.0 .9 .8 .2

11.0 .8 .2R=

1.0 .4 )

l.0

57

Table 5.3: GAU-S1 : Eigenvalue and Variable Loadings

First APC Theoretical Estimated Bias Std dey RMS dey

Eigenvalue 0.0197 0.0186 -0.0011 0.0011 0.0015

Loadings :Var 1 0.460 0.462 0.002 0.Ql5 0.016

Var 2 0.345 0.349 0.004 0.Ql8 0.018

Var 3 0.510 0.511 0.001 0.012 0.012

Var4 0.639 0.634 -0.005 0.012 0.013

Second APC Theoretical Estimated Bias Std dey RMS dey

Eigenvalue 0.2033 0.1722 -0.0311 0.0501 0.0594

Loadings :Var 1 0.431 0.449 0.018 0.082 0.084

Var 2 0.288 0.297 0.009 0.081 0.081

Var 3 0.538 0.530 -0.008 0.069 0.070

Var 4 0.644 0.640 -0.004 0.045 0.051

Third APC Theoretical Estimated Bias Std dey RMS dey

Eigenvalue 0.3817 0.3448 -0.0369 0.0643 0.0746

Loadings :Var 1 0.398 0.449 0.051 0.138 0.148

Var 2 0.224 0.340 0.116 0.127 0.174

Var 3 0.570 0.558 -0.012 0.081 0.082

Var 4 0.683 0.560 -0.123 0.135 0.184

Fourth APC Theoretical Estimated Bias Std dey RMS dey

Eigenvalue 0.545 0.506 -0.039 0.052 0.066

Loadings :Var 1 0.663 0.581 -0.082 0.113 0.141

Var 2 0.419 0.455 0.036 0.138 0.142

Var 3 0.567 0.526 -0.041 0.078 0.088

IVar 4 0.252 , 0.356 0.104 I 0.131 , 0.168 I

I ,

58

CompolUlnl t, Vutahie t COmpOlUlal t. V.utahie 1 Compollent t. Vut..hle 3. Compount t, Vulahle •

1.,~ 1.'1u ul

/U$ G.$J

~00$6 ~.6$

0.0 U 0.0 ~.O

-0.&51-0.&0 -0.60 -0.55

-1.1 ..vt -1.1-1.1 , , , ,-1.f -o.t 0.0 o.t 1.f "'1 .• -0,5 0.0 0.• ,.. -u -0.• 0.0 0.• ,.. -1,' -0.• 0.0 0.• 1.8

~ l:l ~ l:l ~ d ~ dCompollent 1. Vartahle t Compollent 1. Varlahle1 COmponeJll 1, Variable J Componenl 1, Vulahle •

1.1 1•• M

VU

U5 V •.H ••&0 U5

0.0 0.0

~u 0.0(\-O.H -0.66 -0.55 -0-'$

-M -1.f -M -,.t

-,.. ...• ... ... U -ii.' .... ••• M 1.8 -1.8 -... ... o.t ,.. ...• -u U ••• ...J l:l ~ d ~ ~ ~ d

Compoullt 3, Vul..hl. t CO.pODellt 3, Vulahl. 2. C••pOllent 3, Vulable 3 Co.pount 3.. VUlahle 4

••7 0.7 0.7 0.7~

vvI ,

0.05 AvJ 0,S5 M5 o.35j. ,~0.0 " .. 0.0 . . ••0 u!...... ",/'

. .,. .

.'''j-0.t5 ·o.H ·~.'5

-~,7 -0.1 -0.1 -0.7 !, , , .

-u .0.9 0.0 0.• 1<1 -'1.1 -O,t 0.0 o.t ". -u ·u u O,t ,.. -u -O,t 0.0 O,t 1.8

~ d ~ d ~ ~ -* dComponent ., Vulable t Compoll.u,t 4, Variable 2. Component 4, Variable 3 Component 4. VUlab!e 4

1.21 q l,2

::1~,u 0-'

"j".

",. ,

~"

.h '" 0.01 '.'.u , ••0 0.0

",

" .-, -ui --·0.61

-$.6 / -0••

..A !

·,.t -<.21 -1.2, , , , , , , , , , ,.,.. .... u ... ..• -1.' -t.• ... U 1.8 ...,.. .•.. U I .• ,.. "1,l -1.1 ... U ...J d ~ d J d J d

For the kth APe, itA varia.ble : ¢~~) (Xf)(points) and tP~Ic)(X?}(801idline) versus Xf.Figure 5.2: GAU-Sl : APe-function Estimation

59

COlnPOl:lellt t. Variable 1 COlnP0Debt t, Variable .1 COlnPOlleftt t, Vart.ble J CompOlll!ftt t, Varl.ble (

"'1 u

"I "1U6 0.51 D.n U6

- diit - .. 0.0

'r.·~ .-,O.G 0.0

-6.66 -o.n -G.fO -0.80

-1.:1 -". "i , ' -1:.'0.0 ... 0.0 M 1.1 -L. -D,t 0.0 U 1.t

Componellt 1, Variable 1

l.t

Compoaent J, Variable (

G'O~iW~

"'I~Compoent J. Variable JCompOllellt J. Variable .1Compollut J, V.arlabl e t

1.4

~1.8 ,.0j

~ -C S ~ 1 <!:0.0 0.0 G.O

-1.8 -1.8 "1:.5

1.0

~1.0 1.0 1.4

~ ~O.D 0.0 teD 0.'

-;.6 -1.0 -'.4 -1.0

-1.0 -u 0.0 D.t 1.t

Component 4, VAllable 1 Comp0Jlellt (, Variable .1

,-1.' "9.' 0,0 O.!I 1••

Component ., Variable J Componebt ., V.ulable (

ujO'UJ~.

t.O I-0.u1

I

".s1

Ll ~--~-~-,...

For the ktn. APe, j·in. variable :(Jf? - ~~.k»)(Xf) versus X? for j = 1 ... N.

Figure 5.3: GAU·Sl : Variance of APC.function Estima.tion

60

Components Included 1 1,2 1,2,3 1,2,3,4

Unsealed metric 0.951 2.842 5.131 5.717

Std dev (N=20) 0.277 1.071 1.489 1.559

Scaled metric 0.0106 0.0223 0.0329 0.0318

The three smallest APCs are estimated on each of 20 data sets, again of 200 observations.

The eigenvalues of the smallest APCs are given below.

Eigenvalues of R

Eigenvalues of R(2-)

Eigenvalues of R(Soo)

0.1000,0.1914,0.9224 ...

0.1900,0.3954 .

0.2710,0.5486 .

This shows that the smallest APC is the smallest linear principal component, but due

to the closeness of the eigenvalues, the second and third span the space of the smallest

quadratic and the second smallest linear. The separation of the fourth largest eigenvalue

is sufficient to make the space unique. Table 5.5 compares the true and estimated eigen

values for the first three components, and the variable loadings of the smallest. The first

two eigenvalue estimates are again negatively biased, although not the third, and vari

ability increases with component number. The estimated variable loadings of the smallest

component are considerably more biased for the small values.

The APC correlation for the smallest APC is 0.9459, with standard deviance 0.0017.

The angle between true and average estimated variable loadings for the smallest APC is

0.7710, with average angle between true and estimated of 0.615°, and between average and

estimated 0.866°. Figure 5.4 gives the correlation plots for all three APCs, and the lack

of uniqueness in the second and third APC is apparent. Figure 5.5 shows the estimation

of the APC-functions for the smallest APC.

Evaluation of the accuracy of second and third estimated APC is possible only through

assessment of the canonical metric comparing the spans of the true and estimated APCs.

Table S,6 contains C8.itlOlHC;M metrics of the relevant spaces.

61

Table 5.5: GAU~S2 : Eigenvalue and Variable Loadings

First APe Theoretical Estimated Bias Std dey RMS dey

Eigenvalue 0.1000 0.0887 -0.0113 0.0116 0.0164

Loadings :Var 1 0.707 0.704 -0.003 0.058 0.058

Var 2 0.707 0.690 -0.017 0.053 0.056

Var 3 0.000 0.112 0.112 0.086 0.144

Var 4 0.000 0.046 0.046 0.022 0.052


Eigenvalue 0.1900 0.1590 -0.0310 0.0311 0.0445


Eigenvalue 0.1914 0.1946 0.0032 0.0259 0.0261

Compo.ell.t 1 c.m.ponent 2

·"1 •.51 "'1.p

~.

uus ,

,.10'. us,~

" ..

./" ~~~l. u6.6 ..:""'-\b.6 ~ ~.: ..

~ .( ....·us

f<"~

O~l

" ·o.n ; .J,ui '. I-0.4 I

-o.t -6.# 0.0 0.46 O.t ~1.0 -u '.6 C.l U -u,; d ,; d

Component!

...~: ~. :!.. "

:~··1·: ~ '~;.: : ~ ...• ~ .1 ..

"

-'.06 U '.44 t .•

Figure 5A: GAU~S2 : Correlation Plots

62

C.mpOllelll t, Varb.bIe t Componeat t. V uiable 1 CompGlleat t. Vub:ble .3 Component t, Variable ,(

[<., <., <., "'1

0.&5 ILU Uf 1l.U

,., ,.,,., ,.,

~o,n ~O.16 ~•.n ~lUf

~".~1.' *t.$ .,..

·1.' ~u',., ,.,

'" ~1 •• ~U ,., ,.t '" ~1,t *IU ,., ,.t ,., vi,f ·0.' ,., ,.t <.,~ t:'I • t:'I ~ t:'I • ~

Component I, Vari.ble t Component I, Vart"ble 1 Component I, Variable.3 COmpOlleDt 1, Vartable 4:

.., .., .., ..,

,., '" '",.,

pz t:;t;Q ~-=::::>.

~ ~ -,., ,., ,., ,.,""" -

"'1 "'L .1.11' ~UI

·:.0 ·:.ll ·:.0~U

~U ~... ,., '.t ,., .,.. ·0.' t.t t.t '" .,., ·u ,., t.t ,.t .,., ·0.' t., t.t '"• t:'I • tl • •For the jth variable: ¥lk\x?) (points) and 4>lkl(.xr) {solid line) versus X?

Figure 5.5: GAU-S2 : APC-function Estimation for Smallest APC

Table 5.6: GAU-S2 : Canonical Metric between {¢(kl(XOm and {,p(kl(XOm

Components Included 1 2,3 1,2,3

Unsealed metric 1.935 I 4.042 3.718

Std dev (N=20) 0.785 I 1.414 I1.220 ,,Scaled metric 0.0215 0.0318 0.0238 I

63

of lack of resolution between the spaces of the smallest and second smallest eigenvalue

(the latter having dimension two), since the distance between the true and estimated 3

dimensional space of the first and second eigenvalues is smaller than that between the

2-dimensional spaces of the second eigenvalue only. This explains the behavior of the

estimates for Variables 3 and 4 in the smallest component. These APC-function estimates,

instead of being virtually zero in the first APC, foreshadow the APC-functions of these

variables in the second eigenspace.

5.4 Simulations using the Uniform Distribution on an

Ellipsoid

Data sets are simulated from the Uniform distribution on the ellipsoid described by the

correlation matrix :

(

1.00 0.55 0,33)

R - 1.00 0.30 .

1.00

The theoretical solutions for this correlation matrix have been derived in section 3.6 : the

APC-functions are scaled Gegenbauer polynomials, and the variable scaling and eigenval

ues are computed from the eigen decomposition of their correlation matrix. The k'h APC

is constructed from a polynomial of degree 2k or less.

For the distribution specified, the four smallest APCs and their eigenvalues are:

0.4488 : the smallest linear principal component

0.6750 : the smallest cubic principal component

0.7526: the second smallest linear principal component

0.8596 : the smallest quadratic principal component

These eigenvalues are all far larger than those of the Gaussian Scenarios. By their

proximity to one, it is clear the additive dependencies are weak.

Table 5.7 compares the true and estimated eigenvalues and variable loadings for each

of four .W1me" APe. £,SlllJ:lll.\.ea eigenvalues are in good aq,'eemenL although all have

64

Table 5.7: UNI.S : Eigenvalue and Variable Loadings


Eigenvalue 0.4488 0.4295 -0.0193 0.0307 0.0366

Loadings :Var 1 0.722 0.715 -0.007 0.025 0.026

Var 2 0.689 0.686 -0.003 0.028 0.028

Var 3 0.057 0.130 0.083 0.034 0.082


Eigenvalue 0.6750 0.6701 -0.0049 0.0428 0.0432

Loadings :Var 1 0.559 0.481 -0.078 0.109 0.131

Var 2 0.549 0.512 -0.037 0.067 0.078

Var 3 0.620 0.696 0.076 0.086 0.116


Eigenvalue 0.7526 0.7458 -0.0066 0.0343 0.0351

Loadings :Var 1 0.301 0.439 0.138 0.127 0.190

Var 2 0.387 0.498 0.111 0.115 0.161

Var 3 0.871 0.722 -0.149 0.103 0.185

Fourth APC Theoretical Estimated Bias Std dey RMS dey

Eigenvalue 0.8596 0.8383 -0.0213 0.0411 0.0465

Loadings :Var 1 0.273 0.489 0.216 0.142 0.264

Var 2 0.789 0.592 -0.197 0.149 0.251

I0.552 0.580 0.028I Var 3 I 0.185 0.188

65

Table E.9: UNI-S : Correlation between ~(k) (X 0) and ~(k) (X 0)


Correlation 0.969 0.646 0.539 0.673

Std dev (N=20) 0.018 0.311 0.330 0.221

Table E.I0: UNI-S : Loading metric


d,(&, atrue)(D) 0.459 0.121 1.466 1.875

ave (d, (a., &)) 0.273 0.834 1.159 1.604

ave (d,(ai, atrue» 0.505 0.985 1.715 2.312

& slight negative bias, which increases with the size of the eigenvalue. The variability

in eigenvalue estimation remains constant, however. Both bias and variance of the variable

loadings increases with the number of components estimated.

Average correlations between APCs, in Table 5.8, show close agreement only for the

smallest APC. The other three correlations are highly variable, indicating true and esti

mated APOs are not close in many of the iterations, despite the accuracy of the eigenvalue

estimates.

The mean angular separation of the loading vectors for each component, Table 5.9 l

reflect decreasing accuracy as more components are estimated, which is echoed in Figure

5.6. Nevertheless, the canonical metrics, Table 5.10 indicate very good agreement between

true and estimated APes for the four smallest components. Notice there is little change

in the scaled metric from 2 to 4 dimensions.

Figures 5.7 and 5.8 show the APC-function estimates and their variance plots. From

these we see that the bias of the estimates is caused by behavior reminiscent of that

expected when eigenfunctions are ill-determined. As is apparent from the variance plots

the second a.nd APe, for some samples the functions of the second APe a.re

66

COlnpoUDt t CompoulU 2.

1.01..

..j~..'"0.51 l... u

"j......". . .... . .

" "~ ••0".''.

."j.. "

'*. ...

-e.• ., ,

"1.0'

"'1 .• -u ••0 e.f 1.t .'{,4- -0,7 0.0 0.1 1,4.- " 6 ~

Component 3 Componnt •

"f- _,oM"

...........

0.0

'."

..'-o.n

...,......' .

, ..

-1.7-2.U•.1u- •.7

J

-u~i-u{'- ~ - __-- _

Figure 5.6: UNI-S : Correlation Plots

Table 5.10: UNI-S : Canonical Metric between {~(k) (XO)}f and {¢(k) (XO)}f

Components Included 1 1,2 1,2,3 1,2,3,4

Unsealed metric 1.511 5.184 7.143 8.134

IStd dey (N=20) ! 0.485 2.492 3.294 3.741

1

Scaled metric 0.0168 I0.0407 0.0458 0.0450 1

67

CompOlU!llt 1, Vartalll.. 1 Compount 1. Vutabl.. 1 COIllPOllellt 1. V&ltable 3.

d

...

-...0.0 0.1

Component 1. VvI.bl.. t CompOllenl 1. Vvtable 1 Component 1. Vutabl.. 3.

...

'.1•.n•..-'.1 - ••U

.... 4

-0.1

0.1'.0

J-0.1 -us

-0.1

COIllPOUllt 3. V&ll"ble Compount 3, Vutabl.. 1 C••pOUllt 3. Vutabl.. 3.

... ...

0.0-0,1 -o.to

0.0

0.1

0.1us0.0-0,1 -0.'50.10.0-0.7 -us

••1

o.a~ 0.0'

-o·'lM1.4

~-------~

Component 4, Vvlahle 1 Componellt 4. Vutable 2 Compollent 4, Vutable 3.

For the kth APe) i th variable : i>~~)(xp)(pointlJ) and ¢'~k)(X?)(solid line) versus X?

Figure 5.7: UNI-S : APe-function Estimation

•.1•.n

• ~ c .... ~ • "<. ..

....u t,t-0.7

0.1'

0.0 I-0,1"1

-i.51.........~--~-----t.O I.U".1

0.0

1,7us-'.1 ".to

uio.,s~

Mi~-0.1

-ui

68

COlllpoaellt 1, Vulable 1 COIllPOUllt 1, Vuiable 1 Comp01U!lIt 1, Vulable 3

0.7us

-0.7

•.7•.U-0,7 -o.U

-\.4

-0.7

0.7-us"0,7

:~J~~:!IiIIiiii!EfiI£132r~,E!:us

Component 1, Vulable 1 CompOlltlllt 1, Vulable 1 Compount 1. Variable 3

'.0

COllllpouat 3, Varlabl. 1 COIlllPOD.8l1t 3, Variable 1 Collllpoaeat 3, Variable 3

-1.1

-0.7 -us 0.7

Compoael>t 4, Varl.ble t COlllpOl>el>t 4, Vallable 1 CompOUl>t 4, Varl.ble 3

•.7t,U-•.1 - ••U

-i.01,-~__~_......,.__~_-.-_'.7'.K•.t- •.7 ....K

For the k th APC} itA variable :(¢~:) - ¢~~»(Xl) versus Xp for j = 1 ... N.

Figure 5.8: UNI-S : Variance of APC-function Estimation

69

cubic, while for others they are linear. Evidently, despite theoretical uniqueness of the

eigenfunctions, estimates are not well-determined. It appears that both the size and the

spacing of the eigenvalues are important for stability of the eigenfunctions. In view of the

weak dependencies of the APCs, it is not surprising that the estimates are unstable for

the higher APCs.

The canonical metric for the first four dimensions, however, will not be affected by

this phenomenon, provided the space of the four smallest estimated APCs corresponds to

the four smallest true eigenfunctions.

5.5 Simulations using Manifolds defined by Specified

Constraints

This simulation studies the algorithm applied to data sets with a larger number of variables

and more complex structure. We discovered a method for constructing data sets that

satisfy a set of orthogonal constraints, which are almost completely prespecified for an

arbitrary set of marginal distributions. Using this technique, data sets are simulated that

concentrate around a 4-dimensional manifold in the 6 variables. Two simulation studies

are presented, using the same set of constraints for each, but changing the eigenvalues of

the two smallest APCs.

5.5.1 A Method for Constructing Manifolds

Constructing realizations of additive manifolds using continuous functions is not a trivial

problem. The following method induces additive dependencies between variables described

by a constraint in which all except one of the variable functions are specified.

The idea underlying the method is a simple trick with permutations. Suppose we

begin with p variables Xl'" X p and some set of specified functions, ¢1(XI ) ... ¢p(Xp).

Let Y == ¢1(Xtl and Z == L:~=2 ¢i(Xi). lfwe sort the values ofY and Z, then the ordered

variables are likely to be similar. By permuting Xl and X2 .. X p in parallel with Y and

Z respectively, we have that:

70

p

,&lP:;') ~ LMXl),i=2

where Xi are the permuted variables.

The exact procedure, first for a manifold of co-dimension 1 :

1. Generate p independent samples 201,202", xp of Xl, X 2 ••• Xp• Standardize so that

ave Xi = 0, ave X; = 1. Compute '&l(Xt>, ... ,&p(xp) for the specified sequence of

functions '&i. Center the transformed variables, ave '&i(Xi) = 0.

2. Form II = '&l(Xt> and z = E;=2 '&i(Xi)'

3. Find the ordering permutations ,... and ,... for II and z, so "'.(11) and ,...(z) are both

increasing.

4. Form

X~ = ,...(xt>

x: = ,...(Xi) .= 2 ... p

The resulting data set x~, .. . x~, satisfies the constraint:

where g(.) is a monotonic function (g is the mapping between the ordered variables "'.(11)

and,...(z) ). All transformations except 9 0'&1 are prespecified.

The above constraint is exact. To construct a data set for which the variance of the

constraint is nonzero, we modify step 2 above as follows.

Generate a vector u independent of X 2, • •• X p , with ave u = 0 and ave u2 = £2. Form

z by z = E;=2 ,&i(Xi) + u. Perform the sequence of steps above using this modified z. The

resulting data set has :

where u' = ,...(u). Hence:

g(4)l(XD)- E;=2'&i(XD) = u',

and so var (g( 4>1 (xD)- E;=2 4>i(x:J)) = vaT ("I) =val:" (ti) = £2

71

The procedure can be generalized to generate data sets with lower dimensional additive

structure, that is, satisfying more than one orthogonal constraint. The same permutation

idea is used, however now a separate ordering is induced for each constraint. To produce

these multiple dependencies, the block of summed variables (z, above) is not permutsd,

rather, the inverse ordering is applied to the single variable (y, above). Orthogonality can

be achieved by using a Gram-Schmidt step, or by specifying the functions to be orthogonal.

The following method produces a data set lying in a manifold of co-dimension 2.

1. Generate p independent samples Xl, X2 ... xp with ave Xi = O,ave X; = 1. Com

pute (PI (Xl), <Ps(xs) ... <pp(xp) for some sequence of functions <Pi, where ave <Pi (X;) =

O. Compute <P2(X2) ,<ps(xs) ... <pp(xp) for some sequence of functions <Pi, where

ave<Pi(xi) = 0 and <Pi(Xi) J..<Pi(Xi) for;= 3 ... p.

3. Find the ordering permutations "'Vl> ,..., of Y1, Z1> and "'th' ,..., of Y2, Z2.

4. Form

X~ = "';,l"'v,(xd

x~, = ,..;,1"'v.(X2)'

The data set xL x~, Xs, .. . xp, satisfy the two orthogonal constraints :

9M1(Xi)) - <Ps(xs) - ... - <pp(xp) = °92(.Mxm - <ps(xs) -'" - tPp(xp) = 0

with I:f=l cov (<Pi(Xi),tPi(Xi)) = o.

All of the functions, except 91 0 <PI and 92 0 tP2 can be specified. This can be altered in a

manner similar to the single constraint case, so that the constraints are not exact. Adding

noise components with different variances permits control over the eigenvalue separation.

5.5.2 SCM-S1: Specified Constraint Manifold Scenario 1

The simulated data sets we use for the two simulations that follow are generated using the

technique of the previous section. We use 6 variables and two constraints. The variabies

72

(Xl, ... Xs) satisfy the two equations of the form :

Zl = tPl(Xl ) + tPs(Xs) + tP,(X,) + tPs(Xs)

Zz = + tPz(Xz) + tPs(Xs) + tP,(X,) + tPs(Xs)

i: COy (tPi, "'i) - 0,

where varZl = 0.048,

varZz = 0.137.

Variable 6 is not included in the manifold construction, and thus has a true zero function

for both equations. The marginal distributions of the variables before centering and scaling

are:

Xl - )/(0,1)

Xz - Chisq(6)

Xs - 0.6 * )/(0,1) + 0.4 * )/(2,1)

X, - Categorical

Xs - )/(0,1)

X s - Uniform[-l,l]

The four dimensional manifold is defined by the orthogonal equations ~(X) = 0 and

~(X) = o.The variances of Zl and Zz, 0.048 and 0.137, respectively, are the approximate eigen

values of the two smallest APCs. The APCs are defined by the APC-functions <P(X) =

(tPb"'tPS) and W(X) = ("'1,'""'s).

The two smallest APC are estimated for 20 data sets of 200 observations and 6 variables

simulated from the manifold described above.

The true and estimated eigenvalues and variable loadings, given in Table 5.11, show

accurate estimates, with the exception of the variable loadings of Variable 2 in Component

1, and Variable 1 in Component 2. The correlations between estimated and true APCs,

Table 5.12, show close agreement for the first component, however, the second component

estimation is less consistent. The angles between the true and average estimated variable

loadings, Table 5.13, while not significantly different from zero, are more varia.ble than the

two smallest APC variable loadings of other simula.tions. The canonical metric, Table 5.14,

shows a gr€,at,,, accuracy ill estimation of the prc,bably reflects a

73

Table 5.11: SCM·S1 : Eigenvalue and Variable Loadings


Eigenvalue 0.048 0.049 0.001 0.004 0.004

Loadings :Var 1 0.707 0.690 -0.017 0.022 0.028

Var 2 0.000 0.116 0.116 0.075 0.141

Var 3 0.408 0.435 0.027 0.020 0.034

Var 4 0.408 0.375 -0.033 0.028 0.044

Var 5 0.408 0.415 0.007 0.018 0.019

Var 6 0.000 0.031 0.031 0.010 0.034


Eigenvalue 0.137 0.113 -0.024 0.011 0.021

Loadings :Var 1 0.000 0.147 0.147 0.082 0.167

Var 2 0.707 0.681 -0.026 0.024 0.036

Var 3 0.408 0.397 -0.011 0.031 0.033

Var 4 0.408 0.404 -0.004 0.041 0.041

Var 5 0.408 0.425 0.017 0.048 0.051

Var 6 0.000 0.051 0.051 0.020 0.056

Table 5.12: SCM.S1 : Correlation between ¢(l'l(X 0) and ¢(l:l(X 0)

Component First Second

, Correlation 0.9018 I 0.7518

Std dey (N=20) 0.0188 0.0645 I

74

Table 5.13: SCM-Sl : Loading metric


dl(a, atrutH") 0.812 0.971

ave (dl(a;, a)) 0.474 0.631

ave (dl (ai, atrue)) 0.876 1.113

Table 5.14: SCM-S1 : Canonical Metric between {~(k)(XO))t and {¢(k)(XO))t

Components Included 1 1,2

Unsealed metric 4.99 2.79

Std dev (N==20) 0.549 0.270

Scaled metric 0.0372 0.0311

slight lack of exact resolution of the separate components in the manifold. Figures 5.9 and

5.10 compare the estimated and true APC-functions of the first two components. Despite

the discrepancies in loading and eigenvalue estimates, the function estimates are close to

their true counterparts, particularly in a qualitative sense - they follow closely the global

characteristics of the true functions. The greatest inaccuracies arise in the estimation of

the zero function. For instance, in the APC-function for the second variable, smallest

APC, the estimate appears to match the shape of the true APC-function estimate for

the second variable, second component. This is also apparent in the variance plots of the

estimation. A similar phenomenon is observed in tbe APC-function estimate for the first

variable, second smallest APC, where the estimate instead of being close to zero, seems to

echo the function from the first component. In contrast, the estimate of the zero function

for Variable 6, which is independent, is negligible as expected.

Such mixing between the two components would cause the increase in the scaled canon

ical metric for two dimensions noted above.

15

Campou.t 1, Vart.lilt. t (AlllpolUat 1, Vartahle 1 Compollent 1, V viable .3

!1,.J u~1 °1

- I1,7 , - 1,1 D,71

" ~u -_.-~.-.. ~ ~

I,D'. c'-

I,D "- ;,;;;::;- ~~..... <---""'~

"-1,7 -1.7 -1,7

... .,.4 -1 .• -1,4' , ,-u I,D 1.5 -1.5 U M -1:.4 0.0 U.. ~ ; d J d

ComplIll.llt 1, VuLt,bIe 4 COmpll1lellt 1, Varl.lioIe 5 ClImpllllellt 1, Varlallle I)

1,4 1,. 1••

1,7 1.7

~1,7

I,D

~U 1.1

",

-a,7 -1,1 -1,7

-1 .• ........ -1.4

-1.5 I .• 1.5 -1.$ ... ... -u t,a U

j ~ ; ~ j ~

Compo,",llt 1, Vartable t Compollellt 1, Varialille 2. Compilnellt 1, VarIable .3

1.3 ui 1.3

I.H uo

<7ff!1@i....

~ ~U D,O a,a

-0,," -a.'" -us

-1.3 -'.1 "'.3.

-~.5 a,a 1,0 -1.6 I,. 1.5 "".$ ... ..0

• hl • ~ J hlCCllnpllllent 1. Vall aMe • COIllPOHllt 1, Variable 5 COmpllllellt I, Varh.ble I)

1.3 uj uia,U

.~~ ..E§0,u1...... & ~

I

U a,a a .,al • .- ..I

_a"j-0.5 -"'1

I

-1.ai I;o-L~!" , , , ! . , -1,31 , ,

"1.1 a,1 u -1.5 ... ... ~·L6 a.a U

j ~ .. d ~ ~

Figure 5.9: SCM-51 APC-function Estimation for Component 1

76

Component 1. Vart.hIe 1 Compoullt 1. Vuiable 2. Component 1. Vuiable 3

u ,.,

"10.1 0.1 , 0.1

.. --

l~.. o:::~, -'-".'" .0

.01j~O.1 .

-1,<l -1.4-1.4

-1,5 0.0 U ·'.6 0.0 '.0 -1.6 0.0 '.0

" " C " C

Component 1. Vuiable • Component 1. Variable 5 Component 1. V.dahle 6

,.. u ,..0.1

~0.1 0.1

0.0 0.0 (\ 0.0

/'

-0.7 ·0,1 .t1

-'I •• -1•• -1,4

-i.' 0.0 U -i.' 0.0 U ·'.4 0.0 U

; <l " ~ ; ~

Component 1. Variable 1 CompoDent 1. Variable 1. COmpOllellt 1, Vart.bl. 3

u u u

!its iiG ~ ~ ~ 3!t0.0 0.0 0.0 •-1,5 -',J .,.j

-u: 0.0 U -1.6 '.0 U -1.5 0.' U

j ~ j II j IIComponent 1. Vuiable 4 c.:mpODeDt 1. Variable 5 Component 1, Varia-hIe 6

u u ,.jjJO> • • "-: 9 ~ ooj I••• '.0

"·u -u: .'L6~I.,.. '.0 '.0 -1.6: ,., H: -1,S ,., U

" C " <l • <l

Figure 5.10: SCM-S1 APC-function Estimation for Component 2

11

Table 5.15: SCM-S2 : Correlation between ?>(k) (X 0) and ~(k) (X 0)


Correlation 0.9151 0.8662

Std dev (N=20) 0.0241 0.0514

Table 5.16: SCM-52: Loading metric


dl(a, l.I:true)(") 0.528 1.159

ave (dl("'i, a)) 0.348 0.911

ave (dl ("", "'true)) 0.626 1.446

5.5.3 SCM-S2: Specified Constraint Manifold Scenario 2

The variables for this second simulation obey exactly the constraints and specifications

of the first specified manifold. Again, 20 data sets of size 200 were generated for the

simulation. For this simulation, the order of the APCs is changed, since in the second

simulation var Zl '" 0.413 and var Zz '" 0.091. The smallest APC now corresponds to

the APC-functions 'l>(X) with eigenvalue 0.091, and the second smallest to 4>(X), with

eigenvalue 0.413.

The correlations between true and estimated APCs, Table 5.15, and the angles between

the variable loadings, Table 5.16, as usual show greater precision for the smallest APC,

although both are highly correlated. The true and average estimated eigenvalues and

loadings for the two smallest APC are given in Table 5.11. The estimate for the second

eigenvalue is unexpectedly low, although the variable loadings show very good agreement.

The canonical metric, Table 5.18, shows the expected increase for the two dimensional

space. Figures 5.11 and 5.12 depict the true and estimated APC-functions and their

variance. The structure of the data given by the two constraints has been accurately

78

Table 5.17: SCM-S2 : Eigenvalue and Variable Loadings


Eigenvalue 0.001 0.084 0.007 0.007 0.010

Loadings :Var 1 0.000 0.067 0.067 0.023 0.073

Var 2 0.707 0.698 0.009 0.021 0.023

Var 3 00408 00415 0.007 0.020 0.021

Var4 0.408 00421 0.013 0.029 0.032

Var 5 00408 0.391 0.017 0.031 0.035

Var6 0.000 0.044 0.044 0.019 0.049


Eigenvalue 00413 0.291 0.122 0.023 0.126

Loadings :Var 1 0.707 0.76 0.031 0.032 0.045

Var 2 0.000 0.130 0.130 0.075 0.153

Var 3 00408 00418 0.010 0.087 0.088

Var 4 00408 0.388 0.020 0.070 0.073

Var 5 00408 00400 0.008 0.069 0.069

Var 6 0.000 0.124 0.124 0.050 0.137

Table 5.18: SCM-S2 : Canonical Metric between {¢(kJ(XOm and {J;(k)(XOm

Components Included 1 1,2

Unscaled metric 2.571 4.09 IStd dey (N=20) 0.396 0.647 I

Scaled metric 1 0.029 0.032 !

19

CompOlunli 1, Vuiable t Component. t, Variable 2: Component 1, Variable 3

,\

~

; cl .. tl .. tlComponent I, Variable 4 Component I, Vadahle 5 Component I, V&.liable 6

J ~" "

L ;, ~,

tl"Component t. Variable 1 Cemponent 1. Vartable 1. Component t. Variable 3

,.. ,.. ,..0.65 o.n u:~

ii ~ ht -~

G C§... .. • 54 ...~U ·8.f .....·1.' ·1.1 ·1.11

-2.0 -Ul ... ,.. M -2.0 -1.l1 ••• ,.. ,.. -2.0 ·Ul ... ,.. M

; tl .. tl ;, cl

Component 1, Variable 4 Component t. Variable 5 Component 1. V.uiable 6

,.. I... 1.31o.e:6 0.66 Mi61

iii £@

~•.£d I

III -••• ••• D.e; 6 i!!!!!I

"j -0.15-

IUi1I

.u~I

-u: , , ....j-u *'U) .., ,., M -2.0 ·1./1 ... ... M -2.0 -1.0 ... q) ,..

; ~ ; ~ .. r;;

Figure 5.11: SCM·S2 APC·function Estimation for Component 1

80

CempoMat 1. Vut.ble 1 Compouat 1. Varl.ld. 1 c.mpollut 1. Varlahle 3

'.

~'"~" ,

.. I:l ... I:l 11 6Compollent %. Variable .. ComponeDt 1, Variable 5 Component 1. Variable 6

~ ~ . -

... I:l ... I:l ... ~

CO.pOlUlut 1, Vuiahle 1 C••pount 1. Variable 1 Compouat 1. V.dabie 3:

.., .., ,.,U6

~0.16 0.16

~,., ,., ~<S ~,.,

~·0.' -O.ts -0.16'

-1.7 -1.7 -1.1

-t.o -1,0 ,., ,., '-' -2.11 -1.0 ,., ,., '-' -u -t.ll ,., .., '"... ~ 11 ~ 11 ~

Component 1. Variable .. Component 1, V_dahl. 5 Component 1. Variable 6

,., ,., "i... ... 0..)

t @£~ 7' rq"j ~ i!fj,., ,.,

-,. -G.' -U4"1i

-"'i -U' -u1.-u -i.' t., ... '-' -,H -1.0 t.' '.t '-' ·u -1,0 t., ... '-'

11 ~ 11 ~ 11 ~

Figure 5.12: SCM-S2 APC-function Estimation for Component 2

81

recovered.

5.6 APe Estimation for Uncorrelated Variables

Variables that are mutually independent have no additive structure, that is, they have

no APC with eigenvalues smaller than 1 (Theorem 2.3). Variables that are pairwise

uncorrelated we would usually expect to have only very weak additive structure. In

either case, it is valuable to know how APC estimation behaves in the absence of additive

structure, as this will provide guidance for judging the significance of structure detected

in data analysis.

Independent Gaussian

The three smallest APCs were estimated for five samples of 200 observations from .11.(0, I).

Quantiles for the three smallest eigenvalues of the initial estimates, calculated from an

empirical distribution function, are given for comparison. 1

APC estimate• 1 • 1 • 1 • 1F200(0.01) F200(0.05) FiOO(O.10) F2"oo(0.50)

>'(1) 0.676 0.566 0.715 0.763 0.867

>'(2) 0.694 0.732 0.758 0.780 0.849

>'(3) 0.721 0.714 0.758 0.772 0.846

Loadings of the APC-functions indicate the variables usually contribute equally in the

APC estimate.

The APC-function estimates, shown in figure 5.13, share a common feature - a steep

gradient in the extremes of the variable marginals. The effect of these transforms of the

variables is to exaggerate the kurtosis of the sample : the density of points near the

origin is increased while observations on the perimeter of the sphere are further separated.

The resulting transformed variables have a projection with smaller variance than the

original variables. Notice also, that a good polynomial approximation to the APC function

IThat is, the three smallest eigenvalues of the(Xl,P2(X2),P3(X3), ... X.,P2(x.),ps(x.)), where Xi is sample)/(0,1), Pi is jth degree polynomial.(see Chapter 4.2.3)

correlation matrix ofof 200 observations from

82

Compellelll 1, Vartallle 1 CompoMlll t, Vartable 1. CompoMllt I, Vartallle oJ CompODellt 1. V&Itabl••

u

Component 1, Variable 4

Component 3, Variable.

Compllnenll, Vartable 3Component 1, Vartable 1.Compollellt 1, Varillbl. 1

Component 3, Variable t

I.e

-1-0

-1.6

d

Figure 5.13: Independent Gaussian Estimates of the Three Smallest APes

83

estimates would require polynomial of degree much higher than three. The quantiles given

above, then, are conservative.

Uniform on the Ball

Four variables, uniformly distributed on B., are uncorrelated but have weak additive

structure. The smallest eigenvalue of it has multiplicity 3; the corresponding APCs are

contrasts of the squared variables, that is, I::. C.g2(X.) where I::. c. = O. Five sets of APC

estimates using 200 points sampled from the Uniform distribution on B•. are pictured in

Figure 5.14. Again, quantiles of an empirical distribution function of the initial estimates

are given for comparison.

APC estimate• 1 • 1 • 1 • 1

F200 (0.01) F200 (0.05) F:iOO(0.10) F200(0.50)

A(I) 0.730 0.702 0.766 0.788 0.854

A(2) 0.730 0.696 0.735 0.779 0.846

A(3) 0.763 0.724 0.776 0.790 0.845

Qualitatively the APC-functions for the Uniform distribution show the same tendency to

be steep in the variable marginals. The gradients are not as extreme as the Gaussian

estimates, however, and are more wiggly in the body of the transform.

5.7 APe Estimation for Distributions with Exact Addi-

tive Dependencies

The previous section considered APC estimation in the absence of additive structure,

var ¢(X) "" 1; here, we consider the opposite extreme of exact additive singularity,

var ¢(X) "" O. Recall the distributions described in Section 3.7, in which the data lie

exactly on an additive manifold described by step functions of the variables. Estimation

of such discrete dependencies among continuous variables discrete in the sense that

the transformed variables are discrete valued - cannot be exact, since the APC·functions

estimates are constrained to be smooth.

To indicate the behavior of finite sample estimates of these APCs, we generate data

from the following distribution.

84

Co.p0llent 1, Varl.ble 1 C••pouat t, Varl.bl. 1 CO.'Ollfllt 1. Varl.blo !

-u

Component 1. Vali.ble t Component 1. Vutabl. 4

J"'.7 "'O.I! •.e •.•'5 1.1-1.1 ..... 0.' ... u

J

:,-

-i.' .•.• I.' U I..

-1.15

-u .......__-_- _

-1.1 -u u ." 1.1

JCo.ponont 3. Variable Co.poneat 3. Vuiabio 1 Compollent 3. Varlabl. ! Cem.peaent 3. Vuiabl••

1.1

u

Ul

"j\ ·""1 \\

-t.•J-".6

Figure 5.14: Uniform on Ba.ll Estimates of the Three Smallest APes

85

Xl> X z and Xs lie in diagonally opposite quadrants, defined by the cut points 0,0, o.That is, P(X1 < 0, Xz < 0, Xs < 0) and P(X1 ;::: 0, Xz ;::: 0, Xg ;::: 0) are both non-zero,

and sum to 1. The exact singularities in the data, that is, the APCs with eigenvalues of

0, are defined by </>i mapping the two sets {Xi < O} and {Xi;::: O} onto different constants

ki and Ii, 80 that II</>ill = 1. Then for any c such that Ei Ci =0, var E e; </>i(Xi) =O. The

eigenvalue 0 has multiplicity 2.

A data set of 200 points is simulated from this distribution, where the marginals of the

variables are beta(0.5,O.25),beta(0.75,0.75),beta(0.2,0.8). Estimates of the two smallest

eigenvalues are 0.026 and 0.031, hence the estimated APCs closely captures the exact

singularities. The function estimates, shown in Figure 5.15, strongly suggest the true step

functions. The jump in the step function is smeared by the smoothing, resulting in a steep

segment just at the cut point between flat mesas on either side.

5.8 Conclusions

The simulations of this Chapter draw attention to two factors influencing the accuracy

of APCs estimation. First, the separation between the eigenvalues, which affects the

performance of the algorithm; and second, the absolute size of the eigenvalue, which

determines the intrinsic variability of the estimation problem.

Separation between eigenvalues

When the eigenvalues are close, we observe slow convergence and lack of uniqueness or

resolution in the estimates.

The rate of convergence of the algorithm is controlled by the ratio:

a is the spectrum shift (Chapter 4.2.4).

la - >"110 -A*j

>.'

where >.'

is the target eigenvalue,

the next smallest eigenvalue to >,',

When the ratio is close to one, convergence is slow, so the algorithm's convergence criterion

may be satisfied before convergence has occurred. This ratio is near one if the eigenvalues

are dose or if the value relative 10 the difference between A· and A' In the

86

Variable 1 Variable .1 Vul"bl. ,) Vut.:hl...SLdev {Compo 1)=0. 7170 St.dev (Comp. 1 ):::0.6949 St.dev (Comp. 1l=O.O365 St.dev (Comp. 1)=0.0422

". _......... o-f" •

: ,~ .............. ........,,--. •

~.

'.

:,: .... ,-' .....,j d j ~ j d j d

Variable 1 Variable .1 Variable 3 Vuiable 4St.dev (Comp. 2)=0.3889 St.dev (Comp. 2)=0.4214 SLdev (Comp. 2)=0.8180 St.dev (Comp. 2)=0.0445..........,

••, ........- f'" '""'-' •( ,:

.~"'--....: •• , •

: .-"':-'......... -. -.

j Q ;, 0 ;, 0 ;, 0

Figure 5.15: Discrete APC : Estimates or the Two Smallest APCs

87

former case of close eigenvalues, lack of uniqueness is unavoidable. In the latter case, where

the parameter", is large relative to the eigenvalue separation, we have slow convergence

that is a troublesome artifact of the estimation technique itself.

The phenomenon of "mixing", referred to in several of the simulations, is a consequence

of this artifact. Precisely what "mixing" means, is that a small fraction of another APC

remains in the current estimate although the convergence criterion has been fulfilled.

When iteration is ceased, the APC-function estimate is ~i = (1 - 8)4>P) + 84>12) instead of

the true 4>1 1). If the variable loading of 4>11

) is large, the contribution from the second APC

will be noticeable only as a consistent bias in the estimation - since 4>11) will dominate.

However if the true loading is small or zero the residue, 4>12), will contribute a nontrivial

function as the estimated APC-function.

The first specified constraint manifold (SCM-S1) demonstrates a situation where the

dependencies are strong and so accurate estimates are expected, however the eigenvalues

are close enough to that the estimates exhibit a bias due to mixing of the two smallest

APCs. The second specified constraint manifold (SCM-S2) does not suffer from such inac

curacy: even although the eigenvalues are considerably larger and thus the dependencies

are weaker, their separation is sufficiently large that mixing does not occur.

This mixing phenomenon explains the large bias sometimes observed in the variable

loading estimates of near zero values. In GAU-S2, the third and fourth variables, have

zero variable loadings in the smallest APC, although nonzero values in later APCs. The

size of the estimated loadings in the smallest APC is explained by the presence of a small

proportion of a higher APC; further evidence is given by the observed decrease in the

canonical metric. This also explains the systematic bias in the APC-function estimates of

Variables 1 and 2 discussed in the SCM-S1 simulation.

The phenomenon of mixing, since it is caused in part by an inaccurate assessment of

convergence might be removed by using a more exacting convergence criterion. Conver

gence is determined by evaluating the change in the estimated eigenvalue, so the stringency

of the convergence requirement is constrained by the numerical accuracy of the eigenvalue

estimation. Hence, the mixing cannot always be circumvented simply by forcing longer

iteration, An alternative apprl,ath more resolution of

88

tions, is to improve the estimate of 0<. Good initial estimates of 0< can be obtained from

the starting estimates as described in Section 4.2.4.

Size of the eigenvalues

As eigenvalues approach 1, the variance of their estimates increases, hence APC estimation

is intrinsically less accurate as more components are estimated. This is well known to be

true for linear principal components, where the variance of the usual MLE estimates is

larger for middle eigenvalues than for extreme eigenvalues.

Also, for continuous distributions, as more components are estimated, it becomes

more likely that APC will be ill-determined. For estimates to be well-determined the

separation between eigenvalues has to increase as the eigenvalues increase, and yet the

sequence of eigenvalues accumulates at 1, so typically the distance between higher eigen

values decreases. These conflicting tendencies will mean that only the first few APCs of

a continuous distribution can be reliably estimated, and only then when the eigenvalues

of these components are small.

UNI-S illustrates the case where despite separation between the eigenvalues, the ab

solute size of the eigenvalues is such that the corresponding dependencies are not well

determined.

Numerical Accuracy of Estimation

Evaluation of the success of the estimation has two components, which could be loosely

termed the qualitative and the quantitative assessment: quantitative assessment refers

to accuracy of numerical estimates, whereas qualitative to the more subjective graphical

comparisons of function estimates.

Attempting to make generalizations of the accuracy of the numerical quantities

the eigenvalues and variable loadings - is difficult, since there are no entirely consistent

trends in the bias, standard or RMS deviance.

• Eigenvalues are consistently underestimated. This bias is an inherent property of

the attributable to the degrees of freedom used in the

89

APC-function estimation - in effect the data is overfitted by the smoother in the

iterative algorithm. Estimation of eigenvalues for the uniform distribution seem to

be more accurate than other distributions.

• Estimation of the variable loading vector, when assessed using the loading metric,

is accurate in all simulations. The variance of the estimates, however, is often large,

and increases with the eigenvalue.

• The canonical metric, except for cases explicitly commented upon, increases with

the number of components, indicating an increase in the variance of estimation as

more components are estimated.

The decrease in precision of estimation for the higher APCs is attributable in part to

the orthogonality constraints, since to maintain orthogonality, a Gram-Schmidt step re

moves previous estimated APCs, thus estimation errors in the lower APCs are propagated

into the new APC.

Accuracy of Function Estimation

Assessment of the accuracy of function estimates is subjective; our main concern is whether

the estimated functions reproduce the global characteristics of the transforms, and hence

retain the sense of the dependency between the variables, rather than whether we have

local accuracy at each observation.

By this criterion, the estimation is impressively accurate for the smallest APC in all

simulations. If the eigenvalues are "well-separated", which depends on both the relative

separation between eigenvalues and the absolute size of the eigenvalue, APC-function

estimates are excellent for the second and third APCs also. APC-function estimates are

most reliable when the dependencies are strong.

The qualitative accuracy of the estimation implies that for data analysis the APC

function estimates can yield reliable information about the relationships between the

variables when the dependencies are strong. However, more care must be taken with

interpretation of the eigenvalues and loadings, whose numerical values may not be accu

rate.

90

Independence and Singularity

The estimates of APCs for data from independent Gaussian distribution, are a cautionary

example: data which have no dependent structure can have estimated eigenvalue as low

as 0.6 ! A hallmark of these spurious APCs, which exploit outlying points of the ssmple,

is APC-functions that are steep in the tails of the variable marginals.

On the other hand, APC estimates for data lying on an additive manifold, recover

the additive structure very accurately, even when the manifold has co-dimension greater

than 1. Note, however, for the example of section 5.7, the accuracy of estimation is

somewhat implementation dependent. The supersmooth is able to approximate a step

function fairly readily, whereas a more rigid smoothing technique might not recover the

constraint as accurately.

Chapter 6

Applied Additive Principal

Component Analysis

6.1 Introduction

Prior to embarking on the APe analysis of a real data set, we present a detailed consider

ation of interpretation techniques and issues. The need to examine methods of interpreta

tion is perhaps not immediately clear, however, in moving from a simple linear sum to the

more general additive function of the variables, the increase in flexibility is accompanied

by an increase in complexity. While linear relationships between variables have natural ge

ometric frames of reference, which make interpretation a relatively straightforward affair,

these are rarely useful for additive functions. Instead, it is necessary to develop frames

of reference that are meaningful for additive functions, often by simple analogy with the

linear case, and to develop new intuition, using the guidance of exact solutions.

The most important motive behind this discussion, though, is entirely pragmatic - un

less interpretation can be made both simple and comprehensible, we have gained nothing

from the additional flexibility of additive functions. We develop a technique for interpre

tation of additive dependencies that utilizes dynamic graphic technology to provide an

elegant method for understanding the analysis.

92

In the first section we discuss each estimated quantity, its properties and interpretation.

The second section draws on insights gained from the exact solutions to explain some

expected anomalies in the behavior of the sequence of estimated APC. The final sections

illustrate the estimation and interpretation of the APC on some real data sets.

6.2 Interpretation Techniques for Data Analysis

From the sample APC computed by the algorithm we are interested in estimates of the

eigenvalue, variable loading vector, APC and APC-function.

The Eigenvalue

The eigenvalue, var I>h(Xi ), is bounded between 0 and 1. Exact additive dependence

results in a zero eigenvalue, whereas an eigenvalue of 1 indicates the transformed variables

are uncorrelated. The eigenvalue measures the overall strength of the dependency. If the

equation is considered as a linear constraint in the transformed variables, the eigenvalue

gives the variance of the data around the linear manifold it defines.

Recall from Chapter 2.4, that the eigenvalues of an infinite sequence of APC tend to 1,

and every eigenvalue distinct from 1 has finite multiplicity. As the eigenvalues approach

one, the relative separation between the eigenvalues will usually decrease. This implies the

higher APC estimates of continuous variables are likely to become increasingly unstable.

Add to this the cumulative errors of estimation and it is unlikely that estimates of the

fourth and fifth APC will be reliable, particularly if they have large eigenvalues.

Possibly the most important purpose of the eigenvalue is for detecting when the eigen

values are not distinct. Since none of the following discussion on interpretation is relevant

unless the eigenfunctions are unique, it is essential to first verify that the eigenvalue of the

next APC is well separated from the present eigenvalue. If it is not, other than recognizing

there is additive dependence which involves those variables with nonzero transforms, no

meaningful interpretation can occur without addition information.

93

The APe

The APC, ~(X) = I: qli(Xi), is like a residual function, since it represents the departure

from manifold defined by the constraint I: qI.(Xi) = O. Hence the estimate, ~(X) =

I:~i(Xi), can be interpreted as a residual vector, implying that features that are of

interest in regression residual analysis are also informative here.

Outlying points in the residual (estimated APC) indicate observations that do not lie

near the manifold, and are thus unlike the rest of the data set.

The distribution of the residual is informative for eigenvalue interpretation; the vari

ance estimate of the APC could have been inflated by outliers, or be misleading due to

extreme skewness or kurtosis. The spread of points may also reveal something of the

type of dependency detected: when the structure is caused by clusters within the data,

typically the APC will have small groups of outlying values, whereas when continuously

dependent on the variables, more symmetric patterns will be apparent.

It is informative to examine scatterplots of residuals belonging to different eigenval

ues, that is, I: ~lk) versus I: ~lk'). This is analogous to the plot of the projection of the

data onto the principal component coordinate planes, commonly made in linear principal

component analysis. Usually plots corresponding to the largest eigenvalues are made-

as these projections have the largest variance they are often informative low-dimensional'(1) '(2'plots. However, we are estimating the smallest APC, hence plotting I: qI. versus I: qli )

is analogous to the least informative projection of the data, that is, the projection with the

smallest variance. The APe are linearly uncorrelated, as is also true for the linear case.

Outliers in the scatterplot will indicate points not lying near the (p - 2)-dimensional man

ifold defined by the corresponding constraints. This plot may detect points not unusual

in either of the individual residuals.

An important difference between the additive and linear plots, is that the residual plot

in the additive case cannot be considered to be a projection of the original data, or even of

the transformed data. Two different sets of transforms are involved, so there is no simple

relationship between the plot and the original data. Formally, the random variables I: 4>ll)

and I: qll2) minimize Val" (I: t/>i + qI:) subject to 114>11 = 114>'11 = 1 and 4> 1. 4>'. With the

94

additional constraint that var I: ¢P) is minimized this is a joint characterization of the

two smallest APC.

The Variable Loadings

By the variable loadings of an APC, we mean the standard deviation of the transformed

variables, 8. = o-(¢.(X.)). AIl explained in Chapter 5.2, these are analogous to variable

loadings of linear principal component analysis. AIl I: var ¢.(X.) = 1, the loadings are

constrained to lie on the unit sphere, I:8; = 1. The loadings quantify the relative contri

bution from each variable, and hence the extent to which each variable is involved in the

dependency.

The Function Estimates

The estimate of each variable's APC-function, ¢., no longer has an analogy in linear

principal components. For linear relationsbips the signs and magnitudes given by the

loadings are sufficient to describe the dependencies completely. For additive dependencies,

interpretation is often more complex and hence requires more sophisticated techniques.

We discuss several approaches to the interpretation of the function estimates:

• A Linear technique : The smallest APC, has a characterization as the small

est linear principal component of the transformed data, as expalined in section 2.6.

Interpretation techniques of linear analysis can be validly applied to the standard

ized, transformed variables. The obvious disadvantage of this approach is that the

transforms do not always define a meaningful scale. Another disadvantage is that

the extension to further APC is awkward. In linear analysis, the orthogonality of

the components is a geometric constraint with a natural interpretation. For addi

tive analysis, orthogonality only implies the APC are uncorrelated, which has no

geometric analogy. We can state that the second smallest APC is the smallest lin

ear principal component of 114>(2)11-1 4>(2) that is uncorrelated with 4>(1) but since

4>(1) and 4>(2) are usually different sets of transforms, this statement is not terribly

enlightening.

95

• A Regression Technique: For APC with small eigenvalues, the data come close

to satisfying the implicit equation L. ¢.(X.) "" O. Most usually the variable with

the largest loading, say X m , is the most important variable in determining the

constraint. Even small changes in ¢m(Xm) will demand modification of all other

variable transforms, if the constraint is to hold. Conversely, a change in any of the

other variable transforms is likely to be reflected in ¢m(Xm). Informally, then, it

seems reasonable to solve the implicit equation for ¢m(Xm) and write:

¢m(Xm) "" L ¢j(Xj ).jtom

Then, regarding this as an ACE regression model, we can use interpretive techniques

of regression analysis.

A more formal justification for choosing the variable with the largest score for the

role of dependent variable in the regression equation is as follows. From the implicit

equation of the APC, there are p possible regression models, one for each variable.

The regression model Ljtom ¢j(Xj) for the dependent variable ¢m(Xm) has regres

sion coefficient R = (1 - 0s;;:.!), where>. = var Lj ¢j(Xj ) and Sm is the m th

variable loading. This is maximized by the largest variable loading, hence the re

gression of the transformed variable with the largest loading as the response has the

best regression in terms of R2.

The appeal of this interpretation is the familiarity of the regression framework and its

extensive analysis tools - however it is only an approximation to the true symmetric

dependency of the APC. The relationship between the ACE regression solution and

the APC solution has been discussed in Chapter 2.7. Recall that the two are identical

if the eigenvalue is zero, hence a regression interpretation is most appropriate when

the eigenvalue is small.

When a regression interpretation of the APC is plausible, we could regard the APC

solution as an alternative to ACE regression, with the possible advantage that APC

treats the variables in a symmetric fashion. The APC equation allows the data to

suggest which variables show strong dependence and hence make reasonable candi~

dates for the role of a response when rewriting the implicit equation in a regression

form. This may be appropriate if no variable has been designated as the response.

• The Brushing Technique : The ensuing technique is the most accurate and

simplest way to interpret an APC estimate. Explanation is most effectively achieved

by way of illustration.

A two dimensional additive manifold in three dimensions is described by the con

straint :

(6.1)

This constraint implies that when 4>1 is large and positive, 4>2 +4>3 must be large and

negative. If 4>1> 4>2, 4>3 were linear we could deduce from the signs and magnitudes

of the loadings how the variables enter into the dependency. However, in general,

it cannot be inferred from the functions themselves how 4>2 and 4>3 individually

contribute to the constraint. To discover the interaction of the transformed variables

in APC analysis, a different approach is needed. One simple and effective approach

is to combine the graphical techniques of brushing and connecting plots.

Brushing is the interactive technique whereby points on a graphical display are

selected by means of an interaction device, such as a mouse or cursor. Selected

observations are marked or highlighted by immediate change of color or plotting

symbol, enabling interactive identification of observations of interest on the plot.

Connecting plots is a technique used where two or more plots display different vari

ables from the same data set, hence all plots depict the same observations. If the

plots are connected, brushing on anyone of the plots causes the corresponding ob

servations to be highlighted in all the connected plots.

The data we are going to look at are a sample of 200 points from the manifold

described by an equation of the form (6.1). The APC algorithm is used to calculate

estimates ,flo <12, J;s of the defining functions, for which var (J;l + J;2 + ,fs) = 0.0038.

For each variable we plot ,fi(Xi) versus Xi, called the APe-function plots, and then

tonnect the three The estimated manifold is defined )+

Vulable 1St.dev (Comp 1) "0.7033

Residual of 1Var (Comp 1) 0.0038

97

St.dev (Comp 1) =0.5040

..'\...... ;'

'; /__ l

-',--,..-'"

Vulable ..St.dev (Comp 1 j =0.5 14

\ -"'"\ /"\

V~ \\,\

. I L.

Figure 6.1: Interpretation ExampleAPC~function of Variable 1

The APC~function plots, brushing on the

¢s(Xs) = O. When the observations with high values of the transformed variable

4>1 are brushed, corresponding observations in the other plots are highlighted, and

the values of 4>2 and ¢s show exactly how these transforms fulfill the constraint. In

Figure 6.1, the three APCfunction plots are shown in the top row, with brushing

occurring in the APC~functionplot of Variable 1. High values of h correspond with

low values of both 4>1. and ¢g.

The real value of these plots, however, is that they enable direct interpretation of

the constraint in terms of the original, untransformed variables. In Figure 6.1, the

projection of the highlighted points onto the X-coordinate in each APC-function plot

Vutahle tSt.dev (Cemp 1) =0.7033

Resl dual or tVar (Comp 1) 0.0038

98

Varlahh, "St.dev (Comp 1) ",0.5040 St.dev (Cemp 1) ,.0.5014

•

. i 1.. •

Figure 6.2: Interpretation ExampleAPC-function of variable 2.

The APC-function plots, brushing on the

shows that high Xl values correspond with center values of X2, and with highest

and lower middle values of Xs. Highlighting the highest points of 4>2, in Figure 6.2,

show extreme values of X2 (both high and low) are associated with low values of

Xl, and middle values of Xs.

This technique can be extended to make use of residual information from the APC,

r = <PI +<P2 + tis. A fourth plot of a. histogram of r is connected to the variable plots,

as in Figures 6.1 and 6.2. The residua.l histogram is the leftmost of the bottom row,

the two remaining plots show a grid of the estimated manifold and the data cloud

showing the data configuration in variable space.

St.d6\' (Comp 1) ,.0.7033

99

Variable 1SCda\, (Comp 1) ~O.5040

Variabia 3St.d.iOY (Comp 1) =0.5014

.,•

Reddtlal cl tVar (Comp 1) 0.0038

."

~\

"1. \ ~" ,'"\. / "'\ / '........... "

i.\i..

"I l. . •

Figure 6.3: Interpretation Example : The Added Variable plots a.nd APe residual plot.

The connected residual plot can be used for two different diagnostic purposes.

First, the observations outlying in the residual can be identified in a.ll APe-function

plots. The plots L:.,ei (/>i(Xi) versus Xi, which are akin to added variable plots for

each variable smooth, can also be connected to the residual plot, a.nd from these

the discrepancy can be diagnosed. In Figure 6.3, the three added variable plots are

connected to the APe residual plot. The large positive residual is very close to

lowest on the marginal distribution of all three variables, however low values of Xl

and Xz are not associated with low values of Xs in the estimated manifold. This

observation does not lie dose to the estimated manifold, hence appears as an outlier.

Variable 1St.dev (Comp 1) =0.703

100

Variable %St.dev (Comp 1) =0.5040

Variable ,}St.dev (Comp 1) =0.5014

Reddual of 1Va,r (Comp 1) 0.0038

/i

\ ••j

,\. ;'- , ....

'''''--"''"

0' L. o

Figure 6.4: Interpretation Example: The APC-function plots, brushing on residual plot

Second, the residual plot is used to aid interpretation of the dependency relationship.

Observations which have large residuals do not lie near the manifold, hence when

brushing the APC-function plots, observations which have large residuals should not

be highlighted because they do not reflect the dependency detected by the constraint.

The residual plot can be used to "downlight" observations with large residuals before

using the APC-function plots to interpret the relationship. Compare Figure 6.4, in

which the observations with large residuals have been downlighted, with Figure 6.2.

The correspondence between the transformed variables is now sharper, since the

observations not dose to the manifold are no longer highlighted.

101

The graphical devices used for this last brushing technique provide very powerful inter

pretive tools, since we can then view the dependency simultaneously in the transformed

space, the variable space and the residual space. The multiple views of the APC en

courage a very detailed acquaintance with the structure implied by the APC-functions,

and the observations not obeying this structure. The transparency of the brushing inter

pretation makes assimilation of the information in the APC estimates both feasible and

comprehensible.

The power of this technique increases the usefulness of APC analysis dramatically.

The translation of the estimated constraint back to relationships between the variables

themselves would be extremely complex with only classical methodology. Brushing on

connected APC plots provides an accurate and elegant method for interpretation.

6.3 Guidelines for Detecting Real Structure

The relationship between the orthogonal decomposition of the operator P and the APC

explained in Chapter 2.4 is an elegant theoretical result. Yet the practical implications of

this result for APC estimation are rather strange.

For continuous variables, there are an infinite sequence of transformations of the data,

each of which describe an independent relationship between the variables. From a data

analysis point of view, this is clearly nonsensical. At some stage there must either be

some redundancy in the representation, or component estimates where idiosyncrasies of

the sample are being exploited to create spurious relationships. Recognizing either case

would signal when the APC have ceased to provide practically relevant information, or

equivalently when the eigenvalue of the APC is "too large". There is some theoretical

precedence for both these possibilities that may help determine in an applied setting

when this may have occurred.

6.3.1 Judging Redundancy

Redundancy in representation is a phenomenon that occurs in both the Uniform and Gaus-

sian distribution cases of the th.em'v and simulation cnap,.el·S the term U"nHlLla,lL

102

we mean any additive constraint than does not reveal new structure in the data, given the

smaller APC. From an applied viewpoint, any redundant APC can be discarded.

For the Gaussian distribution, the correlation matrix completely describes the variable

dependencies. The linear principal components are sufficient for the correlation matrix,

hence the only relevant dependencies are linear. Thus the APC based on higher order

polynomials are redundant. In fact for the Gaussian distribution, since the APC of k'h

order polynomials are sufficient for R h , we would hypothesize for any non-linear APC,

there is a linear relationship between the variables based on the same dependencies, with

a smaller variance.

For the Uniform distribution on an ellipsoid, the variable dependencies are described

by the linear and quadratic dependencies (these completely specify the orientation and

shape of the distribution ellipsoid). The higher order polynomials are thus redundant.

Another important case of redundancy occurs when only two variables are involved in

a strong dependency, <Pl(XJl + <Pz(Xz) "" O. Then, for any I, 1(<Pl(Xd) + 1(<Pz(Xz)) "" O.

If (I(<pd,/(<Pz)).l (<Pl, <pz), the resulting APC is redundant.

The Gaussian and Uniform distributions can thus be used to characterize some forms

of redundancy. Empirically, it has been observed that the APC transformed variables

of a data set are usually more symmetric than the untransformed data. This is not

surprising, in view of the fact that if a transformation to multivariate Gaussianity, or to

any Gegenbauer distribution exists, a linear sum of those transformed variables is an APC

solution. Hence, at least when the transformed variables have continuous range, it might

be expected that the APC-functions will tend to symmetrize the variables.

Now, suppose the smallest APC transformed variables are <Pl(Xd, <Pz(Xz) ... ,<pp(Xp).

If there is no other independent constraint in the data, and if the "variables"

<Pl(Xd,<Pz(Xz) ... ,<pp(Xp) are approximately Gaussian, the second smallest APC of X is

either the second linear principal component, or the smallest quadratic principal compo

nent of these transformed variables. In either case, the second APC is redundant, since

it reiterates the dependencies of the smallest APe. The same behavior would usually be

true if the transformed variables were Gegenbauer. Thus when the transformed variables

a.re symmetric, it seerns plausible that this polynomial redundancy may exist.

103

There are several indicators of redundant representation which may be valuable when

trying to discern redundancy in APC estimation. When an APC is redundant, it must

involve the same variables as a smaller APC. The new transforms will be derived from the

transforms of the smaller APC : either close to identical, or squared versions of the smaller

APC-functions. The latter case suggests, that if none of the transforms are monotonic,

the APC may be redundant. Finally, if the APC is redundant, by definition it cannot

reveal new insight into the dependencies between the variables, hence its interpretation

will reveal dependencies already discovered.

6.3.2 Assessing Spuriousness

The transformations that exploit idiosyncrasies of the sample are usually easy to detect.

They typically approximate an extreme case of the discrete APC, Section 3.7. Data

spatially separated in at least two variables, lie exactly on an additive manifold described

by step functions. Suppose there are a small number of observations in the data which are

extreme in two variables. Then a "spurious" dependency can be formed by considering

these observations as a cluster, and using step function transforms to separate these points

from the body of the data. This is the type of behavior observed in the APC estimates of

the uncorrelated Gaussian and Uniform distributions in Chapter 5.6, where points on the

perimeter of the data cloud were exploited by the algorithm to create additive functions

with lower variance.

In a data analysis, care must be taken before describing a transformation as detecting

spurious structure. If there are clusters in the data distribution, these are dependencies

that may well be important to the data analyst. It may also be useful to detect outlying

points in the joint multivariate distribution for a data sample - it is just that one needs to

be careful not to interpret the outliers as a bonafide dependency existing in the population

at large. If an APC has transforms that are strongly influenced by a small number

of unusual observations and a large eigenvalue, relative to the uncorrelated Gaussian

estimates, then it is probably spurious.

104

6.4 The Infant Mortality Data

6.4.1 Introduction

In the late 1960's, Ernest J. Sternglass alleged that the radioactive fallout from nuclear

weapons testing significantly raised infant mortality rates [Ste69b] [Ste69aj. The scientific

community, on examination of his evidence, thought the claim inadequately supported

We cannot support Dr. Sternglass's conclusion and ascribe changing patterns in infant

and fetal mortality to the cause and factors invoked by Dr. Sternglass - certainly not

with the conviction or certainty required by most epidemiologists and statisticians.

It is clear from a review of all of the data that certain gaps in the knowledge of how

environmental levels of Sr-90 may effect the genetic material of individuals still exist

and further studies in this direction are probably warranted. [Mill

However, outside of criticizing his conclusions, no other independent study of the effects

of levels of radioactivity and infant mortality was attempted. This data set was collected

by Fuchs [Fuc79j, in an effort to reexamine the issue.

The data set consists of 528 observations taken over a period of 11 years (1960-1970)

on each of the 48 states (excluding Alaska, Hawaii, Washington D.C.). The response

variables considered are postnatal, prenatal and total infant mortality rates (the latter

being the sum of the two previous), measured as the number of deaths per 1000 births in

each state. Covariates are state categories (State), year of observation (Year), per capita

income calculated in 1970 dollars (Income), and percentage of non-white births in the

state, (%Nwbirth). Radioactivity is measured by the levels of Strontium-90 (8r-90) and

Cesium-137 (C8-181) levels in city milk supply. While radiation is measured by both 8r-90

and C8-187 levels; 8r-90, being biologically longer lived, was the variable more favored by

the investigators, and is the variable used here. Details of data sources and aggregation

can be found in Fuchs [Fuc79].

105

Table 6.1: TIM: The Additive Regression Models

x· u(8i(Xi))s

%Nwbirth 0.764 0.978 1.018 1.044 0.754

Income 0.072 0.142 0.029 0.180 0.054.j

Sr-90 0.127 0.037 0.067 0.066 0.042

State 0.440 0.552 0.546 0.596 0.404

Year 0.431 0.483 0.487 0.487 0.515

R2 0.9301 0.9297 0.9289 0.9303 0.9288

Order (12345) (23451) (34512) (41253) (51234)

6.4.2 Additive Regression Models

We fitted an additive regression model for response variable total in/ant mortal1"ty, with

covariates Year, State, %Nwbirth, Income, Sr-90. The regression model was computed five

times, with the variables presented to the algorithm in different orders, which is simply

equivalent to using different initial values for the transforms.

The estimated transforms are shown in Figure 6.5, and the standard deviations of these

transforms are given in Table 6.1, where the response variable has been scaled to variance

1. Each solution provides an almost identical model for Total infant mortality, as seen

by the small changes in R2, and yet the models are by no means the same. The primary

predictive variables are State, %Nwbirth and Year, however both the State quantifications

and the contribution from %Nwbirths vary considerably. The minor variables also show

fluctuation in the transforms.

The differences between these models are not important if considering predictive use

of the model. However we are intent on examining the marginal effect of Sr.90, which is

unstable, but not necessarily insignificant.

Since we know that instabilities such as these, induced by merely altering the starting

result

106

Q

:J-I

'7

~

0

...

..,

'"

50 100 150

Sr-90

200 250 300

Q

...

o

o

o

10

10

20

20

30

Income

State

30

50

40

.,.t •

60

50

Figure 6.5: TIM: The Additive Regression Models

107

Table 6.2: TIM-5var : Eigenvalues and Variable Loadings

Components First Second Third

Eigenvalue 0.0067 0.0226 0.0350

Loadings: %Nwbirth 0.667 0.128 0.343

Income 0.166 0.491 0.572

Sr-90 0.010 0.749 0.210

State 0.715 0.279 0.337

Year 0.120 0.321 0.606

components analysis could explain why the models differ.

6.4.3 APC analyses

APC of the full data set

Several smallest APC were computed for the five independent variables, %Nwbirth, In

come, Sr-90, State and Year. Their eigenvalues and variable loadings are given in Table

6.2.

The smallest principal component detects a strong relationship between %Nwbirth

and the categorical variable State, with an eigenvalue of 0.0067. This simply reflects the

small within state variation of %Nwbirth - since the data cluster tightly within each

state, a low variance APC is obtained by setting state quantifications to the within state

mean of %Nwbirth. The income variable only contributes where income is low, (since for

high incomes, the transform is constant) and shows for those observations, lower income

accompanies lower %Nwbirth(Figure 6.6).

The second smallest APC, shown in Figure 6.7, is primarily a relationship between

Sr-90, Year and Income. State only makes an additive adjustment for the level of Florida

(State 29). The eigenvalue of 0.0226 shows this dependency is also strong.

The third APe has an el2enVaJ.ue of 0.0350, which is re:<scm2JO close to the second

lOS

~ Noawbite aitths lacome (1970 $) Strollt.htm 90St.(1e-v (Comp ') =0.6677 St.d~v (Comp 1) =0.1665 St.dev (Comp 1) =0.0111

'\ "---... -- .,

"-"~ d j I:l ~ "S.... Year Residual of 1

St.dev (Comp 1) =0.7155 5t.dev (Comp 1) =0.1204 Var (Comp 1) 0.0067

•... ..•• •• •

•• • •• • •• • •• •

L......

" :"

...d!.. .., .

•~ e ~ e ~

Figure 6.6: TIM-5var : APC-functions of the smallest APC

109

~ Nonwhite Births Jncomf! (1970 $) Strontium 90

SLdev (Camp 2) =0.1280 St,dev (Camp 2) =0.4913 St.dev (Comp 2) =0.7490

/'.

.... \'. "- ..' ..-..

J II J II l II,Slatl! Year Residual of 1

St.dev (Comp 2) =0.2791 St.dev (Camp 2) =0.3214 Var (Comp 2) 0.02261

• •.. .,....... • • .,.,.. ...• • •• ........ ,. •• ..... • • • • •

,..11 Il......

J II j II ~ II

Figure 6.7: TIM-5var : APC-functions of the second APC

110

... Nera.hitA! JUr1hs lllto.. (1910 $) Strontium. 90$t.dev (Comp 3) =0.3428 St.de.... (Comp 3) :0.5721 St.dev (Comp 3) =0.2094

,,.-/

1\ r--'",'- •..

.. ~ .. ~ .. "State Yo... Residul of 3St.dev (Comp 3) =-0.3373 St.dev (Comp 3) =0.6306 Var (Comp 3) 0.0350

• • • • •••• •., . ....••• 'I....

• •••••• •• ., •: .. •• •• •••

.. ,&1 'i,.. •.. 0 .. ..

Figure 6.8: TIM-5var : APC-functions for the third APC.

eigenvalue. All variables contribute significantly in this dependency, yet the variables

group into two distinct relationships, one between Year, Income and Sr-90, and a second

between State and %Nwbirth. This is discovered using the technique of brushing on con

nected APC-function plots: in Figure 6.8, cases with high values of the transformed Year,

have no systematic relationship with transformed values of either State or %Nwbirth, and

the same is true for the transforms of Income, Sr-90.

The last comment leads to consideration of the implication of the near degeneracy

between State and %Nwbirth revealed by the smallest APC. The within state variance of

%Nwbirth is smaller than the between state variance, hence transforming state scores to

correspond to state means of %Nwbirth. results in a low variance APC, as observed. This

111

APC involves a strong dependency between only two variables, so is a likely candidate for

generating redundant APC. The variable State is categorical, so has enormous flexibility

in its transformation (47 df). In fact any sequence of orthogonal transformations of the

continuous variable %Nwbirth can be matched by appropriately orthogonal assignment of

state scores, producing a sequence of redundant APC with low variance. With both State

and %Nwbirth in the analysis, it will be difficult to assess dependencies between other

variables, because there is a high density of small eigenvalues from the redundant APC.

In the simulated examples small separation between eigenvalues was seen to cause mixing

in the estimated eigenfunctions. In this data set, the third APC shows evidence of mixing

from a redundant APC, since the transforms of State and %Nwbirth are not related to the

transforms of Year, Income and Sr·90.

The main factor causing these low eigenvalues and contributing to the observed be

havior is the large number of categories in State, so it will be informative to estimate the

APC of the four continuous variables only. This is a simple way to avoid the masking

effects of mixing of the numerous small APC's of the 5 variable data set, enabling us to

view cleanly the separate dependencies existing between the continuous variables.

APC without State categories

The smallest APC of the data set without the state categories, involves Year and Sr

90, and achieves a variance of O.OBO. The transforms of Figure 6.9 reflect the change

in radioactivity levels over the 11 years of the data - low in the beginning and ending

periods, and peaking in 1965-66 during the period of heaviest nuclear testing. Eigenvalues

and variable loadings are given in Table 6.3.

The second APC involves all four variables, but most strongly Income and Year. The

transforms shown in Figure 6.10 can be interpreted as depicting the relative change in

wealth during these 11 years, where incomes were higher in the middle years than the

early and late periods. The variance achieved is 0.169.

The third APC has a much higher variance of 0.287. From the transforms in Figure

6.11 it is suspected that this APC is redundant, as none of the transforms are monotonic,

and the strong variables are Sr~{)O, Income and Year. From the of section

112

Table 6.3: TIM-4var: Eigenvalues and Variable Loadings

Components First Second Third Smallest Linear Second Linear

Eigenvalue I 0.0805 0.1689 0.2866 .201 .860

Loadings:%Nwbirth I 0.094 0.277 0.030 0.198 -0.418

Income 0.144 0.777 0.458 -0.700 0.206

Sr-90 0.757 0.295 0.730 0.685 0.284

Year 0.630 0.482 0.506 0.038 0.837

"" Non_hite Births Income (1970 $)SLdev (Comp 1) ",0.0946 St.dev (Comp 1) =0.1441

r ..... '---_ .....---

.. " .. "Strontium 90 YearSt.dev (Comp 1) =0.7570 St.dev (Comp 1) ::0.6303

.",.....- ...... ,.,-

• •• •• ••

.. " • "Figure 6.9: APC-functions for the smallest APC

113

$ NOD:white Births focollle (1910 $)St.d(;v (Comp 2) =0.2768 St.dev (Camp 2) =0.7770

• • •..-"...

;//

~.... .~....'" ...

~ -if * i:lStrontium 90 Ye.,

St.dev (Comp 2) =0.2951 St.dev (Camp 2) =0.4823

~, • • •......• • •

~ i:l * i:l

Figure 6.10: TIM-4var : APC·functions for the second APC

114

• NonW'hlte Birth.- h,come (1970 $)

St.dev (Comp 3) ",0.0302 St.dev (Comp 3) "'0.4582

, -........... r--.....,-. - ....... .

I••,

•J. J. ~

Strontium 90 y..,St.dev (Comp 3) =0.7301 St.dev (Comp 3) =0.5061

••, .-..- •, ,,-

\;/ • • • •• • •

J. ~ J. t:l

Figure 6.11: TIM-4var : APC-functions for the third APC

115

6.3.1, we know if the transformed variables q\(1) (Sr-90),</>(I) (Year) and </>(2) (Income) are

approximately Gaussian, then scaled hermite polynomials of the transformed variables are

APCs of X. (Note that </>(1) (Year) '" </>(2) (Year) so the first two APC are approximately

linear in these transformed variables.) A quadratic transformation of the APC-functions of

the two smallest APC describes very closely the transforms observed for the third APC.

Furthermore, interpretation of the transforms does not add any further information to

the dependencies detected using the two smallest APC. We conclude the third APC is

redundant.

The residual plot, shown in Figure 6.12, of I: </>~1) (X;) versus I: </>~2) (Xi) shows a dis

tinct group of outlying points in the bottom right corner. These points, which correspond

to the eleven observations of Florida, do not obey the implicit relations of these APC.

Notice that on neither of the marginal projections, would these points be clearly unusual.

Recalling the separation of the transform value for this state in the second APC of the five

variable analysis, it is clearly unusual. Florida is distinguished by its Sr-90 levels, which

are distinctly higher than all other states. In the other three variables it is not unusual.

Recall that the transform of state categories in the five variable analysis adjusted for this

additive difference.

For the sake of comparison, the linear principal components are calculated for the four

variables Year, Sr-90, %Nwbirth, Income. These are given in Table 6.3. The smallest

shows a linear dependence between Sr-90 and Income. This is both difficult to explain

convincingly as a causal dependence and, as seen from the APC analysis, misleading. The

observed linear relationship between Sr-90 and Income is simply a consequence of their

mutual dependency on the same nonlinear transform of year.

Both of the relationships in the first two APC, when attention is restricted to the two

main variables, are simple and have sensible interpretations. Furthermore they will both

be fairly evident from careful scrutiny of simple scatterplots. With the APC analysis we

have the ability to detect much more complex relationships - in the APC estimated here,

the contribution from variables with smaller loadings, while small, are not negligible and

with careful study more subtle aspects of the structure can be discerned.

116

• Nonwhite 8irths Jacome (191. $) SUollthull 90St.dev (Comp 1) =0.0946 St.dev (Comp 1) =0.1441 St.dev (Camp 1) ",0.7570

/ ......

I'~..

"'-~ .....

.. " .. " ~ "f .., Ilesidual of.2. VI. 1 Residual of 3 VI. 1SLoev (Camp 1) "'0.6303 Var (2) 0.1642. Var (1) 0.0736 Var (3) 0.2687. Var (1) 0.0736

•. '.. .' • •.r

":'" ....,':"J ;. •

• • :\¥~,. ~Ir• • : :",1. •.: .• • 'r\: ~.... M', -. . •t" ~"

t •• , ••

• '~~\~f'• .•,3-. ;.....• • ,.~.. " "C....~:~ ••

• t' ~ ..... , • ;.' .... :•• '. :-' r. . .:;• : • •. •• • •• •••

• " ~ " ~ "Figure 6.12: TIM·4var : The residuals of the smallest APe. The eleven highlighted ohservations are of Florida

Median ValueRoomAge

DistanceHighway

TaxPtratioBlackLstatCrimeZone

IndustryRiver

Noxsq

117

Median value of owner occupied housingAverage number of rooms in owner unitsProportion of owner units built prior to 1940Weighted distances to five employment in the Boston regionIndex of accessibility to radial highwaysFull property tax rate(j1000)Pupil-Teacher ratio by Town school districtBlack proportion of the populationProportion of the population that is lower status.Crime rate by townProportion of the town's residential land zoned for lotsgreater than 25,000 square feet.Proportion of nonretail business acres per townCharles River dummy = 1 if tract bounds the Charles river,ootherwiseNitrogen Oxide concentration in ppm

6.4.4 Conclusion

Using the APC analyses, it is possible to explain the instability of the additive regression

model of section 6.4.2. The changes in the relative contributions of State and %Nwbirth,

and the fluctuation in their transforms is a consequence of the very close correspondence

between these variables, discovered by the smallest APC of all five variables.

The smaller changes of the model amongst the other three variables is explained by

the interdependency between Year, Income and Sr-90, indicated by the smaller APC of

the four variable set.

6.5 The Boston Housing Data

In this analysis, we examine the Boston Housing Market data of Harrison and Rubinfeld

[HR78J. These data were used to estimate marginal air pollution effects on the housing

market. A regression model relates the median value of owner occupied homes in each of

the 50 census tracts in the Boston Standard Metropolitan Statistics Area to air pollution

(as measured by the concentration of Nitrogen oxides) and to 12 other variables that are

thought to affect housing prices. The variables of the full data set are briefly described

here, for a fuller description see [HR78, Table

118

The housing value equation of Harrison and Rubinfeld is developed using linear re

gression and experimenting with a number of poasible variable transformations for all 12

predictor variables. Breiman and Friedman [BF85] present an alternative model, using

the technique of ACE regression to estimate optimal transformations of the data. The

model they report is built on a reduced set of 5 variables: four variables chosen using a

forward stepwise variable selection procedure and Nouq, for estimation of the marginal

effect of pollution.

Using APC, we will examine the structure of both the reduced and full set of pre

dictors. Dependencies between the variables may have influenced the choice of variables

made by the forward stepwise selection procedure. Alternative variable groupings may be

suggested by a fuller understanding of the variable structure. In the reduced data set it

is naturally of interest to discover whether there are potential problems with the stability

of the transforms estimated by the ACE regression algorithm. In particular, we want to

determine whether instabilities affect the estimation of the marginal effect of Noxsq.

6.5.1 BH-small :The Smaller Boston Housing Dataset

The variables comprising the smaller dataset of this analysis are the 5 variables selected by

Breiman and Friedman: Noxsq, Room, Tax, Ptrat;o, Lstat. Table 6.4 displays the loadings

and eigenvalues for the three smallest APe. The smallest APe reveals a dependency

between Nouq, Tax and Ptrat;o, achieving a variance (eigenvalue) of 0.04. The Tax and

Ptrat;o transformations both strongly separate two points from the body of the data. (Bear

in mind that due to smoothness constraints imposed by the estimation, discontinuities can

only be approximated.) The case correspondence between the highest two Tax rates and

the separated Ptrat;o observations, shown in Figure 6.13, is almost exact. These two

points represent 137 cases (27% of the data); since both Tax and Ptrat;o are fixed within

each of the 50 census tracts they are somewhat categorical in nature.

The APC also implicates Nouq, and brushing reveals these cases have high values of

the pollution index Noxsq. The first APC has detected a large cluster in the data set

determined by the two highest tax rates, and two high values of Ptrat;o. This cluster also

have values on the Since this duster is in the

119

Table 6.4: BH~small : Eigenvalues and Variable Loadings

Components First Second Third

Eigenvalue 0.0430 0.1010 0.2177

Loadings:Nouq 0.4143 0.7000 0.1560

Tax 0.0282 0.0260 0.6297

Ptratio 0.7601 0.2767 0.0857

Latat 0.4977 0.6275 0.2344

Roomsq 0.0468 0.1974 0.7189

NltrDpD. Oxide (Ave I of J.oollU)Al HOllSe TaxSt.dev (Comp 1) =0.4143 St.dev (Comp 1) =0.0282 St.d61,1 (Comp 1) =0.7601

•

•

......'\~ ..t A ..\ ... ....,...41- .• ..\ .. '- ..,'-.. ..~/

~ d ~ d ~ dPupil-Teacher RAtio ~Lower StAtllS House. ResidUAl or t

St.dev (Comp 1) =0.4977 St.dev (Comp 1) =0.0468 Var (Comp 1) 0.0430

.. .-.... '. ." ....

" ....".../ ..., ...---•*'

.11•• 1,.1. .t

• ~ • :;1 J d

Figure (US: BH-srnaH . The smallest APC-function

120

NitropJl Oxide {Ave I" at 1.00.,)"'1 House TnSt.dev (Comp 2) =0.7000 St.dey (Comp 2) =0.0260 St.dev (Comp 2) =0.2767

r.\I., :-.: .,

~ :, • - /./-_ .. - •• "• ,

"",•\ -..

";, l:l ;I, l:l .. I:lPupU~Te~heI Katia "Lower St.t.tu Houses Ilestdual of .1 VI, t

St.dev (Comp 2) ",0.6275 SLdev {Comp 2} =-0.1974 Var (2) 0.1010. Var (1) 0.0430

"..'

• ••

.J ,".'" ····if

""" .

.- .. " ".~...' .,.

..........~: '" .~. :., ."

.. ~ .. l:l .. I:l

Figure 6.14: BH-small : The second APC-function plots.

analysis, we shall refer to these 137 cases as the Tax-Ptratio cluster.

The second smallest APC, Figure 6.14, shows a dependency between the same three

variables as the smallest APC, with a variance of 0.101. Furthermore, the transform of

Ptratio suggests that the structure detected is again due to the Tax-Ptratio cluster. The

similarity between the interpretation of this APC, together with the strong resemblance

between the transforms for the variables, indicates this APC is redundant.

The third smallest APC has an eigenvalue of 0.203, indicating a weak relationship be

tween Roomsq and Lstat. The transforms, shown in Figure 6.15, show a smooth increasing

association, that is, houses with a larger number of rooms tend to be in neighborhoods

with a higher proportion of lower status househoids. The change in the lower values of the

121

NU.repa Oldde (Ave' of 1lee1ll$)""2, House TnSLoe" (Comp 3) =0.15$0 St.dev (Comp 3) =0.6297 St.dev (Comp 3) =0.0857

/ ........... '\ .......,.......,.. • ••

,"-.

~ -.. ~ .. bl ,I, ~

PupilwTeacher Ratio SLower Status Houses Residual of 3St.dev (Comp 3) =0.2344 St.dev (Comp 3) =0.7189 Var (Comp 3) 0.2177. .,

•, t

.",1111111, ,

............. # ...

,I, ~ .. <l ~ bl

Figure 6.15: BH-sma!l: The third APC-function plots.

Room8q transform reflects a different trend for houses with the smallest number of rooms;

these do not strongly correspond with the highest values of L8tat.

The residual histogram of the smallest APC has two distinct groups of outliers. The

group of large negative residuals are observa.tions in the Tax-Ptratio cluster which are not

in the highest tax cluster, see Figure 6.16. One cluster of the large positive residuals are

observations in the Tax-Ptratio cluster which have low values of Noxsq. The remaining

large positive residuals reveal that the census tra.cts with the lowest value of Ptratio have

low pollution.

122

NiU'opa Oxide (Ave I' af beau) ..." Boue TnSLdev (Comp 1) =-0,4143 St.dev (Comp 1) =-0.0282 St.dflv (Comp 1) ::0.7601

•

if\• ..• ...-~. ~ 0

\ .....- f\ ",- .

;.. ..., :.0'........

~ i:l .- i:l ~ i:lPupil-Teacher Ratio 'LLower Status Houes Residual of .1 vs. t

St.dev (Comp 1) =0.4977 St.d~w (Comp 1) =0.0468 Var (2) 0.1010, Var (1) O.043lJ1

0 o.J'0

0• •• > I

;..... : •• f. ~.. / : IV!..... ". .. -., ,• ~v"'* --- l. .....' " ," +•.,!.

.. 1'".-. .., .• ;- '-P... :1..., ..,~.

~ i:l ~ d ~ i:l

Figure 6.16: BH·small : The smallest APe outliers

123

6.5.2 BH-full: The Full Boston Housing Data Set

The smallest APC of this dataset has a variance 0.0077, showing a virtually exact de

pendence between Tax, Ldistance and Industry. Examination of the estimated functions

shows a large separation of the highest Ldistance value, which is found to correspond ex

actly to the separated second highest tax values and to high values of Industry. As with

the smaller data set of the previous section, this APC picks out a strong singularity caused

by observations spatially separated in three variables.

Unfortunately, the estimation of further APC cannot yield information about other

specific dependencies in the data, for the next three eigenvalues are all close to 0.05,

hence the APC are not unique. Table 6.4 presents the estimated loadings for the four

smallest APe. The additive manifold corresponding to the second smallest eigenvalue

has co-dimension 3 and involves the variables Nozsq, Crime, Ldistance, Industry, Tax,

and to a lesser degree, Zone, Age, Lhighway , Ptratio, Lstat. The transformations of all

these APC are smooth, describing continuous dependencies between variables, rather than

differentiating a cluster.

Examining the smallest APC, there are two clusters of outlying values. Both these

clusters are found to contradict the overall trends which the smoother detects, i.e., the

highest tax cases are not high in Ldistance and some high Industry values have the lowest

Tax values.

6.5.3 Conclusions

The APC provide insight into the structure of the data used in estimating the housing

equation. The strongest singularities are caused by clusters of observations in the data.

The APC analysis calls attention to this important characteristic of the data which might

easily be overlooked, since clusters of tied values are concealed in simple seatterplots.

In all the APC detecting spatial separation of observations, notice that the residuals

typically have distinct clusters of outlying points. As the additive dependency is a common

correspondence of values of a large number of observations, any cases at odds with this

correspondence will appear as outliers in the residual plot. This supports the assertion,

124

Table 6.5: BH·full : Eigenvalues and Variable Loadings

Components First Second Third Fourth Fifth

Eigenvalues 0.0077 0.0533 0.0527 0.0471 0.1009

Loadings: Crime 0.0782 0.3981 0.6622 0.1958 0.3383

Zone 0.0363 0.1188 0.1319 0.0948 0.0676

Indus 0.1507 0.1156 0.2176 0.6999 0.4560

Rive' 0.0056 0.0142 0.0118 0.0186 0.0304

Noxsq 0.0509 0.6442 0.5566 0.1177 0.1417,Roomsq 0.0076 0.0385 0.0964 0.0471 0.1608

Age 0.0157 0.0185 0.1252 0.1823 0.4595

Ldistance 0.0368 0.4927 0.1507 0.2684 0.5301

LHighway 0.7639 0.1581 0.2056 0.2638 0.1392

Tax 0.6164 0.3193 0.2366 0.4671 0.1361

Pt,atio 0.0430 0.1427 0.1396 0.1853 0.1520

%Blacksq 0.0150 0.0201 0.0497 0.0154 0.0593

Lstat 0.0048 0.0812 0.1374 0.1370 0.2636

125

made in Section 6.2, that residual structure from discrete dependencies are likely to have

distinct groups of points that do not lie close to the manifold.

We return to the housing equation estimated by Breiman and Friedman, involving

only the five variables of the smaller data set. Since the investigation hopes to determine

the effect of pollution on the prices people are prepared to pay for housing, the insight

provided by the smallest APC is valuable.

From this APC, we see that many of the high pollution index cases belong to the large

Taz-Ptratio cluster. This suggests that if there is a predictor or indicator of housing value

which has been excluded from the analysis that could specifically adjust for this group,

the marginal effect of Nozsq could change considerably.

Examining the 7 excluded variables, we found that the largest value of the Highway

index variable (LHighway)- indicating greatest accessibility to radial highways - cor

responded exactly to the Taz-Ptratio cluster. Adding an indicator variable for greatest

accessibility to radial highways (Highway), the R2 of the ACE regression model increased

by 0.003, and the standard deviation of the Nozsq transformation increased from 0.172 to

0.232. The transforms for the two regression models are displayed in Figure 6.17.

By adjusting for the presence of the cluster, there appears to be stronger evidence in

support of a decrease in housing prices for areas with high pollution.

6.6 A Diagnostic for Additive Regression Transform

Stability

APC estimates can provide a diagnostic for instability of variable transformations of ACE

and additive regression models, when the instability is due to additive dependencies among

the predictors. The idea for the diagnostic comes from observing that the two models

y - E8.(X.) and Y - E(8. + <P.}(X.) will have almost identical fitted values when

E <Pi (Xi) has a very small variance. This leads us to consider alternative models obtained

by perturbing the optimal regression model by adding a fraction of the smallest APC,

which we know has smallest variance among all additive functions of the data. The

changes in regression transforms affect the sum of squares of the reQ'reE,sicm

Model 21..·...·___

42-,

_.'*.~..,--.....--.. --,.••'-:.-•••-."'='.......---...""'-;:':;"

126..,(\W

~

~0

0

";' ..,9

~

'? "1";'

8.5 9.0 9.5 10.0 10.S 11.0 -4 -2 0 2 4 6

Median Housing Value: Rsq(1). 0.832 RSQ(2). 0.835 Roomsq: sd(1). 0.283 sd(2). 0.272

"1 "1

l/l It)

c:i c:i

~It)

9

~ "1'7 ";'

-4 -3 -2 ·1 0 2 3 -3 -2 -, 0 2

Lslat: ad. 0.561 sd(2). 0.554 PtratiQ: sd(1)- 0.189 Sd(2). 0.198

~ ~

It) It)

c:i

~.....,0

•..........._~ •It)

'.It)

9 9

"1 "1

-1.5 ·1.0 ·0.5 0.0 0.5 1.0 1.5 2.0 -0.5 0.0 0.5 1.0 1.5

Tax: sd(1l- 0.111 sd(2). 0.098 Highway access: sd(2). 0.131

"1

"10

It)

9

~-·2

Figure 6.17: BH·smaU : The ACE regression models; Modell = 5 variable, Model 2 =Modell plus Highway variable.

127

model minimally.

6.6.1 Perturbing the Optimal Model

Suppose we have an optimal additive regression model for a response variable Y, using p

predictor variables Xl> X2 ... X p :

y t:::l SeX) =Lo.(x.).•

The residual sum of squares of the optimal regression is :

For the set of predictor variables, denote the smallest APe as usual, by

¢(X) = Lq;.(Xi) with Lvar¢i(Xi) = 1.i •

(6.2)

Its variance is var ~(X) =(2.

The optimal model is perturbed using the smallest additive principal component, so

that for some fixed 0:, we have the alternative model:

Y ~ L(O. + O:¢i)(X.),i

(6.3)

This model increases the residual sum of squares of the fit only minimally, since the

residuals from the additive regression are orthogonal to H+(X)- the additive equivalent

of the familiar property of orthogonality between residuals and fitted values of linear

regression. This orthogonality follows from the projection characterization of the additive

model: since p X Y = S(X), for any .p(X) = Ei tPi(Xi) E H+(X),

cOy (Y - S(X), ~(X» = COY (pX (Y - e(X», ~(X»

= COY (e(X) - e(X), ¢(X)) =o.

Hence the increase in residual sum of squares of any alternative model of the form:

(6.4)

128

for II~II = 1 is :

RSS(~,o:) = E (Y - 0- o:~)'

= E (Y - 0)' - 2aE ((Y - O)~) + o:'E (~)'

= E (Y - 0)' + o:'E (~)'

= 0" + o:'E (~)'.

The alternative model formed using the smallest APC is the least possible perturbation

of the additive model in the following sense.

Theorem 6.1 Among all alternati"e models to the additi"e regre88ion model (6.£) of the

form (6../), RSS(~,o:) ill minimized bll the smallest APC, that ill, ~ = ¢, for anll 0: i' o.

Proof: The minimal change in RSS between the additive regression model and the alter

native model is :

min~EH+ RSS - RSS(~, 0:) =

For any fixed, non-zero 0:, the smallest APC minimizes the decrease in RSS. •

In the alternative models, the sign of 0: is indeterminate - both positive and negative

values produce the same increase in RSS. However, the models are not identical: 8.+o"p. i'

8. - 0:</>••

A diagnostic allowing multi-dimensional alternatives to the additive model can be

constructed using the sequence of smallest APC, on the basis of a corollary to Theorem

6.1.

Coronary 6.1 Consider the k-dimensional alternati"e model specified b,l

Y "" i + O:l~(l) + a,~(') ... + o:.~(.)

where ~(') .L ~(j) i i' j

11.i.(i) Ii - 1j't' ji-

with RSS(~(l), ... ,~(·);O:l" .a.) = E (Y - (8 + 0:; + o:,oP(2) ... + o:..,i;(it)jj'.

For every (ar, ... "'.) with "'. i' 0, the RSS(oP(l), ... ,.p(k); "'1 ... "'k) is minimized over

subject to the orthogonality and norm CotlS£I'alJ",S. the k smallest APe.

129

The proof is a simple extension of the above argument.

6.6.2 A Dynamic Diagnostic

In the previous section, the value of a in the alternative model (6.4) is fixed. If we treat a

as a continuous parameter we can construct a continuum of models which move from the

optimal additive regression in the direction of the smallest APC. This suggests a diagnos

tic in which changes in the regression transformations occur dynamically as the parameter

a = aCt) varies - for the current value of aCt) the alternative variable transformations,

IJ, + a(t)<p" are displayed. This is easily implemented within the Symbolics Lisp environ

ment, by using the mouse to control the value a, and continuously update the regression

transforms to correspond to the current a (see Appendix A). Since computing the new

transforms only involves addition of already computed functions, the updating computa

tion is easily fast enough that variable transformations appear to change smoothly with

a. Periodically, the new RSS(~, aCt)) and y(mw) = pY (8 + a(t)J) are also recomputed.

These latter two quantities, the new RSS and new regression model, could in theory

also be computed quickly, since RSS(~, aCt)) = ".2+ a(t)2(2 and y(mw) = y(add)+aPY (~).

However, in the finite sample version, with smoothers estimating the conditional expecta

tion operators, the orthogonality between the residuals and the APC will not hold exactly.

Hence the above relations are not necessarily accurate for the estimates, so RSS and y(mw)

must be calculated explicitly.

This dynamic diagnostic can also be implemented for higher dimensional models. The

two dimensional generalization is easily made, by using the mouse to input values of a1, a2

from a two dimensional display. For higher dimensions, a more sophisticated "touring"

device, which can be guided interactively in k-space, would enable the k-dimensional

diagnostic to be implemented.

In the above discussion we have restricted attention to additive regression models. The

results are directly applicable to the ACE models, that is, models in which the response

variable is also transformed.

The above diagnostic for a single APC has been implemented on the Symbolics LISP

It was applJ,ed to the ,,,j,ilt[V~ rell"!SS;OO for the Infant M'>rtall!;y dat,,,"

130

Perturbations in the model due to the smallest additive principal component are shown

in Figure 6.18 for'" E [-1,1]. As expected, the transforms of State and %NwBirth are

unstable. The maximal increase in RSS of the model, for this range of '" was 0.09. The

diagnostic was then applied again, using the second smallest APe, and", in the range

[-.5, .5]. In Figure 6.19, the transforms for Sr-90,Income and Year change considerably,

although the maximal increase in RSS is merely 0.08. From the range of functions shown

for the transform of Sr-90, it is clear that it is not possible, using this model, to determine

from this data set whether the marginal effect of Sr-90 is detrimental.

Total Infant Kortality

Strontil.lD 98

131

, Nonwhite Births

State Year

Figure 6.18: APe Diagnostic for TIM Regression: Smallest APe

Year

132

State

l Monwhite Births

Strontiu. 90

Figure 6,19: APC Diagnostic for TIM Regression: Second smallest APC

Total Infant "ortality

Chapter 7

Literature Review

7.1 Linear Principal Component Analysis

The applications of analysis based on principal components are diverse, in part because

the principal components themselves have a multitude of different, yet equivalent inter

pretations. The first use of principal components is attributed to Pearson [Pea01]' who

posed the problem of finding the best fitting line or plane to a set of points in a higher

dimensional space. The problem arises in a regression context, where there are errors in

the predictor variables. Pearson shows the best fitting hyperplane is the line or plane

minimizing the sum of squares of perpendicular distances to the subspace. For a plane in

three space, this corresponds to the plane orthogonal to the smallest principal component.

Principal component related methods have a long history in the social sciences. Spear

man [Spe04] first examined the structure of sets of correlated variables, such as scores made

by school children in tests of speed and skill in solving arithmetic problems. The question

posed is whether a single underlying factor exists, representing unmeasurable "general

intelligence", that determines the scores a child will achieve. In the 1930's, the prob

lem was extended to allow "general intelligence" to be dependent on several independent

underlying factors.

Thurstone !Thu31] derived the underlying factors by assuming the so-called factor, .

134

model. A k-factor model is appropriate if ;

X=M+z,

where

X are the observed p-dimensional data

f are unobservable k-dimensional common factors,

A is a p X k matrix of unknown parameters, the factor loadings,

Z are unobserved p-dimensional unique factors.

The premise underlying the model is that the p observed variables actually lie in a lower

k-dimensional space, as represented by f, the common factors. The unique factor z allows

both for individuals to perform differently on particular tests and for the tests being only

an approximate measure of the underlying factor. A unique factor Zi will be small if the

test is closely related to the factors. The common factors are standardized to variance

1 and all factors are assumed to be uncorrelated. Hence for this model, the covariance

structure of X is;

(7.1)

where 'l1 z is diagonal. Note that as defined by this covariance structure, the factor loadings

A are not uniquely determined, since any orthogonal rotation of the factors will give the

same model.

Estimating a model entails finding Aand -it to closely approximate the ideal ;

"','" ..S = A A + '1' •.

The principal factor solution, proposed by Thurstone, first estimates 'l1, then finds the

best k-dimensional A approximating S - '1' "" A!A. This is solved by finding the k largest

principal component directions of the matrix S '1'. Note that if z has zero variance, the

principal components are a solution for the factor modeL

A different criterion for finding underlying factors was used by Rotelling [Hot33j. who

a.ppr'oa,:h,:d the p,,)bl,em as a linear transformation the observed data

135

can be written as a linear combination of a smaller number of independent components

(VI ... tlA:J, x, == L.j a'jvj. His solution is to find the coefficients a defining the best

univariate component representation, by minimizing the loss function:

(7.2)

The p-vector a gives loadings for each variable. This leads to the first principal component

as the optimal univariate summary. This will be an adequate summary of the data matrix

to the extent that the rows of X are homogeneous - hence this procedure is given the name

homogeneity analysis in the psychometric literature. Higher dimensional summaries of the

data matrix are constructed sequentially, each being the best l~dimensional component,

constrained to be uncorrelated to all previous components. Hotelling notes that under the

assumption of a multivariate normality, the principal component directions are the major

axes of the correlation ellipsoid.

Principal components analysis and factor analysis differ in that principal components

make no assumptions about the form of the covariance matrix from which the data come.

Factor analysis, on the other hand, is based on a well-defined model, and assumes the

covariance matrix of the observations to have the structure 7.1. If these assumptions are

invalid, the model may produce spurious results.

The discovery of the connection between principal components, finding the best k

dimensional linear subspace of the data, and the independently developed singular value

decomposition, led to the realization in Eckart and Young [EY36] for instance, that

for linear components, finding a k-dimensional representation by the sequential method

of Hotelling was equivalent to finding the best k~dimensional linear subspace. The k

dimensional linear manifold closest to the data in the least squares sense is exactly the

manifold defined by the first k i-dimensional orthogonal components, which are in turn

the first k left singular vectors of the data matrix X. For linear components the solu

tions that are nested in k, that is, the span of the k-dimensional solution contains the

(k - l)~dimensional solution.

Underlying the use of principal components in psychometry is the premise that the

data In a space, Ii more general setting, principal COlmf;>orler,t

136

analysis has gained wide acceptance as a technique of data summary, in the spirit that

it was proposed by Hotelling. As already noted, no model or distributional assumptions

are made, the principal components are defined as optimizing some algebraic or geometric

property of the data.

Algebraically, the first k principal components maximize the variance of any k-dimen

sional projection of the data, or equivalently minimize the loss function :

(7.3)

where F is n x k, 8< is the i'h row of the p X k matrix A. F then is the best k-dimensional

linear representation of the data minimizing over k dimensions simultaneously.

Geometrically interpreted, the k largest principal components define the k-dimensional

linear manifold lying closest to the data. An alternate geometric view is that projecting

the data onto this manifold gives the k dimensional representation that preserves the

configuration of points in the original space to the greatest possible extent. The data can

be exactly represented using all p eigenvectors, and for any dimension k the minimal loss

of information is incurred by using the first k eigenvectors. The eigenvalue associated

with the i'h eigenvector, A., gives the variance of the i'h principal component, henceI;' )."

the ratio tr:;~i' measures the proportion of total variance lost by using k dimensions

instead of the full p. Often, the several smaiiest eigenvalues are close to zero, and little

information about the joint distribution of the variables is given by the corresponding

components. Thus principal component analysis can be used as an optimal dimension

reduction technique in which the minimal amount of information is lost.

Applications of principal component analysis are also found in multidimensional scal

ing, errors in variables regression, cluster analysis, size and shape methods, among others.

All these techniques focus on the principal components of the larger eigenvalues. Explicit

use of the smallest linear principal component seems to be rare.

The smallest eigenvalues are relevant when principal components are used as a

nostic tool for studying collinearity in regression analysis. The relative variance of the

smallest principal components are examined to determine whether linear relationships

among

137

The notion of using the smallest principal component as a technique for studying

the interdependency of data has rarely been explici tly utilized, although G nanadesikan

comments [Gna77, pH.],

For purposes of interpretation - detection or specification of constraints on, or re

dundancy of, the observed variables - it may often be the relations that define near

constancy (Le" tho&«! specified by the smallest eigenvalues) that are of greatest im

portance.

Yet strangely enough, Pearson's first proposal Was estimation of the constraint implied

by the smallest linear principal component. In this connection, there is a generalization of

Pearson's motivation for using the smallest principal component as an alternative to the

linear regression model, when all the variables are observed with error.

The regression solution is the hyperplane in the union of the X and Y space that

minimizes the distance to the data; measured in the Y direction. Suppose we find the

smallest principal component direction, a· of the combined matrix X • = (Y, X ). The

hyperplane defined by setting X "a· = 0, that is, projecting the data orthogonal to the

smallest principal component direction, minimizes orthogonal distances to the data.

7.2 Nonlinear Generalizations of Principal Component

Analysis

The common concern underlying all the techniques described in this section, is that as

suming linear structure is often unrealistic. By allowing nonlinearity, the structure of

the data might be represented more closely. On the other hand, introducing nonlinear

transforms necessarily moves away from the ideal of model simplicity. If linear structure

is appropriate; we would still want a non-linear method to reflect simple linearity. Thus,

a na.tural requirement for any more general class of models, is that they contain the dass

of linear models.

The generalizations and extensions to linear principal components that ha.ve been

developed can be classified according to their trea.tment of three issues.

1 ~ The definition "nonlinea.rity" of the manifold.

138

2. The generalization of one dimensional representations to higher dimensions.

3. Whether the manifold is represented parametrically or determined by a constraint.

7.2.1 The Nonlinearity of the Manifold

In linear principal components, lower dimensional representations of the data are defined

as minimizing over F nxk and Apxk the loss function :

where ai is the ,1h row of A. There are two ways to extend the linear definition to allow

nonlinear representations of the data.

One is to replace the k linear factors F by a nonlinear function of k parameters. Thus

we define f(a) = (11 (a), (f2 (a), ... (fp(a)) as minimizing

I:: IIXi - /;(a) II 2•

i

This is the approach taken by Hastie [Has83] in defining principal curves and surfaces. An

appealing property of this parameterization, is that the norm is minimized in the original

variable scale. The resultant models thus have appealing geometric properties, but can

be difficult to interpret, particularly in higher dimensional generalizations.

The second approach is to transform the variables, replacing X by I(X), where f is

chosen to have an optimal linear factor representation. We minimize

I::1I/(X)i- FaII1 2.

i

A disadvantage of this approach is that the norm is minimized in the space of the trans

formed variables.

The second approach is the one used in APes, thus we only review generalizations of

linear principal component analysis that allow nonlinear through transformation of the

variables.

139

7.2.2 Optimal Data Transformation Methods

Within the methods that model nonlinearity by transforming the variables, there are

several distinct approaches to the class of transforms, f, chosen. The most extensively

studied representations adhere to the class of additive functions, i(X) = I: Ii(Xi ). Only

marginal transformations are considered, thus distinct variable spaces are retained in the

transformed representation.

The functions Ii can be restricted to be linear, to belong to some finite dimensional

class of functions, or, as in our case, simply required to have zero mean and finite variance.

Since the additive model plays an important role in the existing literature for nonlinear

modelling, it is discussed further in section (7.3).

Finite Dimensional Distributions

The most extensive treatment of nonlinear principal components with additive transforms

is found in the psychometric literature. 0 bservations are considered to be exclusively

categorical, although the underlying components can be of nominal, ordinal or continu

ous measurement type (in the somewhat radically stated view of Gifi [Gif81, p46], "all

data are categorical" !). All distributions are discrete, consequently the space of additive

transformations is finite dimensional.

For discrete distributions, then, the task of finding optimal transformations of the

data is greatly simplified, since it reduces to a finite dimensional problem. Estimating Ii

amounts to estimating an optimal scaling or quantification for a metric representation of

each variable under the restriction imposed by the measurement type.

The numerical representation for a vector of observations on a variable of k categories,

can be written using a matrix of dummy variables, G. G is a matrix with k columns of

O's and l's, each row has a 1 in the column of the category observed for that individual.

Suppose the k scalings for the variable categories are given by q, the corresponding nu

merical representation of the vector of observations is Gq. By convention, the columns

of G are scaled to norm one; thus the variable quantifications of the it" variable is

() -1'2 (1iii = Ii Xi = D1 ' G1ql = H1qi> where Di = diag nl, n2.··· n p ., nk the number of

140

occurences of the le'" category. Thus, the estimation simply entails finding ql for given

HI'

Algorithms for estimation of an optimal univariate representation of categorical data

are all based on a two step estimation procedure.

• "Model" estimation: Quantifications of all variables are presumed known, and opti

mal parameters for a linear component representation, using the usual loss function

are found. This amounts to finding the linear principal components of the trans

formed variables, or equivalently, projecting the fixed quantifications onto the model

space.

Explicitly, for fixed (YI ... Yp), find a and f minimizing

IT(a, f) = L IllIi - aifll z.

This is solved by the largest eigenvector of :

Ey = Q'H'HQ whereQ='llEB EBqp

H = HI EB EB H p.

• Optimal scaling step : Assume the component representation fixed, hence every

variable is approximated by the common component f of n points. The optimal

quantification for each variable for this fixed f, using the same loss function, is

found by simple regression of the component vector onto the transform space for each

variable. To minimize Ei IlYi - aifll" for each i, simply minimize the i'" term IlY,

aifll" = Ila,f -HIllJllz. The usual least squares solution gives Y; = a,{HiHil-IHlf =

alHif. Measurement restrictions are imposed in the regression step, for instance,

ordinal measurements are preserved by using isotonic regression. The geometric

interpretation of this step is projecting the component onto the transform space for

each variable.

The algorithm alternates between these two steps, computing a restricted optimal solution

at each stage. A standardizing constraint is imposed in one of the steps to avoid collapse

to the trivial zero SOimiGIL as each of the

141

algorithm is a projection onto a closed convex set. Convergence to the globally optimal

solution is not guaranteed, a local minimum may Occur. Detailed references and comments

are found in De Leeuw, Young and Takane [dLYT76].

For this class of models there are two distinct methods to extend principal component

analysis to a k-dimensional representation. Multiple correspondence analysis, a method of

analysis for purely nominal variables, was introduced by Benzecri [Ben72]. The analogous

technique allowing other measurement types is known as homogeneity analysis, as in

Young, De Leeuw and Takane [YTdL78]. In both of these, a sequence of I-dimensional

components are constructed. Each component is required to be optimal according to

the usual univariate loss function, subject to orthogonality with respect to all previous

solutions. Combining the first k solutions gives a k-dimensional nested model.

The k th homogeneity solution is estimated using the same two step scheme as above,

however at the model step, the representation is restricted to be orthogonal to the k - 1

previously found solutions. This is implemented by adding a Gram-Schmidt orthogonal

ization to the model estimation step.

The second approach, known as nonmetric principal components analysis, was first

proposed by Kruskal and Shepherd [KS74]. For any fixed dimension, k, a single transfor

mation of the data is sought. The first k linear principal components of the p transformed

variables are required to have maximal variance over all possible transformations of the

variables. These representations are not usually nested.

Again, nonmetric principal components can be estimated by modifying the two step

algorithm described previously. At the model step, the first k principal components of

the transformed variables are calculated, rather than just the largest. Then in the op

timal scaling step, the simple regression is replaced by a multiple regression of these k

components onto each variable.

Algorithms for these two methods of nonlinear principal component analysis are respec

tively HOMALS (HOMgeneity analysis by Alternating Least Squares) and PRINCALS

(nonmetric PRINcipal Components by Alternating Least Squares)[GifBI].

142

One Transformation vs Multiple Transformations

A k-dimensional solution in homogeneity analysis has k different mutually orthogonal

quantifications for each variable, with each solution having smaller variance than all pre

vious solutions. A nonmetric principal component solution gives only one quantification

for any dimension k, and its first k linear principal components have maximal variance for

that dimension.

For linear transformations these two approaches yield the same solution, hence the

dichotomy in the generalization to higher dimensions is only present in the nonlinear case.

In several ways, the dimension definition used in homogeneity analysis, (and also in APe,

for continuous random variables), is the more natural one to use.

First, the models are nested, hence the parameter k need not be known. In nonmetric

principal component estimation, the analyst has the unenviable task of trying to guess

the appropriate linear dimension of some unknown transformation of the data.

Second, there is a strong analytical structure underlying the multiple quantification

representation, that parallels the sufficiency of the linear principal component representa

tion.

In linear principal components the sequence of eigenvectors and eigenvalues gives the

unique orthogonal decomposition of the correlation matrix of the data. If the data are

Gaussian, this implies the principal components are sufficient for the correlation matrix.

The following two finite dimensional cases reveal similar analytical properties.

In the case of nominal variables, multiple correspondence analysis amounts to a weight

ed principal component analysis of all the bivariate marginals (the Burt table). Hence the

bivariate dependencies can be completely recovered by using all the components, and

taking the largest k preserves the configuration of the bivariate marginals to the greatest

possible extent.

In homogeneity analysis each set of quantifications is a function estimate in the sum

space

H(X) = H(Xd $ H(X2) $ ... $ H(Xp )

= {f(X). f(X) = E /;(X i )}

The sequences qmmtific,atii)ns defined the multi,ple qmmtrncatHOl1S ~nrwo;,rh of no-

143

mogeneity analysis provide a complete orthogonal decomposition of the space H (X ).

A major objection to the approach of homogeneity analysis is the aspect of "data pro

duction" - where we began with p variables, we now have pk "variables". The nonmetric

principal component approach yields only one set of transformed variables, which has ap

peal because of its apparent simplicity. However, this representation is optimally linear

on the transformed scale. In general, linear interpretation of variables in the transformed

scale may not be meaningful.

De Leeuw [dL82] investigates the similarities and differences between the two forms

of analysis for categorical data. He proves that if the bivariate frequency tables have the

same singular vectors, then the two methods yield identical solutions. This is exactly the

condition of Theorem 3.1 for discrete distributions) hence for the distributions discussed in

Chapter 3, the two different approaches will yield identical k dimensional representations.

Continuous Random Variables

An early suggestion for a method of introducing nonlinearity into principal component

analysis, is found in Gnanadesikan [Gna77J. He proposes allowing polynomial transfor~

mation of the data matrix up to degree k) hence the transform space is again finite

dimensional. This is easily implemented by conducting an ordinary principal component

analysis on the augmented data matrix of the original variables plus all their squares and

crossproducts (for degree 2, say). While this strays outside the class of additive models)

it can be restricted to purely additive transforms, in which no interactions between vari~

abIes occur, by excluding cross product terms. Furthermore, the technique is made more

useful if the transforms of a single variable are mutually orthonormal, so that the analysis

models the dependencies between different variables) rather than within transformations

of the same variable. With these two restrictions, this technique is exactly equivalent to a

restricted APe, where the APe class of transforms are reduced to k degree polynomials

as proven in Proposition 4.1

When nonlinear analysis is considered for continuous variables, we a.re confronted with

a possibly infinite dimensional spa.ce for the transform functions. Provided the spaces

are finite dimensional, between is via a

144

constraint or parametrically, though it still exists, is not really important, since it simply

amounts to choosing the maximal or minimal eigenvalues of a finite dimensional eigen

decomposition. In infinite dimensions, since complete enumeration is impossible, one or

other approach must be taken, depending on the rationale behind the analysis. In our

estimation of the smallest APC, we use the constraint estimation approach because we are

interested in exploring the interdependencies of the data. Koyak [KoyB5], proposes a k.

factor multivariate dimen8ional~'tyreduction analysis. Clearly, since the intent is dimension

reduction, the appropriate method of finding the best low dimensional representation of the

data is to estimate the manifold directly. Koyak estimates a single transformation for any

fixed k dimensions, and hence generalizes nonmetric principal component analysis. Each

APC in our method estimates a different set of transforms, which follows the approach of

homogeneity analysis.

Another approach to finding singularities in da.ta, that is, additive transformations

of the data with a small variance is suggested by Fowlkes and Kettenring [FK85J. The

criterion they consider is minimizing the determinant of the correlation matrix of the

transformed data, that is, the product of the eigenvalues. This leads to one set of tra.ns

formations. One drawback to their approach is that all variable transforms are forced to

enter equally into the analysis.

There are a few nonlinear generalizations of principal component which have not re

stricted the data transformation to be additive. Klein and Garayalde [KG85] propose

a projection pursuit principal component. Their approach is in the context of principal

components analysis as a dimension reduction technique, so they find the transform of the

data f(X) which comes closest to the data matrix X, where f(X) is restricted to be of

the form:

This definition only makes sense for direct estimation of the manifold - there appears

to be no natural way to generalize this method to a smallest principal component for a

projection pursuit modeL

145

7.3 Additive Models

The additive model, as defined by Hastie and Tibshirani [HT86], has been the focus of

much attention in the recent efforts to move away from the restrictions of parametric

models and distributional assumptions. It is widely assumed that the additive model,

since more flexible than linear models, is therefore adequate. While this is clearly not

always true, the reasons for using the additive model are persuasive.

Stone [St085] writes :

Three fundamental aspects of statistical models are flexibility, dimensionality and

interpretability. Flexibility is the ability of the model to provide accurate fits in a wide

variety of situations, inaccuracy here leads to bias in estimation. Dimt!nln:onality can

be thought of in terms of the variance in estimation, the "curse of dimensionality ill

being that the amount of the data required to avoid an unacceptably large variance

increases rapidly with increasing dimensionality. In practice there is an inevitable

trade off between flexibility and dimensionality or, as usually put, between bias and

variance. Interpretability lies in the potential for shedding light on the underlying

structure.

Classical linear and parametric models in general, are relatively easy to interpret and

to estimate. Historically this is the sole reason for their preeminence. The disadvantage

of the classical methods is an inability to adapt in situations where the assumed structure

is inappropriate - scenarios in which both the bias and the variance of estimation will be

large.

More general models, such as projection pursuit for instance, or models that include

simple interactions between variables allow far more complex representations. Conse

quently they can require large amounts of data for reliable estimation: sparseness is often

a problem when modelling interaction terms, or using multivariate smoothers. Sometimes

such flexible models are simply too complex for the intended application.

Additive models fall nicely in the middle ground between these two alternatives. Since

there are no interaction terms, we retain the desirable elegance of interpretation for addi

tivity : if Xl is changed to x\, and all other variables remain constant, the effect on Xl can

be measured as a of the difference ; 80 the bivariate relationsr"

146

need be considered. Dimensionality problems are avoided because the additive structure

permits successive, rather than simultaneous, estimation of the functions, as embodied

in the alternating conditional expectation algorithms. Finally the models are reasonably

flexible. Even if J is not genuinely additive, an additive approximation to J may capture

the structure sufficiently for a given application, and has the advantage of being easily

interpretable.

The additive model will reproduce linear structure where linearity holds, and can easily

be extended to include known interaction terms if desired, by adding "new" variables that

are formed from products of the original variables.

Chapter 8

Conclusion

Our primary aim is to present a viable data analysis method for understanding the additive

structure of multivariate data.

The additive structure is described by the additve principal component, defined as the

additve function of the data with smallest variance. The APCs are a natural generalization

of linear principal components, and have a characterization as a sequence of eigenfunctions

belonging to the smallest eigenvalues of P .

Estimates of the APC can be calculated using a simple iterative algorithm, which is

an implementation of the power method for estimating eigenfunctions. The estimates are

accurate when the eigenvalues are small and well separated, or equivalently, when the

observations lie near an additive manifold.

Interpretation of the dependencies implied by additive equations with small variance

is made practicable by the interactive technique of brushing on connected APC-function

plots. The dependencies embodied in the APC are then easily expressed in terms of the

original variables. Observations with large residuals can be located by using the residual

plot of the APC. Through the power and simplicity of this technique of interpretation,

the APC becomes a viable data analysis tool.

The theoretical precedents of the Gaussian and Gegenbauer distributions provide guid

ance for detecting APC that are redundant or spurious. Recognition of either of these

cases is a step toward answering the tu:nQ,arr,en,tal 'iWCCCWH of wnetr"," the APe have

148

detected real structure in the data set.

APe estimates provide a diagnostic for instability of predictor transforms in additive

regression models. For a small decrease in R2 of the additive regression, we can examine

a range of alternative sets of transforms of the predictor variables, thereby detecting

variables whose transforms are unstable due to additive dependencies in the predictors.

Bibliography

[Ben72] J. P. Benzecri. Sur l'Analyse des Tableaus Binaire Assoces a une Co rrespon·

dance Multiple. Technical Report, Universite Pierre et Marie Curie, Paris.,

1972. Note Mimeo, Lab. Stat. Math.

[BF85] L. Breiman and J. H. Friedman. Estimating optimal transformations for multi

pIe regression and correlation. Journal of the A merican Statistical Association,

80:580-598,1985.

[BK85] A. Buja and R. Kass. Comment to [bf85J. Journal of t~. American Statistical

Association, 80:602-607, 1985.

[BS83] R. B. Bapat and V. S. Sunder. On Majorization and Sc}.,",. Products. Technical

Report 8319, Indian Statistical Institute, New Delhi., 19I~:J.

[Buj85) A. Buja. Theory of Bivariate A CEo Technical Report 74, Department of Statis

tics, University of Washington, Seattle., 1985.

[dL82] J. de Leeuw. Nonlinear principal components analysis. COMPSTAT, 77-86,

1982.

[dLYT76] J. de Leeuw, F. W. Young, and Y. Takane. Additive structure in qualita

tive data: an alternating least squares approach with optimal scaling features.

Psychometrika, 41:471-503, 1976.

(EY36! C. Eckart and G. Young. The approximation of a matri~ by another of lower

rank. PS1/chometrika, 1:211-218, 1936.

150

[FK85] E. B. Fowlkes and J. R. Kettenring. Comment to [bf85]. Journal of the Amer

ican Statistical Association, 80:607-613, 1985.

[FS81] J. H. Friedman and W. Stuetzle. Smoothing of Scatterplots. Technical Re

port ORIONOO3., Department of Statistics, Stanford University, Stanford, Cal

ifornia, 1981.

[Fuc79] V. R. Fuchs. Low Level Radiation and Infant Mortality. Technical Report un

known, National Bureau of Economic Research, Stanford, California, 1979.

[Gif81] A. Gift. Non-linear Multivariate Analysis. Leiden. Department of Data The

ory., 1981.

[Gna77] R. Gnanadesikan. Methods for Statistical Data Analysis of Multivariate Obser

vations. Wiley, New York., 1977.

[H&S83] T. Hastie. Principal Curves and Surfaces. PhD thesis, Department of Statis

tics, Stanford University, Stanford, California., 1983.

[Hot33] H. Hotelling. Analysis of a complex of statistical variables into principal com

ponents. Journal of Educational Psychology, 24:417-441, 498-520, 1933.

[HR78] D. Harrison and D. L. Rubinfeld. Hedonic housing prices and the demand for

clear air. Journal of Environmental Economics Management, 5:81-102, 1978.

[HT86] T. Hastie and R. Tibshirani. Generalized additive models. Statistical Science,

1:297-318,1986.

por70] K. Jorgens. Linear Integral Operators. Pitman Articles LTD, London., 1970.

[KG85] R. Klein and E. G. Garayalde. Nonlinear principal components by projection

pursuit. 1985. Informes de matimatic, Serie B-032, Rio de Janeiro.

[Koy85] R. Koyak. Optimal Transformations for Multivariate Linear Reduction Anal

ys's. PhD thesis, Department of Statistics, University of California, Berkeley.

Califo:mi'L, 1985

151

[KS74] J. B. Kruskal and R. N. Shepherd. A nonmetric variety of linear factor analysis.

Psychometrika, 39:123-157, 1974.

[Lan58] H. O. Lancaster. The structure of bivariate distributions. Annals of Mathe

matical Statistics, 29:719-736, 1958.

[Mil] W. A. Mills. Preface to [TB69].

[MP85] J. A. McDonald and J. O. Pedersen. Computing environments for data analysis,

part i-iii. SIAM J. Scientific and Statistical Computing, 6:1004-1021, 1985.

[Pea01 j K. Pearson. On lines and planes of closest fit to points in space. Phil. Magazine,

2:559-572, 1901.

[SiI69] S. D. Silvey. Multicollinearity and imprecise estimation. JRSSB, 31:539-552,

1969.

[Spe04] C. Spearman. The proof and measurement of association between two things.

American Journal of Psychology, 15:72 and 202, 1904.

[Ste69a] E. J. Sternglass. Evidence for low-level radiation effects on the human embryo

and fetus. In Proceedings of Hanford Symposium. The Radiation biology of the

fetal and juvenile mammal, pages 5-8., May 1969.

[Ste69bj E. J. Sternglass. Infant mortality and nuclear tests. Bul. Atomic Scientists,

XXV:18-20, 1969.

[St085] C. J. Stone. Additive regression and other nonparametric models. Annals of

Statistics, 13:689-705, 1985.

[Thu31] L. L. Thurstone. Multiple factor analysis. Psychology Review, 38:406-27, 1931.

[YdLT76] F. W. Young, J. de Leeuw, and Y. Takane. Regression with qualitative and

quantitative variables: an alternating least squares approach with optimal scal

ing features. Psychometrika, 41:505-529, 1976.

152

[YTdL78] F. W. Young, Y. Takane, and J. de Leeuw. The principal components of mixed

measurement level multivariate data: an alternating least squares approach

with optimal scaling features. Ps,/chometrika, 43:279-281, 1978.

Appendix A

Statistical Programming on the

Symbolics 36xx Lisp Machine

The algorithm for estimating the APC, the graphical interpretation techniques for

the APC, and the diagnostic for Additive regression were implemented on the 8ymbolics

Lisp Machine 36xx series (8LM). These machines are currently nonstandard for statistical

research, and yet, as argued by McDonald and Pederson [MP85], they possess many

capabilities making them well suited for this use. I will discuss here my experience in

using these machines for the programming tasks of this dissertation.

The 8LM is a single user graphics work station. It has computing power roughly

equivalent to a VAX 780, a high resolution bitmap display and a graphical input device

(mouse).

There are two aspects of the machine which I found significantly affected the path of

my research: the programming environment and the graphics capabilities.

Programming environment

The 8LM has an integrated programming environment, which is distinguished from more

convention operating systems (e.g. UNIX or VM8) by two features:

154

• It uses a single language for (almost) all programming tasks.

The first of these has a considerable impact on the programmer's willingness to ex

periment with an implementation. Since procedures and data remain in memory, small

changes can be made incrementally - that is, without the overhead of linking and reload

ing programs into memory. I found this resulted in a faster, less frustrating coding stage,

since the time between conception and execution of changes and corrections to the pro

gram was not significant. More importantly, however, I was encouraged to experiment

with the algorithm at all levels of program development: changes to input values, data,

function definitions and procedures are all simple to effect, and the time commitment in

doing so is not daunting. In addition, since any intermediate stage of the iterative algo

rithm could be examined, and modifications made interactively, it was easy to experiment

with factors affecting the implementation performance. This close acquaintance with the

inner workings of an implementation is a far cry from the black box paradigm of batch

processlDg.

The single language of the SLM is the integrated dialect Flavors, an object oriented

extension of LISP. Object oriented programming languages are general purpose program

ming languages, which I can attest, are particularly accessible to the naive user. This is

primarily because the language provides a natural mental model for programming, that

is, the abstractions of the language come close to the way we naturally think about a

problem. This again improves communication between user and computer, enhancing the

capabilities of both the machine and the user.

In fact the SLM is not strictly monolingual. It has a FORTRAN compiler and an

interface that allows procedures to be called from LISP programs. Hence I was able to use

existing, tested FORTRAN software for the Supersmooth, and the EISPACK subroutines

for eigen decompositions.

The Graphic Capabilities

The combination of a high resolution bitmap display and the mouse permits a natural

graphical language between user and computer. The multi-window system of the SLM

155

allows effective use of the bitmap display, and easy interaction between the multiple func

tions of the software. Together, these three features provide strong support for graphical

interaction.

Single user machines, as opposed to time sharing machines, allow real time motion

in graphics. To give the illusion of smooth motion the graphics program must satisfy

exacting timing constraints. On a single user machine, with adequate computing power

on demand, and high speed data transfer between CPU and display, the required response

time is guaranteed.

The high speed graphical interaction of the SLM also permits real time constraint

satisfaction, which enabled the implementation of brushing on connected scatterplots. The

constraint that all points representing the same observation (in connected scatterplots)

be drawn with the same glyph is satisfied practically simultaneously with the observation

being brushed by the mouse. The speed of this interaction is a major factor in its utility

as an interpretational tool.

Vita

Deborah Donnell was born October 29, 1958 in Auckland, New Zealand,

She completed high school at Rangitoto College Auckland in 1976, having been ac

credited the University Entrance Examination and winning a Junior Scholarship Award,

She graduated from Auckland University in 1980, with a Bachelor of Arts in Music

and Mathematics, gaining the senior mathematics prize in her final year, In 1982 she

completed a Mast"r of Arts degree in Mathematics at Auckland University,