University of Washington
Abstract
Additive Principal Components : A Method for Estimating
Additive Equations with Small Variance From Multivariate Data
by Deborah J. Donnell
Chairperson of the Supervisory Committee: Professor Werner Stuetzle
Department of Statistics
Additive equatiollJ or additive principal componenta are a generaliJation of linear principal
components, with .ums of arbitrary tranaformation. replacing linear combinatiollJ of variables.
The presence of additive principal componenta with .mall variances indicat.. the concentration of
the obaervatiollJ around a pouibly nonlinear manifold, implying strong dependenci.. between the
variabl... Additive principal componenta thus have diagnostic applications - additive dependen
ci.. among predictor variabl.. of an additive regre..ion model cause problema that are similar to
those caused by collinearity among predictors in a linear modeL
Additive principal components are the solution of an eigenproblem in an appropriate function
space. An iterative algorithm is given for the sequential computation of the smallest additive
principal components, and convergence to the correct minimising solution is shown. The estima
tion technique is evaluated on data generated from certain symmetric diatributiollJ for which the
solution can be determined explicitly.
A tr&llJparent method for interpretation of the additive dependenci.. using dynamic graphics is
sugg..ted. The elfectivene.. of this technique is illustrated on several data seta. An application of
the additive principal component ... a diagnostic for in.tability of predictor transforms in additive
regression models is demollJtrated.
Table of Contents
List of Figures v
List of Tables .
Chapter 1: Introduction
. • . . • . . . • . . . . . . • . . • . . . • . . . . . . . . . . . Vll
1
Chapter 2: Definition and Theory of the Additive Principal Component
2.1 Introduction .
2.2 Definition of the Smallest Additive Principal Component.
2.3 Finding the Additive Principal Component
2.4 Further Additive Principal Components . .
2.5 A Null Distribution for Additive Principal Components
2.6 A Linear Characterization .
5
5
6
7
17
20
21
2.7 Alternating Conditional Expectation Regression and Additive Principal
Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . " 22
Chapter 3: Additive Principal Component Solutions for some Multivariate Distri-
butions 26
3.1 Introduction. 26
3.2 Distributions with Bivariate Symmetry. 27
3.3 The Additive Principal Components of Distributions with Bivariate Sym-
metry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Polynomial Biorthogonality . . . . . . . . . . . . . . . . . . . 32
3.5 Additive Principa.l Components of the Gaussian Distribution 33
3.6 Additive Principa.l Components of the Gegenba.uer Distribution 35
3.7 Zero Varia.nce Additive Principal Components for Clustered and Categori-
ca.l Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40
Chapter 4: Estimation of Additive Principal Components 42
4.1 Introduction............. 42
4.2 Algorithm Implementation Deta.ils 43
4.3 Algorithm Improvement: A Linear Principal Component Step 47
Chapter 5: Simulations of Additive Principal Component Estimation. 49
5.1 Introduction..... 49
5.2 Evaluation mea.sures 50
5.3 Simulations using the Gaussian Distribution. 53
5.4 Simulations using the Uniform Distribution on an Ellipsoid 63
5.5 Simulations using Manifolds defined by Specified Constraints 69
5.6 APC Estimation for Uncorrelated Variables . . . . . . . . . . 81
5.7 APC Estimation for Distributions with Exact Additive Dependencies. 83
5.8 Conclusions................................. 85
Chapter 6: Applied Additive Principal Component Analysis. 91
6.1 Introduction................... 91
6.2 Interpretation Techniques for Data Analysis . 92
6.3 Guidelines for Detecting Real Structure . 101
6.4 The Infant Morta.lity Data. . . 104
6.5 The Boston Housing Data . . 117
6.6 A Diagnostic for Additive Regression Transform Stability . 126
Chapter 7: Literature Review . . . . . . . . . . . . . . . . . . . . .
III
. 133
7.1 Linear Principal Component Analysis ..
1.2 Nonlinear Generalizations of Principal Component Analysis
1.3 Additive Models . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 8: Conclusion .
Bibliography . . . . . .
.133
. 137
. 145
.141
.149
Appendix A: Statistical Programming on the Symbolics 36xx Lisp Machine . 153
iv
List of Figures
5.1 GAU-Sl: Correlation Plots .....
5.2 GAU-Sl: APC-function Estimation
5.3 GAU-Sl: Variance of APC-function Estimation
5.4 GAU-S2: Correlation Plots .
5.5 GAU-S2: APC-function Estimation for Smallest APC
5.6 UNI-S: Correlation Plots .....
5.7 UNI-S: APC-function Estimation
5.8 UNI-S: Variance of APC-function Estimation
5.9 SCM-S1: APC-function Estimation for Component 1
5.10 SCM-S1 : APC-function Estimation for Component 2
5.11 SCM-S2 : APC-function Estimation for Component 1
5.12 SCM-S2 : APC-function Estimation for Component 2
5.13 Independent Gaussian: Estimates of the Three Smallest APCs
5.14 Uniform on Ball: Estimates of the Three Smallest APCs
5.15 Discrete APC : Estimates of the Two Smallest APCs .
6.1 Interpretation Example: The APC-function plots
6.2 Interpretation Example: The APC-function plots
6.3 Interpretation Example : Added Variable Plots . .
6.4 Interpretation Example: The APe-function plots
6.5 TIM: The Additive Regression Models .
55
58
59
61
62
66
67
68
75
76
79
80
82
84
86
97
98
99
. 100
.106
6.6 '0IM-5var: APC-functions of the smallest APC .
6.7 TIM-5var: APC-functions of the second APC.
6.8 TLM-5var: APC-functions for the third APC..
6.9 TIM-4var: APC-functions for the smallest APC
6.10 TIM-4var : APC-functions for the second APC
6.11 TIM-4var: APC-functions for the third APC .
6.12 TIM-4var: The residuals of the smallest APC .
6.13 BH-small : The smallest APC-function plots
6.14 BH-small: The second APC-function plots.
6.15 BH-small: The third APC-function plots.
6.16 BH-small : The smallest APC outliers .
6.17 BH-small : The ACE regression models
6.18 APC Diagnostic for TIM Regression: Smallest APC
6.19 APC Diagnostic for TIM Regression: Second smallest APC .
VI
.108
.109
.110
.112
.113
.114
.116
· 120
· 121
· 122
.123
.127
· 131
· 132
List of Tables
5.1 GAU-S1: Correlations between True and Estimated APCs
5.2 GAU-S1: Loading metric .
5.3 GAU-S1: Eigenvalue and Variable Loadings
5.4 GAU-S1: Canonical metric .
5.5 GAU-S2: Eigenvalue and Variable Loadings
5.6 GAU-S2: Canonical Metric .
5.7 UNI-S: Eigenvalue and Variable Loadings
5.8 UNI-S: Correlation between True and Estimated APCs
5.9 UNI-S: Loading metric .
5.10 UNI-S : Canonical Metric
5.11 SCM-S1 : Eigenvalue and Variable Loadings.
5.12 SCM-S1 : Correlation between True and Estimated APCs
5.13 SCM-S1 : Loading metric .
5.14 SCM-S1 : Canonical Metric
5.15 SCM-S2 : Correlation between True and Estimated APCs
5.16 SCM-S2 : Loading metric .
5.17 SCM-S2 : Eigenvalue and Variable Loadings.
5.18 SCM-S2 : Canonical Metric .
6.1 TIM: The Additive Regression Models
6.2 TIM-5var: Eigenvalues and Variable Loadings
54
54
57
60
61
62
64
65
65
66
73
73
74
74
77
77
78
78
. 105
.107
6.3 TIM-4var: Eigenvalues and Variable Loadings
6.4 BH-small: Eigenvalues and Variable Loadings
6.5 BH-full: Eigenvalues and Variable Loadings .
. 112
.119
. 124
Acknowledgements
I gratefully acknowledge the guidance, patience and assistance of Werner Stuetzle and
Andreas Buja, whose generosity in sharing their time and ideas provided an invaluable
resource for this dissertation.
I would like to express my appreciation to the entire staff and faculty of the Statistics
department, who provided funding throughout my program; with special thanks to Peter
Guttorp and Jon Wellner, from whom help was never sought in vain.
No thanks would be complete without mention of the warm and constant support of my
friends in Seattle: to my housemates Catherine, Anne, Scott, Kay, Anne, Stefan, Heather
and Robert for all the dinners, fun and companionship; to my fellow students, especially
Robert, Jeff, Nuala, Katrina, Keith, Gary and Russell for the never ending invitations to
coffee.
Finally, to Andrew, who was always prepared to endure the worst and celebrate the
best with me, my warmest and deepest thanks.
The research for this dissertation was completed while the author was supported by
DOE Grant number DE-FG06--85ER25006.
Gloria Deo - To God be the Glory
Chapter 1
Introduction
Computers have made it possible to collect and analyze ever larger data sets. This de
velopment has created a need for new statistical methods. Small sample size necessarily
limits the complexity of models that can be fitted and of structure that can be reliably
detected; thus one is restricted to classical parametric methods like linear regression and
linear principal component analysis. On the other hand, large sample size allows the
detection of complicated structure, and the fitting of complex models. This means that
nonparametric methods for description and inference making fewer assumptions about the
underlying situation are called for.
These considerations are the reasons why in recent years there has been a surge of
interest in methods for nonparametric multiple regression, in particular the models of
additive regression and Alternating Least Squares (ALS) [YdLT76] or Alternating Con
ditional Expectation (ACE) regression [BF85]. The former models the response as an
additive function of the predictors :
p
y ~ I>l>i(Xi ),i=1
whereas ACE regression finds transformations 4>1, ... ,4>. of the predictors as well as a
2
transformation 9 of the response :
p
9(Y) - L 4>i(Xi)'i=1
This dissertation contributes to the development of methodology suitable for detecting
complex structure in data.
Our intent is to estimate additive equations from multivariate data which satisfy as
nearly as possible the constraint:
p
L 4>i(Xi ) = O.i=1
Such an additive constraint describes high-dimensions.! structure in the data. Recall the
linear structure implied by a linear constraint, l(x) = a' x = O. If the data nearly satisfy
this constraint, they lie close to a linear manifold of co-dimension p - 1. Analogously,
an additive constraint E 4>i(Xi) = 0 defines an additive manifold of co-dimension 1, and
data nearly satisfying this constraint lie near this additive manifold.
We present a method for estimating the transformations 4>1(X1),4>x(Xx), ... ,4>p(Xp)
describing an additive manifold close to the data. The additive equation is defined by
generalizing the definition of the linear principal component, resulting in the Additive
Principal Component.
Detecting high-dimensional structure in data is intrinsically a difficult task, even with
sophisticated graphical tools. The estimation of constraints will be an appropriate analysis
tool when the search for structure in the data is undirected, that is, no variables are
designated a priori as intrinsically more important, or dependent rather than independent.
Although the additive form of the equation places some restrictions on the surfaces that
can be modelled, they nevertheless are considerably more general than linear manifolds. If
they can be reliably estimated, and properly displayed and interpreted, additive principal
components have the potential to be an important' tool for better understanding the
multivariate nature of data.
The importance of recognizing nonlinear dependencies among the predictor variables
when fitting additive regression models is analogous to the importance of detecting colli-
nearity patterns when fitting linear models ]Sil69]. In a IL'lear model cO.!lulea,rlt;y between
3
carriers results in inflated variance of the estimated regression coefficients. It is then
not possible to infer the separate influence of the collinear explanatory variables on the
response variable. In the additive case, similar difficulties arise. Suppose we fit an additive
model Y .,. E~:1 4>;(Xj ) to the data. We often want to make both qualitative and
quantitative statements about the contributions of each Xi in the model, based on the
estimated <Pi. Consider the analogy to the extreme case of exact collinearity, where there
exist functions of the variables such that E 9;(X;) = O. In this situation, the alternative
fit :p
Y - 2:)<Pi + 9i)(Xi ),i:1
is indistinguishable from the initial one. If the data come close to satisfying this constraint,
some or all of the estimated <Pi will not be stable. We are clearly in no position to interpret
the component functions of the fitted model when this is the case. A method which enables
us to examine how close the data come to satisfying an additive constraint would thus be
a diagnostic check for global stability of the transforms in additive or ACE regression.
Additive principal components provide a method for detecting high dimensional struc
ture in multivariate data, and thus discovering the implied additive dependencies between
the variables. As a natural extension to linear principal component methodology, there
are many potential applications for this data analysis technique.
This dissertation considers first the theoretical properties of additive principal compo
nents, followed by a study of the estimation problem.
We begin in the next chapter with a formal definition of the Additive Principal Com
ponent (APC). An algorithm for finding the APCs leads naturally to consideration of their
theoretical properties. In the third chapter, continuing the theoretical development, we
find explicit APC solutions for a class of elliptically symmetric distributions.
The algorithm developed in Chapter 2 provides a method of estimation for the APCs.
Finite sample iasues arising in the implementation of the algorithm are discussed in Chap
ter 4. The fifth chapter draws on the known APC solutions derived in the third chapter,
to study the finite sample algorithm via simulation. The sixth chapter addresses the use
of APes the data analysis context, of int"q)retation
4
the use of dynamic graphics, and then demonstrating their use on two real data sets. A
dynamic graphical diagnostic for additive and ACE regression is also explained.
The seventh chapter reviews linear principal component techniques, and then discusses
some other nonlinear generalizations of principal components that have been developed.
The dissertation concludes with a brief summary of the research presented.
Chapter 2
Definition and Theory of the
Additive Principal Component
2.1 Introduction
We begin by defining the smallest additive principal component. Then a simple intuitive
idea suggeste an algorithm for finding the APe, which we analyze for the linear case.
The insight gained from the linear solution is applied to the additive case, resulting in
the characterization of the APC as an eigenfunction. Due to this characterization, the
properties of the APC and the algorithm are more fully understood, and we deduce a
modification of the algorithm for which convergence, at least in the population case,
can be shown. Throughout this chapter, we concern ourselves only with the population
properties of the algorithm.
Before proceeding further, we remark that estimation of additive equations is not in
variant under rescaling of the variables. This is analogous to the scaling issue in linear
principal component analysis. Throughout this dissertation we assume the random vari
ables have been standardized, so Xl, X2, .. ,Xp have E (Xi) = 0 and var (Xi) = L
6
2.2 Definition of the Smallest Additive Principal
Component
Our objective is to determine whether random variables Xl, X2 " • •• ,Xpcome close to sat
isfying an additive constraint I>I>.(X.) = 0, for some set of transformations </>1> </>2, ••• ,</>p.
First consider the classical version of this problem where the functions </>. are restricted
to be linear, that is, E </>.(X.) = E a.X•. The aim then is to find the linear combination
of the variables that is closest to zero.
One possible criterion is to find the vector a minimizing the variance of the sum,
var (E a.X.) = var (X· a). To avoid the trivial solution, a = 0, the constraint E a; =
E var a.X. = 1 is imposed. The minimum occurs for a an eigenvector for the smallest
eigenvalue of cov(X) = ~. The random variable E a.X. is called a smallest principal
component of X. The corresponding linear function 11 (x) = a· x defines a linear manifold
of co-dimension 1 in p-space through Il(x) = O. It can be shown that this manifold
minimizes the expected squared distance from the observations to any linear manifold of
co-dimension 1. Hence defining a using this geometric criterion, of minimizing distance to
a linear manifold, results in the same solution as finding a minimizing the variance.
For the linear case, the above three characterizations of the vector a are equivalent:
• E a.X. has minimal variance among all linear combinations of the variables with
Ea; = 1.
• a· x = 0 defines the manifold of co-dimension 1 lying closest to the data.
• a is an eigenvector for the smallest eigenvalue of ~.
We return to the problem posed for the more general additive case : find non trivial
functions </>. making the sum of the transformed variables, E </>.(X.) , "closest" to zero.
We need to decide on a criterion that makes the notion of closeness exact and uniquely
defines the set of transformations ~ = (</>1> 4>2, ... ,</>p).
A natural approach is to extend the definition of the smallest principal component
of X. Using the minimum variance characterization, we could define ~ as the vector of
transformations of the variables minirrJz:ing val" ¢;i (Xi) to var ~ 1"
7
Alternatively, we could use the geometric characterization and determine the additive
manifold described by 2:::CP,(X,) =0 which minimizes the expected squared distance from
the observations to any additive manifold of co-dimension 1.
Any solution, q1(X) =Ei cP,(X,) , to the minimum variance criterion will define through
~(x) = 0 an additive manifold which lies close to the data. However, unlike the linear
case, this additive manifold and the additive manifold closest to the data in the geometric
sense win not be the same.
We choose to use the minimum variance approach, which is both computationally and
theoretically more tractable.
Definition
The smallest additive principal component of X = (Xl," . ,Xp ) i8 the ran·
dom variable q1(X) = Ef=l cP,(X,) minimizing var Ef=l cP,(X,) subject to
Ef=l varcP,(Xi) = 1 .
Note that the constraint E varcPi = 1 is indeed the natural analogue to the linear defini
tion. If cPi (Xi) =ai Xi, E var <Pi (Xi) = E var aiXi = E a~var Xi = E at = l.
At this point we present the notation and terminology conventions we will use through
out this dissertation.
• ~(X) = Li cPi(Xi) denotes the Additive Principal Component, abbreviated APe.
• cP,(Xi ) is referred to as the APC-function for the j"th variable.
• q>(X) = (cPt, ... ,cPp) denotes the vector of transformations defining the APC.
2.3 Finding the Additive Principal Component
2.3.1 A Naive Algorithm
Our intent is to find functions q> = (4)h<PZ,''''¢p) minimizing var E¢i(Xi ) subject to
E var ¢,(X,) = 1. Rewriting the variance of the sum,
8
suggests a straightforward componentwise minimization scheme, in the spirit of ACE
[BF85]. Let us ignore the constraint Lvar¢i(Xi) =1 for the moment. If we assume
tP2,"" ", ¢p to be known, then the minimizing transformation of Xl is given by :
(Here, E Xl == E (. IXl).) This is then done for each variable in turn, yielding a new set
of transformations. The constraint is reinstated by rescaling using L var ¢i of the new
functions. This suggests the following algorithm.
Naive algorithm
Ch .. . 1 r t" "J,[O] "J,!O] "J,[O]oose lnltla tranSlorma Ions 'f'l ,'f'2 ,"" 'f'p
Repeat for N = 1,2, ...
Do for i - l, ... ,p
) Inner loop
Outer Loop
Until
( [N] [N] [N] ( )tPl , tP2 , ... ,¢p ) +- CtPl' CtP2, •.. ,c¢p
var L ¢~Nl converges.
Notice that the iteration scheme employed here in the inner loop is different to that used
in the ACE algorithm of Breiman and Friedman [BF85J. Breiman and Friedman replace
each tPi by its new transformation as the inner loop proceeds, whereas we obtain the new
p-tuple U8ing only the previous p~tuple throughout the entire inner loop. This provides
us with a natural way of restandardizing in the outer loop and will a.llow a tra.nsparent
a.nalysis of the convergence of the algorithm.
is :
9
thelargest eigenvalue of :E is 3.of ilhlstration, that
- - E;;tl a;o)ld)cov(Xi , Xl)
- aio)ld) - E~=1 a}o)ld)cov(Xi , Xd.
Using this equation, the inner loop iteration over all variables written in vector notation
2.3.2 Analysis of the Algorithm for Linear Transformations
E (- E#1 aiX;Xdal = EX:
1
Assuming that this step is part of the inner loop of an algorithm in which we compute
aiNw
) from (aiold
), a~.nd), •.• , a~old», and also making use of the assumptions EX, = 0 a.nd
var Xi =1, we can write:
The problem in the linear case is to find the vector a minimizing the variance of the
corresponding linear combination of X, that is, to minimize var E h(Xi) =var E a,Xi
subject to Evar fliXi = Eat = 1. The solution is an eigenvector for the smallest
eigenvalue of E, hence we expect to establish the convergence of the naive algorithm to
this vector.
Consider the first step of an inner loop. Following the previous 8ection, we initia.lly
ignore the side condition and assume a2,' .. ,ap to be fixed. The value of a1 minimizing
var« - E;;t1 ajX;) -a1 X !l is the coefficient of the simple linear regression of - E#1 aiXi
onX1 :
Each iteration applies the matrix 1- E, and then restandardizes, so the a.lgorithm is
simply the power method for computing eigenvectors. It is easily shown to converge to
the eigenvector for the largest absolute eigenvalue of I -:E. This may not be the vector
we are seeking, which is rather the sma.llest eigenvector for E. Suppose, for the sa.ke
a(NW) = a(old) - E (X1X)a(o)ld)
_ (1 - E)a(old).
We impose the constraint E a; = 1 by rescaling by the norm factor JE(a~mw»2 11(1 - E)a(old)lI after the inner loop is completed. Hence, after the k th iteration we have:
[k] _ (1 - E)a[k-l] _ (1 - E}k-latOja - 11(1 - E}a[k-1JII - 1[(1 - E)k-lalc/n'
10
eigenvalue of I - E with the largest absolute value is 11 - 31 ~ 11 - 0.51, so the algorithm
will converge to the eigenvector for the largest eigenvalue of E instead of the smallest.
However, since the eigenvalues of E lie between 0 and p, the algorithm can be modified
to ensure convergence to the desired eigenvector.
Observe that if a matrix A nxp has an eigenvalue A with corresponding eigenvector v,
then the matrix pI - A has an eigenvalue p - A with the same eigenvector, v. So if A
has the increasing sequence of eigenvalues 0 :5: At :5: ..• :5: Ap :5: p, then pI - A has the
positive decreasing sequence of eigenvalues p ~ p - At ~ ... ~ p - Ap ~ O. It follows
that an eigenvector for the largest (absolute) eigenvalue of pI - E is an eigenvector for
the smallest eigenvalue of E.
We can ensure convergence to the smallest eigenvector of E, by applying the matrix
pI - E at each step, instead of the naive I - E. Then the new ai in the inner loop is :p
(....,) (old)" (old) (X X)Q. +- pal - L...." aj COV ,., i·
;=1
To summarize, the modified algorithm, which applies pI - E instead of I - E, will
converge to a correct solution of our problem in the linear case.
2.3.3 The Hilbert Space of the Additive Principal Component
In the linear setting, the problem of minimizing the variance is easily solved analytically
by reexpressing the criterion as a minimization in RP. That is,
var I: aiXi - a'E (X 'X)a
- a'Ea
= (a,Ea).
(2.1)
The minimization of var I: aiXi subject to I: a; = 1 is equivalent to minimizing (a, r:a)
subject to (a, a) - the minimizing vector is then found by appealing to the Cauchy
Schwartz inequality.
The minimization problem in the additive case can be re-expressed in an analogous
manner, however we first need to establish an appropriate formal framework. The additive
principal component is defined by a vector of functions \tl = (,pI, ... ,,pp), so a natural
amUOj" to l!l space of Lz lUrict::ons.
11
For; = 1, ... ,p, define the function spaces:
H(X,} = {t/>I : E t/>,(X,} = 0, E t/>;eX;} < oo}.
Each of these is a Hilbert space with inner product (t/>I,t/>:) = E (t/>,(X,}t/>;(X;)} and
corresponding norm 1It/>,II· = E t/>;(X,} = vart/>;(X,}.Define the cartesian product space HP = H(Xd X H(X.} x •.. X H(Xp }. The natural
inner product on HP is :
(~, ~I). - E, (t/>, , t/>D
= E, E (t/>It/>D,
with corresponding norm:
II~II: = LE (t/>;) = Lvart/>I'I
In Breiman and Friedman [BF85], it is established that HP is a Hilbert space for which
the natural embeddings of H(Xt), H(X.}, ... , H(Xp } are all closed linear subspaces. Also,
the norm topology of HP coincides with the product topology inherited from the factors
The smallest APC belongs to the cartesian sum space :
H+ = H(Xd $ ... $ H(Xp }
- {I(X} = E. !;(X,) : Evar Ii = 1'/, E H(X,)}.
Now the formal notions for the APC are established, we proceed to the eigenfunction
characterization.
2.3.4 The Eigenfunction Characterization
The eigen properties of the APC follow when we reformulate the definition of the APC as
a minimization problem in HP. We begin by characterizing the estimator resulting from
the naive algorithm of section 2.3.1.
Recall that in the inner loop we obtain a new estimate of each t/>, by :
t/>l""") ...... E X;( - Ej;>!; t/>}Oldl)
= E X; (if>lo'd)
(2.2)
12
The conditional expectation operator E Xi, denoted Pi hereafter, is a projection mapping
H+ onto the subspace H(Xi ).
Since conditional expectation is a linear operator, we can rewrite,
p
<p~MW) = ~~old)(Xi) - I: Pi(~)old)(Xj)); i =1, ... , p.i=1
This can be written in an "operator matrix" notation, illustrating the similarity to the
linearca.se:
I PI PI 4>lold
)
tt(new) (X) (1 P2 I P2 )4>~old)
= -
Pp Pp I 4>~old)
= (1 - P)lJ}(old)
I PI PI
P2 I P2where P -
Pp Pp I
1 - the identity mapping in H.
There is a slight abuse of notation in this representation concerning the domain of the
operator Pi' Pi is defined on the domain H+. In the above form, Pi maps from the
subspace domain H(X,) to H(X,). Strictly speaking, the dependence of this restricted
operator should be indicated in the matrix representation of P , however, if l{ is simply
considered as shorthand for E Xi(_), no confusion will result.
After k iterations of the outer loop,
fkJ _ (1 - P)lJ}[k-l] _ (1 _ P)k-l~[Ol
lJ} - 11(1 - P)lJ}[k-IJIl. - fi(1 - P)k-lep[O!Il•.
The naive algorithm is seen to be the power method applied to the operator I - P, so it
will converge to the eigenfunction for the largest absolute eigenvalue of 1-P, if it exists.
The similarity between the above representation (2.2) and the linear analysis of section
auggests vector APe-functions is an 61gen:tuEld:,on
13
HP, above. P maps the additive function formed from summing the elements of <I? onto
each of its conditional expectations, that is,
[P <I?]i = Pi I>MXi )·i
Ai; a first step towards establishing this characterization, we have the following simple
identity.
Lemma 2.1
(<I?,P <I?). = var L.pii
Proof:(<I?, P <I?). = L,i(.pi, P; L,i .pi)
= L,i(.pi,L,i.pi)
- (L,i .pi, L,i .pi)
= var L,i .pi.
The second equality follows from the self-adjoint property of projection operators:
(.pi, Pd) = (Pi.pi,!) = (.pi,!). I
This identity provides the crux of the argument. Noting that L,Var.pi = fl<I?iI:, the
definition of the smallest additive principal component has an equivalent characterization
as the solution to an extremum problem in HP.
Theorem 2.1 A function vector <I? E HP minimizes (<I?,P<I?). subject to the constraint
1f<I?1f; = 1 iff the set of transformations {,h,.p2, ... ,.pp} minimizes var L,.pi(Xi) under
L,var .pi(Xi) = 1.
Proof: An immediate consequence of Lemma 2.1. I
Notice the analogy between these equivalent characterizations and the two character
izations of the linear solution, equation (2.1).
It is a well known fact from the theory of self-adjoint operators, iJor70, Th 6.7 p.125],
4> E HP minimizing (<l\, P4». subject to = 1j or eqmval"ntly minimizing the
Rayleigh quotient:(~,P~).
II~II:
is an eigenfunction for the smallest eigenvalue of P (where it exists). Thus, once we have
shown P is self adjoint, the following eigen characterization of the vector of APe-functions
is established.
Theorem 2.2 The smallest eigenfunction of the operator P, if it erists, is a vector of
APC-functions for the smallest additive principal component ofX.
An immediate corollary to Theorems 2.1 and 2.2 is :
Corollary 2.1 Suppose ~ = (<Pl,th, ... ,<pp) is a smallest eigenfunction of P belonging
to the eigenvalue Amin' with II~II = 1. Then:
1. The smallest APC of X is ~ = I:i <Pi,
E. The variance of the smallest APC is Amin'
Proof: The first is immediate; for the second,
I
We now turn our attention to establishing the properties of the operator P.
Lemma 2.2 P is a bounded, self-adjoint, non-negative operator in HP.
Proof: P is bounded :
IIP~II: ~ I:.IIPi I:i <Pi II2
::; I:i II I:i <Pi 112
= p II I:i <Pi 112
::; P (I:i IltPillJ2·
The maximum of I:i 1!<Pill under the constraint I: II <Pi 11 2 = 1 is attained at II4>ill = 1'-L
Hence,
:0; 1'(I:i IIM)2::; 1'2.
15
The inequality is sharp, with equality occurring when Xi = Xi V i,j.
P is self-adjoint :
(~,Pw). ~f L,i(<Pi,PiL,itPi)
= L,i (<Pi ,L,i tPi)
= (L,i <Pi,L,i tPi)'
From the symmetry of this expression it follows that (~,P w). = (P ~, w) •.
P is non-negative: (Lemma 2.1) (~,P ~). = var L, <Pi ~ 0. •
Finally, we address the existence of the smallest eigenspace.
Finding linear principal components is a simple finite dimensional problem, with 1:
having at most p distinct eigenvalues. Finding additive principal components, where we
are solving for a set of L2 functions, is generally not a finite dimensional problem - for
continuous variables the spectrum of the operator P will not necessarily be finite or even
discrete. We do know, however, since P is bounded, that the spectrum of P is contained
in the closed interval [0, pl. The theory of bounded self-adjoint operators reveals that
there are potential problems: P may have a non-trivial continuous spectrum or may have
spectral values that are not eigenvalues. We can rule out these undesirable possibilities
by adopting suitable compactness assumptions, following Breiman and Friedman [BF85].
Assumption: The restricted operators Pi /. : H(X.) .... H(Xi } are compact
lor k t- i, i = 1, ... , p.
A sufficient condition for compactness to hold is given in Breiman and Friedman
[BF85J, and the implications of assuming compactness are more fully discussed by Buja
[Buj85].
The assumption implies that the image of the unit ball is relatively compact. Even
under this assumption, P itself is not compact: suppose Xl is independent of X 2 , .•• ,Xp ,
then the bounded set {~: ~ = (<Pl(X!l,O, ... ,0)', li~ll. :::; I} is preserved under P but not
relatively compact in HP. However we can show:
Lemma 2.3 The operator PI: HP ..... HP is compact.
16
Proof: Let B denote the unit ball in HP, B. the unit baH in H(X.).
P - I = E. Q. where Q. : HP ~ HP is defined by
Q.( l))) =(Pl4>,•... , Pi - 1tP" 0, p'+1tP., ...• PptP.Y
It is enough to show that every Qi is compact by Jorgens [Jor70, Th 5.10 p.98]. Since
Be B1 X .,. x Bp , compactness of Q. is established if Q.(Bt x '" x Bp ) is shown to be
relatively compact.
By assumption P.(Ei ) is relatively compact in H(X,) Vi ¥- i, hence;
is relatively compact in HP. •Assumption of compactness implies the spectrum of P is essentially discrete, since the
continuous spectrum of a compact operator consists of at most one point. The spectrum of
a compact operator in an infinite dimensional Hilbert space has the following properties :
• There exists a sequence {h}f of distinct nonzero eigenvalues for which:
• The eigenspaces for distinct eigenvalues are orthogonal and the sum of all the
eigenspaces is dense in the whole space.
• The nonzero eigenvalues have finite multiplicity.
The spectrum of P - I is thus a discrete, bounded set with 0 as the only possible
accumulation point. Since the eigenvalues {1.J of P - I are related to the eigenvalues
{Ak} ofP through Ak = lk + I, the eigenvalues and eigenspa.ces ofP inherit all the above
properties, however the accumulation point of the eigenvalues is 1.
In summary, under the assumption of compa.ctness, the smallest eigenvalue of P exists,
and any eigenfunction corresponding to this eigenvalue is a smallest principal component
ofXo
17
Outer Loop
) Inner loop
for i = 1, .. . ,pDo
Standardize
var E <p1N ] converges.
Repeat for N = 1,2, ...
Until
The final algorithm is :
Ch . 't' I r t' ~Iol ~!O] ~[O]oose IDl la tranSlorma Ions 'f'l ,'f'2 ,.,., 'f'p
2.3.5 The Final Algorithm
2.4 Further Additive Principal Components
Now that the correspondence between P and the smallest APe is established, it is dear
the naive algorithm for the additive case has the same flaw as it had in the linear case.
The eigenvalues and eigenfunctions of P and 1- P are in one-to-one correspondence,
exactly as for the linear case of section 2.3.2. The naive algorithm converges to the
eigenfunction of I - P for the eigenvalue with the largest absolute value. AB in the linear
case, applying the modified operator pI - P will guarantee convergence to the correct
solution.
Algorithm
Up to this point, we have considered only a single constraint, however, there may be
other additive dependencies of importance that can be captured with a second constraint.
In linear principal component analysis, searching for a.dditionallinear dependencies would
18
correspond to examining the principal components of other eigenvalues. The characteriza
tion of the smallest additive principal component as a smallest eigenfunction of P suggests
exploring eigenfunctions associated with other eigenvalues of P.
We first review the properties of the second smallest linear principal component. In
section 2.2 we pointed out that the smallest principal component, a1> minimizes var(X'a),
or equivalently defines through lr(x) = a1 . x = 0 a linear manifold L(1) minimizing the
expected equared distance from the observations to any linear manifold of co-dimension
1. The second smallest principal component, a2, is defined as the unit vector minimizing
var (X· a) subject to cov(I:. aliXi,I:i a2iXi) = O. An equivalent definition replaces the
covariance constraint by the requirement a1 1. a2.
The vector a2 defines a linear function 12 and a corresponding manifold L(2) of co
dimension 1 through 12(x) = O. Together the functions lr and 12 define a linear manifold
L(12) of co-dimension 2, which is the intersection of L(1) and L(2). This is the manifold
that has the smallest expected squared distance from the observations among all manifolds
of co-dimension 2.
The second smallest additive principal component is defined by extending the variance
criterion of the linear definition.
Definition
The second smallest additifJe principal component oj X is the random fJariable
¢;(2) (X) = I: rp;2) (Xi) minimizing:
var I:i rpj(Xi) = W, P 4>').
subject to (4)',4>(1)). = I:lcOV(rp;,rpf!)) =0
and 114"/i: = I:varrpj(Xi) =1.
The additional constraint above defining or the second smallest APC is a natural condition
of orthogonality between APCs with respect to the inner product of the Hilbert space HP.
The second smallest additive principal component is an eigenfunction corresponding to
the second smallest eigenvalue of P. It is easy to generalize this idea and define a sequence
of additive principal components, each one orthogonal to all the preceedlng oneK The
19
additive principal component corresponds to an eigenfunction of P belonging to the k'h
smallest eigenvalue (where eigenvalues are repeated according to their multiplicity). The
k'h APC is denoted by adding an upper subscript to the usual notation, i.e., ~(')(X).
As the operator P - I is compact, we can express its decomposition explicitly. Divide
the eigenvalues of P - I into an upper and a lower sequence according to whether they are
positive or negative. Denote the negative values by the increasing sequence {I. - 1 : I. :5
l,k = 1,2 ...}, positive values by the decreasing sequence {Uk -1: u. ~ l,k = 1,2 ...}.
Both sequences, if they are infinite, converge to zero. Let U. denote the operator that
projects onto the eigenspace of the eigenvalue u., likewise for L.. Then P - I can be
written:
P - I = E(I. -1)L. + E(UI- I)U,.k I
Thus,
P = 1+ E(I. - I)L. + E(UI - I)U,.k I
The eigenvalues of P smaller than 1 are 1 + (I. - 1) = I., hence the sequence of APCs
spans the union of the range spaces of {L. : k = 1,2 ...}. The sequence of operators
{L1, L z, ...} are an orthogonal decomposition of the contracting part of P. That is, if
L = I:f:l(l. - I)L., then for ~ E w, IIL~II. :5 II~II. :5 liP ~II.·
Linear principal components are uncorrelated, and the vector of variable loadings are
orthogonal eigenvectors of I:, hence the dispersion matrix of the principal components,
y = Ax, is diagonal :
vary = var Ax = A'I:A = diag(>'1> ... ,>..) where A'A = I.
For additive principal components the same result holds true: the additive principal com
ponents simultaneously diagonalize the quadratic forms (~, P ~). and 1I~li:' and additive
principal components belonging to two different eigenvalues are uncorrelated. Note that
we then have:
mr[m,ar pre,petty of linear prUnCI]pal cornp')l1<m~s.
20
The geometric structure induced by further additive principal components is analogous
to the linear case. IT the operator P has two vanishing eigenvalues the observations lie in
an additive manifold of co-dimension 2, described by the two constraints:
'" (1)( ) _ '" (2)(X.) _ . h ",(",(1) (2))_L... tPi Xi - 0 and L... tPi • - 0, WIt L... 'l'i ,tPi - O.
If there are two eigenvalues "close" to zero, then the observations lie close to the additive
manifold defined by the corresponding pair of implicit equations. The manifold described
by these two constraints is not the manifold lying closest to the data in the Euclidean
metric, unless the manifold is linear.
An obvious method for finding the k th smallest component is to choose initial functions
for the algorithm that are orthogonal to all the previous principal components. That is,
choose \1>[oJ such that (\1>[01,\1>(1)) = ... = (\1>[OI,\1>(k-l)) = o.The algorithm will then converge to the k th largest eigenfunction of pI - P, or equiva
lently the k th smallest eigenfunction of P. Hence it finds the k th smallest additive principal
component associated with the eigenvalue Ak = var Li tPlk ) .
2.5 A Null Distribution for Additive Principal
Components
The APCs are standardized 80 that L var tPi = 1, hence only if var L tPi < 1 does the
APC reveal a dependency between the variables. This is equivalent to restricting our
attention to eigenfunctions corresponding to eigenvalues of P smaller than one. It is
natural to ask when P has no eigenvalues less than one. The following theorem provides
a characterization of this null situation.
Theorem 2.3 The following are equivalent:
1. All the spectral values ofP are greater than or equal to 1,
E. P - I is non.negative definite,
s. P = I, so the spectrum olP is the singleton {I},
4. The variables
21
5. The spacu H(Xd,H(X:), ... ,H(Xp ) are orthogonal.
Proof: (1 => 2) If all the spectral values of P are at least 1, then all the spectral values
of P - I are non-negative, or equivalently P - I is non-negative.
(2 => 4) If P - I is non-negative, then for ~ij == (O,th, 0, ... ,0, ¢>i' 0" .. ,0) E HI'.
(~ii,(p _ I)~ii). = (~ii,p~ii). -lI~iill: ~°=> ° $ var (¢>i +¢i) - 1 by Lemma. 2.1 and II~II: = 1
- var ¢>i +var¢i + 2COV(¢i, ¢>i) -1
= 2 COY (¢.) ¢>j) since lI~.jll: == var ¢>i + var ¢>j = 1.
Replacing ¢i by -tPi in the above we arrive at the conclusion that cov(¢>i, ¢>,.)= ° V¢>i, ,piUnder the assumption of compactness of Pi and Pit it follows that Xi and Xi are inde
pendent V" # j.
(4=> 5) Clear, since cov(¢>i'¢>i) == (¢i,tPi)'
(5 => 3) H(Xi)..L H(Xi ) => Pil i := OVi # j. Hence P =I.
(3 => 1) Trivially. •
Note that orthogonality of spaces H(Xd, H(X2),"" H(Xp ) is equivalent to pairwise
independence only, and full independence of Xl> _.. )XI' does not follow: from H(Xt} ..L
H(X:) and H(Xt} 1. H(Xs) we only have H(XI) 1. H(X:)$H(Xs) whereas independence
of Xl from (X:, X s) is equivalent to H(Xr) ..L H(X:, X s), the latter denoting the space
of centered L2 functions which depend only on X2 and X g •
We conclude with the following simple corollary:
Corollary 2.2 The APes 0/ x -- )/ (0, I) all have the eigenvalue 1. Any normed Bum of
transformed variables i8 an APO 0/ X.
2.6 A Linear Characterization
For the smallest additive principal component, there is a linear characterization of the
minimizing solution. Namely, the smallest additive principal component of X is exactly
the smallest linear principal component of the transformed and restandardized varia.bles :
22
The smallest eigenvector has the vector of variable loadings a = (!I<h(Xd!l, ... , !I4>p(Xp)lil.Recall that the linear principal component of Y achieves minimal variance among all
linear combinations of Yl> Y:, ... ,Yp • This minimum cannot be less than the minimum
over all additive functions, since it is itself an additive function, nor can it be greater than
the minimum since the smallest APC is a linear combination of YI , Y" .. . ,Yp • Hence the
two are identical.
For additive principal components for other eigenvalues, this duality no longer exists,
as the transformations of the variables are different for each additive principal compo
nent. Moreover, the k'h additive principal component does not correspond to the smallest
linear principal component of its restandardized transformed variables, as it is subject to
orthogonality with respect to the previous APCs. This is true even in the case where the
smallest additive principal component is linear.
2.7 Alternating Conditional Expectation Regression and
Additive Principal Component Analysis
The stationary equations of the ACE regression model and the APC solutions have a
striking similarity, which suggests investigating more closely the differences between the
two solutions. This leads to an interpretation of the APC solutions as a possible alternative
to ACE regression, in the case where a response variable is not designated a priori.
For the purposes of comparison between the two stationary equations we will choose
a variable, say Xl, as the response variable for the ACE regression, and the remaining
variables, X" ... ,Xp , as predictors. Let H~ll denote the space of additive functions
of these predictor variables. The optimal ACE transformations 4>i, 4>i, ... ,4>; satisfy for
maxinIal ). E [0, 1J, the two stationary equations:
).·.pi = pXl E.;<1 .pi,
)" E.;<1 4>i = p H i':'II4>i·
For the smallest APC of X , the APC-functions satisfy for mininIal ). E [0,11 the p
23
equations:
(1 - -')¢1 = pX1 2:i#1 ¢i,(1 - -\)tPt = pX, 2:i :l-i ¢;, i =2, ... •p.
Comparing the two solutions, it is dear that the APC and ACE solutions for two variables
are identical up to scalar multiples for p =2, with the correspondence ¢l = -¢i. For larger
p, the ACE equations are unsymmetric, since Xl is singled out as a response variable, and
then predictor transformations are found to give the best additive approximation to a
transformation of Xl. The APC equations, by contrast, treat the variables symmetrically.
The APC solution satisfies p restricted "'regression" equations, that is, each ¢. minimizes:
II¢. - (- 2:i#.¢i)1l2
subject to IItPill 2 = 1 - 2:i#i II4'iIl2.
Loosely speaking, the APe-functions are the solution which is simultaneously best over
all p possible ACE regressions of one variable on all the other variables.
The linear analogy to using APe as a an alternative to additive regression, is using
linear principal components regression as an alternative to least squares regression. Prin
cipal component regression is advocated when both response and predictor variables are
observed with (known) error. Then the principal component plane is an optimal fit to the
data in the sense that it minimizes the residual sum of squares over the joint distribution of
response and predictors. However, a regression interpretation of the APC solution cannot
be similarly justified, since we do not minimize over the joint distribution of the original
variables. In APC analysis, the variance of the transformed variables is minimized.
Finally, we show that the ACE regression model and the APC solution for more than
two variables, coincide only when an exact additive singularity exists.
Theorem 2.4 For p ~ 3, consider random variables X ll - .. ,Xp • Suppose the ACE reo
gression of X. on {Xi: j'l: i} is :
¢i(xt) '""' L ¢i(Xi ),i#
24
The two sets of transformations correspond for some constant c according to the rule:
<l>i(Xi) - -c<l>i(Xi )
<1>; (Xj) - c<l>j(Xj) for j # i,
if and only if 3 <1>1, ...• <l>p with <l>i # 0 such that II Ei <l>i (Xi) II = o.
(2.3)
Proof: (:j» Suppose the ACE and APC solutions coincide. Without 1088 of generality.
take Xl to be the response variable of the ACE regression.
For fixed <1>1. <l>z •... •<I>P' ACE has a linear characterization in the standardized. tra.nlr
formed variables :
If 1 and a are the first canonical correlation vectors for Yl and (Yz•...• Yp ) respectively.
with canonical correlation p*. then:
(lI<1>ill.II<I>;II.·· ·.II<I>;1Il = (l.p*a) = (l.a*). say. (2.4)
This linear characterization follows from Theorem 5.1 in Breiman and Friedman [BF85].
and the minimization criterion defining the ACE regression.
In section 2.6. we gave a linear characterization of the smallest APC : the smallest
linear principal component direction of Yl •...•Yp is (1I<I>lil •... ,lI<1>pill = 1. say.
If R is the correlation matrix of YI, Yz•...• Yp • then 1 is an eigenvector of R. hence
Rl= AI. Since the solutions coincide, from (2.3). cl = (II<I>ill.il<l>;II.,.,.il<l>;U) = (-l.a*),
It follows that :
(1 rb1(-1)=A(-1).\r12 Rzz ) a* a*
The lower partition implies,
r12 = a*Rn- Aa*, (2.5)
We will now show that equations (2.4) and (2,5) can hold simultaneously iff A = 0,
which im.,lies Ii ::::: 0, as '''U,Ulf''U,
25
1
From the properties of the canonical correlation solution, we know a = R.z--./a, where
a is the first singular vector of ruli;}. That is, ruR;l = p'a. Substitution yields the
relation p'a = R221rU' or equivalently,
Comparison of (2.5) and (2.6) yield a contradiction, unless A= O.
(<=) The converse direction of the theorem is trivial. •
(2.6)
Chapter 3
Additive Principal Component
Solutions for some Multivariate
Distributions
3.1 Introduction
For distributions with strong symmetry, it is possible to explicitly calculate the addi
tive principle components. From both an applied and theoretical viewpoint this exact
knowledge is very valuable.
First, it enables us to study the performance of the estimation procedure, since we can
assess the accuracy of our estimates by comparison with the known theoretical solution.
Second, the particular distributions for which the eigen solutions are tractable encompass
a limited class of null situations - the independent Gaussian and the Uniform on the p
ball, for example. The APCs of these null distributions provide a standard of comparison
for assessing the significance of detected structure in real data.
The first three sections of this chapter establish conditions under which exact APC
solutions are easily characterized. Then APCs are enumerated for a number of specific
distributions. Finally, we discuss some non-trivial distributions which have APCs with
27
zero eigenvalues. These typically involve dependencies in the data that are not represented
by smooth transformations of the variables.
3.2 Distributions with Bivariate Symmetry
Calculation of the APCs is simplified when all bivariate marginals of the distribution are
symmetric. Symmetry, in the bivariate setting, specifically refers the assumption that the
law of (X, Y) is the aame as that of (Y, X). From this it follows that the ranges of X and
Yare the &arne, and that X and Y have the aame marginal distributions.
Suppose X and Yare distributed according to QX,y(dtlodtZ), with marginals Qx(dt)
and Qy (dt) respectively. Let :
H(X) = Lz(Qx) = {</>(X) : E </>(X) = 0, var </>(X) < oc}
H(Y) = Lz(Qy) = {8(Y) : E 8(Y) = 0, var 8(Y) < oc}.
The conditional expectation operators:
pX : H(Y) -+ H(X)
where pX(8(Y» = E(8(Y) IX)
and Py: H(X) -+ H(Y)
and Py (8(X» = E(8(X) IY),
are mappings between the two spaces. When the joint distribution of (X, Y) is symmetric
however, X and Y have the aame marginal distribution Qx(dt) = Qy(dt) = Q(dt). We
can then consider pX and Py as mappings of Lz(Q) onto itself, and in this sense, the
conditional expectation operators are identical, pX = Py = P. P can be defined as an
operator on H(X), aay, according to P(g(X» = pXg(Y). P thus defined is symmetric
and nonnegative definite, and all of its eigenfunctions are clearly also eigenfunctions of
the identical operator defined as a mapping of H(Y) onto itself.
When P is compact and self-adjoint, spectral theory grants the existence of a sequence
of eigenvalues which converge to zero, and of associated eigen spaces which are mutually
orthogonal, finite dimensional (for nonzero eigenvalue ), and complete in the sense that
the closure of the span of the eigenspaces is the whole space.
28
P is self-adjoint since:
(4)(X) , P8(X)) = (4)(X), p X 8(Y))
= (4)(X),8(Y))
= (PY 4>(X) ,8(Y))
= (px 4>(Y),8(X))
= (N(X),8(X)),
where symmetry plays its part in the penultimate equality. Nonnegative definiteness is a
property of the inner product.
By definition, the eigenfunctions {4>.(X)}t and {4>.(y)}. are both sequences of orthog
onal functions, but in addition, they are mutually orthogonal:
(4).(X),4>;(Y)) = (4).(X), p X 4>;(Y))
= (4).(X), N;(Y))
= (4).(X), A;4>;(X))
= o i¥j.
A full discussion of these properties of the symmetric bivariate distribution, leading nat
urally to a singular value decomposition of the distribution function is given in Buja
[Buj85].
3.3 The Additive Principal Components of Distributions
with Bivariate Symmetry
The previous section established that symmetry of a bivariate distribution implies that
the two variables have a common sequence of eigenfunctions. To calculate the APCs of X,
we need to strengthen this condition: all bivariate distributions have to share the same
eigenfunction sequence. This implies symmetry of all pairwise bivariate distributions, but
it is considerably stronger.
Under this condition, we will show that an APC is defined by scalar multiples of a
single.eigenfunction of the common sequence.
Denote the common family of eigenfunctions by f. ={II, /2, ... : E /. =0, var /. = 1}.
The operator P of the previous section} corresponding to the variables and IS
29
denoted Pii. Let the eigenvalue of Pij belonging to the kth eigenfunction lie be denoted
(Ie)1'ij . Thus we have:
(3.1)
The eigenvalue 1';;) is the correlation between lle(Xi) and ll:(Xi) :
cor (l1e(X.),ll:(Xj » - (ll:(Xi),lle(Xi »)
= 1'i~)(lIe(Xi»ll:(Xi»)
There is a. potential for confusion between the eigenfunctions and eigenvalues of each Pi;'
and the eigenfunctions and eigenvalues of P. Since our interest is primarily in the APCs
of X, the terms "eigenfunction" and "eigenvalue" will be reserved for the eigen analysis
of P ; the 1'i~~) will be called correlations and the ll:, APC-basis functions.
The vector of APC~basis functions AIe(X) = (lle(XI ), ... ,11e(Xp»has correlation ma-
trix :
1
1'(1:) 1'(1:) 1pI p2
Since every bivariate distribution is symmetric, 1'i~) =1'}:), thus T(Ic) is symmetric.
When every variable has the same sequence of APC-basis functions, the APCs have a
particularly simple structure, as can be seen from the following algebraic argument. We
denote elementwise multiplication by *, that is, a * A(X) = (all(Xl ), ... ,apl(Xp», and
omit the superscript k. Also, in the following it is convenient to use Pi to denote E Xi,
instead of the more explicit Pi" allowing the domain space of the mapping to be inferred.
[P (a * A(X »]t ~ Pi E i ajl(Xi )
= E i aiPil(Xj )(3.2)
- (Ei aj1'ij)I(X,;) by (3.1)
= [(Ta) * A(X )Ii
It fonows that a * A(X ) is an eigenfunction of P if Ta = Aa. This is satisfied exactly
when a is an eigenvector of the symmetric matrix T, and the corresponding eigenvalue of
a* A(X) will be
30
Denote the sequence of eigenvalues of T(k) by (Aik) $ A~k) $ '" $ A~k)} with corre-
., {(k) (k) (k)}spondIng eIgenvectors VI • V z .... ,vp •
Theorem 3.1 Suppose X ku a p-variate distribution witk all bivariate distributions skar.
ing the same set 0/ APC· basis functions. Then tke (unordered) eigenvalues/or tke operator
Pare:{ Ail) •A~I) A~l) •
A\Z) AZ) A~Z) •...• A;k) •...}.
and tke eigenvectors belonging to tkese values are
~. = { ...(1) (1) (1).. .....1 Z p •
q,iZ). q,~Z) q,~Z) •...• q,;k) ....}.
wkere q,;k) = vik) • Ak(X).
The APC with variance A;k) is :
Proof: It is sufficient to establish
1. q,\k) is an eigenvector belonging to the eigenvalue A;k).
2. J:.' is a complete orthonormal basis for H(X) = H(Xr) X H(Xz) x ... x H(Xp ).
For the first.
P q,;k) = P (v;k) • Ak(X)}
- (T(k)v;k). Ak(X)
= A;k)v;k). Ak(X)
= A(k)q,(k).• •
by (3.2)
since v;k) is an eigenvector of T(k)
To show the q,;k) are orthonormal, first consider the case k =P k'. The APe·basis functions
are mutually orthogonal. 80 :
(q,\k). q,;:'). = '2:", (v;:]lk( X",), v;:;J Ik'(X~)
= '2:", v;:1v;:;1 (lk(X",), I.,(Xm»
= 0.
31
elliptical distributions with comJ.'110n APe-basis functions is not
I
if i =i'
else, since v~k}) v~:} are eigenvectors of T(k}.
The construction of the APCs is an application of Corollary 2.1.
product, H(X) =H(Xt} x H(Xz)··· x H(X,,).
This is in the span of ,C*) as required.
Completeness is inherited from the completeness of each of the subspaces in the direct
h(X) - E~l 80k * l.,(X}
- E~l Ef=l a:~A:)Vi(l:) * h~(X)- E~l Ef=l a:~l:)~~A:).
e* is a complete basis for H(X) since'c, the set of APC-basis functions, is a complete
orthonormal basis for H(X.) for every i. Hence h E H(X.) can be written h(X.) =l:~l aki1k(Xi} and thus heX) E H+ can be written heX) =Ef=l8ok*A(k}(X). Now the
, {(k)}p np b 't ",P (k) (k) Hp-vector 8ok, sroce tI. ,=1 span" ,can e Wll ten aj: = L.-.=l a. v, . ence,
Each APC is defined by a single member of 'c, provided an eigenvalues are distinct.
If an eigenvalue of T(k) has multiplicity greater than I, then although the APC-basis
function is uniquely determined, the sealings v~k) of the APC-basis functions are not well
determined. If an eigenvalue of T(k) coincides with an eigenvalue of TW), then neither the
sealings nor the APC-basis functions are well-determined. In this case, there is a mixing
of APC-basis functions: any set of transformations of the form ll> = a~~k) +~~~A:l)for a E [0, 1J, defines an APC.
It may seem at first sight that the symmetry condition alone will restrict consideration
to only a very small class of distributions. However, with the convention of standardizing
the distribution so the variables have equal variance, symmetry of univariate marginate
will often be accompanied by symmetry of the bivariate distributions. Hence, the da.ss of
If k = k',
32
3.4 Polynomial Biorthogonality
This section serves only to introduce an auxiliary property of bivariate distributions, which
simplifies the calculation of the APe·basis functions for the distributions of the ensuing
sections.
Definition
A bivariate distribution has the property of polynomial biorthogonality if all eigenfunctions
with respect to projection onto marginals are polunomials.
The sets of eigenfunctions for such distributions necessarily correspond to two sets of
orthogonal polynomials with respect to the marginal distributions, since the eigen families
are complete. The following two propositions give conditions to establish that polynomials
are preserved under projection onto marginals, and a simple method of finding the related
eigenvalues.
Proposition 3.1 A bivariate distribution has the polunomial biorthogonalitu property iff
the conditional moment ElY'" IXl is a polunomial of degree no greater than m in X for
all m, and the same for E [X'" IY] as a polunomial in Y. If the distribution is 81/mmetric,
only one of the two conditions need hold.
Proposition 3.2 If a bivariate distribution has the polynomial biorthogonalitu propertu,
the eigenvalue A:" is the product of the leading coefficients in the polunomials given bU the
conditional moments E [Y'" IXl and E [X'" IYJ. If the distribution is symmetric, A;" is
the square of the leading coefficient of either polynomial.
Polynomial biorthogonality was first studied by Lancaster [Lan58J; proofs of the two
propositions above can be found in Buja [Buj85J.
An easy sequence of steps lead to the eigenfunctions and eigenvalues. First check that
the distribution is polynomially biorthogonal by finding the conditional expectations and
computing conditional moments. Then from the marginals, find the family of orthogo
nal polynomials. Finally, read off the eigenvalues from the moments of the conditional
expectation.
33
305 Additive Principal Components of the Gaussian
Distribution
Assume X - .lIp(o, R), with R a correlation matrix. The condition of Theorem 3.1
requires common APC-basis functions (eigenfunctions) for all bivariate pairs, so we first
focus on the bivariate distributions.
As all bivariate distributions are Gaussian, symmetry is trivial. The conditional dis
tribution of Xi given X; is also Gaussian, and the conditional moments are:
E (X~ IXi) = ~ (m) ~p":-2;(I_ p2yX!,,-2;., L.. 2' 2L(l)' " I,.
,=0 J '2'
Bivariate Gaussian distributions are thus polynomially biorthogonal, and the system of
orthogonal polynomials generated by the Gaussian marginal are the hermite polynomials.
Applying Proposition 3.2, the correlation between the k'h degree hermite polynomial pair
(P.(Xi),P'(X;)), is p~;, where as usual P' is centered and standardized.
Returning now to the APC construction, since each bivariate distribution is Gaussian,
with a standard Gaussian marginal, every pair has as APC-basis functions the family of
hermite polynomials, independent of their correlation Pi;. Theorem 3.1 is applicable, with
the hermite polynomials providing the common set of APC-basis functions.
The APC-functions are multiples of a hermite polynomial. The correlation matrix
associated with the k'h degree hermite polynomial is the k'h Schur product of R :
T(k) ~f var (rrk)
= var (Pk(XIl, Pk(X2), . .. ,Pk(Xp ))
= Rko,
where [RkOJi; = kPi;'
Thus for each k, the P eigenvalues, (Aik) $ ).~k) $ ... $ ).~k)) of R" are all eigenvalues of
P.
The APCs are the sequence {¢;k) : i = 1,2, ... ,P, k = 1,2, ...}, where:
and
¢;k) = (v;k))trr.(X)
:s an elgen"ecwr for the
34
Since each Rio' is a correlation matrix, we can obtain partial orderings of the eigenvalues
of P for the Gaussian distribution.
Proposition 3.3 For tAe eigenlJalue8, plio), i = 1,2, ... ,P}j;;1 0/ the 8equence 0/ Schur
product8 Rio' 0/ a corrtlation matm, R :
1. A~I) ::; A~2) ::; ... ::; A~k) ::; ... ::; 1
(1) (2) ,(10)E. Ap :?: Ap :?: .•. :?: "p :?: ... :?: 1
9. All) ::; A~ih/ i, j
I ,(1) > ,(i)y' ."t. Ap _ I'\i ')J.
This is a consequence of the following result of majorization theory (Bapat and Sunder
[BS83J). Let A' B denote the Schur product of A and B, [A. Blij = aijbij , A(A) the
ordered sequence of eigenvalues of A and -< the majorization relation, U-< v if 2::=1 tli ::;
2::=I lJi y j.
Lemma 3.1 Let A and B be p X P matrice8, with A 8tl/-adjoint, and B a corrtlation
matrix. Then,
A(A. B) -< A(A).
An elegant proof of the above is found in the given reference.
Proof of Proposition 3.3 :
Repeated application of Lemma 3.1 to R yields the relationship :A(Rh) -< A(R(k-I),),
establishing (1) and (E). Statements (9) and U) follow as obvious consequences, since by•• ,(10) ,(10) ,(10)
defimtlOn "1 ::; "2 ::; ... ::; "p . •
Proposition 3.3 has the following consequences for the Gaussian distribution:
1. The smallest additive principle component is the smallest linear principal component
ofX.
2. The largest linear principal component achieves largest variance among all additive
functions.
35
3. The second smallest additive principal component is either the smallest component of
the quadratically transformed variables, Jl2) l or the second smallest linear principal
-(1)component, tP2 •
4. In general, the Joth smallest APC is among the eigenfunctions belonging to the "upper
triangular" subset of eigenvalues p.~i:) ~ l,i + k ~ j + I}.
The first two points above simply verify the well known result that minimal (maximal)
correlation over all marginal transformations for the multivariate Gaussian is achieved by
the smallest (largest) linear principal component of X. (See Lancaster [Lan58J)
3.6 Additive Principal Components of the
Gegenbauer Distribution
In this section we compute the APCs of variables distributed according to a dass of
symmetric multivariate distributions, which include the Uniform distribution on the unit
p-ball, Bp • This is then used to find the APes for the transformed distribution on an
ellipsoid.
The distribution we consider is a multivariate generalization of the symmetric beta
distribution, centered at zero and rescaled onto Bp • The p-variate Gegenbauer distribution
with parameter a on Bp has density :
lP(a) ~ Q(Xl,X2,.··,xp ;a)r~i+a) (1 2 2 2)a-l- - Xl - X2 - ••• - Xp.. rea)
The distribution can be derived from a transformation of the Dirichlet distribution -
if Yl,"" Yp are distributed Dirichlet(a,!, ... , !), then vYl,"')VYp have the above
density.
In particular, note several special cases of this density:
1. For a ~ I we have the Uniform in the unit p-balL
2. While the Uniform on the (p - 1)-sphere does not have a density, it can be obtained
as limit as 11 --. O.
36
3. The independent p-variate Gaussian is included in this family by considering a suit
ably rescaled version as a ..... 00. Explicitly, for viz.1p(a),
Comparing the three cases enumerated above, the Uniform distribution can be thought
of as intermediate between two extremes, the degenerate distribution on the surface of the
sphere on the one hand, and the limiting Gaussian distribution in the "center" of the
sphere on the other.
It is a simple matter to establish the following relationship between marginals of 1p(a) :
Proof:
t}fral (1 - ,,~-1 2)a+t-1- +4 L...it=l Xl
= rrX+a) (1 _ ~2 _ ~2 _ _ ~2 )a+t-1..trrra) ~1 ~2 . .. ~p-l
where "p-l x~ < 1 •£"1 1-
It follows by induction that the bivariate distribution of (Xi, Xj) is 12(a + f - 1) and
the univaria.te marginal distribution of each Xi is 11 (a+ 9). The conditional distribution
of Xi given Xj is Vi - Xl 1(a+ j -1). 1 The moments of 11(<» are easily derived: since
INote that the 11(<» density is simply the usual symmetric beta density tI(<>,<» on10, 1J linearly rescaled over the interval [-1, I). It is also the density of the square root ofa !)e
37
symmetric about zero the odd moments vanish, and the even moments of order 2m are:
B(m+t,a)B( t, a)
The polynomial biorthogonality property holds, since
E (x}m IXi) - (1- Xj)mE (')'l(a + i- 1))= (1 _ x~)mB(m+t,(1+i-l)
J B(!.t1+f-1)·
The system of orthogonal polynomials generated by 1'1(a + (Pi1» are the ultraspherica.l
or Gegenbauer polynomials of order a + f - 1, 9k(-; a,p) (hence the distribution name).
As usual, the 9k('; a, p) are centered and standardized.
The coefficients of the leading polynomial terms in the conditional moment are inde
pendent of both i and j, implying that for every variable pair, the correlations between
(9k(Xi),9k(Xj)) are 0 for k odd, and for k = 2m,
.x~J~m)(a,p)= A(2m)(a,p) = B(m+ t ,a+!-1)(_1)2m• B(~,4+f-l)
We now apply these properties of the bivariate distribution to solve for the APCs.
All the bivariate distributions are identical, and the sequence of Gegenbauer poly*
nomials provide the common set of APC·basis functions. The correlation matrix for
1
A(2m)
).(2m)
1
where dependence on a and p has been suppressed.
This matrix has only two distinct eigenvalues: 1 + !>.. (2m)l, which exceeds 1, and
1 - (p -1)-11.x(2m)L which corresponds to the (p - I)-dim eigenspace spanned by vectors
of contrasts, {c : l:Ci = O}. The results for the Gegenbauer distribution on the unit ball
are summa.rized below.
38
Proposition 3.4 Suppose Xl, ... , X p - ')"p(a) on B p.
1. The APCs of X with "aManceless than 1 are contrasts of e"en degree Oegenbauer
polynomials,
~)2"')(X) = L c,g2m(X,; a,p), with L c, = o.,e. The eigen"alue of the APCs formed from polynomials of degree 2m is
1 - (p - 1)-IIA(2m)(a)1 with multiplicity p - 1.
The (p - I)-dimensional eigenspace of each APC eigenvalue is spanned by the (p - 1)
dimensional space of contrasts, so the APCs are not unique. The APCs eigenvalues are
increasing as a function in 2m for fixed a, and increasing in a for fixed m. Hence the first
p - 1 smallest additive principal components are any p - 1 linearly independent contrasts
of 2nd degree Gegenbauer polynomials corresponding to the eigenvalue l-IA(2)(a)l. The
second smallest eigenvalue 1 -IA()(a)1 defines the space spanned by contrasts of the 4th
degree Gegenbauer polynomials.
For a numerical example, we calculate the three smallest eigenvalues for ')"p(I), the
Uniform distribution on the unit p-ball.
degree 2 4 6
p=3 7 15 11S1i 16 ill
p=4 U S( 62IT 35 6!
Even for small m and p the eigenvalues are very close to 1. This result reflects the weak
dependencies of the Uniform distribution on the unit ball.
We now derive the APCs of the more interesting class of Gegenbauer distributions on
a p-dimensional ellipse. This distribution will later be used for a simulation example, so
the solutions are fully enumerated.
The random variable Y has a Gegenbauer(a;p) distribution on a p-dimensional ellipse if
Y = R - ~X, where X is Gegenbauer(a; p) on the unit ball, and R is any correlation matrix.
This is equivalent to choosing p arbitrary directions in the unit sphere and generating
Y1 , ... , Yp as the projections of X onto those directions. Each ofthe marginal distributions
is identical to the marginals of the spherical case, as a trivial consequence of the sphericity,
•
hc)'we·ver the tnvana!:e pairs are now ro"related.
39
We shall briefly digress here to discuss the bivariate APC·basis functions or eigenfunc
tions of the correlated case.
Every pair of variates (Yi, Yi) can be written 88 a linear combination:
(3.3)
where the pair (Xl, X2) has a bivariate Gegenbauer distribution, ")'p(a+ r-1) on the unit
disk and p is the correlation between Yi and Yi' If (Xl, X2) are polynomially biorthog
onal, then the prop08ition below establishes the eigenfunctions and eigenvalues for the
elliptically transformed variables (Yi, l'j).
Proposition 3.5 If (Yi, Y2) 4fe generated from the polynomia1l1l biorthogonal variables
(XI, X2) according to the transformation (9.9), then the eigenfunctions of (YI , Y2) are
exactly those of (Xl, X 2) irrespective of the correlation, p. The eigenvalues values are:
where >"2m(O) are the eigenvalues of the circular case, p = O.
A proof of Proposition 3.5 can be found in Buja. [Buj85].
The eigenvalues are now functions of the correlation p, and are no longer monotonic in
m. For large p the linear polynomial has larger correlation than the quadratic, for small
p, the quadratic polynomial dominates the linear. Hence, if there is a strong correlation
between any of the variables, the smallest APC-basis function will be linear, however if
all the correlations are weak, the smallest APC-basis function will be quadratic.
We can now calculate the APCs for the Gegenbauer on an ellipsoid. Proposition 3.5
J1
>.(m) (PIp)
)Jm) (P2p)
implies the common set of APe-basis functions consists of the Gegenbauer polynomials
of order a + p/2 - L Theorem 3.1 can be applied, with the matrix of correlations for the
transformations G~m) == (gm(Xl), ... ,9m(Xp )),
>..(m} (P12)
1
For example, for the simulation of Chapter 5.4, we will take a = I, hence we are con
sidering the Uniform on the ellipsoid generated hy the symmetric transformation matrix:
(
1.0 0.55 0.33)
R = 1.0 0.3 .
1.0
The correlation matrix of the linear polynomials is R itself, which has two eigenvalues
less than I, of 0.4488 and 0.7526. The correlation matrices of the quadratic and cubic
polynomials are :
(
1.0
T(2) =
0.128
1.0-0.114) (1.0-0.138 and TIS) =
1.0
-0.121
1.0
-0.185 )
-0.178 .
1.0
The smallest eigenvalues of these symmetric matrices are 0.8596 and 0.6750 respectively.
Thus the smallest APC-basis functions are linear, the second smallest are cuhic, and the
third are again linear.
3.1 Zero Variance Additive Principal Components for
Clustered and Categorical Data
Consider the situation where two variables Xl and X2 of a p-variate set divide into two
natural clusters in diagonally opposite quadrants, as described by Buja and Kass [BK85].
More exactly, suppose there exist cut points a, h such that P(XI < a, X 2 < h) and P(XI ?
a, X 2 ? h) are both non-zero, and sum to 1. Then there is an exact singularity in the data,
that is, an APC with an eigenvalue of O. By defining ¢>l to map the two sets {Xl < a} and
{Xl? a} onto different constants kl and k2, and ¢>2 mapping the corresponding sets {X2 <
h} and {X2 ? h} onto -kt> -k2, then with ¢>s =... = ¢>p =0, var(tPl(Xt}+¢>2(X2)) = o.
With a pair of categorical variables a similar phenomenon can occur. Suppose the
categories ;1 and il of variables Xi and Xi always occur together. Then P(Xi E Ci"
Xi E
cil) and P(Xi '" Ci p Xi '" cil) are both nonzero and sum to L As there are no ordering
restrictions on categorical variables, we can assign scores kl , kz to Xl, according to whether
41
an observation is in cit or not. Similarly the scores -k1 , -kz can be assigned 0 Xz. The
resulting transformation of the variables is a zero variance APC, exactly as above.
In the same spirit, APCs with exact singularities can exist between continuous and
categorical variables. Suppose there are a group of categories of the categorical vari·
able, Xl, whose values on the continuous variable, Xz, are distinct from the remaining
categories. Then there exist cut points a1, az, with P(X1 E clI" a1 < Xz ::; az) and
P(X1 i clI,,41 ~ Xz or Xz > az) are nonzero and sum to 1. Again, defining functions
mapping the disjoint sets onto different constants results in a zero variance APC.
Any of these two variable dependencies can exist in higher dimensional generalizations.
APCs that are formed from step functions are referred to as discrete APCs, since the
transformed variables are discrete valued.
Chapter 4
Estimation of Additive Principal
Components
4.1 Introduction
The algorithm of Chapter 2 for finding the APC of X can be implemented as an estima
tion procedure in the finite sample setting, simply by using a data smooth to estimate
the conditional expectations. The resulting algorithm was implemented on a Symbolics
Lisp 3610, a computing environment well suited to developmental programming. Some
comments on the use of this machine are found in Appendix A.
We have discussed properties of the APC for the population case in the preceeding
two chapters. Now, using simulation, we will look at the behavior of the algorithm as an
estimation technique.
We will not discuss asymptotic convergence and consistency properties of the finite
sample algorithm. These are dependent on properties of the data smooth used in the
implementation. Results for data smooths are fragmentary, and strong results are only
available for a restricted class of data smooths. A selection of relevant results is found in
Breiman and Friedman [BFBS, Appendix].
In practice, convergence can be a delicate matter. Through using the APC algorithm
on a wide range of data sets, we have examined some factors affecting speed and accuracy
43
of the convergence to an optimal solution. Several refinements and improvements of the
basic algorithm based on our experience have been developed.
4.2 Algorithm Implementation Details
4.2.1 Data Smooth
The smoother used in our implementation is the variable span "supersmooth" developed
by Friedman and Stuetzle [FS81]. A full description of the procedure in found in the
given reference. A attractive facet of the supersmooth is that it encompasses variable
span, fixed span running linear, linear, monotonic, cyclic and categorical estimators of
conditional expectation.
4.2.2 Convergence Criterion
Convergence of the algorithm is assessed using convergence of the eigenvalue:
var ~:>Pi(Xi)Evar4>i(Xi) .
Iteration of the outer loop continues until the eigenvalue estimate ceases to decrease.
Typically, we used a criterion for convergence of a change less than 0.005 in the eigenvalue
estimate over the last three iterations. For most applications, this seemed adequate,
although if the eigenvalue is very small (less than 0.01) further iteration may be required.
We used a straightforward estimator of the eigenvalue, simply calculating the squared
standard deviation of the vector estimates of E. 4>. and each (!Ji.
4.2.3 Initial Estimates
Occasionally, we find that the algorithm estimates an APe with smaller variance than
a previous APC. Suppose, for instance, that the eigenvalue of the third estima.ted APC
is smaller than the second. Then the algorithm has not located a global minimum for
the second APC. If we want to locate a correct solution for the second APC, a natural
procedure to consider is re-estimating the second and third APC, using the estimated
APe as an initial estimate for the second APe. Sometimes this simply results in a.
44
reversal of two original estimates, indicating the algorithm initially converged to a local
minimum at the third APe.
This ability to become stuck in local minima has the unfortunate consequence that the
algorithm is sensitive to initial values. In addition, since the algorithm utilizes a power
iteration technique, good starting guesses will speed the convergence considerably.
Both of these factors have made it worthwhile to find a method for calculating good ini
tial values. The basis of our starting estimates is Gnanadesikan's proposal for introducing
nonlinearity into ordinary principal components analysis [Gna77l.
First the variable matrix is augmented by adding second and third degree orthonormal
polynomials in each variable,
X a.ug = (X1>P2(X2),P3(X3), .•• , Xp , P2(Xp ),P3(X,»
where Pi: is an kth degree polynomial,
Pi: 1. PI, E Pi: = 0, var Pi: = 1.
Then the smallest linear principal component of this augmented matrix is formed:
(4.1)
Finally, estimators for each variable are constructed from the summed contribution of each
variable :
Proposition 4.1 The estimator (4.2), is the smallest APC of X when APC-functions
are restricted to the class of third degree polynomials.
Proof: The smallest principal component of the augmented matrix (4.1), minimizes for
aE RP ,, 3 P 3
var (L L ai:iPk(X,» subject to L L aii = 1.~1~1 ~lbl
The APe solution minimizes over ~ :
p p
var (L ¢.(X.)) subject to L var (¢i(Xi )) =1.i= 1 io:::: 1
(4.3)
The tPi are restricted to be centeredt third degree polynomials, hence can be written
tPi = l:~=1 bkiPk, since P1, 1'2, P8 span the space of cubic polynomials. The minimization
criterion defining the APe then reduces to optimization of:
P 3
var (L L bciPk(Xi)),i=I.1:=1
over b = (bIb b2b b3t. ... t ~p, b3p) subject to :
P 3 P 3 P 3
L (var L bciPk(Xi )) = L L bZivar (Pi(X,)) = L L b~ .. = 1.i=1 c=1 i=1 c=1 i=1 c=1
This is exactly the criterion (4.3), hence the two solutions are identical. I
The sample estimate for the smallest APC is simply the sample smallest linear principal
component of the augmented matrix of data vectors. Initial estimates for the k th APC
use the kth sample linear principal component of the augmented matrix, orthogonalized
with respect to the final estimate found for each of the previous k - 1 APe.
Borrowing from the principles of multiple correspondence analysis, a simple extension
of the above procedure yields starting values for categorical variables. If necessary, the
categorical variables are collapsed into a small number, say 4, similar classes. Three
independent vectors are formed using indicator vectors for the first three classes. These
are centered and normed, then added to the augmented matrix (4.1). A linear principal
component analysis on this matrix yields starting estimates in the same fashion as before,
which assign scores to each of the similar categories.
4.2.4 Spectrum Shift
The outer loop of the APC algorithm of Chapter 2.3.5, applies the operator pI - P, where
p is the number of variables in the analysis. The spectrum shift P was chosen in order to
ensure convergence to the smallest eigen solution, however, applying the operator 0:1 - P
will still converge to the correct solution if :
(4.4)
Since the eigenvalues are bounded sharply by 0 and P, p is the smallest shift that will
always work, However, dearly if a = >''''''$, the inequality is satisfied. It is easily shown
46
that the rate of convergence of the power method is controlled by the ratio:
let - A(2) I where A(l) and A(2) are the smallest and
let - A(I)1 second smallest eigenvalues of P .(4.5)
For most cases, the value p is a very conservative estimate of the largest eigenvalue
this upper bound is only achieved in the extreme case of all variables being identical.
Hence the ratio (4.5) can be close to one, particularly when the number of variables is
large. Convergence to an optimal solution is then slow, hence it is difficult to assess when
the stationary solution has been reached and the solutions may be dependent on initial
values. The behavior of the algorithm is greatly improved if a good estimate of the largest
eigenvalue is available.
In our implementation, the initial estimates described in the preceeding subsection
provide a good approximation to the largest eigenvalue. Explicitly, we use the largest
eigenvalue of the augmented matrix (4.1). This will tend to underestimate the true
maximal eigenvalue, however for the condition (4.4) to be satisfied, it is sufficient that
).""" > (A""" + A.....)!2. This will almost always hold, so in practice, convergence to the
smallest eigen solution will still occur.
4.2.5 Maintaining Orthogonality
The algorithm proposed for the calculation of higher principal components in the popu
lation case, simply orthogonalizes the initial estimates with respect to all previous com
ponents, thus in theory restricting the solution to be in the orthogonal space. In the
implementation of the algorithm for estimating the .~h additive principal component, it
will not be sufficient to orthogonali2e only the initial estimates. In the finite sample ver
sion, conditional expectations are estimated by data smooths. Typically the data smooth
is not a projection operator - it may not even satisfy linearity or symmetry - in which
case exact orthogonality is not preserved in the inner loop. Even if the smooth is a projec
tion, rounding error will still reintroduce components in the orthogonal space. To ensure
convergence to an orthogonal solution, the new estimate will need to be re-orthogonalised
in every pass tlll":m,,h the outer ioop.
4.3
47
Algorithm Improvement
Component Step
A Linear Principal
The basic algorithm entails a straightforward repetition of the iteration cycle over all
variables, using exactly the function estimates of the previous iteration cycle. We describe
an improvement which optimizes linearly over the fixed function estimates after each cycle
of iterations.
In this section, the estimation of the loadings of the variable functions is considered
separately from the transformations themselves, so for the sake of clarity we reparameterize
the APC, separating out the scale factor. Let ai = II~i II, then define:
1'Pi = -~i.
ai
Under the constraints I:i a; = 1 and var 'Pi = l,the representation ~(X) = I:i ai'P(X;)
is equivalent to the usual representation ~(X) = I:~i(Xi) with I:ivar~i(Xi) = 1. We
will use this alternative parameterization in the remainder of this section.
At the conclusion of each full iteration cycle, we have the p new variable function
estimates (al 'PI,· .. ,ap'Pp). The idea is simply to compute new scalings, (ai, ... ,a;) which
minimize I:var ai'Pi over a, with a'a = 1 for fixed ('P1> ... ,lOp). This is achieved by
computing the smallest principal component direction, a· for the p variables ('PI, ... ,'Pp).
Then we update the APC-function estimates, replacing ai'Pi with ai 'Pi. Since the smallest
principal component minimizes the variance of the sum, var I: ai'Pi ~ var I: ai'Pi. Hence
addition of the linear principal component step must improve the rate of convergence.
Implementing a linear principal component step for higher APC employs the same
basic idea, however the orthogonalization between components must be preserved. For
the current estimates of the kth APC, (alk
)'Plk), . .. ,a~k)'P~k», we want to find new scalings
( (k). (k).) h' h .. ." (k) . h I 1 H h k tha l , .... , ap w Ie m.mlmIze L" var ailOt over at WIt Q: a = . owever, t e
APC must also obey k - 1 orthogonality constraints,
o = ".cov(a.,~(k) aU),)i»L,t- n·"t' • rt
= ". a'cov (,~(k) aU),~U».L....-t- t Yt) t Yt
48
This boils down to k - 1 linear constraints on a, which can be written in matrix form:
Ca = 0,
where [C lu = ~(j)cov (101k) , lOP). Hence, finding the optimal scalings for fixed 10. while
preserving the orthogonality conditions is a simple constrained linear optimization prob
lem. The solution is easily computed by a change of basis, using the first k - 1 columns of
C as the first k - 1 basis vectors, then computing the smallest linear principal component
in the remaining p - k + 1 dimensions. Translation back to the usual basis yields the
minimizing solution.
As in the smallest APe, by definition, var I:a;k)' lO;k) ~ var I: (k).'P)k), hence the
rate of convergence is improved.
Chapter 5
Simulations of Additive Principal
Component Estimation
5.1 Introduction
In this chapter we present a simulation study of APC estimation. The purpose of the
study is assessment of the performance characteristics of the finite sample algorithm using
distributions with known solutions.
We know the exact APC solutions for the Gaussian distribution and the Uniform
on an ellipsoid, so these are natural distributions to consider. The dependencies of the
elliptically symmetric distributions of Chapter 3, can be fully characterized using only
linear and quadratic functions. This implies that these distributions represent a class of
"null distributions" from the point of view of nonlinear additive methodology. In the
Gaussian case, for instance, the only significant structure is linear, and the higher order
polynomial APC are an artifact of the elliptical distribution, hence are redundant once
the linear structure is known. This notion of redundancy will be discussed more fully in
section 6.3.1. From a simulation study we might expect to get some practical guidance as
to how to detect such uninformative nonlinearities.
We have discovered a technique for constructing data sets with high-dimensional ad-
50
ditive structure. With this technique, it is possible to test a more interesting range of
situations, involving mixtures of different marginal distributions and functions of any
form. In these simulations, we can test the effectiveness of the algorithm in recovering
"real" structure.
Lastly, we use simulation to study APC estimation in the two extreme cases of additive
dependency- the cases of mutual independence and exact singularity.
Data sets with known APC solutions provide a valuable testing ground for the algo
rithm, allowing quantitative assessment of bias and stability, and qualitative assessment
of the implementation. However, this simulation study is not intended to be an exhaustive
study of factors affecting estimation or algorithm performance - a task which would be
daunting to attempt. Rather, the aim is come to a fuller understanding of the inherent
properties of the estimation procedure.
In addition, through the accumulation of experimental evidence we can begin to de
velop an intuition for the behavior of the algorithm, and hence for interpretation of the
estimates. From this accumulated experience we can infer heuristic guidance for using
APC estimation in an applied setting.
5.2 Evaluation measures
Before presenting results of the simulations, we first describe the quantities we use to
assess the accuracy of the estimation.
Each simulation yields a sequence of estimated APCs : the APC-function for variable
t, APC k from the ;j'h sample is denoted ~;:) where i = 1... p ,j = 1... N , k = 1... K.
Sample statistics calculated from the simulations are compared with the true values of the
quantities described below.
1. The eigenvalue, var I: 4>i(Xi), a number lying between 0 and 1.
For the eigenvalue, bias, standard deviance and RMS deviance of the simulation
estimates are reported.
51
2. The p 8tandard deviau'ons of the transformed variables,
a p.vector con8trained to lie on the (p - l).sphere.
The standard deviations of the variables are analogous to variable loadings of linear
principal component analysis, There, ¢(Xi ) :;:; aiXi, so o-(¢i(Xi)) :;:; lail, hence this
p-vector shall be referred to as the variable loading of the estimated component. The
bias, standard deviance and RMS deviance are reported for each variable loading.
However, since the estimates are constrained to lie on the unit (p-l)-sphere, a more
meaningful metric is the angular separation between the estimated and true loading
vectors, the loading metric, dl(&,a) = cos-1 J(&,a)J. An estimate of the mean angle
is given by d(a" a) where 5. = 01 2:j aj[l)-l 2:i aj. Two estimates of the variability
of the loadings are given, analogous to standard error and RMS error estimators :
the average angle between estimated and true loadings, N-l 2: j dl(OJ, a), and the
average angle between estimated and mean loadings, N-l 2:idz(&j,a.).
3. The APe, ¢(X) = 2:¢i(Xi ) a function of X .
If the estimation of the APe is accurate, the joint distribution of the true and esti
mated APC is concentrated along the diagonal. Hence an indicator of the accuracy
of estimation for a sample is the correlation between the estimated and true APC.
This correlation, averaged over all sample estimates, together with the standard de
viation are reported. A plot corresponding to this average correlation, which gives a
visual impression of the estimation accuracy, is ~(k)(X 0) versus ¢(k)(X 0), for some
fixed sample X 0 of X , where :
~(k)(X 0) =L ~~~)(xf).i
The average APC-function estimate, ~~~)(Xf) is simply an average over the APe
function estimates of all samples of the simulation:
52
4. The APe-function for each lJariable, <I>.(X.).
The accuracy of the APC-function estimates is best assessed by comparing true and
estimated APC-functions graphically. The "average" accuracy of the ith function
estimate can be assessed by comparing ¢l.l:l(xp) versus xp with the true </>ll:l(Xp).
An impression of the "variance" of the estimation is gained from superimposing the
N plots (if>!;l - <I>!l:l)(Xp) versus xp. To facilitate comparison, all the kth APC
function plots are plotted with the same Y-axis scale.
If the eigenvalue of the component has multiplicity greater than one, none of the
quantities 2-4 are uniquely defined. In the finite sample setting, if true eigenvalues of two
components are almost identical, the sample APCs will be unstable since the underlying
eigen space is ill-determined. This is a consequence of the discontinuity of eigenfunctions
as a function of their eigenvalue estimates. More explicitly, suppose two eigenvalues are
close, and we use as our finite sample estimate of the smallest eigenvalue the smallest
estimated eigenvalue. Due to finite sample variability, the eigenfunction corresponding to
this estimate could be either of the smallest or second smallest true eigenfunction. Between
different samples, then, the eigenfunction estimates for the smallest eigenvalue fluctuate
between the orthogonal sets of functions of the smallest and second smallest eigenvalues.
The eigenvalue estimates themselves will be close to their true values, however.
To assess accuracy of estimation in this scenario, measures of comparison that are
not affected by this behavior must be used. Although we cannot compare the individual
components, the eigenspace they span is uniquely determined. If, for instance, the 2nd
and 3rd smallest eigenvalues are the same, then while the 2nd and 3rd smallest APCs
are not unique, we wish at least span(Eif>121,Eif>!81) to be close to span(E<I>!21,E<I>!8).
This suggests a canonical correlation analysis of these two spaces will give a measure of
the closeness of the true and estimated spaces in terms of canonical angles, pIll and p(2).
A natural metric between the spaces, which we will refer to as the canonical metric, is
C(2) = v'0I.2 + fP, where 01. = coS-lp(I), f3 = cos-lp(2). This metric can be extended in
an obvious way to provide an overall measure of closeness for k APCs: C(I:) = VE~ 01.;
where OI.i = C08- 1pW, p(j) the j'h canonical angle between the true and empirical spaces.
53
C(k) is bounded above by cos-1(0)Vk, which is increasing in k. Hence the scaled metric,
Jk c<» i( )' which is bounded above by 1 for all k, facilitates comparison between spaceskCQ'- 0
of different dimension.
While this metric is especially useful for ill-determined eigenspaces, it is also a useful
overall measure of APC estimation up to k dimensions, hence the mean and standard
deviance of this global measure of accuracy are also reported for all simulations.
5.3 Simulations using the Gaussian Distribution
For the normal distribution, the APC-functions are scaled hermite polynomials. Recall
that the appropriate variable loadings for an APC constructed from ktl> degree polynomials
and its associated eigenvalue, are determined by the eigenvectors and eigenvalues of the
correlation matrix R(k.). The ktl> smallest APC must belong to the subset of APCs formed
from hermite polynomials of degree k or less.
For the Gaussian distribution, two different scenarios were studied by simulation. In
the first scenario, GAU-Sl, the variables are nearly collinear, that is, the smallest eigen
value of their correlation matrix is close to zero, and all eigenvalues are distinct. Estimates
of the APCs, the variable loadings and the APC-functions should be well determined.
For the second simulation study, GAU-S2, the second and third smallest eigenvalues of
operator P are almost identical. The smallest APC estimates should be well determined,
however the second and third APCs are not unique, although they uniquely determine the
space of the second smallest eigenvalue.
5.3.1 GAU-S1: Gaussian Scenario 1
Random samples are generated from a normal distribution with zero mean and correlation
matrix:1.0 .6 .4 -.7
1.0 .5 -.3R=
1.0 -.8
1.0
54
Table 5.1: GAU-S1 : Correlations between ¢(k) (X 0) and ¢(k)(X 0)
Component First Second Third Fourth
Correlation 0.988 0.876 0.681 0.748
Std dev (N=20) 0.008 0.088 0.190 0.188
Table 5.2: GAU-S1 : Loading metric
Component First Second Third Fourth
dd&, atrueH") 0.038 0.185 1.111 0.895
ave (dz(<>;,&)) 0.167 0.774 1.385 1.332
ave (d,(<>i, atrue)) 0.172 0.778 1.665 1.531
The four smallest APCs are estimated on each of 20 data sets with 200 observations. The
eigenvalues of the smallest APC are given below.
Eigenvalues of R
Eigenvalues of R{2.)
Eigenvalues of R(B.)
0.0197,0.5455,0.7732 ...
0.2033, 0.7594 .
0.3817, 0.8556 .
Thus, the smallest APC is the smallest linear principal component, the second smallest is
the smallest quadratic, the third the smallest cubic, but the fourth smallest is the second
smallest linear principal component.
Table 5.1 gives the average correlations between the true and estimated APCs and
Figure 5.1, the corresponding plots.
Table 5.2 gives the angle between the mean vector of loadings and the true loadings
( the loading metric ), and two measures of dispersion for this angle: the average angle
between estimated and true loadings, and the average angle between estimated and mean
loadings.
In Table 5.3 the true and estimated eigenvalues and individual loadings are compared.
,.' 1.ll~
G.2
.....u:t
~.". U .,-~'#. • <
G.G -.<t."'" ... \".
'( ~I • '~~ e.G .)~
-G.2.c'" ',/i-'
"'U
., ."
-G•• ,"-1.&
-0.• ·0.2 G,. G.~ -1.t -G.5 G.O M I.G
J d ~ dComponent 3 CompOllellt.
1.16 u
Got G.n ,.~.~ . "
0-4C1'.:.. ..;. ,
G.G ,~:.~.. :~ ..~.'
t-;~: .~. v
I. ~. ..
G.O
,<:;~~l '0,$<1 "
-G.46 -ui"'1,i -t.n G•• G... ,.. -1.1 -U5 t.G 1.15 1.7
j d ~ "For the k th APC :¢o:) (X 0) versus ¢:(k)(X 0)
Figure 5.1: GAU-51 : Correlation Plots
CompoHllt 1
55
CompOllent .1
56
All the eigenvalue estimates have small negative bias, both bias and variance estimates
increase with the number of componente estimated. Similarly, the bias and variance of
the variable loadings is larger for the higher components.
Finally, Table 5.4 contains the sca.!ed and unscaled canonical metries between the APC
spaces spanned by {~(k)(XOm and {~(k)(XOm.
It is apparent from the APC correlations, the angles of the variable loadings and the
canonical metric, that while the first and second APCs are estimated extremely accurately,
the third and fourth are less exactly determined. This is clearly seen in the increase in
scatter of Figure 5.1.
The scaled canonical metric decrease. with the inclusion of the fourth APC. This
decrease implies the 4-dimensiona.! space is more accurately determined than the 3-dimen
siona.! space, suggesting the estimates of the third and fourth APCs are "mixing", that is,
there is a slight lack of resolution within the space of the third and fourth components.
This phenomenon is discussed fully in the concluding section 5.8.
Figure 5.2 shows the true and average estimated APC-functions, Figure 5.3 variance
of these functions over the 20 replications. The function plots verify the accuracy in esti·
mation of the two smallest components. The variance plote for the third and fourth com·
ponents show the presence of a linear trend in some replicates of the third APC·functions,
and likewise a cubic trend in some fourth APC·functions. This behavior provides further
support for the explanation of mixing, or lack of resolution of these components, suggested
previously.
5.3.2 GAU-S2: Gaussian Scenario 2
Random samples are generated from a normal distribution with zero mean and correlation
matrix:1.0 .9 .8 .2
11.0 .8 .2R=
1.0 .4 )
l.0
57
Table 5.3: GAU-S1 : Eigenvalue and Variable Loadings
First APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.0197 0.0186 -0.0011 0.0011 0.0015
Loadings :Var 1 0.460 0.462 0.002 0.Ql5 0.016
Var 2 0.345 0.349 0.004 0.Ql8 0.018
Var 3 0.510 0.511 0.001 0.012 0.012
Var4 0.639 0.634 -0.005 0.012 0.013
Second APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.2033 0.1722 -0.0311 0.0501 0.0594
Loadings :Var 1 0.431 0.449 0.018 0.082 0.084
Var 2 0.288 0.297 0.009 0.081 0.081
Var 3 0.538 0.530 -0.008 0.069 0.070
Var 4 0.644 0.640 -0.004 0.045 0.051
Third APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.3817 0.3448 -0.0369 0.0643 0.0746
Loadings :Var 1 0.398 0.449 0.051 0.138 0.148
Var 2 0.224 0.340 0.116 0.127 0.174
Var 3 0.570 0.558 -0.012 0.081 0.082
Var 4 0.683 0.560 -0.123 0.135 0.184
Fourth APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.545 0.506 -0.039 0.052 0.066
Loadings :Var 1 0.663 0.581 -0.082 0.113 0.141
Var 2 0.419 0.455 0.036 0.138 0.142
Var 3 0.567 0.526 -0.041 0.078 0.088
IVar 4 0.252 , 0.356 0.104 I 0.131 , 0.168 I
I ,
58
CompolUlnl t, Vutahie t COmpOlUlal t. V.utahie 1 Compollent t. Vut..hle 3. Compount t, Vulahle •
1.,~ 1.'1u ul
/U$ G.$J
~00$6 ~.6$
0.0 U 0.0 ~.O
-0.&51-0.&0 -0.60 -0.55
-1.1 ..vt -1.1-1.1 , , , ,-1.f -o.t 0.0 o.t 1.f "'1 .• -0,5 0.0 0.• ,.. -u -0.• 0.0 0.• ,.. -1,' -0.• 0.0 0.• 1.8
~ l:l ~ l:l ~ d ~ dCompollent 1. Vartahle t Compollent 1. Varlahle1 COmponeJll 1, Variable J Componenl 1, Vulahle •
1.1 1•• M
VU
U5 V •.H ••&0 U5
0.0 0.0
~u 0.0(\-O.H -0.66 -0.55 -0-'$
-M -1.f -M -,.t
-,.. ...• ... ... U -ii.' .... ••• M 1.8 -1.8 -... ... o.t ,.. ...• -u U ••• ...J l:l ~ d ~ ~ ~ d
Compoullt 3, Vul..hl. t CO.pODellt 3, Vulahl. 2. C••pOllent 3, Vulable 3 Co.pount 3.. VUlahle 4
••7 0.7 0.7 0.7~
vvI ,
0.05 AvJ 0,S5 M5 o.35j. ,~0.0 " .. 0.0 . . ••0 u!...... ",/'
. .,. .
.'''j-0.t5 ·o.H ·~.'5
-~,7 -0.1 -0.1 -0.7 !, , , .
-u .0.9 0.0 0.• 1<1 -'1.1 -O,t 0.0 o.t ". -u ·u u O,t ,.. -u -O,t 0.0 O,t 1.8
~ d ~ d ~ ~ -* dComponent ., Vulable t Compoll.u,t 4, Variable 2. Component 4, Variable 3 Component 4. VUlab!e 4
1.21 q l,2
::1~,u 0-'
"j".
",. ,
~"
.h '" 0.01 '.'.u , ••0 0.0
",
" .-, -ui --·0.61
-$.6 / -0••
..A !
·,.t -<.21 -1.2, , , , , , , , , , ,.,.. .... u ... ..• -1.' -t.• ... U 1.8 ...,.. .•.. U I .• ,.. "1,l -1.1 ... U ...J d ~ d J d J d
For the kth APe, itA varia.ble : ¢~~) (Xf)(points) and tP~Ic)(X?}(801idline) versus Xf.Figure 5.2: GAU-Sl : APe-function Estimation
59
COlnPOl:lellt t. Variable 1 COlnP0Debt t, Variable .1 COlnPOlleftt t, Vart.ble J CompOlll!ftt t, Varl.ble (
"'1 u
"I "1U6 0.51 D.n U6
- diit - .. 0.0
'r.·~ .-,O.G 0.0
-6.66 -o.n -G.fO -0.80
-1.:1 -". "i , ' -1:.'0.0 ... 0.0 M 1.1 -L. -D,t 0.0 U 1.t
Componellt 1, Variable 1
l.t
Compoaent J, Variable (
G'O~iW~
"'I~Compoent J. Variable JCompOllellt J. Variable .1Compollut J, V.arlabl e t
1.4
~1.8 ,.0j
~ -C S ~ 1 <!:0.0 0.0 G.O
-1.8 -1.8 "1:.5
1.0
~1.0 1.0 1.4
~ ~O.D 0.0 teD 0.'
-;.6 -1.0 -'.4 -1.0
-1.0 -u 0.0 D.t 1.t
Component 4, VAllable 1 Comp0Jlellt (, Variable .1
,-1.' "9.' 0,0 O.!I 1••
Component ., Variable J Componebt ., V.ulable (
ujO'UJ~.
t.O I-0.u1
I
".s1
Ll ~--~-~-,...
For the ktn. APe, j·in. variable :(Jf? - ~~.k»)(Xf) versus X? for j = 1 ... N.
Figure 5.3: GAU·Sl : Variance of APC.function Estima.tion
60
Components Included 1 1,2 1,2,3 1,2,3,4
Unsealed metric 0.951 2.842 5.131 5.717
Std dev (N=20) 0.277 1.071 1.489 1.559
Scaled metric 0.0106 0.0223 0.0329 0.0318
The three smallest APCs are estimated on each of 20 data sets, again of 200 observations.
The eigenvalues of the smallest APCs are given below.
Eigenvalues of R
Eigenvalues of R(2-)
Eigenvalues of R(Soo)
0.1000,0.1914,0.9224 ...
0.1900,0.3954 .
0.2710,0.5486 .
This shows that the smallest APC is the smallest linear principal component, but due
to the closeness of the eigenvalues, the second and third span the space of the smallest
quadratic and the second smallest linear. The separation of the fourth largest eigenvalue
is sufficient to make the space unique. Table 5.5 compares the true and estimated eigen
values for the first three components, and the variable loadings of the smallest. The first
two eigenvalue estimates are again negatively biased, although not the third, and vari
ability increases with component number. The estimated variable loadings of the smallest
component are considerably more biased for the small values.
The APC correlation for the smallest APC is 0.9459, with standard deviance 0.0017.
The angle between true and average estimated variable loadings for the smallest APC is
0.7710, with average angle between true and estimated of 0.615°, and between average and
estimated 0.866°. Figure 5.4 gives the correlation plots for all three APCs, and the lack
of uniqueness in the second and third APC is apparent. Figure 5.5 shows the estimation
of the APC-functions for the smallest APC.
Evaluation of the accuracy of second and third estimated APC is possible only through
assessment of the canonical metric comparing the spans of the true and estimated APCs.
Table S,6 contains C8.itlOlHC;M metrics of the relevant spaces.
61
Table 5.5: GAU~S2 : Eigenvalue and Variable Loadings
First APe Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.1000 0.0887 -0.0113 0.0116 0.0164
Loadings :Var 1 0.707 0.704 -0.003 0.058 0.058
Var 2 0.707 0.690 -0.017 0.053 0.056
Var 3 0.000 0.112 0.112 0.086 0.144
Var 4 0.000 0.046 0.046 0.022 0.052
Second APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.1900 0.1590 -0.0310 0.0311 0.0445
Third APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.1914 0.1946 0.0032 0.0259 0.0261
Compo.ell.t 1 c.m.ponent 2
·"1 •.51 "'1.p
~.
uus ,
,.10'. us,~
" ..
./" ~~~l. u6.6 ..:""'-\b.6 ~ ~.: ..
~ .( ....·us
f<"~
O~l
" ·o.n ; .J,ui '. I-0.4 I
-o.t -6.# 0.0 0.46 O.t ~1.0 -u '.6 C.l U -u,; d ,; d
Component!
...~: ~. :!.. "
:~··1·: ~ '~;.: : ~ ...• ~ .1 ..
"
-'.06 U '.44 t .•
Figure 5A: GAU~S2 : Correlation Plots
62
C.mpOllelll t, Varb.bIe t Componeat t. V uiable 1 CompGlleat t. Vub:ble .3 Component t, Variable ,(
[<., <., <., "'1
0.&5 ILU Uf 1l.U
,., ,.,,., ,.,
~o,n ~O.16 ~•.n ~lUf
~".~1.' *t.$ .,..
·1.' ~u',., ,.,
'" ~1 •• ~U ,., ,.t '" ~1,t *IU ,., ,.t ,., vi,f ·0.' ,., ,.t <.,~ t:'I • t:'I ~ t:'I • ~
Component I, Vari.ble t Component I, Vart"ble 1 Component I, Variable.3 COmpOlleDt 1, Vartable 4:
.., .., .., ..,
,., '" '",.,
pz t:;t;Q ~-=::::>.
~ ~ -,., ,., ,., ,.,""" -
"'1 "'L .1.11' ~UI
·:.0 ·:.ll ·:.0~U
~U ~... ,., '.t ,., .,.. ·0.' t.t t.t '" .,., ·u ,., t.t ,.t .,., ·0.' t., t.t '"• t:'I • tl • •For the jth variable: ¥lk\x?) (points) and 4>lkl(.xr) {solid line) versus X?
Figure 5.5: GAU-S2 : APC-function Estimation for Smallest APC
Table 5.6: GAU-S2 : Canonical Metric between {¢(kl(XOm and {,p(kl(XOm
Components Included 1 2,3 1,2,3
Unsealed metric 1.935 I 4.042 3.718
Std dev (N=20) 0.785 I 1.414 I1.220 ,,Scaled metric 0.0215 0.0318 0.0238 I
63
of lack of resolution between the spaces of the smallest and second smallest eigenvalue
(the latter having dimension two), since the distance between the true and estimated 3
dimensional space of the first and second eigenvalues is smaller than that between the
2-dimensional spaces of the second eigenvalue only. This explains the behavior of the
estimates for Variables 3 and 4 in the smallest component. These APC-function estimates,
instead of being virtually zero in the first APC, foreshadow the APC-functions of these
variables in the second eigenspace.
5.4 Simulations using the Uniform Distribution on an
Ellipsoid
Data sets are simulated from the Uniform distribution on the ellipsoid described by the
correlation matrix :
(
1.00 0.55 0,33)
R - 1.00 0.30 .
1.00
The theoretical solutions for this correlation matrix have been derived in section 3.6 : the
APC-functions are scaled Gegenbauer polynomials, and the variable scaling and eigenval
ues are computed from the eigen decomposition of their correlation matrix. The k'h APC
is constructed from a polynomial of degree 2k or less.
For the distribution specified, the four smallest APCs and their eigenvalues are:
0.4488 : the smallest linear principal component
0.6750 : the smallest cubic principal component
0.7526: the second smallest linear principal component
0.8596 : the smallest quadratic principal component
These eigenvalues are all far larger than those of the Gaussian Scenarios. By their
proximity to one, it is clear the additive dependencies are weak.
Table 5.7 compares the true and estimated eigenvalues and variable loadings for each
of four .W1me" APe. £,SlllJ:lll.\.ea eigenvalues are in good aq,'eemenL although all have
64
Table 5.7: UNI.S : Eigenvalue and Variable Loadings
First APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.4488 0.4295 -0.0193 0.0307 0.0366
Loadings :Var 1 0.722 0.715 -0.007 0.025 0.026
Var 2 0.689 0.686 -0.003 0.028 0.028
Var 3 0.057 0.130 0.083 0.034 0.082
Second APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.6750 0.6701 -0.0049 0.0428 0.0432
Loadings :Var 1 0.559 0.481 -0.078 0.109 0.131
Var 2 0.549 0.512 -0.037 0.067 0.078
Var 3 0.620 0.696 0.076 0.086 0.116
Third APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.7526 0.7458 -0.0066 0.0343 0.0351
Loadings :Var 1 0.301 0.439 0.138 0.127 0.190
Var 2 0.387 0.498 0.111 0.115 0.161
Var 3 0.871 0.722 -0.149 0.103 0.185
Fourth APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.8596 0.8383 -0.0213 0.0411 0.0465
Loadings :Var 1 0.273 0.489 0.216 0.142 0.264
Var 2 0.789 0.592 -0.197 0.149 0.251
I0.552 0.580 0.028I Var 3 I 0.185 0.188
65
Table E.9: UNI-S : Correlation between ~(k) (X 0) and ~(k) (X 0)
Component First Second Third Fourth
Correlation 0.969 0.646 0.539 0.673
Std dev (N=20) 0.018 0.311 0.330 0.221
Table E.I0: UNI-S : Loading metric
Component First Second Third Fourth
d,(&, atrue)(D) 0.459 0.121 1.466 1.875
ave (d, (a., &)) 0.273 0.834 1.159 1.604
ave (d,(ai, atrue» 0.505 0.985 1.715 2.312
& slight negative bias, which increases with the size of the eigenvalue. The variability
in eigenvalue estimation remains constant, however. Both bias and variance of the variable
loadings increases with the number of components estimated.
Average correlations between APCs, in Table 5.8, show close agreement only for the
smallest APC. The other three correlations are highly variable, indicating true and esti
mated APOs are not close in many of the iterations, despite the accuracy of the eigenvalue
estimates.
The mean angular separation of the loading vectors for each component, Table 5.9 l
reflect decreasing accuracy as more components are estimated, which is echoed in Figure
5.6. Nevertheless, the canonical metrics, Table 5.10 indicate very good agreement between
true and estimated APes for the four smallest components. Notice there is little change
in the scaled metric from 2 to 4 dimensions.
Figures 5.7 and 5.8 show the APC-function estimates and their variance plots. From
these we see that the bias of the estimates is caused by behavior reminiscent of that
expected when eigenfunctions are ill-determined. As is apparent from the variance plots
the second a.nd APe, for some samples the functions of the second APe a.re
66
COlnpoUDt t CompoulU 2.
1.01..
..j~..'"0.51 l... u
"j......". . .... . .
" "~ ••0".''.
."j.. "
'*. ...
-e.• ., ,
"1.0'
"'1 .• -u ••0 e.f 1.t .'{,4- -0,7 0.0 0.1 1,4.- " 6 ~
Component 3 Componnt •
"f- _,oM"
...........
0.0
'."
..'-o.n
...,......' .
, ..
-1.7-2.U•.1u- •.7
J
-u~i-u{'- ~ - __-- _
Figure 5.6: UNI-S : Correlation Plots
Table 5.10: UNI-S : Canonical Metric between {~(k) (XO)}f and {¢(k) (XO)}f
Components Included 1 1,2 1,2,3 1,2,3,4
Unsealed metric 1.511 5.184 7.143 8.134
IStd dey (N=20) ! 0.485 2.492 3.294 3.741
1
Scaled metric 0.0168 I0.0407 0.0458 0.0450 1
67
CompOlU!llt 1, Vartalll.. 1 Compount 1. Vutabl.. 1 COIllPOllellt 1. V<able 3.
d
...
-...0.0 0.1
Component 1. VvI.bl.. t CompOllenl 1. Vvtable 1 Component 1. Vutabl.. 3.
...
'.1•.n•..-'.1 - ••U
.... 4
-0.1
0.1'.0
J-0.1 -us
-0.1
COIllPOUllt 3. V&ll"ble Compount 3, Vutabl.. 1 C••pOUllt 3. Vutabl.. 3.
... ...
0.0-0,1 -o.to
0.0
0.1
0.1us0.0-0,1 -0.'50.10.0-0.7 -us
••1
o.a~ 0.0'
-o·'lM1.4
~-------~
Component 4, Vvlahle 1 Componellt 4. Vutable 2 Compollent 4, Vutable 3.
For the kth APe) i th variable : i>~~)(xp)(pointlJ) and ¢'~k)(X?)(solid line) versus X?
Figure 5.7: UNI-S : APe-function Estimation
•.1•.n
• ~ c .... ~ • "<. ..
....u t,t-0.7
0.1'
0.0 I-0,1"1
-i.51.........~--~-----t.O I.U".1
0.0
1,7us-'.1 ".to
uio.,s~
Mi~-0.1
-ui
68
COlllpoaellt 1, Vulable 1 COIllPOUllt 1, Vuiable 1 Comp01U!lIt 1, Vulable 3
0.7us
-0.7
•.7•.U-0,7 -o.U
-\.4
-0.7
0.7-us"0,7
:~J~~:!IiIIiiii!EfiI£132r~,E!:us
Component 1, Vulable 1 CompOlltlllt 1, Vulable 1 Compount 1. Variable 3
'.0
COllllpouat 3, Varlabl. 1 COIlllPOD.8l1t 3, Variable 1 Collllpoaeat 3, Variable 3
-1.1
-0.7 -us 0.7
Compoael>t 4, Varl.ble t COlllpOl>el>t 4, Vallable 1 CompOUl>t 4, Varl.ble 3
•.7t,U-•.1 - ••U
-i.01,-~__~_......,.__~_-.-_'.7'.K•.t- •.7 ....K
For the k th APC} itA variable :(¢~:) - ¢~~»(Xl) versus Xp for j = 1 ... N.
Figure 5.8: UNI-S : Variance of APC-function Estimation
69
cubic, while for others they are linear. Evidently, despite theoretical uniqueness of the
eigenfunctions, estimates are not well-determined. It appears that both the size and the
spacing of the eigenvalues are important for stability of the eigenfunctions. In view of the
weak dependencies of the APCs, it is not surprising that the estimates are unstable for
the higher APCs.
The canonical metric for the first four dimensions, however, will not be affected by
this phenomenon, provided the space of the four smallest estimated APCs corresponds to
the four smallest true eigenfunctions.
5.5 Simulations using Manifolds defined by Specified
Constraints
This simulation studies the algorithm applied to data sets with a larger number of variables
and more complex structure. We discovered a method for constructing data sets that
satisfy a set of orthogonal constraints, which are almost completely prespecified for an
arbitrary set of marginal distributions. Using this technique, data sets are simulated that
concentrate around a 4-dimensional manifold in the 6 variables. Two simulation studies
are presented, using the same set of constraints for each, but changing the eigenvalues of
the two smallest APCs.
5.5.1 A Method for Constructing Manifolds
Constructing realizations of additive manifolds using continuous functions is not a trivial
problem. The following method induces additive dependencies between variables described
by a constraint in which all except one of the variable functions are specified.
The idea underlying the method is a simple trick with permutations. Suppose we
begin with p variables Xl'" X p and some set of specified functions, ¢1(XI ) ... ¢p(Xp).
Let Y == ¢1(Xtl and Z == L:~=2 ¢i(Xi). lfwe sort the values ofY and Z, then the ordered
variables are likely to be similar. By permuting Xl and X2 .. X p in parallel with Y and
Z respectively, we have that:
70
p
,&lP:;') ~ LMXl),i=2
where Xi are the permuted variables.
The exact procedure, first for a manifold of co-dimension 1 :
1. Generate p independent samples 201,202", xp of Xl, X 2 ••• Xp• Standardize so that
ave Xi = 0, ave X; = 1. Compute '&l(Xt>, ... ,&p(xp) for the specified sequence of
functions '&i. Center the transformed variables, ave '&i(Xi) = 0.
2. Form II = '&l(Xt> and z = E;=2 '&i(Xi)'
3. Find the ordering permutations ,... and ,... for II and z, so "'.(11) and ,...(z) are both
increasing.
4. Form
X~ = ,...(xt>
x: = ,...(Xi) .= 2 ... p
The resulting data set x~, .. . x~, satisfies the constraint:
where g(.) is a monotonic function (g is the mapping between the ordered variables "'.(11)
and,...(z) ). All transformations except 9 0'&1 are prespecified.
The above constraint is exact. To construct a data set for which the variance of the
constraint is nonzero, we modify step 2 above as follows.
Generate a vector u independent of X 2, • •• X p , with ave u = 0 and ave u2 = £2. Form
z by z = E;=2 ,&i(Xi) + u. Perform the sequence of steps above using this modified z. The
resulting data set has :
where u' = ,...(u). Hence:
g(4)l(XD)- E;=2'&i(XD) = u',
and so var (g( 4>1 (xD)- E;=2 4>i(x:J)) = vaT ("I) =val:" (ti) = £2
71
The procedure can be generalized to generate data sets with lower dimensional additive
structure, that is, satisfying more than one orthogonal constraint. The same permutation
idea is used, however now a separate ordering is induced for each constraint. To produce
these multiple dependencies, the block of summed variables (z, above) is not permutsd,
rather, the inverse ordering is applied to the single variable (y, above). Orthogonality can
be achieved by using a Gram-Schmidt step, or by specifying the functions to be orthogonal.
The following method produces a data set lying in a manifold of co-dimension 2.
1. Generate p independent samples Xl, X2 ... xp with ave Xi = O,ave X; = 1. Com
pute (PI (Xl), <Ps(xs) ... <pp(xp) for some sequence of functions <Pi, where ave <Pi (X;) =
O. Compute <P2(X2) ,<ps(xs) ... <pp(xp) for some sequence of functions <Pi, where
ave<Pi(xi) = 0 and <Pi(Xi) J..<Pi(Xi) for;= 3 ... p.
3. Find the ordering permutations "'Vl> ,..., of Y1, Z1> and "'th' ,..., of Y2, Z2.
4. Form
X~ = "';,l"'v,(xd
x~, = ,..;,1"'v.(X2)'
The data set xL x~, Xs, .. . xp, satisfy the two orthogonal constraints :
9M1(Xi)) - <Ps(xs) - ... - <pp(xp) = °92(.Mxm - <ps(xs) -'" - tPp(xp) = 0
with I:f=l cov (<Pi(Xi),tPi(Xi)) = o.
All of the functions, except 91 0 <PI and 92 0 tP2 can be specified. This can be altered in a
manner similar to the single constraint case, so that the constraints are not exact. Adding
noise components with different variances permits control over the eigenvalue separation.
5.5.2 SCM-S1: Specified Constraint Manifold Scenario 1
The simulated data sets we use for the two simulations that follow are generated using the
technique of the previous section. We use 6 variables and two constraints. The variabies
72
(Xl, ... Xs) satisfy the two equations of the form :
Zl = tPl(Xl ) + tPs(Xs) + tP,(X,) + tPs(Xs)
Zz = + tPz(Xz) + tPs(Xs) + tP,(X,) + tPs(Xs)
i: COy (tPi, "'i) - 0,
where varZl = 0.048,
varZz = 0.137.
Variable 6 is not included in the manifold construction, and thus has a true zero function
for both equations. The marginal distributions of the variables before centering and scaling
are:
Xl - )/(0,1)
Xz - Chisq(6)
Xs - 0.6 * )/(0,1) + 0.4 * )/(2,1)
X, - Categorical
Xs - )/(0,1)
X s - Uniform[-l,l]
The four dimensional manifold is defined by the orthogonal equations ~(X) = 0 and
~(X) = o.The variances of Zl and Zz, 0.048 and 0.137, respectively, are the approximate eigen
values of the two smallest APCs. The APCs are defined by the APC-functions <P(X) =
(tPb"'tPS) and W(X) = ("'1,'""'s).
The two smallest APC are estimated for 20 data sets of 200 observations and 6 variables
simulated from the manifold described above.
The true and estimated eigenvalues and variable loadings, given in Table 5.11, show
accurate estimates, with the exception of the variable loadings of Variable 2 in Component
1, and Variable 1 in Component 2. The correlations between estimated and true APCs,
Table 5.12, show close agreement for the first component, however, the second component
estimation is less consistent. The angles between the true and average estimated variable
loadings, Table 5.13, while not significantly different from zero, are more varia.ble than the
two smallest APC variable loadings of other simula.tions. The canonical metric, Table 5.14,
shows a gr€,at,,, accuracy ill estimation of the prc,bably reflects a
73
Table 5.11: SCM·S1 : Eigenvalue and Variable Loadings
First APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.048 0.049 0.001 0.004 0.004
Loadings :Var 1 0.707 0.690 -0.017 0.022 0.028
Var 2 0.000 0.116 0.116 0.075 0.141
Var 3 0.408 0.435 0.027 0.020 0.034
Var 4 0.408 0.375 -0.033 0.028 0.044
Var 5 0.408 0.415 0.007 0.018 0.019
Var 6 0.000 0.031 0.031 0.010 0.034
Second APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.137 0.113 -0.024 0.011 0.021
Loadings :Var 1 0.000 0.147 0.147 0.082 0.167
Var 2 0.707 0.681 -0.026 0.024 0.036
Var 3 0.408 0.397 -0.011 0.031 0.033
Var 4 0.408 0.404 -0.004 0.041 0.041
Var 5 0.408 0.425 0.017 0.048 0.051
Var 6 0.000 0.051 0.051 0.020 0.056
Table 5.12: SCM.S1 : Correlation between ¢(l'l(X 0) and ¢(l:l(X 0)
Component First Second
, Correlation 0.9018 I 0.7518
Std dey (N=20) 0.0188 0.0645 I
74
Table 5.13: SCM-Sl : Loading metric
Component First Second
dl(a, atrutH") 0.812 0.971
ave (dl(a;, a)) 0.474 0.631
ave (dl (ai, atrue)) 0.876 1.113
Table 5.14: SCM-S1 : Canonical Metric between {~(k)(XO))t and {¢(k)(XO))t
Components Included 1 1,2
Unsealed metric 4.99 2.79
Std dev (N==20) 0.549 0.270
Scaled metric 0.0372 0.0311
slight lack of exact resolution of the separate components in the manifold. Figures 5.9 and
5.10 compare the estimated and true APC-functions of the first two components. Despite
the discrepancies in loading and eigenvalue estimates, the function estimates are close to
their true counterparts, particularly in a qualitative sense - they follow closely the global
characteristics of the true functions. The greatest inaccuracies arise in the estimation of
the zero function. For instance, in the APC-function for the second variable, smallest
APC, the estimate appears to match the shape of the true APC-function estimate for
the second variable, second component. This is also apparent in the variance plots of the
estimation. A similar phenomenon is observed in tbe APC-function estimate for the first
variable, second smallest APC, where the estimate instead of being close to zero, seems to
echo the function from the first component. In contrast, the estimate of the zero function
for Variable 6, which is independent, is negligible as expected.
Such mixing between the two components would cause the increase in the scaled canon
ical metric for two dimensions noted above.
15
Campou.t 1, Vart.lilt. t (AlllpolUat 1, Vartahle 1 Compollent 1, V viable .3
!1,.J u~1 °1
- I1,7 , - 1,1 D,71
" ~u -_.-~.-.. ~ ~
I,D'. c'-
I,D "- ;,;;;::;- ~~..... <---""'~
"-1,7 -1.7 -1,7
... .,.4 -1 .• -1,4' , ,-u I,D 1.5 -1.5 U M -1:.4 0.0 U.. ~ ; d J d
ComplIll.llt 1, VuLt,bIe 4 COmpll1lellt 1, Varl.lioIe 5 ClImpllllellt 1, Varlallle I)
1,4 1,. 1••
1,7 1.7
~1,7
I,D
~U 1.1
",
-a,7 -1,1 -1,7
-1 .• ........ -1.4
-1.5 I .• 1.5 -1.$ ... ... -u t,a U
j ~ ; ~ j ~
Compo,",llt 1, Vartable t Compollellt 1, Varialille 2. Compilnellt 1, VarIable .3
1.3 ui 1.3
I.H uo
<7ff!1@i....
~ ~U D,O a,a
-0,," -a.'" -us
-1.3 -'.1 "'.3.
-~.5 a,a 1,0 -1.6 I,. 1.5 "".$ ... ..0
• hl • ~ J hlCCllnpllllent 1. Vall aMe • COIllPOHllt 1, Variable 5 COmpllllellt I, Varh.ble I)
1.3 uj uia,U
.~~ ..E§0,u1...... & ~
I
U a,a a .,al • .- ..I
_a"j-0.5 -"'1
I
-1.ai I;o-L~!" , , , ! . , -1,31 , ,
"1.1 a,1 u -1.5 ... ... ~·L6 a.a U
j ~ .. d ~ ~
Figure 5.9: SCM-51 APC-function Estimation for Component 1
76
Component 1. Vart.hIe 1 Compoullt 1. Vuiable 2. Component 1. Vuiable 3
u ,.,
"10.1 0.1 , 0.1
.. --
l~.. o:::~, -'-".'" .0
.01j~O.1 .
-1,<l -1.4-1.4
-1,5 0.0 U ·'.6 0.0 '.0 -1.6 0.0 '.0
" " C " C
Component 1. Vuiable • Component 1. Variable 5 Component 1. V.dahle 6
,.. u ,..0.1
~0.1 0.1
0.0 0.0 (\ 0.0
/'
-0.7 ·0,1 .t1
-'I •• -1•• -1,4
-i.' 0.0 U -i.' 0.0 U ·'.4 0.0 U
; <l " ~ ; ~
Component 1. Variable 1 CompoDent 1. Variable 1. COmpOllellt 1, Vart.bl. 3
u u u
!its iiG ~ ~ ~ 3!t0.0 0.0 0.0 •-1,5 -',J .,.j
-u: 0.0 U -1.6 '.0 U -1.5 0.' U
j ~ j II j IIComponent 1. Vuiable 4 c.:mpODeDt 1. Variable 5 Component 1, Varia-hIe 6
u u ,.jjJO> • • "-: 9 ~ ooj I••• '.0
"·u -u: .'L6~I.,.. '.0 '.0 -1.6: ,., H: -1,S ,., U
" C " <l • <l
Figure 5.10: SCM-S1 APC-function Estimation for Component 2
11
Table 5.15: SCM-S2 : Correlation between ?>(k) (X 0) and ~(k) (X 0)
Component First Second
Correlation 0.9151 0.8662
Std dev (N=20) 0.0241 0.0514
Table 5.16: SCM-52: Loading metric
Component First Second
dl(a, l.I:true)(") 0.528 1.159
ave (dl("'i, a)) 0.348 0.911
ave (dl ("", "'true)) 0.626 1.446
5.5.3 SCM-S2: Specified Constraint Manifold Scenario 2
The variables for this second simulation obey exactly the constraints and specifications
of the first specified manifold. Again, 20 data sets of size 200 were generated for the
simulation. For this simulation, the order of the APCs is changed, since in the second
simulation var Zl '" 0.413 and var Zz '" 0.091. The smallest APC now corresponds to
the APC-functions 'l>(X) with eigenvalue 0.091, and the second smallest to 4>(X), with
eigenvalue 0.413.
The correlations between true and estimated APCs, Table 5.15, and the angles between
the variable loadings, Table 5.16, as usual show greater precision for the smallest APC,
although both are highly correlated. The true and average estimated eigenvalues and
loadings for the two smallest APC are given in Table 5.11. The estimate for the second
eigenvalue is unexpectedly low, although the variable loadings show very good agreement.
The canonical metric, Table 5.18, shows the expected increase for the two dimensional
space. Figures 5.11 and 5.12 depict the true and estimated APC-functions and their
variance. The structure of the data given by the two constraints has been accurately
78
Table 5.17: SCM-S2 : Eigenvalue and Variable Loadings
First APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 0.001 0.084 0.007 0.007 0.010
Loadings :Var 1 0.000 0.067 0.067 0.023 0.073
Var 2 0.707 0.698 0.009 0.021 0.023
Var 3 00408 00415 0.007 0.020 0.021
Var4 0.408 00421 0.013 0.029 0.032
Var 5 00408 0.391 0.017 0.031 0.035
Var6 0.000 0.044 0.044 0.019 0.049
Second APC Theoretical Estimated Bias Std dey RMS dey
Eigenvalue 00413 0.291 0.122 0.023 0.126
Loadings :Var 1 0.707 0.76 0.031 0.032 0.045
Var 2 0.000 0.130 0.130 0.075 0.153
Var 3 00408 00418 0.010 0.087 0.088
Var 4 00408 0.388 0.020 0.070 0.073
Var 5 00408 00400 0.008 0.069 0.069
Var 6 0.000 0.124 0.124 0.050 0.137
Table 5.18: SCM-S2 : Canonical Metric between {¢(kJ(XOm and {J;(k)(XOm
Components Included 1 1,2
Unscaled metric 2.571 4.09 IStd dey (N=20) 0.396 0.647 I
Scaled metric 1 0.029 0.032 !
19
CompOlunli 1, Vuiable t Component. t, Variable 2: Component 1, Variable 3
,\
~
; cl .. tl .. tlComponent I, Variable 4 Component I, Vadahle 5 Component I, V&.liable 6
J ~" "
L ;, ~,
tl"Component t. Variable 1 Cemponent 1. Vartable 1. Component t. Variable 3
,.. ,.. ,..0.65 o.n u:~
ii ~ ht -~
G C§... .. • 54 ...~U ·8.f .....·1.' ·1.1 ·1.11
-2.0 -Ul ... ,.. M -2.0 -1.l1 ••• ,.. ,.. -2.0 ·Ul ... ,.. M
; tl .. tl ;, cl
Component 1, Variable 4 Component t. Variable 5 Component 1. V.uiable 6
,.. I... 1.31o.e:6 0.66 Mi61
iii £@
~•.£d I
III -••• ••• D.e; 6 i!!!!!I
"j -0.15-
IUi1I
.u~I
-u: , , ....j-u *'U) .., ,., M -2.0 ·1./1 ... ... M -2.0 -1.0 ... q) ,..
; ~ ; ~ .. r;;
Figure 5.11: SCM·S2 APC·function Estimation for Component 1
80
CempoMat 1. Vut.ble 1 Compouat 1. Varl.ld. 1 c.mpollut 1. Varlahle 3
'.
~'"~" ,
.. I:l ... I:l 11 6Compollent %. Variable .. ComponeDt 1, Variable 5 Component 1. Variable 6
~ ~ . -
... I:l ... I:l ... ~
CO.pOlUlut 1, Vuiahle 1 C••pount 1. Variable 1 Compouat 1. V.dabie 3:
.., .., ,.,U6
~0.16 0.16
~,., ,., ~<S ~,.,
~·0.' -O.ts -0.16'
-1.7 -1.7 -1.1
-t.o -1,0 ,., ,., '-' -2.11 -1.0 ,., ,., '-' -u -t.ll ,., .., '"... ~ 11 ~ 11 ~
Component 1. Variable .. Component 1, V_dahl. 5 Component 1. Variable 6
,., ,., "i... ... 0..)
t @£~ 7' rq"j ~ i!fj,., ,.,
-,. -G.' -U4"1i
-"'i -U' -u1.-u -i.' t., ... '-' -,H -1.0 t.' '.t '-' ·u -1,0 t., ... '-'
11 ~ 11 ~ 11 ~
Figure 5.12: SCM-S2 APC-function Estimation for Component 2
81
recovered.
5.6 APe Estimation for Uncorrelated Variables
Variables that are mutually independent have no additive structure, that is, they have
no APC with eigenvalues smaller than 1 (Theorem 2.3). Variables that are pairwise
uncorrelated we would usually expect to have only very weak additive structure. In
either case, it is valuable to know how APC estimation behaves in the absence of additive
structure, as this will provide guidance for judging the significance of structure detected
in data analysis.
Independent Gaussian
The three smallest APCs were estimated for five samples of 200 observations from .11.(0, I).
Quantiles for the three smallest eigenvalues of the initial estimates, calculated from an
empirical distribution function, are given for comparison. 1
APC estimate• 1 • 1 • 1 • 1F200(0.01) F200(0.05) FiOO(O.10) F2"oo(0.50)
>'(1) 0.676 0.566 0.715 0.763 0.867
>'(2) 0.694 0.732 0.758 0.780 0.849
>'(3) 0.721 0.714 0.758 0.772 0.846
Loadings of the APC-functions indicate the variables usually contribute equally in the
APC estimate.
The APC-function estimates, shown in figure 5.13, share a common feature - a steep
gradient in the extremes of the variable marginals. The effect of these transforms of the
variables is to exaggerate the kurtosis of the sample : the density of points near the
origin is increased while observations on the perimeter of the sphere are further separated.
The resulting transformed variables have a projection with smaller variance than the
original variables. Notice also, that a good polynomial approximation to the APC function
IThat is, the three smallest eigenvalues of the(Xl,P2(X2),P3(X3), ... X.,P2(x.),ps(x.)), where Xi is sample)/(0,1), Pi is jth degree polynomial.(see Chapter 4.2.3)
correlation matrix ofof 200 observations from
82
Compellelll 1, Vartallle 1 CompoMlll t, Vartable 1. CompoMllt I, Vartallle oJ CompODellt 1. V&Itabl••
u
Component 1, Variable 4
Component 3, Variable.
Compllnenll, Vartable 3Component 1, Vartable 1.Compollellt 1, Varillbl. 1
Component 3, Variable t
I.e
-1-0
-1.6
d
Figure 5.13: Independent Gaussian Estimates of the Three Smallest APes
83
estimates would require polynomial of degree much higher than three. The quantiles given
above, then, are conservative.
Uniform on the Ball
Four variables, uniformly distributed on B., are uncorrelated but have weak additive
structure. The smallest eigenvalue of it has multiplicity 3; the corresponding APCs are
contrasts of the squared variables, that is, I::. C.g2(X.) where I::. c. = O. Five sets of APC
estimates using 200 points sampled from the Uniform distribution on B•. are pictured in
Figure 5.14. Again, quantiles of an empirical distribution function of the initial estimates
are given for comparison.
APC estimate• 1 • 1 • 1 • 1
F200 (0.01) F200 (0.05) F:iOO(0.10) F200(0.50)
A(I) 0.730 0.702 0.766 0.788 0.854
A(2) 0.730 0.696 0.735 0.779 0.846
A(3) 0.763 0.724 0.776 0.790 0.845
Qualitatively the APC-functions for the Uniform distribution show the same tendency to
be steep in the variable marginals. The gradients are not as extreme as the Gaussian
estimates, however, and are more wiggly in the body of the transform.
5.7 APe Estimation for Distributions with Exact Addi-
tive Dependencies
The previous section considered APC estimation in the absence of additive structure,
var ¢(X) "" 1; here, we consider the opposite extreme of exact additive singularity,
var ¢(X) "" O. Recall the distributions described in Section 3.7, in which the data lie
exactly on an additive manifold described by step functions of the variables. Estimation
of such discrete dependencies among continuous variables discrete in the sense that
the transformed variables are discrete valued - cannot be exact, since the APC·functions
estimates are constrained to be smooth.
To indicate the behavior of finite sample estimates of these APCs, we generate data
from the following distribution.
84
Co.p0llent 1, Varl.ble 1 C••pouat t, Varl.bl. 1 CO.'Ollfllt 1. Varl.blo !
-u
Component 1. Vali.ble t Component 1. Vutabl. 4
J"'.7 "'O.I! •.e •.•'5 1.1-1.1 ..... 0.' ... u
J
:,-
-i.' .•.• I.' U I..
-1.15
-u .......__-_- _
-1.1 -u u ." 1.1
JCo.ponont 3. Variable Co.poneat 3. Vuiabio 1 Compollent 3. Varlabl. ! Cem.peaent 3. Vuiabl••
1.1
u
Ul
"j\ ·""1 \\
-t.•J-".6
Figure 5.14: Uniform on Ba.ll Estimates of the Three Smallest APes
85
Xl> X z and Xs lie in diagonally opposite quadrants, defined by the cut points 0,0, o.That is, P(X1 < 0, Xz < 0, Xs < 0) and P(X1 ;::: 0, Xz ;::: 0, Xg ;::: 0) are both non-zero,
and sum to 1. The exact singularities in the data, that is, the APCs with eigenvalues of
0, are defined by </>i mapping the two sets {Xi < O} and {Xi;::: O} onto different constants
ki and Ii, 80 that II</>ill = 1. Then for any c such that Ei Ci =0, var E e; </>i(Xi) =O. The
eigenvalue 0 has multiplicity 2.
A data set of 200 points is simulated from this distribution, where the marginals of the
variables are beta(0.5,O.25),beta(0.75,0.75),beta(0.2,0.8). Estimates of the two smallest
eigenvalues are 0.026 and 0.031, hence the estimated APCs closely captures the exact
singularities. The function estimates, shown in Figure 5.15, strongly suggest the true step
functions. The jump in the step function is smeared by the smoothing, resulting in a steep
segment just at the cut point between flat mesas on either side.
5.8 Conclusions
The simulations of this Chapter draw attention to two factors influencing the accuracy
of APCs estimation. First, the separation between the eigenvalues, which affects the
performance of the algorithm; and second, the absolute size of the eigenvalue, which
determines the intrinsic variability of the estimation problem.
Separation between eigenvalues
When the eigenvalues are close, we observe slow convergence and lack of uniqueness or
resolution in the estimates.
The rate of convergence of the algorithm is controlled by the ratio:
a is the spectrum shift (Chapter 4.2.4).
la - >"110 -A*j
>.'
where >.'
is the target eigenvalue,
the next smallest eigenvalue to >,',
When the ratio is close to one, convergence is slow, so the algorithm's convergence criterion
may be satisfied before convergence has occurred. This ratio is near one if the eigenvalues
are dose or if the value relative 10 the difference between A· and A' In the
86
Variable 1 Variable .1 Vul"bl. ,) Vut.:hl...SLdev {Compo 1)=0. 7170 St.dev (Comp. 1 ):::0.6949 St.dev (Comp. 1l=O.O365 St.dev (Comp. 1)=0.0422
". _......... o-f" •
: ,~ .............. ........,,--. •
~.
'.
:,: .... ,-' .....,j d j ~ j d j d
Variable 1 Variable .1 Variable 3 Vuiable 4St.dev (Comp. 2)=0.3889 St.dev (Comp. 2)=0.4214 SLdev (Comp. 2)=0.8180 St.dev (Comp. 2)=0.0445..........,
••, ........- f'" '""'-' •( ,:
.~"'--....: •• , •
: .-"':-'......... -. -.
j Q ;, 0 ;, 0 ;, 0
Figure 5.15: Discrete APC : Estimates or the Two Smallest APCs
87
former case of close eigenvalues, lack of uniqueness is unavoidable. In the latter case, where
the parameter", is large relative to the eigenvalue separation, we have slow convergence
that is a troublesome artifact of the estimation technique itself.
The phenomenon of "mixing", referred to in several of the simulations, is a consequence
of this artifact. Precisely what "mixing" means, is that a small fraction of another APC
remains in the current estimate although the convergence criterion has been fulfilled.
When iteration is ceased, the APC-function estimate is ~i = (1 - 8)4>P) + 84>12) instead of
the true 4>1 1). If the variable loading of 4>11
) is large, the contribution from the second APC
will be noticeable only as a consistent bias in the estimation - since 4>11) will dominate.
However if the true loading is small or zero the residue, 4>12), will contribute a nontrivial
function as the estimated APC-function.
The first specified constraint manifold (SCM-S1) demonstrates a situation where the
dependencies are strong and so accurate estimates are expected, however the eigenvalues
are close enough to that the estimates exhibit a bias due to mixing of the two smallest
APCs. The second specified constraint manifold (SCM-S2) does not suffer from such inac
curacy: even although the eigenvalues are considerably larger and thus the dependencies
are weaker, their separation is sufficiently large that mixing does not occur.
This mixing phenomenon explains the large bias sometimes observed in the variable
loading estimates of near zero values. In GAU-S2, the third and fourth variables, have
zero variable loadings in the smallest APC, although nonzero values in later APCs. The
size of the estimated loadings in the smallest APC is explained by the presence of a small
proportion of a higher APC; further evidence is given by the observed decrease in the
canonical metric. This also explains the systematic bias in the APC-function estimates of
Variables 1 and 2 discussed in the SCM-S1 simulation.
The phenomenon of mixing, since it is caused in part by an inaccurate assessment of
convergence might be removed by using a more exacting convergence criterion. Conver
gence is determined by evaluating the change in the estimated eigenvalue, so the stringency
of the convergence requirement is constrained by the numerical accuracy of the eigenvalue
estimation. Hence, the mixing cannot always be circumvented simply by forcing longer
iteration, An alternative apprl,ath more resolution of
88
tions, is to improve the estimate of 0<. Good initial estimates of 0< can be obtained from
the starting estimates as described in Section 4.2.4.
Size of the eigenvalues
As eigenvalues approach 1, the variance of their estimates increases, hence APC estimation
is intrinsically less accurate as more components are estimated. This is well known to be
true for linear principal components, where the variance of the usual MLE estimates is
larger for middle eigenvalues than for extreme eigenvalues.
Also, for continuous distributions, as more components are estimated, it becomes
more likely that APC will be ill-determined. For estimates to be well-determined the
separation between eigenvalues has to increase as the eigenvalues increase, and yet the
sequence of eigenvalues accumulates at 1, so typically the distance between higher eigen
values decreases. These conflicting tendencies will mean that only the first few APCs of
a continuous distribution can be reliably estimated, and only then when the eigenvalues
of these components are small.
UNI-S illustrates the case where despite separation between the eigenvalues, the ab
solute size of the eigenvalues is such that the corresponding dependencies are not well
determined.
Numerical Accuracy of Estimation
Evaluation of the success of the estimation has two components, which could be loosely
termed the qualitative and the quantitative assessment: quantitative assessment refers
to accuracy of numerical estimates, whereas qualitative to the more subjective graphical
comparisons of function estimates.
Attempting to make generalizations of the accuracy of the numerical quantities
the eigenvalues and variable loadings - is difficult, since there are no entirely consistent
trends in the bias, standard or RMS deviance.
• Eigenvalues are consistently underestimated. This bias is an inherent property of
the attributable to the degrees of freedom used in the
89
APC-function estimation - in effect the data is overfitted by the smoother in the
iterative algorithm. Estimation of eigenvalues for the uniform distribution seem to
be more accurate than other distributions.
• Estimation of the variable loading vector, when assessed using the loading metric,
is accurate in all simulations. The variance of the estimates, however, is often large,
and increases with the eigenvalue.
• The canonical metric, except for cases explicitly commented upon, increases with
the number of components, indicating an increase in the variance of estimation as
more components are estimated.
The decrease in precision of estimation for the higher APCs is attributable in part to
the orthogonality constraints, since to maintain orthogonality, a Gram-Schmidt step re
moves previous estimated APCs, thus estimation errors in the lower APCs are propagated
into the new APC.
Accuracy of Function Estimation
Assessment of the accuracy of function estimates is subjective; our main concern is whether
the estimated functions reproduce the global characteristics of the transforms, and hence
retain the sense of the dependency between the variables, rather than whether we have
local accuracy at each observation.
By this criterion, the estimation is impressively accurate for the smallest APC in all
simulations. If the eigenvalues are "well-separated", which depends on both the relative
separation between eigenvalues and the absolute size of the eigenvalue, APC-function
estimates are excellent for the second and third APCs also. APC-function estimates are
most reliable when the dependencies are strong.
The qualitative accuracy of the estimation implies that for data analysis the APC
function estimates can yield reliable information about the relationships between the
variables when the dependencies are strong. However, more care must be taken with
interpretation of the eigenvalues and loadings, whose numerical values may not be accu
rate.
90
Independence and Singularity
The estimates of APCs for data from independent Gaussian distribution, are a cautionary
example: data which have no dependent structure can have estimated eigenvalue as low
as 0.6 ! A hallmark of these spurious APCs, which exploit outlying points of the ssmple,
is APC-functions that are steep in the tails of the variable marginals.
On the other hand, APC estimates for data lying on an additive manifold, recover
the additive structure very accurately, even when the manifold has co-dimension greater
than 1. Note, however, for the example of section 5.7, the accuracy of estimation is
somewhat implementation dependent. The supersmooth is able to approximate a step
function fairly readily, whereas a more rigid smoothing technique might not recover the
constraint as accurately.
Chapter 6
Applied Additive Principal
Component Analysis
6.1 Introduction
Prior to embarking on the APe analysis of a real data set, we present a detailed consider
ation of interpretation techniques and issues. The need to examine methods of interpreta
tion is perhaps not immediately clear, however, in moving from a simple linear sum to the
more general additive function of the variables, the increase in flexibility is accompanied
by an increase in complexity. While linear relationships between variables have natural ge
ometric frames of reference, which make interpretation a relatively straightforward affair,
these are rarely useful for additive functions. Instead, it is necessary to develop frames
of reference that are meaningful for additive functions, often by simple analogy with the
linear case, and to develop new intuition, using the guidance of exact solutions.
The most important motive behind this discussion, though, is entirely pragmatic - un
less interpretation can be made both simple and comprehensible, we have gained nothing
from the additional flexibility of additive functions. We develop a technique for interpre
tation of additive dependencies that utilizes dynamic graphic technology to provide an
elegant method for understanding the analysis.
92
In the first section we discuss each estimated quantity, its properties and interpretation.
The second section draws on insights gained from the exact solutions to explain some
expected anomalies in the behavior of the sequence of estimated APC. The final sections
illustrate the estimation and interpretation of the APC on some real data sets.
6.2 Interpretation Techniques for Data Analysis
From the sample APC computed by the algorithm we are interested in estimates of the
eigenvalue, variable loading vector, APC and APC-function.
The Eigenvalue
The eigenvalue, var I>h(Xi ), is bounded between 0 and 1. Exact additive dependence
results in a zero eigenvalue, whereas an eigenvalue of 1 indicates the transformed variables
are uncorrelated. The eigenvalue measures the overall strength of the dependency. If the
equation is considered as a linear constraint in the transformed variables, the eigenvalue
gives the variance of the data around the linear manifold it defines.
Recall from Chapter 2.4, that the eigenvalues of an infinite sequence of APC tend to 1,
and every eigenvalue distinct from 1 has finite multiplicity. As the eigenvalues approach
one, the relative separation between the eigenvalues will usually decrease. This implies the
higher APC estimates of continuous variables are likely to become increasingly unstable.
Add to this the cumulative errors of estimation and it is unlikely that estimates of the
fourth and fifth APC will be reliable, particularly if they have large eigenvalues.
Possibly the most important purpose of the eigenvalue is for detecting when the eigen
values are not distinct. Since none of the following discussion on interpretation is relevant
unless the eigenfunctions are unique, it is essential to first verify that the eigenvalue of the
next APC is well separated from the present eigenvalue. If it is not, other than recognizing
there is additive dependence which involves those variables with nonzero transforms, no
meaningful interpretation can occur without addition information.
93
The APe
The APC, ~(X) = I: qli(Xi), is like a residual function, since it represents the departure
from manifold defined by the constraint I: qI.(Xi) = O. Hence the estimate, ~(X) =
I:~i(Xi), can be interpreted as a residual vector, implying that features that are of
interest in regression residual analysis are also informative here.
Outlying points in the residual (estimated APC) indicate observations that do not lie
near the manifold, and are thus unlike the rest of the data set.
The distribution of the residual is informative for eigenvalue interpretation; the vari
ance estimate of the APC could have been inflated by outliers, or be misleading due to
extreme skewness or kurtosis. The spread of points may also reveal something of the
type of dependency detected: when the structure is caused by clusters within the data,
typically the APC will have small groups of outlying values, whereas when continuously
dependent on the variables, more symmetric patterns will be apparent.
It is informative to examine scatterplots of residuals belonging to different eigenval
ues, that is, I: ~lk) versus I: ~lk'). This is analogous to the plot of the projection of the
data onto the principal component coordinate planes, commonly made in linear principal
component analysis. Usually plots corresponding to the largest eigenvalues are made-
as these projections have the largest variance they are often informative low-dimensional'(1) '(2'plots. However, we are estimating the smallest APC, hence plotting I: qI. versus I: qli )
is analogous to the least informative projection of the data, that is, the projection with the
smallest variance. The APe are linearly uncorrelated, as is also true for the linear case.
Outliers in the scatterplot will indicate points not lying near the (p - 2)-dimensional man
ifold defined by the corresponding constraints. This plot may detect points not unusual
in either of the individual residuals.
An important difference between the additive and linear plots, is that the residual plot
in the additive case cannot be considered to be a projection of the original data, or even of
the transformed data. Two different sets of transforms are involved, so there is no simple
relationship between the plot and the original data. Formally, the random variables I: 4>ll)
and I: qll2) minimize Val" (I: t/>i + qI:) subject to 114>11 = 114>'11 = 1 and 4> 1. 4>'. With the
94
additional constraint that var I: ¢P) is minimized this is a joint characterization of the
two smallest APC.
The Variable Loadings
By the variable loadings of an APC, we mean the standard deviation of the transformed
variables, 8. = o-(¢.(X.)). AIl explained in Chapter 5.2, these are analogous to variable
loadings of linear principal component analysis. AIl I: var ¢.(X.) = 1, the loadings are
constrained to lie on the unit sphere, I:8; = 1. The loadings quantify the relative contri
bution from each variable, and hence the extent to which each variable is involved in the
dependency.
The Function Estimates
The estimate of each variable's APC-function, ¢., no longer has an analogy in linear
principal components. For linear relationsbips the signs and magnitudes given by the
loadings are sufficient to describe the dependencies completely. For additive dependencies,
interpretation is often more complex and hence requires more sophisticated techniques.
We discuss several approaches to the interpretation of the function estimates:
• A Linear technique : The smallest APC, has a characterization as the small
est linear principal component of the transformed data, as expalined in section 2.6.
Interpretation techniques of linear analysis can be validly applied to the standard
ized, transformed variables. The obvious disadvantage of this approach is that the
transforms do not always define a meaningful scale. Another disadvantage is that
the extension to further APC is awkward. In linear analysis, the orthogonality of
the components is a geometric constraint with a natural interpretation. For addi
tive analysis, orthogonality only implies the APC are uncorrelated, which has no
geometric analogy. We can state that the second smallest APC is the smallest lin
ear principal component of 114>(2)11-1 4>(2) that is uncorrelated with 4>(1) but since
4>(1) and 4>(2) are usually different sets of transforms, this statement is not terribly
enlightening.
95
• A Regression Technique: For APC with small eigenvalues, the data come close
to satisfying the implicit equation L. ¢.(X.) "" O. Most usually the variable with
the largest loading, say X m , is the most important variable in determining the
constraint. Even small changes in ¢m(Xm) will demand modification of all other
variable transforms, if the constraint is to hold. Conversely, a change in any of the
other variable transforms is likely to be reflected in ¢m(Xm). Informally, then, it
seems reasonable to solve the implicit equation for ¢m(Xm) and write:
¢m(Xm) "" L ¢j(Xj ).jtom
Then, regarding this as an ACE regression model, we can use interpretive techniques
of regression analysis.
A more formal justification for choosing the variable with the largest score for the
role of dependent variable in the regression equation is as follows. From the implicit
equation of the APC, there are p possible regression models, one for each variable.
The regression model Ljtom ¢j(Xj) for the dependent variable ¢m(Xm) has regres
sion coefficient R = (1 - 0s;;:.!), where>. = var Lj ¢j(Xj ) and Sm is the m th
variable loading. This is maximized by the largest variable loading, hence the re
gression of the transformed variable with the largest loading as the response has the
best regression in terms of R2.
The appeal of this interpretation is the familiarity of the regression framework and its
extensive analysis tools - however it is only an approximation to the true symmetric
dependency of the APC. The relationship between the ACE regression solution and
the APC solution has been discussed in Chapter 2.7. Recall that the two are identical
if the eigenvalue is zero, hence a regression interpretation is most appropriate when
the eigenvalue is small.
When a regression interpretation of the APC is plausible, we could regard the APC
solution as an alternative to ACE regression, with the possible advantage that APC
treats the variables in a symmetric fashion. The APC equation allows the data to
suggest which variables show strong dependence and hence make reasonable candi~
dates for the role of a response when rewriting the implicit equation in a regression
form. This may be appropriate if no variable has been designated as the response.
• The Brushing Technique : The ensuing technique is the most accurate and
simplest way to interpret an APC estimate. Explanation is most effectively achieved
by way of illustration.
A two dimensional additive manifold in three dimensions is described by the con
straint :
(6.1)
This constraint implies that when 4>1 is large and positive, 4>2 +4>3 must be large and
negative. If 4>1> 4>2, 4>3 were linear we could deduce from the signs and magnitudes
of the loadings how the variables enter into the dependency. However, in general,
it cannot be inferred from the functions themselves how 4>2 and 4>3 individually
contribute to the constraint. To discover the interaction of the transformed variables
in APC analysis, a different approach is needed. One simple and effective approach
is to combine the graphical techniques of brushing and connecting plots.
Brushing is the interactive technique whereby points on a graphical display are
selected by means of an interaction device, such as a mouse or cursor. Selected
observations are marked or highlighted by immediate change of color or plotting
symbol, enabling interactive identification of observations of interest on the plot.
Connecting plots is a technique used where two or more plots display different vari
ables from the same data set, hence all plots depict the same observations. If the
plots are connected, brushing on anyone of the plots causes the corresponding ob
servations to be highlighted in all the connected plots.
The data we are going to look at are a sample of 200 points from the manifold
described by an equation of the form (6.1). The APC algorithm is used to calculate
estimates ,flo <12, J;s of the defining functions, for which var (J;l + J;2 + ,fs) = 0.0038.
For each variable we plot ,fi(Xi) versus Xi, called the APe-function plots, and then
tonnect the three The estimated manifold is defined )+
Vulable 1St.dev (Comp 1) "0.7033
Residual of 1Var (Comp 1) 0.0038
97
St.dev (Comp 1) =0.5040
..'\...... ;'
'; /__ l
-',--,..-'"
Vulable ..St.dev (Comp 1 j =0.5 14
\ -"'"\ /"\
V~ \\,\
. I L.
Figure 6.1: Interpretation ExampleAPC~function of Variable 1
The APC~function plots, brushing on the
¢s(Xs) = O. When the observations with high values of the transformed variable
4>1 are brushed, corresponding observations in the other plots are highlighted, and
the values of 4>2 and ¢s show exactly how these transforms fulfill the constraint. In
Figure 6.1, the three APCfunction plots are shown in the top row, with brushing
occurring in the APC~functionplot of Variable 1. High values of h correspond with
low values of both 4>1. and ¢g.
The real value of these plots, however, is that they enable direct interpretation of
the constraint in terms of the original, untransformed variables. In Figure 6.1, the
projection of the highlighted points onto the X-coordinate in each APC-function plot
Vutahle tSt.dev (Cemp 1) =0.7033
Resl dual or tVar (Comp 1) 0.0038
98
Varlahh, "St.dev (Comp 1) ",0.5040 St.dev (Cemp 1) ,.0.5014
•
. i 1.. •
Figure 6.2: Interpretation ExampleAPC-function of variable 2.
The APC-function plots, brushing on the
shows that high Xl values correspond with center values of X2, and with highest
and lower middle values of Xs. Highlighting the highest points of 4>2, in Figure 6.2,
show extreme values of X2 (both high and low) are associated with low values of
Xl, and middle values of Xs.
This technique can be extended to make use of residual information from the APC,
r = <PI +<P2 + tis. A fourth plot of a. histogram of r is connected to the variable plots,
as in Figures 6.1 and 6.2. The residua.l histogram is the leftmost of the bottom row,
the two remaining plots show a grid of the estimated manifold and the data cloud
showing the data configuration in variable space.
St.d6\' (Comp 1) ,.0.7033
99
Variable 1SCda\, (Comp 1) ~O.5040
Variabia 3St.d.iOY (Comp 1) =0.5014
.,•
Reddtlal cl tVar (Comp 1) 0.0038
."
~\
"1. \ ~" ,'"\. / "'\ / '........... "
i.\i..
"I l. . •
Figure 6.3: Interpretation Example : The Added Variable plots a.nd APe residual plot.
The connected residual plot can be used for two different diagnostic purposes.
First, the observations outlying in the residual can be identified in a.ll APe-function
plots. The plots L:.,ei (/>i(Xi) versus Xi, which are akin to added variable plots for
each variable smooth, can also be connected to the residual plot, a.nd from these
the discrepancy can be diagnosed. In Figure 6.3, the three added variable plots are
connected to the APe residual plot. The large positive residual is very close to
lowest on the marginal distribution of all three variables, however low values of Xl
and Xz are not associated with low values of Xs in the estimated manifold. This
observation does not lie dose to the estimated manifold, hence appears as an outlier.
Variable 1St.dev (Comp 1) =0.703
100
Variable %St.dev (Comp 1) =0.5040
Variable ,}St.dev (Comp 1) =0.5014
Reddual of 1Va,r (Comp 1) 0.0038
/i
\ ••j
,\. ;'- , ....
'''''--"''"
0' L. o
Figure 6.4: Interpretation Example: The APC-function plots, brushing on residual plot
Second, the residual plot is used to aid interpretation of the dependency relationship.
Observations which have large residuals do not lie near the manifold, hence when
brushing the APC-function plots, observations which have large residuals should not
be highlighted because they do not reflect the dependency detected by the constraint.
The residual plot can be used to "downlight" observations with large residuals before
using the APC-function plots to interpret the relationship. Compare Figure 6.4, in
which the observations with large residuals have been downlighted, with Figure 6.2.
The correspondence between the transformed variables is now sharper, since the
observations not dose to the manifold are no longer highlighted.
101
The graphical devices used for this last brushing technique provide very powerful inter
pretive tools, since we can then view the dependency simultaneously in the transformed
space, the variable space and the residual space. The multiple views of the APC en
courage a very detailed acquaintance with the structure implied by the APC-functions,
and the observations not obeying this structure. The transparency of the brushing inter
pretation makes assimilation of the information in the APC estimates both feasible and
comprehensible.
The power of this technique increases the usefulness of APC analysis dramatically.
The translation of the estimated constraint back to relationships between the variables
themselves would be extremely complex with only classical methodology. Brushing on
connected APC plots provides an accurate and elegant method for interpretation.
6.3 Guidelines for Detecting Real Structure
The relationship between the orthogonal decomposition of the operator P and the APC
explained in Chapter 2.4 is an elegant theoretical result. Yet the practical implications of
this result for APC estimation are rather strange.
For continuous variables, there are an infinite sequence of transformations of the data,
each of which describe an independent relationship between the variables. From a data
analysis point of view, this is clearly nonsensical. At some stage there must either be
some redundancy in the representation, or component estimates where idiosyncrasies of
the sample are being exploited to create spurious relationships. Recognizing either case
would signal when the APC have ceased to provide practically relevant information, or
equivalently when the eigenvalue of the APC is "too large". There is some theoretical
precedence for both these possibilities that may help determine in an applied setting
when this may have occurred.
6.3.1 Judging Redundancy
Redundancy in representation is a phenomenon that occurs in both the Uniform and Gaus-
sian distribution cases of the th.em'v and simulation cnap,.el·S the term U"nHlLla,lL
102
we mean any additive constraint than does not reveal new structure in the data, given the
smaller APC. From an applied viewpoint, any redundant APC can be discarded.
For the Gaussian distribution, the correlation matrix completely describes the variable
dependencies. The linear principal components are sufficient for the correlation matrix,
hence the only relevant dependencies are linear. Thus the APC based on higher order
polynomials are redundant. In fact for the Gaussian distribution, since the APC of k'h
order polynomials are sufficient for R h , we would hypothesize for any non-linear APC,
there is a linear relationship between the variables based on the same dependencies, with
a smaller variance.
For the Uniform distribution on an ellipsoid, the variable dependencies are described
by the linear and quadratic dependencies (these completely specify the orientation and
shape of the distribution ellipsoid). The higher order polynomials are thus redundant.
Another important case of redundancy occurs when only two variables are involved in
a strong dependency, <Pl(XJl + <Pz(Xz) "" O. Then, for any I, 1(<Pl(Xd) + 1(<Pz(Xz)) "" O.
If (I(<pd,/(<Pz)).l (<Pl, <pz), the resulting APC is redundant.
The Gaussian and Uniform distributions can thus be used to characterize some forms
of redundancy. Empirically, it has been observed that the APC transformed variables
of a data set are usually more symmetric than the untransformed data. This is not
surprising, in view of the fact that if a transformation to multivariate Gaussianity, or to
any Gegenbauer distribution exists, a linear sum of those transformed variables is an APC
solution. Hence, at least when the transformed variables have continuous range, it might
be expected that the APC-functions will tend to symmetrize the variables.
Now, suppose the smallest APC transformed variables are <Pl(Xd, <Pz(Xz) ... ,<pp(Xp).
If there is no other independent constraint in the data, and if the "variables"
<Pl(Xd,<Pz(Xz) ... ,<pp(Xp) are approximately Gaussian, the second smallest APC of X is
either the second linear principal component, or the smallest quadratic principal compo
nent of these transformed variables. In either case, the second APC is redundant, since
it reiterates the dependencies of the smallest APe. The same behavior would usually be
true if the transformed variables were Gegenbauer. Thus when the transformed variables
a.re symmetric, it seerns plausible that this polynomial redundancy may exist.
103
There are several indicators of redundant representation which may be valuable when
trying to discern redundancy in APC estimation. When an APC is redundant, it must
involve the same variables as a smaller APC. The new transforms will be derived from the
transforms of the smaller APC : either close to identical, or squared versions of the smaller
APC-functions. The latter case suggests, that if none of the transforms are monotonic,
the APC may be redundant. Finally, if the APC is redundant, by definition it cannot
reveal new insight into the dependencies between the variables, hence its interpretation
will reveal dependencies already discovered.
6.3.2 Assessing Spuriousness
The transformations that exploit idiosyncrasies of the sample are usually easy to detect.
They typically approximate an extreme case of the discrete APC, Section 3.7. Data
spatially separated in at least two variables, lie exactly on an additive manifold described
by step functions. Suppose there are a small number of observations in the data which are
extreme in two variables. Then a "spurious" dependency can be formed by considering
these observations as a cluster, and using step function transforms to separate these points
from the body of the data. This is the type of behavior observed in the APC estimates of
the uncorrelated Gaussian and Uniform distributions in Chapter 5.6, where points on the
perimeter of the data cloud were exploited by the algorithm to create additive functions
with lower variance.
In a data analysis, care must be taken before describing a transformation as detecting
spurious structure. If there are clusters in the data distribution, these are dependencies
that may well be important to the data analyst. It may also be useful to detect outlying
points in the joint multivariate distribution for a data sample - it is just that one needs to
be careful not to interpret the outliers as a bonafide dependency existing in the population
at large. If an APC has transforms that are strongly influenced by a small number
of unusual observations and a large eigenvalue, relative to the uncorrelated Gaussian
estimates, then it is probably spurious.
104
6.4 The Infant Mortality Data
6.4.1 Introduction
In the late 1960's, Ernest J. Sternglass alleged that the radioactive fallout from nuclear
weapons testing significantly raised infant mortality rates [Ste69b] [Ste69aj. The scientific
community, on examination of his evidence, thought the claim inadequately supported
We cannot support Dr. Sternglass's conclusion and ascribe changing patterns in infant
and fetal mortality to the cause and factors invoked by Dr. Sternglass - certainly not
with the conviction or certainty required by most epidemiologists and statisticians.
It is clear from a review of all of the data that certain gaps in the knowledge of how
environmental levels of Sr-90 may effect the genetic material of individuals still exist
and further studies in this direction are probably warranted. [Mill
However, outside of criticizing his conclusions, no other independent study of the effects
of levels of radioactivity and infant mortality was attempted. This data set was collected
by Fuchs [Fuc79j, in an effort to reexamine the issue.
The data set consists of 528 observations taken over a period of 11 years (1960-1970)
on each of the 48 states (excluding Alaska, Hawaii, Washington D.C.). The response
variables considered are postnatal, prenatal and total infant mortality rates (the latter
being the sum of the two previous), measured as the number of deaths per 1000 births in
each state. Covariates are state categories (State), year of observation (Year), per capita
income calculated in 1970 dollars (Income), and percentage of non-white births in the
state, (%Nwbirth). Radioactivity is measured by the levels of Strontium-90 (8r-90) and
Cesium-137 (C8-181) levels in city milk supply. While radiation is measured by both 8r-90
and C8-187 levels; 8r-90, being biologically longer lived, was the variable more favored by
the investigators, and is the variable used here. Details of data sources and aggregation
can be found in Fuchs [Fuc79].
105
Table 6.1: TIM: The Additive Regression Models
x· u(8i(Xi))s
%Nwbirth 0.764 0.978 1.018 1.044 0.754
Income 0.072 0.142 0.029 0.180 0.054.j
Sr-90 0.127 0.037 0.067 0.066 0.042
State 0.440 0.552 0.546 0.596 0.404
Year 0.431 0.483 0.487 0.487 0.515
R2 0.9301 0.9297 0.9289 0.9303 0.9288
Order (12345) (23451) (34512) (41253) (51234)
6.4.2 Additive Regression Models
We fitted an additive regression model for response variable total in/ant mortal1"ty, with
covariates Year, State, %Nwbirth, Income, Sr-90. The regression model was computed five
times, with the variables presented to the algorithm in different orders, which is simply
equivalent to using different initial values for the transforms.
The estimated transforms are shown in Figure 6.5, and the standard deviations of these
transforms are given in Table 6.1, where the response variable has been scaled to variance
1. Each solution provides an almost identical model for Total infant mortality, as seen
by the small changes in R2, and yet the models are by no means the same. The primary
predictive variables are State, %Nwbirth and Year, however both the State quantifications
and the contribution from %Nwbirths vary considerably. The minor variables also show
fluctuation in the transforms.
The differences between these models are not important if considering predictive use
of the model. However we are intent on examining the marginal effect of Sr.90, which is
unstable, but not necessarily insignificant.
Since we know that instabilities such as these, induced by merely altering the starting
result
106
Q
:J-I
'7
~
0
...
..,
'"
50 100 150
Sr-90
200 250 300
Q
...
o
o
o
10
10
20
20
30
Income
State
30
50
40
.,.t •
60
50
Figure 6.5: TIM: The Additive Regression Models
107
Table 6.2: TIM-5var : Eigenvalues and Variable Loadings
Components First Second Third
Eigenvalue 0.0067 0.0226 0.0350
Loadings: %Nwbirth 0.667 0.128 0.343
Income 0.166 0.491 0.572
Sr-90 0.010 0.749 0.210
State 0.715 0.279 0.337
Year 0.120 0.321 0.606
components analysis could explain why the models differ.
6.4.3 APC analyses
APC of the full data set
Several smallest APC were computed for the five independent variables, %Nwbirth, In
come, Sr-90, State and Year. Their eigenvalues and variable loadings are given in Table
6.2.
The smallest principal component detects a strong relationship between %Nwbirth
and the categorical variable State, with an eigenvalue of 0.0067. This simply reflects the
small within state variation of %Nwbirth - since the data cluster tightly within each
state, a low variance APC is obtained by setting state quantifications to the within state
mean of %Nwbirth. The income variable only contributes where income is low, (since for
high incomes, the transform is constant) and shows for those observations, lower income
accompanies lower %Nwbirth(Figure 6.6).
The second smallest APC, shown in Figure 6.7, is primarily a relationship between
Sr-90, Year and Income. State only makes an additive adjustment for the level of Florida
(State 29). The eigenvalue of 0.0226 shows this dependency is also strong.
The third APe has an el2enVaJ.ue of 0.0350, which is re:<scm2JO close to the second
lOS
~ Noawbite aitths lacome (1970 $) Strollt.htm 90St.(1e-v (Comp ') =0.6677 St.d~v (Comp 1) =0.1665 St.dev (Comp 1) =0.0111
'\ "---... -- .,
"-"~ d j I:l ~ "S.... Year Residual of 1
St.dev (Comp 1) =0.7155 5t.dev (Comp 1) =0.1204 Var (Comp 1) 0.0067
•... ..•• •• •
•• • •• • •• • •• •
L......
" :"
...d!.. .., .
•~ e ~ e ~
Figure 6.6: TIM-5var : APC-functions of the smallest APC
109
~ Nonwhite Births Jncomf! (1970 $) Strontium 90
SLdev (Camp 2) =0.1280 St,dev (Camp 2) =0.4913 St.dev (Comp 2) =0.7490
/'.
.... \'. "- ..' ..-..
J II J II l II,Slatl! Year Residual of 1
St.dev (Comp 2) =0.2791 St.dev (Camp 2) =0.3214 Var (Comp 2) 0.02261
• •.. .,....... • • .,.,.. ...• • •• ........ ,. •• ..... • • • • •
,..11 Il......
J II j II ~ II
Figure 6.7: TIM-5var : APC-functions of the second APC
110
... Nera.hitA! JUr1hs lllto.. (1910 $) Strontium. 90$t.dev (Comp 3) =0.3428 St.de.... (Comp 3) :0.5721 St.dev (Comp 3) =0.2094
,,.-/
1\ r--'",'- •..
.. ~ .. ~ .. "State Yo... Residul of 3St.dev (Comp 3) =-0.3373 St.dev (Comp 3) =0.6306 Var (Comp 3) 0.0350
• • • • •••• •., . ....••• 'I....
• •••••• •• ., •: .. •• •• •••
.. ,&1 'i,.. •.. 0 .. ..
Figure 6.8: TIM-5var : APC-functions for the third APC.
eigenvalue. All variables contribute significantly in this dependency, yet the variables
group into two distinct relationships, one between Year, Income and Sr-90, and a second
between State and %Nwbirth. This is discovered using the technique of brushing on con
nected APC-function plots: in Figure 6.8, cases with high values of the transformed Year,
have no systematic relationship with transformed values of either State or %Nwbirth, and
the same is true for the transforms of Income, Sr-90.
The last comment leads to consideration of the implication of the near degeneracy
between State and %Nwbirth revealed by the smallest APC. The within state variance of
%Nwbirth is smaller than the between state variance, hence transforming state scores to
correspond to state means of %Nwbirth. results in a low variance APC, as observed. This
111
APC involves a strong dependency between only two variables, so is a likely candidate for
generating redundant APC. The variable State is categorical, so has enormous flexibility
in its transformation (47 df). In fact any sequence of orthogonal transformations of the
continuous variable %Nwbirth can be matched by appropriately orthogonal assignment of
state scores, producing a sequence of redundant APC with low variance. With both State
and %Nwbirth in the analysis, it will be difficult to assess dependencies between other
variables, because there is a high density of small eigenvalues from the redundant APC.
In the simulated examples small separation between eigenvalues was seen to cause mixing
in the estimated eigenfunctions. In this data set, the third APC shows evidence of mixing
from a redundant APC, since the transforms of State and %Nwbirth are not related to the
transforms of Year, Income and Sr·90.
The main factor causing these low eigenvalues and contributing to the observed be
havior is the large number of categories in State, so it will be informative to estimate the
APC of the four continuous variables only. This is a simple way to avoid the masking
effects of mixing of the numerous small APC's of the 5 variable data set, enabling us to
view cleanly the separate dependencies existing between the continuous variables.
APC without State categories
The smallest APC of the data set without the state categories, involves Year and Sr
90, and achieves a variance of O.OBO. The transforms of Figure 6.9 reflect the change
in radioactivity levels over the 11 years of the data - low in the beginning and ending
periods, and peaking in 1965-66 during the period of heaviest nuclear testing. Eigenvalues
and variable loadings are given in Table 6.3.
The second APC involves all four variables, but most strongly Income and Year. The
transforms shown in Figure 6.10 can be interpreted as depicting the relative change in
wealth during these 11 years, where incomes were higher in the middle years than the
early and late periods. The variance achieved is 0.169.
The third APC has a much higher variance of 0.287. From the transforms in Figure
6.11 it is suspected that this APC is redundant, as none of the transforms are monotonic,
and the strong variables are Sr~{)O, Income and Year. From the of section
112
Table 6.3: TIM-4var: Eigenvalues and Variable Loadings
Components First Second Third Smallest Linear Second Linear
Eigenvalue I 0.0805 0.1689 0.2866 .201 .860
Loadings:%Nwbirth I 0.094 0.277 0.030 0.198 -0.418
Income 0.144 0.777 0.458 -0.700 0.206
Sr-90 0.757 0.295 0.730 0.685 0.284
Year 0.630 0.482 0.506 0.038 0.837
"" Non_hite Births Income (1970 $)SLdev (Comp 1) ",0.0946 St.dev (Comp 1) =0.1441
r ..... '---_ .....---
.. " .. "Strontium 90 YearSt.dev (Comp 1) =0.7570 St.dev (Comp 1) ::0.6303
.",.....- ...... ,.,-
• •• •• ••
.. " • "Figure 6.9: APC-functions for the smallest APC
113
$ NOD:white Births focollle (1910 $)St.d(;v (Comp 2) =0.2768 St.dev (Camp 2) =0.7770
• • •..-"...
;//
~.... .~....'" ...
~ -if * i:lStrontium 90 Ye.,
St.dev (Comp 2) =0.2951 St.dev (Camp 2) =0.4823
~, • • •......• • •
~ i:l * i:l
Figure 6.10: TIM-4var : APC·functions for the second APC
114
• NonW'hlte Birth.- h,come (1970 $)
St.dev (Comp 3) ",0.0302 St.dev (Comp 3) "'0.4582
, -........... r--.....,-. - ....... .
I••,
•J. J. ~
Strontium 90 y..,St.dev (Comp 3) =0.7301 St.dev (Comp 3) =0.5061
••, .-..- •, ,,-
\;/ • • • •• • •
J. ~ J. t:l
Figure 6.11: TIM-4var : APC-functions for the third APC
115
6.3.1, we know if the transformed variables q\(1) (Sr-90),</>(I) (Year) and </>(2) (Income) are
approximately Gaussian, then scaled hermite polynomials of the transformed variables are
APCs of X. (Note that </>(1) (Year) '" </>(2) (Year) so the first two APC are approximately
linear in these transformed variables.) A quadratic transformation of the APC-functions of
the two smallest APC describes very closely the transforms observed for the third APC.
Furthermore, interpretation of the transforms does not add any further information to
the dependencies detected using the two smallest APC. We conclude the third APC is
redundant.
The residual plot, shown in Figure 6.12, of I: </>~1) (X;) versus I: </>~2) (Xi) shows a dis
tinct group of outlying points in the bottom right corner. These points, which correspond
to the eleven observations of Florida, do not obey the implicit relations of these APC.
Notice that on neither of the marginal projections, would these points be clearly unusual.
Recalling the separation of the transform value for this state in the second APC of the five
variable analysis, it is clearly unusual. Florida is distinguished by its Sr-90 levels, which
are distinctly higher than all other states. In the other three variables it is not unusual.
Recall that the transform of state categories in the five variable analysis adjusted for this
additive difference.
For the sake of comparison, the linear principal components are calculated for the four
variables Year, Sr-90, %Nwbirth, Income. These are given in Table 6.3. The smallest
shows a linear dependence between Sr-90 and Income. This is both difficult to explain
convincingly as a causal dependence and, as seen from the APC analysis, misleading. The
observed linear relationship between Sr-90 and Income is simply a consequence of their
mutual dependency on the same nonlinear transform of year.
Both of the relationships in the first two APC, when attention is restricted to the two
main variables, are simple and have sensible interpretations. Furthermore they will both
be fairly evident from careful scrutiny of simple scatterplots. With the APC analysis we
have the ability to detect much more complex relationships - in the APC estimated here,
the contribution from variables with smaller loadings, while small, are not negligible and
with careful study more subtle aspects of the structure can be discerned.
116
• Nonwhite 8irths Jacome (191. $) SUollthull 90St.dev (Comp 1) =0.0946 St.dev (Comp 1) =0.1441 St.dev (Camp 1) ",0.7570
/ ......
I'~..
"'-~ .....
.. " .. " ~ "f .., Ilesidual of.2. VI. 1 Residual of 3 VI. 1SLoev (Camp 1) "'0.6303 Var (2) 0.1642. Var (1) 0.0736 Var (3) 0.2687. Var (1) 0.0736
•. '.. .' • •.r
":'" ....,':"J ;. •
• • :\¥~,. ~Ir• • : :",1. •.: .• • 'r\: ~.... M', -. . •t" ~"
t •• , ••
• '~~\~f'• .•,3-. ;.....• • ,.~.. " "C....~:~ ••
• t' ~ ..... , • ;.' .... :•• '. :-' r. . .:;• : • •. •• • •• •••
• " ~ " ~ "Figure 6.12: TIM·4var : The residuals of the smallest APe. The eleven highlighted ohservations are of Florida
Median ValueRoomAge
DistanceHighway
TaxPtratioBlackLstatCrimeZone
IndustryRiver
Noxsq
117
Median value of owner occupied housingAverage number of rooms in owner unitsProportion of owner units built prior to 1940Weighted distances to five employment in the Boston regionIndex of accessibility to radial highwaysFull property tax rate(j1000)Pupil-Teacher ratio by Town school districtBlack proportion of the populationProportion of the population that is lower status.Crime rate by townProportion of the town's residential land zoned for lotsgreater than 25,000 square feet.Proportion of nonretail business acres per townCharles River dummy = 1 if tract bounds the Charles river,ootherwiseNitrogen Oxide concentration in ppm
6.4.4 Conclusion
Using the APC analyses, it is possible to explain the instability of the additive regression
model of section 6.4.2. The changes in the relative contributions of State and %Nwbirth,
and the fluctuation in their transforms is a consequence of the very close correspondence
between these variables, discovered by the smallest APC of all five variables.
The smaller changes of the model amongst the other three variables is explained by
the interdependency between Year, Income and Sr-90, indicated by the smaller APC of
the four variable set.
6.5 The Boston Housing Data
In this analysis, we examine the Boston Housing Market data of Harrison and Rubinfeld
[HR78J. These data were used to estimate marginal air pollution effects on the housing
market. A regression model relates the median value of owner occupied homes in each of
the 50 census tracts in the Boston Standard Metropolitan Statistics Area to air pollution
(as measured by the concentration of Nitrogen oxides) and to 12 other variables that are
thought to affect housing prices. The variables of the full data set are briefly described
here, for a fuller description see [HR78, Table
118
The housing value equation of Harrison and Rubinfeld is developed using linear re
gression and experimenting with a number of poasible variable transformations for all 12
predictor variables. Breiman and Friedman [BF85] present an alternative model, using
the technique of ACE regression to estimate optimal transformations of the data. The
model they report is built on a reduced set of 5 variables: four variables chosen using a
forward stepwise variable selection procedure and Nouq, for estimation of the marginal
effect of pollution.
Using APC, we will examine the structure of both the reduced and full set of pre
dictors. Dependencies between the variables may have influenced the choice of variables
made by the forward stepwise selection procedure. Alternative variable groupings may be
suggested by a fuller understanding of the variable structure. In the reduced data set it
is naturally of interest to discover whether there are potential problems with the stability
of the transforms estimated by the ACE regression algorithm. In particular, we want to
determine whether instabilities affect the estimation of the marginal effect of Noxsq.
6.5.1 BH-small :The Smaller Boston Housing Dataset
The variables comprising the smaller dataset of this analysis are the 5 variables selected by
Breiman and Friedman: Noxsq, Room, Tax, Ptrat;o, Lstat. Table 6.4 displays the loadings
and eigenvalues for the three smallest APe. The smallest APe reveals a dependency
between Nouq, Tax and Ptrat;o, achieving a variance (eigenvalue) of 0.04. The Tax and
Ptrat;o transformations both strongly separate two points from the body of the data. (Bear
in mind that due to smoothness constraints imposed by the estimation, discontinuities can
only be approximated.) The case correspondence between the highest two Tax rates and
the separated Ptrat;o observations, shown in Figure 6.13, is almost exact. These two
points represent 137 cases (27% of the data); since both Tax and Ptrat;o are fixed within
each of the 50 census tracts they are somewhat categorical in nature.
The APC also implicates Nouq, and brushing reveals these cases have high values of
the pollution index Noxsq. The first APC has detected a large cluster in the data set
determined by the two highest tax rates, and two high values of Ptrat;o. This cluster also
have values on the Since this duster is in the
119
Table 6.4: BH~small : Eigenvalues and Variable Loadings
Components First Second Third
Eigenvalue 0.0430 0.1010 0.2177
Loadings:Nouq 0.4143 0.7000 0.1560
Tax 0.0282 0.0260 0.6297
Ptratio 0.7601 0.2767 0.0857
Latat 0.4977 0.6275 0.2344
Roomsq 0.0468 0.1974 0.7189
NltrDpD. Oxide (Ave I of J.oollU)Al HOllSe TaxSt.dev (Comp 1) =0.4143 St.dev (Comp 1) =0.0282 St.d61,1 (Comp 1) =0.7601
•
•
......'\~ ..t A ..\ ... ....,...41- .• ..\ .. '- ..,'-.. ..~/
~ d ~ d ~ dPupil-Teacher RAtio ~Lower StAtllS House. ResidUAl or t
St.dev (Comp 1) =0.4977 St.dev (Comp 1) =0.0468 Var (Comp 1) 0.0430
.. .-.... '. ." ....
" ....".../ ..., ...---•*'
.11•• 1,.1. .t
• ~ • :;1 J d
Figure (US: BH-srnaH . The smallest APC-function
120
NitropJl Oxide {Ave I" at 1.00.,)"'1 House TnSt.dev (Comp 2) =0.7000 St.dey (Comp 2) =0.0260 St.dev (Comp 2) =0.2767
r.\I., :-.: .,
~ :, • - /./-_ .. - •• "• ,
"",•\ -..
";, l:l ;I, l:l .. I:lPupU~Te~heI Katia "Lower St.t.tu Houses Ilestdual of .1 VI, t
St.dev (Comp 2) ",0.6275 SLdev {Comp 2} =-0.1974 Var (2) 0.1010. Var (1) 0.0430
"..'
• ••
.J ,".'" ····if
""" .
.- .. " ".~...' .,.
..........~: '" .~. :., ."
.. ~ .. l:l .. I:l
Figure 6.14: BH-small : The second APC-function plots.
analysis, we shall refer to these 137 cases as the Tax-Ptratio cluster.
The second smallest APC, Figure 6.14, shows a dependency between the same three
variables as the smallest APC, with a variance of 0.101. Furthermore, the transform of
Ptratio suggests that the structure detected is again due to the Tax-Ptratio cluster. The
similarity between the interpretation of this APC, together with the strong resemblance
between the transforms for the variables, indicates this APC is redundant.
The third smallest APC has an eigenvalue of 0.203, indicating a weak relationship be
tween Roomsq and Lstat. The transforms, shown in Figure 6.15, show a smooth increasing
association, that is, houses with a larger number of rooms tend to be in neighborhoods
with a higher proportion of lower status househoids. The change in the lower values of the
121
NU.repa Oldde (Ave' of 1lee1ll$)""2, House TnSLoe" (Comp 3) =0.15$0 St.dev (Comp 3) =0.6297 St.dev (Comp 3) =0.0857
/ ........... '\ .......,.......,.. • ••
,"-.
~ -.. ~ .. bl ,I, ~
PupilwTeacher Ratio SLower Status Houses Residual of 3St.dev (Comp 3) =0.2344 St.dev (Comp 3) =0.7189 Var (Comp 3) 0.2177. .,
•, t
.",1111111, ,
............. # ...
,I, ~ .. <l ~ bl
Figure 6.15: BH-sma!l: The third APC-function plots.
Room8q transform reflects a different trend for houses with the smallest number of rooms;
these do not strongly correspond with the highest values of L8tat.
The residual histogram of the smallest APC has two distinct groups of outliers. The
group of large negative residuals are observa.tions in the Tax-Ptratio cluster which are not
in the highest tax cluster, see Figure 6.16. One cluster of the large positive residuals are
observations in the Tax-Ptratio cluster which have low values of Noxsq. The remaining
large positive residuals reveal that the census tra.cts with the lowest value of Ptratio have
low pollution.
122
NiU'opa Oxide (Ave I' af beau) ..." Boue TnSLdev (Comp 1) =-0,4143 St.dev (Comp 1) =-0.0282 St.dflv (Comp 1) ::0.7601
•
if\• ..• ...-~. ~ 0
\ .....- f\ ",- .
;.. ..., :.0'........
~ i:l .- i:l ~ i:lPupil-Teacher Ratio 'LLower Status Houes Residual of .1 vs. t
St.dev (Comp 1) =0.4977 St.d~w (Comp 1) =0.0468 Var (2) 0.1010, Var (1) O.043lJ1
0 o.J'0
0• •• > I
;..... : •• f. ~.. / : IV!..... ". .. -., ,• ~v"'* --- l. .....' " ," +•.,!.
.. 1'".-. .., .• ;- '-P... :1..., ..,~.
~ i:l ~ d ~ i:l
Figure 6.16: BH·small : The smallest APe outliers
123
6.5.2 BH-full: The Full Boston Housing Data Set
The smallest APC of this dataset has a variance 0.0077, showing a virtually exact de
pendence between Tax, Ldistance and Industry. Examination of the estimated functions
shows a large separation of the highest Ldistance value, which is found to correspond ex
actly to the separated second highest tax values and to high values of Industry. As with
the smaller data set of the previous section, this APC picks out a strong singularity caused
by observations spatially separated in three variables.
Unfortunately, the estimation of further APC cannot yield information about other
specific dependencies in the data, for the next three eigenvalues are all close to 0.05,
hence the APC are not unique. Table 6.4 presents the estimated loadings for the four
smallest APe. The additive manifold corresponding to the second smallest eigenvalue
has co-dimension 3 and involves the variables Nozsq, Crime, Ldistance, Industry, Tax,
and to a lesser degree, Zone, Age, Lhighway , Ptratio, Lstat. The transformations of all
these APC are smooth, describing continuous dependencies between variables, rather than
differentiating a cluster.
Examining the smallest APC, there are two clusters of outlying values. Both these
clusters are found to contradict the overall trends which the smoother detects, i.e., the
highest tax cases are not high in Ldistance and some high Industry values have the lowest
Tax values.
6.5.3 Conclusions
The APC provide insight into the structure of the data used in estimating the housing
equation. The strongest singularities are caused by clusters of observations in the data.
The APC analysis calls attention to this important characteristic of the data which might
easily be overlooked, since clusters of tied values are concealed in simple seatterplots.
In all the APC detecting spatial separation of observations, notice that the residuals
typically have distinct clusters of outlying points. As the additive dependency is a common
correspondence of values of a large number of observations, any cases at odds with this
correspondence will appear as outliers in the residual plot. This supports the assertion,
124
Table 6.5: BH·full : Eigenvalues and Variable Loadings
Components First Second Third Fourth Fifth
Eigenvalues 0.0077 0.0533 0.0527 0.0471 0.1009
Loadings: Crime 0.0782 0.3981 0.6622 0.1958 0.3383
Zone 0.0363 0.1188 0.1319 0.0948 0.0676
Indus 0.1507 0.1156 0.2176 0.6999 0.4560
Rive' 0.0056 0.0142 0.0118 0.0186 0.0304
Noxsq 0.0509 0.6442 0.5566 0.1177 0.1417,Roomsq 0.0076 0.0385 0.0964 0.0471 0.1608
Age 0.0157 0.0185 0.1252 0.1823 0.4595
Ldistance 0.0368 0.4927 0.1507 0.2684 0.5301
LHighway 0.7639 0.1581 0.2056 0.2638 0.1392
Tax 0.6164 0.3193 0.2366 0.4671 0.1361
Pt,atio 0.0430 0.1427 0.1396 0.1853 0.1520
%Blacksq 0.0150 0.0201 0.0497 0.0154 0.0593
Lstat 0.0048 0.0812 0.1374 0.1370 0.2636
125
made in Section 6.2, that residual structure from discrete dependencies are likely to have
distinct groups of points that do not lie close to the manifold.
We return to the housing equation estimated by Breiman and Friedman, involving
only the five variables of the smaller data set. Since the investigation hopes to determine
the effect of pollution on the prices people are prepared to pay for housing, the insight
provided by the smallest APC is valuable.
From this APC, we see that many of the high pollution index cases belong to the large
Taz-Ptratio cluster. This suggests that if there is a predictor or indicator of housing value
which has been excluded from the analysis that could specifically adjust for this group,
the marginal effect of Nozsq could change considerably.
Examining the 7 excluded variables, we found that the largest value of the Highway
index variable (LHighway)- indicating greatest accessibility to radial highways - cor
responded exactly to the Taz-Ptratio cluster. Adding an indicator variable for greatest
accessibility to radial highways (Highway), the R2 of the ACE regression model increased
by 0.003, and the standard deviation of the Nozsq transformation increased from 0.172 to
0.232. The transforms for the two regression models are displayed in Figure 6.17.
By adjusting for the presence of the cluster, there appears to be stronger evidence in
support of a decrease in housing prices for areas with high pollution.
6.6 A Diagnostic for Additive Regression Transform
Stability
APC estimates can provide a diagnostic for instability of variable transformations of ACE
and additive regression models, when the instability is due to additive dependencies among
the predictors. The idea for the diagnostic comes from observing that the two models
y - E8.(X.) and Y - E(8. + <P.}(X.) will have almost identical fitted values when
E <Pi (Xi) has a very small variance. This leads us to consider alternative models obtained
by perturbing the optimal regression model by adding a fraction of the smallest APC,
which we know has smallest variance among all additive functions of the data. The
changes in regression transforms affect the sum of squares of the reQ'reE,sicm
Model 21..·...·___
42-,
_.'*.~..,--.....--.. --,.••'-:.-•••-."'='.......---...""'-;:':;"
126..,(\W
~
~0
0
";' ..,9
~
'? "1";'
8.5 9.0 9.5 10.0 10.S 11.0 -4 -2 0 2 4 6
Median Housing Value: Rsq(1). 0.832 RSQ(2). 0.835 Roomsq: sd(1). 0.283 sd(2). 0.272
"1 "1
l/l It)
c:i c:i
~It)
9
~ "1'7 ";'
-4 -3 -2 ·1 0 2 3 -3 -2 -, 0 2
Lslat: ad. 0.561 sd(2). 0.554 PtratiQ: sd(1)- 0.189 Sd(2). 0.198
~ ~
It) It)
c:i
~.....,0
•..........._~ •It)
'.It)
9 9
"1 "1
-1.5 ·1.0 ·0.5 0.0 0.5 1.0 1.5 2.0 -0.5 0.0 0.5 1.0 1.5
Tax: sd(1l- 0.111 sd(2). 0.098 Highway access: sd(2). 0.131
"1
"10
It)
9
~-·2
Figure 6.17: BH·smaU : The ACE regression models; Modell = 5 variable, Model 2 =Modell plus Highway variable.
127
model minimally.
6.6.1 Perturbing the Optimal Model
Suppose we have an optimal additive regression model for a response variable Y, using p
predictor variables Xl> X2 ... X p :
y t:::l SeX) =Lo.(x.).•
The residual sum of squares of the optimal regression is :
For the set of predictor variables, denote the smallest APe as usual, by
¢(X) = Lq;.(Xi) with Lvar¢i(Xi) = 1.i •
(6.2)
Its variance is var ~(X) =(2.
The optimal model is perturbed using the smallest additive principal component, so
that for some fixed 0:, we have the alternative model:
Y ~ L(O. + O:¢i)(X.),i
(6.3)
This model increases the residual sum of squares of the fit only minimally, since the
residuals from the additive regression are orthogonal to H+(X)- the additive equivalent
of the familiar property of orthogonality between residuals and fitted values of linear
regression. This orthogonality follows from the projection characterization of the additive
model: since p X Y = S(X), for any .p(X) = Ei tPi(Xi) E H+(X),
cOy (Y - S(X), ~(X» = COY (pX (Y - e(X», ~(X»
= COY (e(X) - e(X), ¢(X)) =o.
Hence the increase in residual sum of squares of any alternative model of the form:
(6.4)
128
for II~II = 1 is :
RSS(~,o:) = E (Y - 0- o:~)'
= E (Y - 0)' - 2aE ((Y - O)~) + o:'E (~)'
= E (Y - 0)' + o:'E (~)'
= 0" + o:'E (~)'.
The alternative model formed using the smallest APC is the least possible perturbation
of the additive model in the following sense.
Theorem 6.1 Among all alternati"e models to the additi"e regre88ion model (6.£) of the
form (6../), RSS(~,o:) ill minimized bll the smallest APC, that ill, ~ = ¢, for anll 0: i' o.
Proof: The minimal change in RSS between the additive regression model and the alter
native model is :
min~EH+ RSS - RSS(~, 0:) =
For any fixed, non-zero 0:, the smallest APC minimizes the decrease in RSS. •
In the alternative models, the sign of 0: is indeterminate - both positive and negative
values produce the same increase in RSS. However, the models are not identical: 8.+o"p. i'
8. - 0:</>••
A diagnostic allowing multi-dimensional alternatives to the additive model can be
constructed using the sequence of smallest APC, on the basis of a corollary to Theorem
6.1.
Coronary 6.1 Consider the k-dimensional alternati"e model specified b,l
Y "" i + O:l~(l) + a,~(') ... + o:.~(.)
where ~(') .L ~(j) i i' j
11.i.(i) Ii - 1j't' ji-
with RSS(~(l), ... ,~(·);O:l" .a.) = E (Y - (8 + 0:; + o:,oP(2) ... + o:..,i;(it)jj'.
For every (ar, ... "'.) with "'. i' 0, the RSS(oP(l), ... ,.p(k); "'1 ... "'k) is minimized over
subject to the orthogonality and norm CotlS£I'alJ",S. the k smallest APe.
129
The proof is a simple extension of the above argument.
6.6.2 A Dynamic Diagnostic
In the previous section, the value of a in the alternative model (6.4) is fixed. If we treat a
as a continuous parameter we can construct a continuum of models which move from the
optimal additive regression in the direction of the smallest APC. This suggests a diagnos
tic in which changes in the regression transformations occur dynamically as the parameter
a = aCt) varies - for the current value of aCt) the alternative variable transformations,
IJ, + a(t)<p" are displayed. This is easily implemented within the Symbolics Lisp environ
ment, by using the mouse to control the value a, and continuously update the regression
transforms to correspond to the current a (see Appendix A). Since computing the new
transforms only involves addition of already computed functions, the updating computa
tion is easily fast enough that variable transformations appear to change smoothly with
a. Periodically, the new RSS(~, aCt)) and y(mw) = pY (8 + a(t)J) are also recomputed.
These latter two quantities, the new RSS and new regression model, could in theory
also be computed quickly, since RSS(~, aCt)) = ".2+ a(t)2(2 and y(mw) = y(add)+aPY (~).
However, in the finite sample version, with smoothers estimating the conditional expecta
tion operators, the orthogonality between the residuals and the APC will not hold exactly.
Hence the above relations are not necessarily accurate for the estimates, so RSS and y(mw)
must be calculated explicitly.
This dynamic diagnostic can also be implemented for higher dimensional models. The
two dimensional generalization is easily made, by using the mouse to input values of a1, a2
from a two dimensional display. For higher dimensions, a more sophisticated "touring"
device, which can be guided interactively in k-space, would enable the k-dimensional
diagnostic to be implemented.
In the above discussion we have restricted attention to additive regression models. The
results are directly applicable to the ACE models, that is, models in which the response
variable is also transformed.
The above diagnostic for a single APC has been implemented on the Symbolics LISP
It was applJ,ed to the ,,,j,ilt[V~ rell"!SS;OO for the Infant M'>rtall!;y dat,,,"
130
Perturbations in the model due to the smallest additive principal component are shown
in Figure 6.18 for'" E [-1,1]. As expected, the transforms of State and %NwBirth are
unstable. The maximal increase in RSS of the model, for this range of '" was 0.09. The
diagnostic was then applied again, using the second smallest APe, and", in the range
[-.5, .5]. In Figure 6.19, the transforms for Sr-90,Income and Year change considerably,
although the maximal increase in RSS is merely 0.08. From the range of functions shown
for the transform of Sr-90, it is clear that it is not possible, using this model, to determine
from this data set whether the marginal effect of Sr-90 is detrimental.
Total Infant Kortality
Strontil.lD 98
131
, Nonwhite Births
State Year
Figure 6.18: APe Diagnostic for TIM Regression: Smallest APe
Year
132
State
l Monwhite Births
Strontiu. 90
Figure 6,19: APC Diagnostic for TIM Regression: Second smallest APC
Total Infant "ortality
Chapter 7
Literature Review
7.1 Linear Principal Component Analysis
The applications of analysis based on principal components are diverse, in part because
the principal components themselves have a multitude of different, yet equivalent inter
pretations. The first use of principal components is attributed to Pearson [Pea01]' who
posed the problem of finding the best fitting line or plane to a set of points in a higher
dimensional space. The problem arises in a regression context, where there are errors in
the predictor variables. Pearson shows the best fitting hyperplane is the line or plane
minimizing the sum of squares of perpendicular distances to the subspace. For a plane in
three space, this corresponds to the plane orthogonal to the smallest principal component.
Principal component related methods have a long history in the social sciences. Spear
man [Spe04] first examined the structure of sets of correlated variables, such as scores made
by school children in tests of speed and skill in solving arithmetic problems. The question
posed is whether a single underlying factor exists, representing unmeasurable "general
intelligence", that determines the scores a child will achieve. In the 1930's, the prob
lem was extended to allow "general intelligence" to be dependent on several independent
underlying factors.
Thurstone !Thu31] derived the underlying factors by assuming the so-called factor, .
134
model. A k-factor model is appropriate if ;
X=M+z,
where
X are the observed p-dimensional data
f are unobservable k-dimensional common factors,
A is a p X k matrix of unknown parameters, the factor loadings,
Z are unobserved p-dimensional unique factors.
The premise underlying the model is that the p observed variables actually lie in a lower
k-dimensional space, as represented by f, the common factors. The unique factor z allows
both for individuals to perform differently on particular tests and for the tests being only
an approximate measure of the underlying factor. A unique factor Zi will be small if the
test is closely related to the factors. The common factors are standardized to variance
1 and all factors are assumed to be uncorrelated. Hence for this model, the covariance
structure of X is;
(7.1)
where 'l1 z is diagonal. Note that as defined by this covariance structure, the factor loadings
A are not uniquely determined, since any orthogonal rotation of the factors will give the
same model.
Estimating a model entails finding Aand -it to closely approximate the ideal ;
"','" ..S = A A + '1' •.
The principal factor solution, proposed by Thurstone, first estimates 'l1, then finds the
best k-dimensional A approximating S - '1' "" A!A. This is solved by finding the k largest
principal component directions of the matrix S '1'. Note that if z has zero variance, the
principal components are a solution for the factor modeL
A different criterion for finding underlying factors was used by Rotelling [Hot33j. who
a.ppr'oa,:h,:d the p,,)bl,em as a linear transformation the observed data
135
can be written as a linear combination of a smaller number of independent components
(VI ... tlA:J, x, == L.j a'jvj. His solution is to find the coefficients a defining the best
univariate component representation, by minimizing the loss function:
(7.2)
The p-vector a gives loadings for each variable. This leads to the first principal component
as the optimal univariate summary. This will be an adequate summary of the data matrix
to the extent that the rows of X are homogeneous - hence this procedure is given the name
homogeneity analysis in the psychometric literature. Higher dimensional summaries of the
data matrix are constructed sequentially, each being the best l~dimensional component,
constrained to be uncorrelated to all previous components. Hotelling notes that under the
assumption of a multivariate normality, the principal component directions are the major
axes of the correlation ellipsoid.
Principal components analysis and factor analysis differ in that principal components
make no assumptions about the form of the covariance matrix from which the data come.
Factor analysis, on the other hand, is based on a well-defined model, and assumes the
covariance matrix of the observations to have the structure 7.1. If these assumptions are
invalid, the model may produce spurious results.
The discovery of the connection between principal components, finding the best k
dimensional linear subspace of the data, and the independently developed singular value
decomposition, led to the realization in Eckart and Young [EY36] for instance, that
for linear components, finding a k-dimensional representation by the sequential method
of Hotelling was equivalent to finding the best k~dimensional linear subspace. The k
dimensional linear manifold closest to the data in the least squares sense is exactly the
manifold defined by the first k i-dimensional orthogonal components, which are in turn
the first k left singular vectors of the data matrix X. For linear components the solu
tions that are nested in k, that is, the span of the k-dimensional solution contains the
(k - l)~dimensional solution.
Underlying the use of principal components in psychometry is the premise that the
data In a space, Ii more general setting, principal COlmf;>orler,t
136
analysis has gained wide acceptance as a technique of data summary, in the spirit that
it was proposed by Hotelling. As already noted, no model or distributional assumptions
are made, the principal components are defined as optimizing some algebraic or geometric
property of the data.
Algebraically, the first k principal components maximize the variance of any k-dimen
sional projection of the data, or equivalently minimize the loss function :
(7.3)
where F is n x k, 8< is the i'h row of the p X k matrix A. F then is the best k-dimensional
linear representation of the data minimizing over k dimensions simultaneously.
Geometrically interpreted, the k largest principal components define the k-dimensional
linear manifold lying closest to the data. An alternate geometric view is that projecting
the data onto this manifold gives the k dimensional representation that preserves the
configuration of points in the original space to the greatest possible extent. The data can
be exactly represented using all p eigenvectors, and for any dimension k the minimal loss
of information is incurred by using the first k eigenvectors. The eigenvalue associated
with the i'h eigenvector, A., gives the variance of the i'h principal component, henceI;' )."
the ratio tr:;~i' measures the proportion of total variance lost by using k dimensions
instead of the full p. Often, the several smaiiest eigenvalues are close to zero, and little
information about the joint distribution of the variables is given by the corresponding
components. Thus principal component analysis can be used as an optimal dimension
reduction technique in which the minimal amount of information is lost.
Applications of principal component analysis are also found in multidimensional scal
ing, errors in variables regression, cluster analysis, size and shape methods, among others.
All these techniques focus on the principal components of the larger eigenvalues. Explicit
use of the smallest linear principal component seems to be rare.
The smallest eigenvalues are relevant when principal components are used as a
nostic tool for studying collinearity in regression analysis. The relative variance of the
smallest principal components are examined to determine whether linear relationships
among
137
The notion of using the smallest principal component as a technique for studying
the interdependency of data has rarely been explici tly utilized, although G nanadesikan
comments [Gna77, pH.],
For purposes of interpretation - detection or specification of constraints on, or re
dundancy of, the observed variables - it may often be the relations that define near
constancy (Le" tho&«! specified by the smallest eigenvalues) that are of greatest im
portance.
Yet strangely enough, Pearson's first proposal Was estimation of the constraint implied
by the smallest linear principal component. In this connection, there is a generalization of
Pearson's motivation for using the smallest principal component as an alternative to the
linear regression model, when all the variables are observed with error.
The regression solution is the hyperplane in the union of the X and Y space that
minimizes the distance to the data; measured in the Y direction. Suppose we find the
smallest principal component direction, a· of the combined matrix X • = (Y, X ). The
hyperplane defined by setting X "a· = 0, that is, projecting the data orthogonal to the
smallest principal component direction, minimizes orthogonal distances to the data.
7.2 Nonlinear Generalizations of Principal Component
Analysis
The common concern underlying all the techniques described in this section, is that as
suming linear structure is often unrealistic. By allowing nonlinearity, the structure of
the data might be represented more closely. On the other hand, introducing nonlinear
transforms necessarily moves away from the ideal of model simplicity. If linear structure
is appropriate; we would still want a non-linear method to reflect simple linearity. Thus,
a na.tural requirement for any more general class of models, is that they contain the dass
of linear models.
The generalizations and extensions to linear principal components that ha.ve been
developed can be classified according to their trea.tment of three issues.
1 ~ The definition "nonlinea.rity" of the manifold.
138
2. The generalization of one dimensional representations to higher dimensions.
3. Whether the manifold is represented parametrically or determined by a constraint.
7.2.1 The Nonlinearity of the Manifold
In linear principal components, lower dimensional representations of the data are defined
as minimizing over F nxk and Apxk the loss function :
where ai is the ,1h row of A. There are two ways to extend the linear definition to allow
nonlinear representations of the data.
One is to replace the k linear factors F by a nonlinear function of k parameters. Thus
we define f(a) = (11 (a), (f2 (a), ... (fp(a)) as minimizing
I:: IIXi - /;(a) II 2•
i
This is the approach taken by Hastie [Has83] in defining principal curves and surfaces. An
appealing property of this parameterization, is that the norm is minimized in the original
variable scale. The resultant models thus have appealing geometric properties, but can
be difficult to interpret, particularly in higher dimensional generalizations.
The second approach is to transform the variables, replacing X by I(X), where f is
chosen to have an optimal linear factor representation. We minimize
I::1I/(X)i- FaII1 2.
i
A disadvantage of this approach is that the norm is minimized in the space of the trans
formed variables.
The second approach is the one used in APes, thus we only review generalizations of
linear principal component analysis that allow nonlinear through transformation of the
variables.
139
7.2.2 Optimal Data Transformation Methods
Within the methods that model nonlinearity by transforming the variables, there are
several distinct approaches to the class of transforms, f, chosen. The most extensively
studied representations adhere to the class of additive functions, i(X) = I: Ii(Xi ). Only
marginal transformations are considered, thus distinct variable spaces are retained in the
transformed representation.
The functions Ii can be restricted to be linear, to belong to some finite dimensional
class of functions, or, as in our case, simply required to have zero mean and finite variance.
Since the additive model plays an important role in the existing literature for nonlinear
modelling, it is discussed further in section (7.3).
Finite Dimensional Distributions
The most extensive treatment of nonlinear principal components with additive transforms
is found in the psychometric literature. 0 bservations are considered to be exclusively
categorical, although the underlying components can be of nominal, ordinal or continu
ous measurement type (in the somewhat radically stated view of Gifi [Gif81, p46], "all
data are categorical" !). All distributions are discrete, consequently the space of additive
transformations is finite dimensional.
For discrete distributions, then, the task of finding optimal transformations of the
data is greatly simplified, since it reduces to a finite dimensional problem. Estimating Ii
amounts to estimating an optimal scaling or quantification for a metric representation of
each variable under the restriction imposed by the measurement type.
The numerical representation for a vector of observations on a variable of k categories,
can be written using a matrix of dummy variables, G. G is a matrix with k columns of
O's and l's, each row has a 1 in the column of the category observed for that individual.
Suppose the k scalings for the variable categories are given by q, the corresponding nu
merical representation of the vector of observations is Gq. By convention, the columns
of G are scaled to norm one; thus the variable quantifications of the it" variable is
() -1'2 (1iii = Ii Xi = D1 ' G1ql = H1qi> where Di = diag nl, n2.··· n p ., nk the number of
140
occurences of the le'" category. Thus, the estimation simply entails finding ql for given
HI'
Algorithms for estimation of an optimal univariate representation of categorical data
are all based on a two step estimation procedure.
• "Model" estimation: Quantifications of all variables are presumed known, and opti
mal parameters for a linear component representation, using the usual loss function
are found. This amounts to finding the linear principal components of the trans
formed variables, or equivalently, projecting the fixed quantifications onto the model
space.
Explicitly, for fixed (YI ... Yp), find a and f minimizing
IT(a, f) = L IllIi - aifll z.
This is solved by the largest eigenvector of :
Ey = Q'H'HQ whereQ='llEB EBqp
H = HI EB EB H p.
• Optimal scaling step : Assume the component representation fixed, hence every
variable is approximated by the common component f of n points. The optimal
quantification for each variable for this fixed f, using the same loss function, is
found by simple regression of the component vector onto the transform space for each
variable. To minimize Ei IlYi - aifll" for each i, simply minimize the i'" term IlY,
aifll" = Ila,f -HIllJllz. The usual least squares solution gives Y; = a,{HiHil-IHlf =
alHif. Measurement restrictions are imposed in the regression step, for instance,
ordinal measurements are preserved by using isotonic regression. The geometric
interpretation of this step is projecting the component onto the transform space for
each variable.
The algorithm alternates between these two steps, computing a restricted optimal solution
at each stage. A standardizing constraint is imposed in one of the steps to avoid collapse
to the trivial zero SOimiGIL as each of the
141
algorithm is a projection onto a closed convex set. Convergence to the globally optimal
solution is not guaranteed, a local minimum may Occur. Detailed references and comments
are found in De Leeuw, Young and Takane [dLYT76].
For this class of models there are two distinct methods to extend principal component
analysis to a k-dimensional representation. Multiple correspondence analysis, a method of
analysis for purely nominal variables, was introduced by Benzecri [Ben72]. The analogous
technique allowing other measurement types is known as homogeneity analysis, as in
Young, De Leeuw and Takane [YTdL78]. In both of these, a sequence of I-dimensional
components are constructed. Each component is required to be optimal according to
the usual univariate loss function, subject to orthogonality with respect to all previous
solutions. Combining the first k solutions gives a k-dimensional nested model.
The k th homogeneity solution is estimated using the same two step scheme as above,
however at the model step, the representation is restricted to be orthogonal to the k - 1
previously found solutions. This is implemented by adding a Gram-Schmidt orthogonal
ization to the model estimation step.
The second approach, known as nonmetric principal components analysis, was first
proposed by Kruskal and Shepherd [KS74]. For any fixed dimension, k, a single transfor
mation of the data is sought. The first k linear principal components of the p transformed
variables are required to have maximal variance over all possible transformations of the
variables. These representations are not usually nested.
Again, nonmetric principal components can be estimated by modifying the two step
algorithm described previously. At the model step, the first k principal components of
the transformed variables are calculated, rather than just the largest. Then in the op
timal scaling step, the simple regression is replaced by a multiple regression of these k
components onto each variable.
Algorithms for these two methods of nonlinear principal component analysis are respec
tively HOMALS (HOMgeneity analysis by Alternating Least Squares) and PRINCALS
(nonmetric PRINcipal Components by Alternating Least Squares)[GifBI].
142
One Transformation vs Multiple Transformations
A k-dimensional solution in homogeneity analysis has k different mutually orthogonal
quantifications for each variable, with each solution having smaller variance than all pre
vious solutions. A nonmetric principal component solution gives only one quantification
for any dimension k, and its first k linear principal components have maximal variance for
that dimension.
For linear transformations these two approaches yield the same solution, hence the
dichotomy in the generalization to higher dimensions is only present in the nonlinear case.
In several ways, the dimension definition used in homogeneity analysis, (and also in APe,
for continuous random variables), is the more natural one to use.
First, the models are nested, hence the parameter k need not be known. In nonmetric
principal component estimation, the analyst has the unenviable task of trying to guess
the appropriate linear dimension of some unknown transformation of the data.
Second, there is a strong analytical structure underlying the multiple quantification
representation, that parallels the sufficiency of the linear principal component representa
tion.
In linear principal components the sequence of eigenvectors and eigenvalues gives the
unique orthogonal decomposition of the correlation matrix of the data. If the data are
Gaussian, this implies the principal components are sufficient for the correlation matrix.
The following two finite dimensional cases reveal similar analytical properties.
In the case of nominal variables, multiple correspondence analysis amounts to a weight
ed principal component analysis of all the bivariate marginals (the Burt table). Hence the
bivariate dependencies can be completely recovered by using all the components, and
taking the largest k preserves the configuration of the bivariate marginals to the greatest
possible extent.
In homogeneity analysis each set of quantifications is a function estimate in the sum
space
H(X) = H(Xd $ H(X2) $ ... $ H(Xp )
= {f(X). f(X) = E /;(X i )}
The sequences qmmtific,atii)ns defined the multi,ple qmmtrncatHOl1S ~nrwo;,rh of no-
143
mogeneity analysis provide a complete orthogonal decomposition of the space H (X ).
A major objection to the approach of homogeneity analysis is the aspect of "data pro
duction" - where we began with p variables, we now have pk "variables". The nonmetric
principal component approach yields only one set of transformed variables, which has ap
peal because of its apparent simplicity. However, this representation is optimally linear
on the transformed scale. In general, linear interpretation of variables in the transformed
scale may not be meaningful.
De Leeuw [dL82] investigates the similarities and differences between the two forms
of analysis for categorical data. He proves that if the bivariate frequency tables have the
same singular vectors, then the two methods yield identical solutions. This is exactly the
condition of Theorem 3.1 for discrete distributions) hence for the distributions discussed in
Chapter 3, the two different approaches will yield identical k dimensional representations.
Continuous Random Variables
An early suggestion for a method of introducing nonlinearity into principal component
analysis, is found in Gnanadesikan [Gna77J. He proposes allowing polynomial transfor~
mation of the data matrix up to degree k) hence the transform space is again finite
dimensional. This is easily implemented by conducting an ordinary principal component
analysis on the augmented data matrix of the original variables plus all their squares and
crossproducts (for degree 2, say). While this strays outside the class of additive models)
it can be restricted to purely additive transforms, in which no interactions between vari~
abIes occur, by excluding cross product terms. Furthermore, the technique is made more
useful if the transforms of a single variable are mutually orthonormal, so that the analysis
models the dependencies between different variables) rather than within transformations
of the same variable. With these two restrictions, this technique is exactly equivalent to a
restricted APe, where the APe class of transforms are reduced to k degree polynomials
as proven in Proposition 4.1
When nonlinear analysis is considered for continuous variables, we a.re confronted with
a possibly infinite dimensional spa.ce for the transform functions. Provided the spaces
are finite dimensional, between is via a
144
constraint or parametrically, though it still exists, is not really important, since it simply
amounts to choosing the maximal or minimal eigenvalues of a finite dimensional eigen
decomposition. In infinite dimensions, since complete enumeration is impossible, one or
other approach must be taken, depending on the rationale behind the analysis. In our
estimation of the smallest APC, we use the constraint estimation approach because we are
interested in exploring the interdependencies of the data. Koyak [KoyB5], proposes a k.
factor multivariate dimen8ional~'tyreduction analysis. Clearly, since the intent is dimension
reduction, the appropriate method of finding the best low dimensional representation of the
data is to estimate the manifold directly. Koyak estimates a single transformation for any
fixed k dimensions, and hence generalizes nonmetric principal component analysis. Each
APC in our method estimates a different set of transforms, which follows the approach of
homogeneity analysis.
Another approach to finding singularities in da.ta, that is, additive transformations
of the data with a small variance is suggested by Fowlkes and Kettenring [FK85J. The
criterion they consider is minimizing the determinant of the correlation matrix of the
transformed data, that is, the product of the eigenvalues. This leads to one set of tra.ns
formations. One drawback to their approach is that all variable transforms are forced to
enter equally into the analysis.
There are a few nonlinear generalizations of principal component which have not re
stricted the data transformation to be additive. Klein and Garayalde [KG85] propose
a projection pursuit principal component. Their approach is in the context of principal
components analysis as a dimension reduction technique, so they find the transform of the
data f(X) which comes closest to the data matrix X, where f(X) is restricted to be of
the form:
This definition only makes sense for direct estimation of the manifold - there appears
to be no natural way to generalize this method to a smallest principal component for a
projection pursuit modeL
145
7.3 Additive Models
The additive model, as defined by Hastie and Tibshirani [HT86], has been the focus of
much attention in the recent efforts to move away from the restrictions of parametric
models and distributional assumptions. It is widely assumed that the additive model,
since more flexible than linear models, is therefore adequate. While this is clearly not
always true, the reasons for using the additive model are persuasive.
Stone [St085] writes :
Three fundamental aspects of statistical models are flexibility, dimensionality and
interpretability. Flexibility is the ability of the model to provide accurate fits in a wide
variety of situations, inaccuracy here leads to bias in estimation. Dimt!nln:onality can
be thought of in terms of the variance in estimation, the "curse of dimensionality ill
being that the amount of the data required to avoid an unacceptably large variance
increases rapidly with increasing dimensionality. In practice there is an inevitable
trade off between flexibility and dimensionality or, as usually put, between bias and
variance. Interpretability lies in the potential for shedding light on the underlying
structure.
Classical linear and parametric models in general, are relatively easy to interpret and
to estimate. Historically this is the sole reason for their preeminence. The disadvantage
of the classical methods is an inability to adapt in situations where the assumed structure
is inappropriate - scenarios in which both the bias and the variance of estimation will be
large.
More general models, such as projection pursuit for instance, or models that include
simple interactions between variables allow far more complex representations. Conse
quently they can require large amounts of data for reliable estimation: sparseness is often
a problem when modelling interaction terms, or using multivariate smoothers. Sometimes
such flexible models are simply too complex for the intended application.
Additive models fall nicely in the middle ground between these two alternatives. Since
there are no interaction terms, we retain the desirable elegance of interpretation for addi
tivity : if Xl is changed to x\, and all other variables remain constant, the effect on Xl can
be measured as a of the difference ; 80 the bivariate relationsr"
146
need be considered. Dimensionality problems are avoided because the additive structure
permits successive, rather than simultaneous, estimation of the functions, as embodied
in the alternating conditional expectation algorithms. Finally the models are reasonably
flexible. Even if J is not genuinely additive, an additive approximation to J may capture
the structure sufficiently for a given application, and has the advantage of being easily
interpretable.
The additive model will reproduce linear structure where linearity holds, and can easily
be extended to include known interaction terms if desired, by adding "new" variables that
are formed from products of the original variables.
Chapter 8
Conclusion
Our primary aim is to present a viable data analysis method for understanding the additive
structure of multivariate data.
The additive structure is described by the additve principal component, defined as the
additve function of the data with smallest variance. The APCs are a natural generalization
of linear principal components, and have a characterization as a sequence of eigenfunctions
belonging to the smallest eigenvalues of P .
Estimates of the APC can be calculated using a simple iterative algorithm, which is
an implementation of the power method for estimating eigenfunctions. The estimates are
accurate when the eigenvalues are small and well separated, or equivalently, when the
observations lie near an additive manifold.
Interpretation of the dependencies implied by additive equations with small variance
is made practicable by the interactive technique of brushing on connected APC-function
plots. The dependencies embodied in the APC are then easily expressed in terms of the
original variables. Observations with large residuals can be located by using the residual
plot of the APC. Through the power and simplicity of this technique of interpretation,
the APC becomes a viable data analysis tool.
The theoretical precedents of the Gaussian and Gegenbauer distributions provide guid
ance for detecting APC that are redundant or spurious. Recognition of either of these
cases is a step toward answering the tu:nQ,arr,en,tal 'iWCCCWH of wnetr"," the APe have
148
detected real structure in the data set.
APe estimates provide a diagnostic for instability of predictor transforms in additive
regression models. For a small decrease in R2 of the additive regression, we can examine
a range of alternative sets of transforms of the predictor variables, thereby detecting
variables whose transforms are unstable due to additive dependencies in the predictors.
Bibliography
[Ben72] J. P. Benzecri. Sur l'Analyse des Tableaus Binaire Assoces a une Co rrespon·
dance Multiple. Technical Report, Universite Pierre et Marie Curie, Paris.,
1972. Note Mimeo, Lab. Stat. Math.
[BF85] L. Breiman and J. H. Friedman. Estimating optimal transformations for multi
pIe regression and correlation. Journal of the A merican Statistical Association,
80:580-598,1985.
[BK85] A. Buja and R. Kass. Comment to [bf85J. Journal of t~. American Statistical
Association, 80:602-607, 1985.
[BS83] R. B. Bapat and V. S. Sunder. On Majorization and Sc}.,",. Products. Technical
Report 8319, Indian Statistical Institute, New Delhi., 19I~:J.
[Buj85) A. Buja. Theory of Bivariate A CEo Technical Report 74, Department of Statis
tics, University of Washington, Seattle., 1985.
[dL82] J. de Leeuw. Nonlinear principal components analysis. COMPSTAT, 77-86,
1982.
[dLYT76] J. de Leeuw, F. W. Young, and Y. Takane. Additive structure in qualita
tive data: an alternating least squares approach with optimal scaling features.
Psychometrika, 41:471-503, 1976.
(EY36! C. Eckart and G. Young. The approximation of a matri~ by another of lower
rank. PS1/chometrika, 1:211-218, 1936.
150
[FK85] E. B. Fowlkes and J. R. Kettenring. Comment to [bf85]. Journal of the Amer
ican Statistical Association, 80:607-613, 1985.
[FS81] J. H. Friedman and W. Stuetzle. Smoothing of Scatterplots. Technical Re
port ORIONOO3., Department of Statistics, Stanford University, Stanford, Cal
ifornia, 1981.
[Fuc79] V. R. Fuchs. Low Level Radiation and Infant Mortality. Technical Report un
known, National Bureau of Economic Research, Stanford, California, 1979.
[Gif81] A. Gift. Non-linear Multivariate Analysis. Leiden. Department of Data The
ory., 1981.
[Gna77] R. Gnanadesikan. Methods for Statistical Data Analysis of Multivariate Obser
vations. Wiley, New York., 1977.
[H&S83] T. Hastie. Principal Curves and Surfaces. PhD thesis, Department of Statis
tics, Stanford University, Stanford, California., 1983.
[Hot33] H. Hotelling. Analysis of a complex of statistical variables into principal com
ponents. Journal of Educational Psychology, 24:417-441, 498-520, 1933.
[HR78] D. Harrison and D. L. Rubinfeld. Hedonic housing prices and the demand for
clear air. Journal of Environmental Economics Management, 5:81-102, 1978.
[HT86] T. Hastie and R. Tibshirani. Generalized additive models. Statistical Science,
1:297-318,1986.
por70] K. Jorgens. Linear Integral Operators. Pitman Articles LTD, London., 1970.
[KG85] R. Klein and E. G. Garayalde. Nonlinear principal components by projection
pursuit. 1985. Informes de matimatic, Serie B-032, Rio de Janeiro.
[Koy85] R. Koyak. Optimal Transformations for Multivariate Linear Reduction Anal
ys's. PhD thesis, Department of Statistics, University of California, Berkeley.
Califo:mi'L, 1985
151
[KS74] J. B. Kruskal and R. N. Shepherd. A nonmetric variety of linear factor analysis.
Psychometrika, 39:123-157, 1974.
[Lan58] H. O. Lancaster. The structure of bivariate distributions. Annals of Mathe
matical Statistics, 29:719-736, 1958.
[Mil] W. A. Mills. Preface to [TB69].
[MP85] J. A. McDonald and J. O. Pedersen. Computing environments for data analysis,
part i-iii. SIAM J. Scientific and Statistical Computing, 6:1004-1021, 1985.
[Pea01 j K. Pearson. On lines and planes of closest fit to points in space. Phil. Magazine,
2:559-572, 1901.
[SiI69] S. D. Silvey. Multicollinearity and imprecise estimation. JRSSB, 31:539-552,
1969.
[Spe04] C. Spearman. The proof and measurement of association between two things.
American Journal of Psychology, 15:72 and 202, 1904.
[Ste69a] E. J. Sternglass. Evidence for low-level radiation effects on the human embryo
and fetus. In Proceedings of Hanford Symposium. The Radiation biology of the
fetal and juvenile mammal, pages 5-8., May 1969.
[Ste69bj E. J. Sternglass. Infant mortality and nuclear tests. Bul. Atomic Scientists,
XXV:18-20, 1969.
[St085] C. J. Stone. Additive regression and other nonparametric models. Annals of
Statistics, 13:689-705, 1985.
[Thu31] L. L. Thurstone. Multiple factor analysis. Psychology Review, 38:406-27, 1931.
[YdLT76] F. W. Young, J. de Leeuw, and Y. Takane. Regression with qualitative and
quantitative variables: an alternating least squares approach with optimal scal
ing features. Psychometrika, 41:505-529, 1976.
152
[YTdL78] F. W. Young, Y. Takane, and J. de Leeuw. The principal components of mixed
measurement level multivariate data: an alternating least squares approach
with optimal scaling features. Ps,/chometrika, 43:279-281, 1978.
Appendix A
Statistical Programming on the
Symbolics 36xx Lisp Machine
The algorithm for estimating the APC, the graphical interpretation techniques for
the APC, and the diagnostic for Additive regression were implemented on the 8ymbolics
Lisp Machine 36xx series (8LM). These machines are currently nonstandard for statistical
research, and yet, as argued by McDonald and Pederson [MP85], they possess many
capabilities making them well suited for this use. I will discuss here my experience in
using these machines for the programming tasks of this dissertation.
The 8LM is a single user graphics work station. It has computing power roughly
equivalent to a VAX 780, a high resolution bitmap display and a graphical input device
(mouse).
There are two aspects of the machine which I found significantly affected the path of
my research: the programming environment and the graphics capabilities.
Programming environment
The 8LM has an integrated programming environment, which is distinguished from more
convention operating systems (e.g. UNIX or VM8) by two features:
154
• It uses a single language for (almost) all programming tasks.
The first of these has a considerable impact on the programmer's willingness to ex
periment with an implementation. Since procedures and data remain in memory, small
changes can be made incrementally - that is, without the overhead of linking and reload
ing programs into memory. I found this resulted in a faster, less frustrating coding stage,
since the time between conception and execution of changes and corrections to the pro
gram was not significant. More importantly, however, I was encouraged to experiment
with the algorithm at all levels of program development: changes to input values, data,
function definitions and procedures are all simple to effect, and the time commitment in
doing so is not daunting. In addition, since any intermediate stage of the iterative algo
rithm could be examined, and modifications made interactively, it was easy to experiment
with factors affecting the implementation performance. This close acquaintance with the
inner workings of an implementation is a far cry from the black box paradigm of batch
processlDg.
The single language of the SLM is the integrated dialect Flavors, an object oriented
extension of LISP. Object oriented programming languages are general purpose program
ming languages, which I can attest, are particularly accessible to the naive user. This is
primarily because the language provides a natural mental model for programming, that
is, the abstractions of the language come close to the way we naturally think about a
problem. This again improves communication between user and computer, enhancing the
capabilities of both the machine and the user.
In fact the SLM is not strictly monolingual. It has a FORTRAN compiler and an
interface that allows procedures to be called from LISP programs. Hence I was able to use
existing, tested FORTRAN software for the Supersmooth, and the EISPACK subroutines
for eigen decompositions.
The Graphic Capabilities
The combination of a high resolution bitmap display and the mouse permits a natural
graphical language between user and computer. The multi-window system of the SLM
155
allows effective use of the bitmap display, and easy interaction between the multiple func
tions of the software. Together, these three features provide strong support for graphical
interaction.
Single user machines, as opposed to time sharing machines, allow real time motion
in graphics. To give the illusion of smooth motion the graphics program must satisfy
exacting timing constraints. On a single user machine, with adequate computing power
on demand, and high speed data transfer between CPU and display, the required response
time is guaranteed.
The high speed graphical interaction of the SLM also permits real time constraint
satisfaction, which enabled the implementation of brushing on connected scatterplots. The
constraint that all points representing the same observation (in connected scatterplots)
be drawn with the same glyph is satisfied practically simultaneously with the observation
being brushed by the mouse. The speed of this interaction is a major factor in its utility
as an interpretational tool.
Vita
Deborah Donnell was born October 29, 1958 in Auckland, New Zealand,
She completed high school at Rangitoto College Auckland in 1976, having been ac
credited the University Entrance Examination and winning a Junior Scholarship Award,
She graduated from Auckland University in 1980, with a Bachelor of Arts in Music
and Mathematics, gaining the senior mathematics prize in her final year, In 1982 she
completed a Mast"r of Arts degree in Mathematics at Auckland University,