Principal manifolds and graphs in practice: from … examples are taken from comparative political science, ... mate a ﬁnite set D in Rm for relatively large m by ... Principal Manifolds

June 3, 2010 15:58 00238

International Journal of Neural Systems, Vol. 20, No. 3 (2010) 219–232c© World Scientific Publishing Company

DOI: 10.1142/S0129065710002383

PRINCIPAL MANIFOLDS AND GRAPHS IN PRACTICE: FROMMOLECULAR BIOLOGY TO DYNAMICAL SYSTEMS

ALEXANDER N. GORBANDepartment of Mathematics, University of Leicester

University Road, Leicester LE1 7RH, [email protected]

ANDREI ZINOVYEV∗

Institut Curie, U900 INSERM/Curie/Mines Paritechrue d’Ulm 26, Paris, 75005, France

[email protected]

We present several applications of non-linear data modeling, using principal manifolds and principalgraphs constructed using the metaphor of elasticity (elastic principal graph approach). These approachesare generalizations of the Kohonen’s self-organizing maps, a class of artificial neural networks. On severalexamples we show advantages of using non-linear objects for data approximation in comparison to thelinear ones. We propose four numerical criteria for comparing linear and non-linear mappings of datasetsinto the spaces of lower dimension. The examples are taken from comparative political science, fromanalysis of high-throughput data in molecular biology, from analysis of dynamical systems.

Keywords: Kohonen neural networks; self-organizing maps; principal manifolds; principal graphs; datavisualization.

1. Introduction

In this paper, we provide several examplesof application of the principal manifold andgraphs methodology taken from applied projects:comparative political science, data analysis in molec-ular biology, theoretical methods of dynamical sys-tems analysis. As it will be shown, one of the mostcommon problems in these fields is how to approxi-mate a finite set D in Rm for relatively large m bya finite subset of a regular low-dimensional objectin Rm.

The first hypothesis we have to check is: whetherthe dataset D is situated near a low–dimensionalaffine manifold (plane) in Rm. If we look for a point,straight line, plane, . . . that minimizes the averagesquared distance to the datapoints, we immediatelycome to the Principal Component Analysis (PCA).

PCA is one of the most seminal inventions indata analysis. Now it is textbook material. Nonlinear

generalization of PCA is a great challenge, and manyattempts have been made to answer it. In 1982 Koho-nen introduced a type of neural networks called Self-Organizing Maps (SOM).1

With the SOM algorithm we take a finite metricspace Y with metric ρ (a space of “neurons”) and tryto map it into Rm with (a) the best preservation ofinitial structure in the image of Y and (b) the bestapproximation of the dataset D. The SOM algorithmhas several setup variables to regulate the compro-mise between these goals. We start from some ini-tial approximation of the map, ϕ1 : M → Rm. Oneach (k-th) step of the algorithm we have a datapointx ∈ D and a current approximation ϕk : M → Rm.For these x and ϕk we define an “owner neuron” of x

in Y : yx = argminy∈Y ‖x−ϕk(y)‖. The next approx-imation, ϕk+1, is

ϕk+1(y) = ϕk(y) + hk w(ρ(y, yx))(x − ϕk(y)). (1)

∗Corresponding author.

219

http://dx.doi.org/10.1142/S0129065710002383

June 3, 2010 15:58 00238

220 A. N. Gorban & A. Zinovyev

Here hk is a step size, 0 ≤ w(ρ(y, yx)) ≤ 1is a monotonically decreasing cutting function andρ(y, yx) is distance between y and yx. There are manyways to combine steps (1) in the whole algorithm.The idea of SOM is very flexible and seminal, hasplenty of applications and generalizations (for exam-ple, see Refs. 2–4), but, strictly speaking, we don’tknow what we are looking for: we have the algorithm,but no independent definition: SOM is a result ofthe algorithm work. The attempts to define SOM assolution of a minimization problem for some energyfunctional were not very successful.5

For a known probability distribution, principalmanifolds were introduced as lines or surfaces passingthrough “the middle” of the data distribution.6

This intuitive vision was transformed into themathematical notion of self-consistency: every pointx of the principal manifold M is a conditional expec-tation of all points z that are projected into x. Nei-ther manifold, nor projection should be linear: just adifferentiable projection π of the data space (usuallyit is Rm or a domain in Rm) onto the manifold M

with the self-consistency requirement for conditionalexpectations: x = E(z|π(z) = x). For a finite datasetD, only one or zero datapoints are typically pro-jected into a point of the principal manifold. In orderto avoid overfitting, we have to introduce smoothersthat become an essential part of the principal mani-fold construction algorithms.

SOMs give the most popular approximations forprincipal manifolds: we can take for Y a fragment ofa regular k-dimensional grid and consider the result-ing SOM as the approximation to the k-dimensionalprincipal manifold (see, for example, Refs. 7 and 8).Several original algorithms for construction of prin-cipal curves9 and surfaces for finite datasets weredeveloped during last decade, as well as many appli-cations of this idea. In 1996, in a discussion aboutSOM at the 5th Russian National Seminar in Neu-roinformatics, a method of multidimensional dataapproximation based on elastic energy minimizationwas proposed (see Refs. 10–17 and the bibliographythere). This method is based on the analogy betweenthe principal manifold and the elastic membrane(and plate). Following the metaphor of elasticity, weintroduce two quadratic smoothness penalty terms.This allows one to apply standard minimization ofquadratic functionals (i.e., solving a system of linearalgebraic equations with a sparse matrix).

2. Method of Principal Elastic Graphsand Principal Elastic Maps

2.1. Principal graphs and manifolds

Here we give a short description of the structure ofthe elastic functional used for construction of princi-pal graphs and manifolds. For more detailed descrip-tion, see other works (in particular, Ref. 17).

In a series of works (see Refs. 10–19), theauthors of this paper used metaphor of elastic mem-brane and plate to construct one-, two- and three-dimensional principal manifold approximations ofvarious topologies. Mean squared distance approxi-mation error combined with the elastic energy of themembrane serves as a functional to be optimised.The elastic map algorithm is extremely fast at theoptimisation step due to the simplest form of thesmoothness penalty. It is implemented in several pro-gramming languages as software libraries or front-end user graphical interfaces freely available from theweb-site http://bioinfo.curie.fr/projects/vidaexpert.The software found applications in microarray dataanalysis, visualization of genetic texts, visualiza-tion of economical and sociological data and otherfields.10–19

Let G be a simple undirected graph with set ofvertices V and set of edges E.

Definition. k-star in a graph G is a subgraph withk + 1 vertices v0,1,...,k ∈ V and k edges {(v0, vi)|i =1, . . . , k}.

Definition. Suppose that for each k ≥ 2, a familySk of k-stars in G has been selected. Then we definean elastic graph as a graph with selected families ofk-stars Sk and for which for all E(i) ∈ E and S

(j)k ∈

Sk, the corresponding elasticity moduli λi > 0 andµkj > 0 are defined.

Definition. Primitive elastic graph is an elasticgraph in which every non-terminal node (with thenumber of neighbours more than one) is associatedwith a k-star formed by all neighbours of the node.All k-stars in the primitive elastic graph are selected,i.e. the Sk sets are completely determined by thegraph structure.

Definition. Elastic net of dimension s is a particu-lar case of elastic graph which (1) contains only ribs(2-stars) (the family Sk are empty for all k > 2);

June 3, 2010 15:58 00238

Principal Manifolds and Graphs in Practice: From Molecular Biology to Dynamical Systems 221

and (2) the vertices of this graph form a regular s-dimensional grid. Let E(i)(0), E(i)(1) denote two ver-tices of the graph edge E(i) and S

(j)k (0), . . . , S(j)

k (k)denote vertices of a k-star S

(j)k (where S

(j)k (0) is the

central vertex, to which all other vertices are con-nected). Let us consider a map φ : V → Rm whichdescribes an embedding of the graph into a multi-dimensional space. The elastic energy of the graphembedding in the Euclidean space is defined as

Uφ(G) := UφE(G) + Uφ

R(G), (2)

UφE(G) :=

∑E(i)

λi‖φ(E(i)(0))− φ(E(i)(1))‖2, (3)

UφE(G) :=

∑

S(j)k

µkj

∥∥∥∥∥φ(S(j)k (0))− 1

k

k∑i=1

φ(S(j)k (i))

∥∥∥∥∥2

.

(4)

Definition. Elastic map is a continuous mani-fold Y ∈ Rm constructed from the elastic netas its grid approximation using some between-node interpolation procedure. This interpolationprocedure constructs a continuous mapping φc :{φ1, . . . , φdim(G)} → Rm from the discrete mapφ : V → Rm, used to embed the graph in Rm, andthe discrete values of node indices {λi

1, . . . , λidim(G)},

i = 1 . . . |V |. For example, the simplest piecewise lin-ear elastic map is build by piecewise linear map φc.

Definition. Elastic principal manifold of dimensions for a dataset X is an elastic map, constructed froman elastic net Y of dimension s embedded in Rm

using such a map φopt : Y → Rm, that correspondsto the minimal value of the functional

Uφ(X, Y ) = MSDW (X, Y ) + Uφ(G), (5)

where the weighted mean squared distance fromthe dataset X to the elastic net Y is calculatedas the distance to the finite set of vertices {y1 =φ(ν1), . . . , yk = φ(vk)}.

In the Euclidean space one can apply an EMalgorithm for estimating the elastic principal man-ifold for a finite dataset. It is based in turn on thegeneral algorithm for estimating the locally optimalembedding map φ or an arbitrary elastic graph G

(see Ref. 17).

2.2. Pluriharmonic graphs as idealapproximators

Approximating datasets by one dimensional princi-pal curves is not satisfactory in the case of datasets

that can be intuitively characterized as branched. Aprincipal object which naturally passes through the‘middle’ of such a data distribution should also havebranching points that are missing in the simple struc-ture of principal curves. Introducing such branchingpoints converts principal curves into principal graphs.

In Refs. 13, 15 and 17, it was proposed to usea universal form of non-linearity penalty for thebranching points. The form of this penalty is definedin the previous Section (4) for the elastic energyof graph embedment. It naturally generalizes thesimplest three-point second derivative approximationsquared: for a 2-star (or rib) the penalty equals‖φ(S(j)

2 (0)) − 12 (φ(S(j)

2 (1)) + φ(S(j)2 (2)))‖2, for a 3-

star it is ‖φ(S(j)3 (0)) − 1

3 (φ(S(j)3 (1)) + φ(S(j)

3 (2)) +φ(S(j)

3 (3)))‖2, etc.For a k-star this penalty equals to zero iff the

position of the central node coincides with the meanpoint of its neighbors. An embedment φ(G) is ‘ideal’if all such penalties equal to zero. For a primitiveelastic graph this means that this embedment is aharmonic function on graph: its value in each non-terminal vertex is a mean of the value in the closestneighbors of this vertex.

For non-primitive graphs we can consider starswhich include not all neighbors of their centers. Forexample, for a square lattice we create elastic graph(elastic net) using 2-stars (ribs): all vertical 2-starsand all horizontal 2-stars. For such elastic net, eachnon-boundary vertex belongs to two stars. For a gen-eral elastic graph G with sets of k-stars Sk we intro-duce the following notion of pluriharmonic function.

Definition. A map φ : V → Rm defined on verticesof G is pluriharmonic iff for any k-star S

(j)k ∈ Sk

with the central vertex S(j)k (0) and the neighbouring

vertices S(j)k (i), i = 1, . . . , k, the equality holds:

φ(S(j)k (0)) =

1k

k∑i=1

φ(S(j)k (i)). (6)

Pluriharmonic maps generalize the notion of lin-ear map and of harmonic map, simultaneously. Forexample:

(1) 1D harmonic functions are linear;(2) If we consider an n-dimensional cubic lattice

as a primitive graph (with 2n-stars for allnon-boundary vertices), then the correspondentpluriharmonic functions are just harmonic ones;

June 3, 2010 15:58 00238


(3) If we create from n-dimensional cubic lattice thestandard n-dimensional elastic net with 2-stars(each non-boundary vertex is a center of n 2-stars, one 2-star for each coordinate direction),then pluriharmonic functions are linear.

In the theory of principal curves and manifolds thepenalty functions were introduced to penalise devia-tion from linear manifolds (straight lines or planes).We proposed to use pluriharmonic embeddings as“ideal objects” instead of manifolds and to introducepenalty (2) for deviation from this ideal form.

2.3. Complexity of principal graphs

The principal graphs can be called data approxi-mators of controllable complexity. By complexity ofthe principal objects we mean the following threenotions:

(1) Geometric complexity: how far a principal objectdeviates from its ideal configuration; for the elas-tic principal graphs we explicitly measure devia-tion from the ‘ideal’ pluriharmonic graph by theelastic energy Uφ(G) (2) (this sort of complexityis often just a measure of non-linearity);

(2) Structural complexity measure: it is some non-decreasing function of the number of ver-tices, edges and k-stars of different ordersSC(G) = SC(|V |, |E|, |S2|, . . . , |Sm|); this func-tion penalises for number of structural elements;

(3) Construction complexity is defined with respectto a graph grammar as a minimum number ofapplications of elementary transformations nec-essary to construct given G from the simplestgraph (one vertex, zero edges).

The construction complexity is defined withrespect to a grammar of elementary transformation.The graph grammars20,21 provide a well-developedformalism for the description of elementary trans-formations. An elastic graph grammar is presentedas a set of production (or substitution) rules. Eachrule has a form A → B, where A and B are elasticgraphs. When this rule is applied to an elastic graph,a copy of A is removed from the graph together withall its incident edges and is replaced with a copy ofB with edges that connect B to the graph. For afull description of this language we need the notionof a labeled graph. Labels are necessary to providethe proper connection between B and the graph.21

An approach based on graph grammars to construct-ing effective approximations of elastic principal graphwas proposed recently.13,15,17

Let us define graph grammar O as a set of graphgrammar operations O = {o1, . . . , os}. All possibleapplications of a graph grammar operation oi to agraph G gives a set of transformations of the ini-tial graph oi(G) = {G1, G2, . . . , Gp}, where p is thenumber of all possible applications of oi to G. Let usalso define a sequence of r different graph grammars{O(1) = {o(1)

1 , . . . , o(1)s1 }, . . . , O(r) = {o(r)

1 , . . . , o(r)sr }}.

Let us choose a grammar of elementary trans-formations, predefined boundaries of structural com-plexity SCmax and construction complexity CCmax,and elasticity coefficients λi and µkj .

Definition. Elastic principal graph for a dataset X

is such an elastic graph G embedded in the Euclideanspace by the map φ : V → Rm that SC(G) ≤ SCmax,CC(G) ≤ CCmax, and Uφ(G)→ min over all possibleelastic graphs G embeddings in Rm.

For the simplest choice of grammar, this defini-tion gives us principal trees (see the Sec. 3.3).

3. Real-Life Examples of PrincipalGraph and Manifold Applications

3.1. Happiness (quality of life) isnon-linear

Let us take a simple real-life example where applica-tion of linear data approximation is not satisfactoryand non-linear one-dimensional principal manifolds(principal curves) are needed to adequately describethe data.

In the Political World Atlas project, launchedby the Moscow State Institute of InternationalRelations,22 systematic quantitative data on 192modern countries for the 1989–2005 period werecollected. One of the goals was to estimate the“objective” position of post-sovetic countries, inparticular, in terms of “objective” estimates ofthe quality of life. Similarly to other projects onquantification of country happiness or life quality,the following quantitative indicators were used forconstructing an integral index and ranking: grossproduct per capita in dollars, life expectancy inyears, infant mortality in cases per 10000 habitants,tuberculosis incidence in cases per 100000 habitants.A common practice in this field is to combine theseor similar features in one index (number) which

June 3, 2010 15:58 00238


usually represents a linear combination of featureswith weights determined by experts (see, for exam-ple, Ref. 23). This approach is inevitably biased bythe experts’ opinion. An unbiased approach wouldbe to use the best one-dimensional approximation ofdata to introduce the “objective” ranking. It makessense if a good approximation function exists.

Each of 192 countries is characterized by the4 values, thus, we can represent this dataset as acloud of 192 points in 4-dimensional space. Simplevisualization of this distribution allows to make aconclusion that no linear function exists that canequally well serve for reducing the dimension of thisspace from 4 to 1 (see Fig. 1). The distribution isintrinsically curved, hence, any linear mapping willinevitably give strong distortions in one or otherregion of dataspace.

There is a simple reason for this. ObservingFig. 1, it is easy to realize that all countries can be

Fig. 1. Approximation of data used for constructing the quality of life index. Each of 192 points represents a countryin 4-dimensional space formed by the values of 4 indicators (gross product, life expectancy, infant mortality, tuberculosisincidence). Different forms and colors correspond to various geographical locations. Red line represents the principal curve,approximating the dataset. Three arrows show the basis of principal components. The best linear index (first principalcomponent) explains 76% of variation, while the non-linear index (principal curve) explains 93% of variation.

roughly separated in two groups. First group con-sists of very wealthy countries, mostly from Westernand Nothern Europe, USA, Australia and some oth-ers (right branch of the distribution on the Fig. 1).It happens that the most of variation among thesecountries can be attributed to the gross product percapita feature while others are approximately equalfor them and do not contribute significantly to thevariance. The second group (left branch of the dis-tribution on the Fig.1) consists of very poor coun-tries (mostly, African), which are “equally poor” interms of the gross product per capita but can bevery different in terms of their problems (lower ofhigher life expectancy, level of infectious diseases,conditions of the state health system), for a num-ber of reasons (wars, difference in internal politics,etc.). However, this classification is not perfect: infact, there are many countries which are localized inthe non-linear junction between these two branches

June 3, 2010 15:58 00238


of the 4-dimensional distribution. This intermediategroup includes the most of the post-sovetic countrieswhich are the subject of the data analysis in thisstudy, hence, it is important to correctly positionthem in the global picture.

To do this, we have constructed a principal curveapproximating the curved data distribution (red lineon the Fig. 1). The advantage of using the principalcurve instead of principal line was significant in termof Mean-Squared Error (93% vs 76% of explainedvariance). Importantly, using the non-linear index fornon-linear one-dimensional dimension reduction cansignificantly change the absolute and relative rank-ings for many countries, including Russia. Thus, forthe index, constructed using the first principal com-ponent, Russia in 2005 is ranked the 86th (with themost prosperous Luxembourg at the 1st place), after,for example, Iran (74th) and Egypt (84th). For non-linear index, Russia is ranked the 71st before Iran(77th) and Egypt (85th).

3.2. Dimension reduction formicroarrays

Now let us consider application of two-dimensionalprincipal manifolds for data visualization. This istheir natural application since they permit to cre-ate a mapping from multidimensional space of datato two- or three- dimensional spaces, thus, providinga possibility to create visual images of multidimen-sional distributions.

DNA microarray data is a rich source of informa-tion for molecular biology (for a recent overview, seeRef. 24). This technology found numerous applica-tions in understanding various biological processesincluding cancer. It allows simultaneous screeningof the expression of all genes in a cell exposedto some specific conditions (for example, stress,cancer, treatment, normal conditions). Obtaining asufficient number of observations (chips), one canconstruct a table of “samples vs genes”, contain-ing logarithms of the expression levels of, typicallyseveral thousands (n) of genes, in typically sev-eral tens (m) of samples. Visualization of microar-ray data is highly demanded field of researchthat, for example, can help with breast cancerdiagnosis.25

For this study, we took three distinct microar-ray datasets, provided at http://www.math.le.ac.uk/

people / ag153 / homepage / PrincManLeicAug 2006.htm:

(1) Breast cancer dataset from Ref. 26, 286 tumorsamples and 17816 genes, several natural group-ings of samples accordingly to survival ofpatients, status of some cell receptors and molec-ular classification of tumors.

(2) Bladder cancer dataset from Ref. 27, 40 tumorsamples, several natural groupings of samplesaccordingly to the stage and grade of the tumor.

(3) Collection of microarrays for normal human tis-sues from Ref. 28, 103 samples of healthy tis-sues, grouped accordingly to the tissue of origin(brain, heart, pancreas, etc.).

In Fig. 2 we compare data visualization scattersafter projection of the normal tissues dataset, pro-vided in Ref. 28, onto the linear two-dimensional andnon-linear two-dimensional principal manifold. Thelater one is constructed by the elastic maps approach.Each point here represents a sample of tissue, labeledby the tissue name. Before dimension reduction it isrepresented as a vector in Rn, containing the expres-sion values for all n genes in the sample.

One of the conclusions that can be made fromthe Fig. 2 is that the non-linear mapping is capa-ble of resolving the structure of the data distribu-tion in more details. In particular, some groups ofpoints that are not separated on the linear scatterbecome separated on the non-linear one. We havedecided to quantify this effect, thus, comparing thequality of linear and non-linear mappings of the samedimension. Four criteria, each characterizing a pro-jection of dataset into the low-dimensional spacewere developed:

(1) Mean-Squared Error (MSE). This is simply themean squared distance from all of the points tothe manifold, where the projection is orthogonal(by the shortest distance). MSE is measured in %of total variance. Notice that non-linear approx-imations should give smaller MSE by construc-tion, since they are less rigid than the linear ones.

(2) Quality of distance mapping (QDM). This is acorrelation coefficient between the pair-wise dis-tances between data points before (dij) and after(dij) projection onto the manifold:

QDM = corr(dij , dij) (7)

June 3, 2010 15:58 00238


(a) (b)

Fig. 2. Comparing linear (b) and non-linear (a) two-dimensional projections of data on the expression of genes in normalhuman tissues. Each point represents a tissue sample (brain, heart, pancreas, etc., each denoted by a distinct color) bya vector of expression of several thousands of genes. The linear projection is done by the standard PCA approach. Thenon-linear projection is made by the elastic map method, described in this paper.

For estimating QDM measure we also proposedto calculate the correlation coefficient only on“the most representative” subset of pair-wise dis-tances, selected by a procedure which we havecalled “Natural PCA”. The procedure was intro-duced in Ref. 16. The first “natural” componentis defined by the pair of the most distant points(i1, j1). The second component is such a pairof points (i2, j2) that 1) i2 is the most distantpoint to the first component, where the distancefrom a point to a set of points S is defined as thedistance to the closest point in S; 2) j2 is thepoint from the first component which is the clos-est to i2. All the next components are introducedanalogously: the nth component (in, jn) is sucha pair of points that the in is the most distantfrom the union Sn−1 of all components from 1 ton − 1 and jn is the point from this union Sn−1

which is the closest to in. We use both Pearsonand Spearman correlation coefficients for calcu-lating QDM.

(3) Quality of point neighborhood preservation(QNP). For measuring it, for every data point i

we calculate the size of the intersection of the setof k neighbours in the multi-dimensional spaceS(i; k) and in the low-dimensional space S(i; k).

QNP is the average value of this intersectiondivided by k:

QNPk = 1/k∑

i=1...N

|S(i; k) ∩ S(i; k)|/N. (8)

(4) Quality of group compactness (QGC). In this testwe assume that there is a label C(i) associatedwith every point i. Then, for each label B, we cancalculate the average number of points with thesame label in the k-neighborhood of the pointsbefore and after projection. Let us define c(i; k)as the number of points in the k-neighbourhoodof the point i having the label C(i). Then, for alabel B,

QGCk(B) = 1/k∑

C(i)=B

c(i; k)/N(B), (9)

where N(B) is the number of points having thelabel B. Values of QGC close to 1 would indicatethat the points of the label B forms a very com-pact and separated from the other labels group.Comparing QGC(B) before and after projectioninto the low-dimensional space, one can concludeon what happened with the group of points of thelabel B.

June 3, 2010 15:58 00238


Fig. 3. Criterion 1 (Mean-Squared Error, MSE)for comparison of linear (of dimension 1-5, 10, columnsPC1-PC5, PC10) and non-linear projections of data (ofdimension 2, column ELMAP2D). The color mark onthe table shows the comparative performance of thenon-linear method compared to the linear ones. Heretwo-dimensional ELMAP approach performs as well as4-dimensional linear approximations.

Fig. 4. Criterion 2 (Quality of distance mapping,QDM). For the column titles see the legend to theFig. 3. The plot shows the comparison of pair-wise pointdistances in the initial mutli-dimensional space and onthe manifold after projection. The subset of distancesselected by NPCA approach is shown (see the text forexplanations).

It is clear from the Figs. 3–6 that the non-lineartwo-dimensional principal manifolds provide system-atically better results accordingly to all four criteria,achieving the performance of three- and four- dimen-sional linear principal manifolds.

Fig. 5. Criterion 3 (Quality of point neighborhoodpreservation,QNP). For the column titles see the leg-end to the Fig. 3. The RANDOM column shows the ran-dom permutation test when random k points are takeninstead of the real k-neighbourhood.

A special attention should be made to the per-formance of the non-linear principal manifolds withrespect to the QGC criterion. It works particu-larly well for the collection of normal tissues. Thereare cases when neither linear nor non-linear low-dimensional manifolds could put together points ofthe same class and there are a few examples whenlinear manifolds perform better. In the latter cases(Breast cancer’s A, B, lumA, lumB and “unclassi-fied” classes, bladder cancer T1 class), almost allclass compactness values are close to the estimatedrandom values which means that these classes havebig intra-class dispersions or are poorly separatedfrom the others. In this case the value of class com-pactness becomes unstable (look, for example, at theclasses A and B of the breast cancer dataset) anddepends on random factors which can not be takeninto account in this framework.

The closer class compactness is to unity, theeasier one can construct a decision function sepa-rating this class from the others. However, in thehigh-dimensional space, due to many degrees offreedom, the “class compactness” might be compro-mised and become better after appropriate dimen-sion reduction. In Fig. 6 one can find examples whendimension reduction gives better class compactnessin comparison with that calculated in the initialmulti-dimensional space (breast cancer basal sub-type, bladder cancer Grade 2 and T1, T2+ classes).It means that sample classifiers can be regularizedby dimension reduction using PCA-like methods.

There are several particular cases (breast cancerbasal subtype, bladder cancer T2+, Grade 2 sub-types) when non-linear manifolds give better class

June 3, 2010 15:58 00238


Fig. 6. Criterion 4 (Quality of group compactness,QGC). For the column titles see the legend to the Fig.3. The numbers on the right and the row colors havethe following meaning (see the graph above): number 3is a group which is compact in the original space andremains compact after the projection; number 1 group isnot compact before and after projection; number 2 groupbeing compact before projection, becomes dispersed afterthe projection; number 4 group being not compact before,becomes compact after the projection. ‘+’ on the leftmeans ‘better than linear’, ‘++’ means ‘much better’,‘+++’ means ‘incomparably better’, ‘-’ means ‘worse’.

compactness than both the multidimensional spaceand linear principal manifolds of the same dimension.In these cases we can conclude that the dataset in theregions of these classes is naturally “curved” and theapplication of non-linear techniques for classificationregularization is an appropriate solution.

We can conclude that non-linear principal mani-folds provide systematically better or equal resolu-tion of class separation in comparison with linearmanifolds of the same dimension. They performparticularly well when there are many small and

relatively compact heterogeneous classes (as in thecase of normal tissue collection).

3.3. Principal trees and Metro Maps

Let us demonstrate how the idea of graph gram-mars (Sec. 2) allows constructing the simplest non-trivial type of the principal graphs, called principaltrees.15,17 For this purpose let us introduce a sim-ple ‘Add a node, bisect an edge’ graph grammar(see Fig. 7) applied to the class of primitive elasticgraphs.

Definition. Principal tree is an acyclic primitiveelastic principal graph.

Definition. ‘Add a node, bisect an edge’ graph gram-mar O(grow) applicable for the class of primitiveelastic graphs consists of two operations: (1) Thetransformation “add a node” can be applied to anyvertex v of G: add a new node z and a new edge (v, z);(2) The transformation “bisect an edge” is applica-ble to any pair of graph vertices v, v′ connected byan edge (v, v′): delete edge (v, v′), add a vertex z

and two edges, (v, z) and (z, v′). The transformation

(a) (b)

(c)

Fig. 7. Illustration of the simple “add node to a node”or “bisect an edge” graph grammar. (a) We start with asimple 2-star from which one can generate three distinctgraphs shown. The “Op1” operation is adding a nodeto a node, operations “Op1” and “Op2” are edge bisec-tions (here they are topologically equivalent to adding anode to a terminal node of the initial 2-star). For illus-tration let us suppose that the “Op2” operation gives thebiggest elastic energy decrement, thus it is the “optimal”operation. (b) From the graph obtained one can gener-ate 5 distinct graphs and choose the optimal one. (c) Theprocess is continued until a definite number of nodes areinserted.

June 3, 2010 15:58 00238


of the elastic structure (change in the star list) isinduced by the change of topology, because the elas-tic graph is primitive. Consecutive application of theoperations from this grammar generates trees, i.e.graphs without cycles.

Definition. ‘Remove a leaf, remove an edge’ graphgrammar O(shrink) applicable for the class of primi-tive elastic graphs consists of two operations: (1) Thetransformation ‘remove a leaf ’ can be applied to anyvertex v of G with connectivity degree equal to 1:remove v and remove the edge (v, v′) connecting vto the tree; (2) The transformation ‘remove an edge’is applicable to any pair of graph vertices v, v′ con-nected by an edge (v, v′): delete edge (v, v′), deletevertex v′, merge the k-stars for which v and v′ arethe central nodes and make a new k-star for whichv is the central node with a set of neighbors whichis the union of the neighbors from the k-stars of v

and v′.Also we should define the structural complex-

ity measure SC(G) = SC(|V |, |E|, |S2|, . . . , |Sm|). Itsconcrete form depends on the application field. Hereare some simple examples:

(1) SC(G) = |V |: i.e., the graph is considered morecomplex if it has more vertices;

(2) SC(G) =

|S3|, if |S3| ≤ bmax

andm∑

k=4

|Sk| = 0,

∞, otherwise

i.e., only bmax simple branches (3-stars) are allowedin the principal tree.

To construct the principal tree, the following sim-ple algorithm is applied:

(1) Initialize the elastic graph G by 2 vertices v1

and v2 connected by an edge. The initial mapφ is chosen in such a way that φ(v1) and φ(v2)belong to the first principal line in such a waythat all the data points are projected onto theprincipal line segment defined by φ(v1), φ(v2);

(2) For a sequence of grammars O(j) = {O(grow),

O(grow), O(shrink)}, j = 1 . . . 3, repeat steps 3–6:(3) Apply all grammar operations from O(j) to G in

all possible ways; this gives a collection of can-didate graph transformations {G1, G2, . . . };

(4) Separate {G1, G2, . . . } into permissible and for-bidden transformations; permissible transforma-tion Gk is such that SC(Gk) ≤ SCmax, whereSCmax is some predefined structural complexityceiling;

(5) Optimize the embedding φ and calculate theelastic energy Uφ(G) of graph embedment forevery permissible candidate transformation, andchoose such a graph Gopt that gives the min-imal value of the elastic functional: Gopt =arg infGk∈permissible set Uφ(Gk);

(6) Substitute G← Gopt ;(7) Repeat steps 2–6 until the set of permissible

transformations is empty or the number of oper-ations exceeds a predefined number — the con-struction complexity.

Using the ‘tree trimming’ grammar O(shrink) allowsto produce principal trees closer to the global opti-mum, trimming excessive tree branching and fusingk-stars separated by small ‘bridges’.

Principal trees can have applications in data visu-alization. A principal tree is embedded into a mul-tidimensional data space. It approximates the dataso that one can project points from the multidimen-sional space into the closest node of the tree. Thetree by its construction is a one-dimensional object,so sthis projection performs dimension reduction ofthe multidimensional data. The question is how toproduce a planar tree layout? Of course, there aremany ways to layout a tree on a plane without edgeintersection. But it would be useful if both localtree properties and global distance relations wouldbe represented using the layout. We can requirethat

(1) In a two-dimensional layout, all k-stars should berepresented equiangular; this is the small penaltyconfiguration;

(2) The edge lengths should be proportional to theirlength in the multidimensional embedding; thusone can represent between-node distances.

This defines a tree layout up to global rotationand scaling and also up to changing the order ofleaves in every k-star. We can change this order toeliminate edge intersections, but the result cannotbe guaranteed. In order to represent the global dis-tance structure, it was found that a good approx-imation for the order of k-star leaves can be taken

June 3, 2010 15:58 00238


from the projection of every k-star on the linear prin-cipal plane15 calculated for all data point or on thelocal principal plane in the vicinity of the k-star, cal-culated only for the points close to this star. Theresulting layout can be further optimized using somegreedy optimization methods.

Note that the distance on the metro map is esti-mated by summing up the lengths of branches alongthe path (Fig. 8). Hence, any layout of the tree willnot distort this information. The k-stars are pro-jected onto the principal plane in order to find a goodordering of nodes inside the star, as a heuristics, toavoid excessive tree branch intersections. However,this ordering does not encode the distance along thetree branches but rather provide a way to generatea nice 2D layout.

Fig. 8. “Metro map” representation of normal humantissue samples (on the right) and the hierarchical den-drogram of the same data (on the left, shown onlyfor comparison). Both methods approximate the databy a tree-like structure, but using different metaphors(“genealogy tree” in the case of hierarchical dendrogramand “metro map” for the harmonic dendrite used by theprincipal tree method). In both representations the usercan estimate a distance from one sample to another bysumming up distances along the path in the tree. Thesize of circles on the metro map diagram is proportionalto the number of points projected into the correspondingnode, the pie diagram represents the composition of thecluster in terms of pre-existing sample groups (humantissues, in this case).

The point projections are then represented aspie diagrams, where the size of the diagram reflectsthe number of points projected into the correspond-ing tree node. The sectors of the diagram allow usto show proportions of points of different classesprojected into the node. An example of the metromap representation applied to the case of microarraydataset of normal tissues is shown in Fig. 8 (comparewith 2D representation of the same dataset shown inFig. 2).

3.4. Principal manifolds for dynamicalsystems analysis

Invariant manifold is a central concept in the theoryof dynamical systems and its construction provides aconsistent approach for model reduction.29 The gen-eral picture behind this concept is that starting froman initial condition, the dynamical system quicklyreaches vicinity of a curved low-dimensional mani-fold and during most of its dynamics remains closeto it.

In Ref. 30 we constructed a dynamical model ofNFkB biochemical signaling cascade. The model con-tains 35 dynamic variables representing concentra-tions of various biochemical species. This system iscapable for sustained oscillations and converges to alimit cycle in the phase space. When studying thedetails of the dynamics of this system, we found thatthe dynamics in the vicinity of the limit cycle canbe characterized by presence of a two-dimensionalinvariant manifold.31

Using the same methodology as described abovein the section 2 and a phase space sampling tech-nique, we have constructed the invariant manifoldapproximation. To do it, we first sample the trajec-tories of the dynamical system in the vicinity of itslimit cycle in the following way:

(1) Using the PCA approach we found a linear man-ifold in which the cycle trajectory is embedded.

(2) A system trajectory was computed in some inter-val of time [0; tm], started from a randomly cho-sen point of the limit cycle plus random shift inthe linear manifold calculated at Step 1. The sizeof this shift δ can be made up to the border ofthe phase space (when one of the concentrationsbecomes zero).

(3) The trajectory obtained at Step 2 was cut intotwo parts: for t ∈ [0; αtm] and t ∈ [αtm; tm]

June 3, 2010 15:58 00238


intervals, where α ∈ [0; 1]. For the sampling,only the second part of the trajectory was used,properly discretized. It was done to skip the first(fast) dynamics of the system towards a hypo-thetical invariant manifold.

(4) Steps 2 and 3 were repeated until sufficient(50000 in our experiments) number points weresampled.

The resulting distribution of points each rep-resenting a ‘snapshot’ along the trajectory ofthe dynamical system was approximated by anon-linear two-dimensional principal manifold con-structed using the elastic map approach. The result-ing manifold projected into the subspace of thefirst three (linear) principal components is shown onthe Fig. 9. The constructed approximation to theinvariant manifold can be further used for modelreduction by projection of the system dynamicsonto it.29,32

Principal component analysis applied to the tra-jectories of a dynamical system sometimes calledKarhunen-Loeve expansion33,34 and it is a useful toolfor reducing complexity of the dynamics. Hence, theapproach presented in this example can be called thenon-linear Karhunen-Loeve expansion.

Fig. 9. Two-dimensional invariant manifold (blue grid)approximation obtained by application of the principalmanifold methodology to the set of “snapshots” (blackpoints) along trajectories of a dynamical system. Themanifold is shown in the 3D projection of the first threeprincipal components. By red one particular trajectoryis shown, quickly approaching the manifold and furtherconverging along it to the limit cycle.

4. Conclusions

Principal component analysis was introduced intoapplied science by Karl Pearson more than one hun-dred years ago35 and since then it became one ofthe most used mathematical tool in many domainsof science (in every domain, where approximation ofa finite set of points is required).

Non-linear extensions of this method (Self-Organizing neural networks and their further gen-eralizations such as principal manifolds, principalgraphs and principal trees) also can serve as auniversal tool allowing to approximate complex dis-tributions of data points, when the linear approxima-tion happens to be insufficient. To prove superiorityand advantage of applying the non-linear approxima-tions, one can use the set of benchmarking criteriadescribed in this paper.

The elastic graph approach can be interpretedas an intermediate between absolutely flexible neu-ral gas36 and significantly more restrictive Self-Organizing Maps and elastic maps.

Using efficient implementation of principalgraphs and manifolds provided by the elasticgraph approach, applying these methods in prac-tice becomes relatively easy and computationally fastexercise. In this paper, we demonstrate this on sev-eral practical examples: from comparative politicalscience, data analysis in molecular biology and anal-ysis of dynamical systems for biochemical modeling.

References

1. T. Kohonen, Self-organized formation of topologi-cally correct feature maps, Biological Cybernetics 43(1982) 59–69.

2. Y. Y. Chen and K. Y. Young, An SOM-based algo-rithm for optimization with dynamic weight updat-ing, International Journal of Neural Systems 17(3)(2007) 171–181.

3. A. Fatehi and A. Kenichi, Flexible structure multiplemodeling using irregular Self-Organizing Maps neu-ral network, International Journal of Neural Systems18(3) (2008) 347–370.

4. C. Fyfe, W. Barbakh, W. C. Ooi and H. Ko, Topolog-ical mappings of video and audio data, InternationalJournal of Neural Systems 18(6) (2008) 481–489.

5. E. Erwin, K. Obermayer and K. Schulten, Self-organizing maps: Ordering, convergence propertiesand energy functions, Biological Cybernetics 67(1992) 47–55.

June 3, 2010 15:58 00238


6. T. Hastie and W. Stuetzle, Principal curves, Jour-nal of the American Statistical Association 84(406)(1989) 502–516.

7. F. Mulier and V. Cherkassky, Self-organization as aniterative kernel smoothing process, Neural Computa-tion 7 (1995) 1165–1177.

8. H. Ritter, T. Martinetz and K. Schulten, NeuralComputation and Self-Organizing Maps: An Intro-duction (Addison-Wesley Reading, Massachusetts,1992).

9. B. Kegl and A. Krzyzak, Piecewise linear skele-tonization using principal curves, IEEE Transactionson Pattern Analysis and Machine Intelligence 24(1)(2002) 59–74.

10. A. N. Gorban and A. A. Rossiev, Neural networkiterative method of principal curves for data withgaps, Journal of Computer and System SciencesInternational 38(5) (1999) 825–831.

11. A. Zinovyev, Visualization of Multidimensional Data(Krasnoyarsk State University Press Publ., 2000,168 p.)

12. A. Gorban and Y. Zinovyev, Elastic principal graphsand manifolds and their practical applications, Com-puting 75 (2005), 359–379.

13. A. Gorban, N. Sumner and A. Zinovyev, Topologicalgrammars for data approximation, Applied Mathe-matics Letters 20(4) (2007) 382–386.

14. A. Gorban, B. Kegl, D. Wunch and A. Zinovyev(eds.). Principal Manifolds for Data Visualiza-tion and Dimension Reduction. Lecture Notes inComputational Science and Engineering, Vol. 58(Berlin-Heidelberg, Springer, 2008).

15. A. Gorban, N. R. Sumner and A. Zinovyev,Beyond The Concept of Manifolds: Principal Trees,Metro Maps, and Elastic Cubic Complexes, inGorban A., Kegl B., Wunch D., Zinovyev A. (eds.)Principal Manifolds for Data Visualization andDimension Reduction, Lecture Notes in Computa-tional Science and Engineering, Vol. 58 (Springer,Berlin-Heidelberg, 2008), pp. 96–130.

16. A. Gorban and A. Zinovyev, Elastic maps andnets for approximating principal manifolds andtheir application to microarray data visualization,in Gorban A., Kegl B., Wunch D., Zinovyev A.(eds.) Principal Manifolds for Data Visualizationand Dimension Reduction, Lecture Notes in Compu-tational Science and Engineering, Vol. 58 (Springer,Berlin-Heidelberg, 2008) pp. 96–130.

17. A. N. Gorban and A. Y. Zinovyev, Principalgraphs and manifolds, in “Handbook of Research onMachine Learning Applications and Trends: Algo-rithms, Methods and Techniques” (eds. Olivas E. S.,Guererro J. D. M., Sober M. M., Benedito J. R.M., Lopes A. J. S.), (Information Science Reference,September 4, 2009), pp. 28–60.

18. A. N. Gorban, A. A. Pitenko, A. Y. Zinov’ev andD. C. Wunsch, Vizualization of any data using elasticmap method, Smart Engineering System Design 11(2001) 363–368.

19. A. N. Gorban, A. Yu. Zinovyev and D. C. Wunsch,Application of the method of elastic maps in analysisof genetic texts, in Proceedings of International JointConference on Neural Networks. (IJCNN Portland,Oregon, July 20–24, 2003).

20. M. Lowe, Algebraic approach to single–pushoutgraph transformation, Theor. Comp. Sci. 109 (1993)181–224.

21. M. Nagl, Formal languages of labelled graphs,Computing 16 (1976) 113–137.

22. A. Yu. Melville, M. V. Ilyin, Ye. Yu. Meleshkina,M. G. Mironyuk, Yu. A. Polunin and I. N. Timo-feyev, Political Atlas of modern time: Attempt ofCountries Classification, Polis 5 (2006) (in Russian).

23. http://hdr.undp.org/en/: Human DevelopmentIndex website.

24. Y. F. Leung and D. Cavalieri, Fundamentals ofcDNA microarray data analysis, Trends Genet19(11) (2003) 649–659.

25. A. Fornells, J. M. Martorell, E. Golobardes, J. M.Garrell and X. Vilasıs, Management of relationsbetween cases and patterns from SOM for help-ing experts in breast cancer diagnosis, InternationalJournal of Neural Systems 18(1) (2008) 33–43.

26. Y. Wang, J. G. Klijn and Y. Zhang et al., Geneexpression profiles to predict distant metastasis oflymph-node-negative primary breast cancer, Lancet365 (2005) 671–679.

27. L. Dyrskjot, T. Thykjaer and M. Kruhoffer et al.,Identifying distinct classes of bladder carcinomausing microarrays, Nat Genetics 33(1) (2003)90–96.

28. R. Shyamsundar, Y. H. Kim and J. P. Higgins et al.,A DNA microarray survey of gene expression in nor-mal human tissues. Genome Biology 6 (2005) R22.

29. A. N. Gorban and I. Karlin, Invariant manifoldsfor physical and chemical kinetics, Volume 660 ofLect. Notes Phys. (Berlin–Heidelberg–New York,Springer, 2005).

30. O. Radulescu, A. Gorban, A. Zinovyev andA. Lilienbaum, Robust simplifications of multiscalebiochemical networks, BMC Systems Biology 2(1)(2008) 86.

31. O. Radulescu, A. Zinovyev and A. Lilienbaum,Model reduction and model comparison for NFkBsignalling, in Proceedings of Foundations of SystemsBiology in Engineering (September 2007, Stuttgart,Germany).

32. A. Gorban, I. Karlin and A. Zinovyev, Invariantgrids for reaction kinetics, Physica A, 333 (2004)106–154.

June 3, 2010 15:58 00238


33. K. Karhunen, Uber lineare Methoden in derWahrscheinlichkeitsrechnung, Ann. Acad. Sci. Fen-nicae, Series A1, Math. Phys. 37 (1946) 3–79.

34. M. M. Loeve, Probability Theory (Princeton, N.J.:VanNostrand, 1955).

35. K. Pearson, On lines and planes of closest fit tosystems of points in space, Philosophical Magazine,Series 6(2) (1901) 559–572.

36. T. M. Martinetz, S. G. Berkovich and K. J.Schulten, Neural-gas network for vector quantiza-tion and its application to time-series prediction,IEEE Transactions on Neural Networks 4 (1993)558–569.

Documents

Principal manifolds and graphs in practice: from … examples are taken from comparative political science, ... mate a ﬁnite set D in Rm for relatively large m by ... Principal Manifolds