6
Singular value decomposition for genome-wide expression data processing and modeling Orly Alter* , Patrick O. Brown , and David Botstein* Departments of *Genetics and Biochemistry, Stanford University, Stanford, CA 94305 Contributed by David Botstein, June 15, 2000 We describe the use of singular value decomposition in transform- ing genome-wide expression data from genes 3 arrays space to reduced diagonalized ‘‘eigengenes’’ 3 ‘‘eigenarrays’’ space, where the eigengenes (or eigenarrays) are unique orthonormal superpo- sitions of the genes (or arrays). Normalizing the data by filtering out the eigengenes (and eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in differ- ent experiments. Sorting the data according to the eigengenes and eigenarrays gives a global picture of the dynamics of gene expres- sion, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. After normalization and sorting, the significant eigengenes and eigenarrays can be associated with observed genome-wide effects of regulators, or with measured samples, in which these regulators are overactive or underactive, respectively. D NA microarray technology (1, 2) and genome sequencing have advanced to the point that it is now possible to monitor gene expression levels on a genomic scale (3). These new data promise to enhance fundamental understanding of life on the molecular level, from regulation of gene expression and gene function to cellular mechanisms, and may prove useful in medical diagnosis, treatment, and drug design. Analysis of these new data requires mathematical tools that are adaptable to the large quantities of data, while reducing the complexity of the data to make them comprehensible. Analysis so far has been limited to identification of genes and arrays with similar expression pat- terns by using clustering methods (4–9). We describe the use of singular value decomposition (SVD) (10) in analyzing genome-wide expression data. SVD is also known as Karhunen–Loe `ve expansion in pattern recognition (11) and as principal-component analysis in statistics (12). SVD is a linear transformation of the expression data from the genes 3 arrays space to the reduced ‘‘eigengenes’’ 3 ‘‘eigen- arrays’’ space. In this space the data are diagonalized, such that each eigengene is expressed only in the corresponding eigen- array, with the corresponding ‘‘eigenexpression’’ level indicating their relative significance. The eigengenes and eigenarrays are unique, and therefore also data-driven, orthonormal superpo- sitions of the genes and arrays, respectively. We show that several significant eigengenes and the corre- sponding eigenarrays capture most of the expression informa- tion. Normalizing the data by filtering out the eigengenes (and the corresponding eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in different experiments. Such normalization may improve any further analysis of the expression data. Sorting the data accord- ing to the correlations of the genes (and arrays) with eigengenes (and eigenarrays) gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. These groups of genes (or arrays) are not defined by overall similarity in expression, but only by similarity in the expression of any chosen subset of eigengenes (or eigenarrays). Upon comparing two or more similar experiments, with a regulator being overactive or underactive in one but normally expressed in the others, the expression pattern of one of the significant eigengenes may be correlated with the expression patterns of this regulator and its targets. This eigengene, therefore, can be associated with the observed genome-wide effect of the regu- lator. The expression pattern of the corresponding eigenarray is correlated with the expression patterns observed in samples in which the regulator is overactive or underactive. This eigenarray, therefore, can be associated with these samples. We conclude that SVD provides a useful mathematical frame- work for processing and modeling genome-wide expression data, in which both the mathematical variables and operations may be assigned biological meaning. Mathematical Framework: Singular Value Decomposition The relative expression levels of N genes of a model organism, which may constitute almost the entire genome of this organism, in a single sample, are probed simultaneously by a single microarray. A series of M arrays, which are almost identical physically, probe the genome-wide expression levels in M dif- ferent samples—i.e., under M different experimental conditions. Let the matrix e ˆ, of size N-genes 3 M-arrays, tabulate the full expression data. Each element of e ˆ satisfies ^nu e ˆu m&[ e nm for all 1 # n # N and 1 # m # M, where e nm is the relative expression level of the nth gene in the mth sample as measured by the mth array. § The vector in the nth row of the matrix e ˆ, ^g n u [^nu e ˆ, lists the relative expression of the nth gene across the different samples which correspond to the different arrays. The vector in the mth column of the matrix e ˆ, u a m &[ e ˆu m&, lists the genome- wide relative expression measured by the mth array. SVD (10) is then linear transformation of the expression data from the N-genes 3 M-arrays space to the reduced L- ‘‘eigenarrays’’ 3 L-‘‘eigengenes’’ space, where L 5 min{M, N} (see Fig. 7 in supplemental material at www.pnas.org), e ˆ 5 u ˆ « ˆv ˆ T . [1] In this space the data are represented by the diagonal nonnega- tive matrix e ˆ, of size L-eigengenes 3 L-eigenarrays, which satisfies ^ku « ˆu l &[ « l d kl $ 0 for all 1 # k, l # L, such that the l th eigengene is expressed only in the corresponding l th eigenarray, with the corresponding ‘‘eigenexpression’’ level « l . Therefore, the expression of each eigengene (or eigenarray) is decoupled Abbreviation: SVD, singular value decomposition. To whom reprint requests should be addressed. E-mail: [email protected]. § In this report, m ˆ denotes a matrix, uv& denotes a column vector, and ^uu denotes a row vector, such that m ˆ uv&, ^uum ˆ , and ^uuv& all denote inner products and uv&^uu denotes an outer product. The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. §1734 solely to indicate this fact. PNAS u August 29, 2000 u vol. 97 u no. 18 u 10101–10106 GENETICS

Singular value decomposition for genome-wide expression data processing …genome- · 2009-08-13 · Singular value decomposition for genome-wide expression data processing and modeling

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Singular value decomposition for genome-wide expression data processing …genome- · 2009-08-13 · Singular value decomposition for genome-wide expression data processing and modeling

Singular value decomposition for genome-wideexpression data processing and modelingOrly Alter*†, Patrick O. Brown‡, and David Botstein*

Departments of *Genetics and ‡Biochemistry, Stanford University, Stanford, CA 94305

Contributed by David Botstein, June 15, 2000

We describe the use of singular value decomposition in transform-ing genome-wide expression data from genes 3 arrays space toreduced diagonalized ‘‘eigengenes’’ 3 ‘‘eigenarrays’’ space, wherethe eigengenes (or eigenarrays) are unique orthonormal superpo-sitions of the genes (or arrays). Normalizing the data by filteringout the eigengenes (and eigenarrays) that are inferred to representnoise or experimental artifacts enables meaningful comparison ofthe expression of different genes across different arrays in differ-ent experiments. Sorting the data according to the eigengenes andeigenarrays gives a global picture of the dynamics of gene expres-sion, in which individual genes and arrays appear to be classifiedinto groups of similar regulation and function, or similar cellularstate and biological phenotype, respectively. After normalizationand sorting, the significant eigengenes and eigenarrays can beassociated with observed genome-wide effects of regulators, orwith measured samples, in which these regulators are overactiveor underactive, respectively.

DNA microarray technology (1, 2) and genome sequencinghave advanced to the point that it is now possible to monitor

gene expression levels on a genomic scale (3). These new datapromise to enhance fundamental understanding of life on themolecular level, from regulation of gene expression and genefunction to cellular mechanisms, and may prove useful in medicaldiagnosis, treatment, and drug design. Analysis of these new datarequires mathematical tools that are adaptable to the largequantities of data, while reducing the complexity of the data tomake them comprehensible. Analysis so far has been limited toidentification of genes and arrays with similar expression pat-terns by using clustering methods (4–9).

We describe the use of singular value decomposition (SVD)(10) in analyzing genome-wide expression data. SVD is alsoknown as Karhunen–Loeve expansion in pattern recognition(11) and as principal-component analysis in statistics (12). SVDis a linear transformation of the expression data from thegenes 3 arrays space to the reduced ‘‘eigengenes’’ 3 ‘‘eigen-arrays’’ space. In this space the data are diagonalized, such thateach eigengene is expressed only in the corresponding eigen-array, with the corresponding ‘‘eigenexpression’’ level indicatingtheir relative significance. The eigengenes and eigenarrays areunique, and therefore also data-driven, orthonormal superpo-sitions of the genes and arrays, respectively.

We show that several significant eigengenes and the corre-sponding eigenarrays capture most of the expression informa-tion. Normalizing the data by filtering out the eigengenes (andthe corresponding eigenarrays) that are inferred to representnoise or experimental artifacts enables meaningful comparisonof the expression of different genes across different arrays indifferent experiments. Such normalization may improve anyfurther analysis of the expression data. Sorting the data accord-ing to the correlations of the genes (and arrays) with eigengenes(and eigenarrays) gives a global picture of the dynamics of geneexpression, in which individual genes and arrays appear to beclassified into groups of similar regulation and function, orsimilar cellular state and biological phenotype, respectively.These groups of genes (or arrays) are not defined by overallsimilarity in expression, but only by similarity in the expression

of any chosen subset of eigengenes (or eigenarrays). Uponcomparing two or more similar experiments, with a regulatorbeing overactive or underactive in one but normally expressed inthe others, the expression pattern of one of the significanteigengenes may be correlated with the expression patterns of thisregulator and its targets. This eigengene, therefore, can beassociated with the observed genome-wide effect of the regu-lator. The expression pattern of the corresponding eigenarray iscorrelated with the expression patterns observed in samples inwhich the regulator is overactive or underactive. This eigenarray,therefore, can be associated with these samples.

We conclude that SVD provides a useful mathematical frame-work for processing and modeling genome-wide expression data,in which both the mathematical variables and operations may beassigned biological meaning.

Mathematical Framework: Singular Value DecompositionThe relative expression levels of N genes of a model organism,which may constitute almost the entire genome of this organism,in a single sample, are probed simultaneously by a singlemicroarray. A series of M arrays, which are almost identicalphysically, probe the genome-wide expression levels in M dif-ferent samples—i.e., under M different experimental conditions.Let the matrix e, of size N-genes 3 M-arrays, tabulate the fullexpression data. Each element of e satisfies ^nueum& [ enm for all1 # n # N and 1 # m # M, where enm is the relative expressionlevel of the nth gene in the mth sample as measured by the mtharray.§ The vector in the nth row of the matrix e, ^gnu [ ^nue, liststhe relative expression of the nth gene across the differentsamples which correspond to the different arrays. The vector inthe mth column of the matrix e, uam& [ eum&, lists the genome-wide relative expression measured by the mth array.

SVD (10) is then linear transformation of the expression datafrom the N-genes 3 M-arrays space to the reduced L-‘‘eigenarrays’’ 3 L-‘‘eigengenes’’ space, where L 5 min{M, N}(see Fig. 7 in supplemental material at www.pnas.org),

e 5 u« vT. [1]

In this space the data are represented by the diagonal nonnega-tive matrix e, of size L-eigengenes 3 L-eigenarrays, whichsatisfies ^ku«ul& [ «ldkl $ 0 for all 1 # k,l # L, such that the ltheigengene is expressed only in the corresponding lth eigenarray,with the corresponding ‘‘eigenexpression’’ level «l. Therefore,the expression of each eigengene (or eigenarray) is decoupled

Abbreviation: SVD, singular value decomposition.

†To whom reprint requests should be addressed. E-mail: [email protected].

§In this report, m denotes a matrix, uv& denotes a column vector, and ^uu denotes a rowvector, such that muv&, ^uum, and ^uuv& all denote inner products and uv&^uu denotes an outerproduct.

The publication costs of this article were defrayed in part by page charge payment. Thisarticle must therefore be hereby marked “advertisement” in accordance with 18 U.S.C.§1734 solely to indicate this fact.

PNAS u August 29, 2000 u vol. 97 u no. 18 u 10101–10106

GEN

ETIC

S

Page 2: Singular value decomposition for genome-wide expression data processing …genome- · 2009-08-13 · Singular value decomposition for genome-wide expression data processing and modeling

from that of all other eigengenes (or eigenarrays). The “fractionof eigenexpression,’’

pl 5 «l2y O

k 5 1

L

«k2, [2]

indicates the relative significance of the lth eigengene andeigenarray in terms of the fraction of the overall expression thatthey capture. Assume also that the eigenexpression levels arearranged in decreasing order of significance, such that «1 $ «2 $. . . $ «L $ 0. “Shannon entropy” of a dataset,

0 # d 521

log~L!O

k 5 1

L

pklog~pk! # 1, [3]

measures the complexity of the data from the distribution of theoverall expression between the different eigengenes (and eigen-arrays), where d 5 0 corresponds to an ordered and redundantdataset in which all expression is captured by a single eigengene(and eigenarray), and d 5 1 corresponds to a disordered andrandom dataset where all eigengenes (and eigenarrays) areequally expressed.

The transformation matrices u and vT define the N-genes 3L-eigenarrays and the L-eigengenes 3 M-arrays basis sets, respec-tively. The vector in the lth row of the matrix vT, ^glu [ ^luvT, lists theexpression of the lth eigengene across the different arrays. Thevector in the lth column of the matrix u, ual& [ uul&, lists thegenome-wide expression in the lth eigenarray. The eigengenes andeigenarrays are orthonormal superpositions of the genes and arrays,such that the transformation matrices u and v are both orthogonal

uTu 5 vTv 5 I , [4]

where I is the identity matrix. Therefore, the expression of eacheigengene (or eigenarray) is not only decoupled but also decorre-lated from that of all other eigengenes (or eigenarrays). Theeigengenes and eigenarrays are unique, except in degenerate sub-spaces, defined by subsets of equal eigenexpression levels, andexcept for a phase factor of 61, such that each eigengene (oreigenarray) captures both parallel and antiparallel gene (or array)expression patterns. Therefore, SVD is data-driven, except indegenerate subspaces.

SVD Calculation. According to Eqs. 1 and 4, the M-arrays 3M-arrays symmetric correlation matrix a 5 eTe 5 v«2vT isrepresented in the L-eigengenes 3 L-eigengenes space by thediagonal matrix «2. The N-genes 3 N-genes correlation matrixg 5 eeT 5 u«2uT is represented in the L-eigenarrays 3 L-eigenarrays space also by «2, where for L 5 min{M, N} 5 M, ghas a null subspace of at least N 2 M null eigenvalues. We,therefore, calculate the SVD of a dataset e, with M ,, N, bydiagonalizing a, and then projecting the resulting v and « ontoe to obtain u 5 ev«21.

Pattern Inference. The decorrelation of the eigengenes (andeigenarrays) suggests the possibility that some of the eigengenes(and the corresponding eigenarrays) represent independentregulatory programs or processes (and corresponding cellularstates). We infer that an eigengene ugl& represents a regulatoryprogram or process from its expression pattern across all arrays,when this pattern is biologically interpretable. This inferencemay be supported by a corresponding coherent biological themereflected in the functions of the genes, whose expression patternscorrelate or anticorrelate with the pattern of this eigengene.With this we assume that the corresponding eigenarray ual&(which lists the amplitude of this eigengene pattern in the

expression of each gene ugn& relative to all other genes ^nual& 5^gnugl&y«l) represents the cellular state which corresponds to thisprocess. We infer that the eigenarray ual& represents a cellularstate from the arrays whose expression patterns correlate oranticorrelate with the pattern of this eigenarray. Upon sorting ofthe genes, this inference may be supported by the expressionpattern of this eigenarray across all genes, when this pattern isbiologically interpretable.

Data Normalization. The decoupling of the eigengenes and eigen-arrays allows filtering the data without eliminating genes orarrays from the dataset. We filter any of the eigengenes ugl& (andthe corresponding eigenarray ual&) e 3 e 2 «lual& ^glu, bysubstituting zero for the eigenexpression level «l 5 0 in thediagonal matrix « and reconstructing the data according to Eq.1. We normalize the data by filtering out those eigengenes (andeigenarrays) that are inferred to represent noise or experimentalartifacts.

Degenerate Subspace Rotation. The uniqueness of the eigengenesand eigenarrays does not hold in a degenerate subspace, definedby equal eigenexpression levels. We approximate significantsimilar eigenexpression levels «l ' «l11 ' . . . ' «m with «l 5 . . .5 «m 5 =¥k51

m «k2y(m 2 l 1 1). Therefore, Eqs. 1–4 remain

valid upon rotation of the corresponding eigengenes {(ugl&, . . . ,ugm&) 3 R(ugl&, . . . , ugm&)}, and eigenarrays {(ual&, . . . , uam&)3 R(ual&, . . . , uam&)}, for all orthogonal R, RTR 5 I. We choosea unique rotation R by subjecting the rotated eigengenes to m 2l constraints, such that these constrained eigengenes may beadvantageous in interpreting and presenting the expression data.

Data Sorting. Inferring that eigengenes (and eigen-arrays) represent independent processes (and cellular states)allows sorting the data by similarity in the expression of anychosen subset of these eigengenes (and eigenarrays), rather thanby overall similarity in expression. Given two eigengenes ugk&and ugl& (or eigenarrays uak& and ual&), we plot the correlationof ugk& with each gene ugn&, ^gkugn&y^gnugn& (or uak& with eacharray uam&) along the y-axis, vs. that of ugl& (or ual&) along thex-axis. In this plot, the distance of each gene (or array)from the origin is its amplitude of expression in the sub-space spanned by ugk& and ugl& (or uak& and ual&), relative toits overall expression rn [ ^gnugn&21 =u^gkugn&u2 1 u^glugn&u2 (orrm [ ^amuam&21 =u^akuam&u2 1 u^aluam&u2). The angular distanceof each gene (or array) from the x-axis is its phase in thetransition from the expression pattern ugl& to ugk& and back to ugl&(or ual& to uak& and back to ual&) tan fn [ ^gkugn&y^glugn&, (or tanfm [ ^akuan&y^aluam&). We sort the genes (or arrays) accordingto fn (or fm).

Biological Data Analysis: Elutriation-Synchronized Cell CycleSpellman et al. (3) monitored genome-wide mRNA levels, for6,108 ORFs of the budding yeast Saccharomyces cerevisiaesimultaneously, over approximately one cell cycle period,T ' 390 min, in a yeast culture synchronized by elutriation,relative to a reference mRNA from an asynchronous yeastculture, at 30-min intervals. The elutriation dataset we analyze(see supplemental data and Mathematica notebook at www.pnas.org and at http:yygenome-www.stanford.eduySVDy) tab-ulates the measured ratios of gene expression levels for the N 55,981 genes, 784 of which were classified by Spellman et al. as cellcycle regulated, with no missing data in the M 5 14 arrays.

Pattern Inference. Consider the 14 eigengenes of the elutriationdataset. The first and most significant eigengene ug1&, whichdescribes time invariant relative expression during the cell cycle(Fig. 8a at www.pnas.org), captures more than 90% of the overallrelative expression in this experiment (Fig. 8b). The entropy of

10102 u www.pnas.org Alter et al.

Page 3: Singular value decomposition for genome-wide expression data processing …genome- · 2009-08-13 · Singular value decomposition for genome-wide expression data processing and modeling

the dataset, therefore, is low d 5 0.14 ,, 1. This suggests thatthe underlying processes are manifested by weak perturbationsof a steady state of expression. This also suggests that time-invariant additive constants due to uncontrolled experimentalvariables may be superimposed on the data. We infer that ug1&represents experimental additive constants superimposed on asteady gene expression state, and assume that ua1& represents thecorresponding steady cellular state. The second, third, andfourth eigengenes, which show oscillations during the cell cycle(Fig. 8c), capture about 3%, 1%, and 0.5% of the overall relativeexpression, respectively. The time variation of ug3& fits a normal-ized sine function of period T, =2yT sin(2ptyT). We infer thatug3& represents expression oscillation, which is consistent withgene expression oscillations during a cell cycle. The time vari-ations of the second and fourth eigengenes fit a cosine functionof period T with =1y2 the amplitude of a normalized cosine withthis period, =1yT cos 2ptyT. However, while ug2& shows de-creasing expression on transition from t 5 0 to 30 min, ug4& showsincreasing expression. We infer that ug2& and ug4& represent initialtransient increase and decrease in expression in response to theelutriation, respectively, superimposed on expression oscillationduring the cell cycle.

Data Normalization. We filter out the first eigengene and eigen-array of the elutriation dataset, e 3 eC 5 e 2 «1ua1& ^g1u,removing the steady state of expression. Each of the elements ofthe dataset eC, ^nueCum& [ eC,nm, is the difference of the

measured expression of the nth gene in the mth array from thesteady-state levels of expression for these gene and array ascalculated by SVD. Therefore, eC,nm

2 is the variance in themeasured expression of the nth gene in the mth array. Let eLVtabulate the natural logarithm of the variances in elutriationexpression, such that each element of eLV satisfies ^nueLVum& [log(eC,nm

2 ) for all 1 # n # N and 1 # m # M, and consider theeigengenes of eLV (Fig. 9a in supplemental material at www.pnas.org). The first eigengene ug1&LV, which captures more than80% of the overall information in this dataset (Fig. 9b), describesa weak initial transient increase superimposed on a time-invariant scale of expression variance. The initial transientincrease in the scale of expression variance may be a response tothe elutriation. The time-invariant scale of expression variancesuggests that a steady scale of experimental as well as biologicaluncertainty is associated with the expression data. This alsosuggests that time-invariant multiplicative constants due touncontrolled experimental variables may be superimposed onthe data. We filter out ug1&LV, removing the steady scale ofexpression variance, eLV 3 eCLV 5 eLV 2 «1,LVua1&LV LV^g1u.

The normalized elutriation dataset eN, where each of its elementssatisfies ^nueNum& [ sign(eC,nm)=exp(eCLV,nm), tabulates for eachgene and array expression patterns that are approximately centeredat the steady-state expression level (i.e., of approximately zeroarithmetic means), with variances which are approximately nor-malized by the steady scale of expression variance (i.e., of approx-imately unit geometric means). The first and second eigengenes,

Fig. 1. Normalized elutriation eigengenes. (a) Raster display of vNT , the expression of 14 eigengenes in 14 arrays. (b) Bar chart of the fractions of eigenexpression,

showing that ug1&N and ug2&N capture about 20% of the overall normalized expression each, and a high entropy d 5 0.88. (c) Line-joined graphs of the expressionlevels of ug1&N (red) and ug2&N (blue) in the 14 arrays fit dashed graphs of normalized sine (red) and cosine (blue) of period T 5 390 min and phase u 5 2py13,respectively.

Fig. 2. Normalized elutriation expression in the subspaceassociated with the cell cycle. (a) Array correlation with ua1&N

along the y-axis vs. that with ua2&N along the x-axis, color-coded according to the classification of the arrays into thefive cell cycle stages, MyG1 (yellow), G1 (green), S (blue), SyG2

(red), and G2yM (orange). The dashed unit and half-unitcircles outline 100% and 25% of overall normalized arrayexpression in the ua1&N and ua2&N subspace. (b) Correlation ofeach gene with ug1&N vs. that with ug2&N, for 784 cell cycleregulated genes, color-coded according to the classificationby Spellman et al. (3).

Alter et al. PNAS u August 29, 2000 u vol. 97 u no. 18 u 10103

GEN

ETIC

S

Page 4: Singular value decomposition for genome-wide expression data processing …genome- · 2009-08-13 · Singular value decomposition for genome-wide expression data processing and modeling

ug1&N and ug2&N, of eN (Fig. 1a), which are of similar significance,capture together more than 40% of the overall normalized expres-sion (Fig. 1b). The time variations of ug1&N and ug2&N fit normalizedsine and cosine functions of period T and initial phase u ' 2py13,=2yT sin(2p tyT 2 u) and =2yT cos(2p tyT 2 u), respectively (Fig.1c). We infer that ug1&N and ug2&N represent cell cycle expressionoscillations, and assume that the corresponding eigenarrays ua1&N

and ua2&N represent the corresponding cell cycle cellular states.Upon sorting of the genes (and arrays) according to ug1&N and ug2&N

(and ua1&N and ua2&N), the initial phase u ' 2py13 can be interpretedas a delay of 30 min between the start of the experiment and thatof the cell cycle stage G1. The decay to zero in the time variationof ug2&N at t 5 360 and 390 min can be interpreted as dephasing intime of the initially synchronized yeast culture.

Data Sorting. Consider the normalized expression of the 14elutriation arrays {uam&} in the subspace spanned by ua1&N andua2&N, which is assumed to approximately represent all cell cyclecellular states (Fig. 2a). All arrays have at least 25% of theirnormalized expression in this subspace, with their distances fromthe origin satisfying 0.5 # rm , 1, except for the eleventh arrayua11&. This suggests that ua1&N and ua2&N are sufficient to approx-imate the elutriation array expression. The sorting of the arraysaccording to their phases {fm}, which describes the transitionfrom the expression pattern ua2&N to ua1&N and back to ua2&N,gives an array order which is similar to that of the cell cycle timepoints measured by the arrays, an order that describes theprogress of the cell cycle expression from the MyG1 stagethrough G1, S, SyG2, and G2yM and back to MyG1.

Because ua1&N is correlated with the arrays ua4&, ua5&, ua6&, and ua7&and is anticorrelated with ua13& and ua14&, we associate ua1&N with thecell cycle cellular state of transition from G1 to S, and 2ua1&N withthe transition from G2yM to MyG1. Similarly, ua2&N is correlatedwith ua2& and ua3&, and therefore we associate ua2&N with the

transition from MyG1 to G1. Also, ua2&N is anticorrelated with ua8&and ua10&, and therefore we associate 2ua2&N with the transition fromS to SyG2. With these associations the phase of ua1&, f1 5 2u '22py13, corresponds to the 30-min delay between the start of theexperiment and that of the cell cycle stage G1, which is also presentin the inferred cell cycle expression oscillations ug1&N and ug2&N.

Consider also the expression of the 5,981 genes {ugn&} in thesubspace spanned by ug1&N and ug2&N, which is inferred to approx-imately represent all cell cycle expression oscillations (Fig. 10 insupplemental material at www.pnas.org). One may expect thatgenes that have almost all of their normalized expression in thissubspace with rn ' 1 are cell cycle regulated, and that genes thathave almost no expression in this subspace with rn ' 0, are notregulated by the cell cycle at all. Indeed, of the 784 genes that wereclassified by Spellman et al. (3) as cell cycle regulated, 641 havemore than 25% of their normalized expression in this subspace (Fig.2b). We sort all 5,981 genes according to their phases {fn}, todescribe the transition from the expression pattern ug2&N to that ofug1&N and back to ug2&N, starting at f1 ' 22py13. One may expectthis to order the genes according to the stages in the cell cycle inwhich their expression patterns peak. However, for the 784 cell cycleregulated genes this sorting gives a classification of the genes intothe five cell cycle stages, which is somewhat different than theclassification by Spellman et al. This may be due to the poor qualityof the elutriation expression data, as synchronization by elutriationwas not very effective in this experiment. For the a factor-synchronized cell cycle expression there is much better agreementbetween the two classifications (Fig. 5b).

With all 5,981 genes sorted, the gene variations of ua1&N andua2&N fit normalized sine and cosine functions of period Z [ N 21 5 5,980 and initial phase u ' 2py13, 2=2yZ sin(2pzyZ 2u) and =2yZ cos(2pzyZ 2 u), respectively, where z [ n 2 1(Fig. 3 b and c). The sorted and normalized elutriation expres-sion fit approximately a traveling wave of expression, varying

Fig. 3. Genes sorted by relative correlation with ug1&N and ug2&N of normalized elutriation. (a) Normalized elutriation expression of the sorted 5,981 genes inthe 14 arrays, showing traveling wave of expression. (b) Eigenarrays expression; the expression of ua1&N and ua2&N, the eigenarrays corresponding to ug1&N and ug2&N,displays the sorting. (c) Expression levels of ua1&N (red) and ua2&N (green) fit normalized sine and cosine functions of period Z ; N 2 1 5 5,980 and phaseu ' 2py13 (blue), respectively.

10104 u www.pnas.org Alter et al.

Page 5: Singular value decomposition for genome-wide expression data processing …genome- · 2009-08-13 · Singular value decomposition for genome-wide expression data processing and modeling

sinusoidally across both genes and arrays, such that the expres-sion of the nth gene in the mth array satisfies ^nueNum& } 22cos[2p(tyT 2 zyZ)]y=ZT (Fig. 3a).

Biological Data Analysis: a Factor-Synchronized CellCycle and CLB2 and CLN3 OveractivationsSpellman et al. (3) also monitored genome-wide mRNA levels,for 6,108 yeast ORFs simultaneously, over approximately twocell cycle periods, in a yeast culture synchronized by a factor,relative to a reference mRNA from an asynchronous yeastculture, at 7-min intervals for 119 min. They also measured, intwo independent experiments, mRNA levels of yeast straincultures with overactivated CLB2, which encodes a G2yM cyclin,both at t 5 40 min relative to their levels at the start ofoveractivation at t 5 0. Two additional independent experimentsmeasured mRNA levels of strain cultures with overactivatedCLN3, which encodes a G1yS cyclin, at t 5 30 and 40 min relativeto their levels at the start of overactivation at t 5 0. The datasetfor the a factor, CLB2, and CLN3 experiments we analyze (seesupplemental data and Mathematica notebook at www.pnas.org)tabulates the ratios of gene expression levels for the N 5 4,579genes, 638 of which were classified by Spellman et al. as cell cycleregulated, with no missing data in the M 5 22 arrays.

After data normalization and degenerate subspace rotation(see Appendix in supplemental material at www.pnas.org), the

time variations of ug1&RN and ug2&RN fit normalized sine andcosine functions of two 66-min periods during the cell cycle, fromt 5 7 to 119 min, and initial phase u ' py4, respectively (Fig. 4c).While ug2&RN describes steady-state expression in the CLB2- andCLN3-overactive arrays, ug1&RN describes underexpression in theCLB2-overactive arrays and overexpression in the CLN3-overactive arrays.

Upon sorting the 4,579 genes in the subspace spanned byug1&RN and ug2&RN (Fig. 5b), ug1&RN is correlated with genes thatpeak late in the cell cycle stage G1 and early in S, among themCLN3, and we associate ug1&RN with the cell cycle expressionoscillations that start at the transition from G1 to S and aredependent on CLN3, which encodes a G1yS cyclin. Also, ug1&RNis anticorrelated with genes that peak late in G2yM and early inMyG1, among them CLB2, and therefore we associate 2ug1&RNwith the oscillations that start at the transition from G2yM toMyG1 and are dependent on CLB2, which encodes a G2yMcyclin. Similarly, ug2&RN is correlated with genes that peak late inMyG1 and early in G1, anticorrelated with genes that peak latein S and early in SyG2, and uncorrelated with CLB2 and CLN3.We, therefore, associate ug2&RN with the oscillations that start atthe transition from MyG1 to G1 (and appear to be CLB2- andCLN3-independent), and 2ug2&RN with the oscillations that startat the transition from S to SyG2 (and appear to be CLB2- andCLN3-independent).

Fig. 4. Rotated normalized a factor, CLB2, and CLN3 eigengenes. (a) Raster display of vRNT , where ug1&RN 5 R2R1ug1&N, ug2&RN 5 R1ug2&N, and ug3&RN 5 R2ug3&N. (b)

ug1&RN, ug2&RN and ug3&RN capture 20% of the overall normalized expression each. (c) Expression levels of ug1&RN (red) and ug2&RN (blue) fit dashed graphs of normalizedsine (red) and cosine (blue) of period Ty2 5 66 min and phase py4, respectively, and ug3&RN (green) fits dashed graph of normalized sine of period T 5 112 minand phase 2py8, from t 5 7 to t 5 119 min during the cell cycle.

Fig. 5. Rotated normalized a factor, CLB2, and CLN3 ex-pression in the subspace associated with the cell cycle. (a)Array correlation with ua1&RN along the y-axis vs. that withua2&RN along the x-axis, color-coded according to the classi-fication of the arrays into the five cell cycle stages, MyG1

(yellow), G1 (green), S (blue), SyG2 (red), and G2yM (orange).The dashed unit and half-unit circles outline 100% and 25%of overall normalized array expression in the ua1&RN andua2&RN subspace. (b) Correlation of each gene with ug1&RN vs.that with ug2&RN, for 638 cell cycle regulated genes, color-coded according to the classification by Spellman et al. (3).

Alter et al. PNAS u August 29, 2000 u vol. 97 u no. 18 u 10105

GEN

ETIC

S

Page 6: Singular value decomposition for genome-wide expression data processing …genome- · 2009-08-13 · Singular value decomposition for genome-wide expression data processing and modeling

Upon sorting the 22 arrays in the subspace spanned by ua1&RN andua2&RN (Fig. 5a), ua1&RN is correlated with the arrays ua13& and ua14&,as well as with ua21& and ua22&, which measure the CLN3-overactivesamples. We therefore associate ua1&RN with the cell cycle cellularstate of transition from G1 to S, which is simulated by CLN3overactivation. Also, ua1&RN is anticorrelated with the arrays ua9& andua10&, as well as with ua19& and ua20&, which measure the CLB2-overactive samples. We associate 2ua1&RN with the cellular transi-tion from G2yM to MyG1, which is simulated by CLB2 overacti-vation. Similarly, ua2&RN appears to be correlated with ua2&, ua3&, ua11&,and ua12&, anticorrelated with ua6&, ua7&, ua16&, and ua17&, and uncor-related with ua19&, ua20&, ua21&, or ua22&. We therefore associate ua2&RNwith the cellular transition from MyG1 to G1 (which appears to beCLB2- and CLN3-independent), and 2ua2&RN with the cellulartransition from S to SyG2 (which also appears to be CLB2- andCLN3-independent).

With all 4,579 genes sorted the gene variations of ua1&RN andua2&RN fit normalized sine and cosine functions of period Z [ N 21 5 4,578 and initial phase py8, respectively (Fig. 6 b and c). Thenormalized and sorted cell cycle expression approximately fits atraveling wave, varying sinusoidally across both genes and arrays.

The normalized and sorted expression in the CLB2- and CLN3-overactive arrays approximately fits standing waves, constant acrossthe arrays and varying sinusoidally across genes only, which appearsimilar to 2ua1&RN and ua1&RN, respectively (Fig. 6a).

ConclusionsWe have shown that SVD provides a useful mathematicalframework for processing and modeling genome-wide expres-sion data, in which both the mathematical variables and opera-tions may be assigned biological meaning.

We thank S. Kim for insightful discussions, G. Sherlock for technicalassistance and careful reading, and J. Doyle and P. Green for thoughtfulreviews of this manuscript. This work was supported by a grant from theNational Cancer Institute (National Institutes of Health, CA77097).O.A. is an Alfred P. Sloan and U.S. Department of Energy PostdoctoralFellow in Computational Molecular Biology, and a National HumanGenome Research Institute Individual Mentored Research ScientistDevelopment Awardee in Genomic Research and Analysis (NationalInstitutes of Health, 1 K01 HG00038-01). P.O.B. is an Associate Inves-tigator of the Howard Hughes Medical Institute.

1. Fodor, S. P., Rava, R. P., Huang, X. C., Pease, A. C., Holmes, C. P. & Adams,C. L. (1993) Nature (London) 364, 555–556.

2. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. (1995) Science 270,467–470.

3. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen,M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998) Mol. Biol. Cell 9,3273–3297.

4. Roth, F. P., Hughes, J. D., Estep, P. W. & Church, G. M. (1998) Nat. Biotechnol.16, 939–945.

5. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc. Natl.Acad. Sci. USA 95, 14863–14868.

6. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. &Levine, A. J. (1999) Proc. Natl. Acad. Sci. USA 96, 6745–6750.

7. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E.,Lander, E. S. & Golub, T. R. (1999) Proc. Natl. Acad. Sci. USA 96, 2907–2912.

8. Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. & Church, G. M. (1999)Nat. Genet. 22, 281–285.

9. Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey,T. S., Ares, M., Jr., & Haussler, D. (2000) Proc. Natl. Acad. Sci. USA 97,262–267.

10. Golub, G. H. & Van Loan, C. F. (1996) Matrix Computation (Johns HopkinsUniv. Press, Baltimore), 3rd Ed.

11. Mallat, S. G. (1999) A Wavelet Tour of Signal Processing (Academic, San Diego),2nd Ed.

12. Anderson, T. W. (1984) Introduction to Multivariate Statistical Analysis (Wiley,New York), 2nd Ed.

Fig. 6. Genes sorted by relative correlation with ug1&RN and ug2&RN of rotated normalized a factor, CLB2, and CLN3. (a) Normalized expression of the sorted 4,579genes in the 22 arrays, showing traveling wave of expression from t 5 0 to 119 min during the cell cycle and standing waves of expression in the CLB2- andCLN3-overactive arrays. (b) Eigenarrays expression; the expression of ua1&RN and ua2&RN, the eigenarrays corresponding to ug1&RN and ug2&RN, displays the sorting. (c)Expression levels of ua1&RN (red) and ua2&RN (green) fit normalized sine and cosine functions of period Z ; N 2 1 5 4,578 and phase py8 (blue), respectively.

10106 u www.pnas.org Alter et al.