10
Unraveling complex temporal associations in cellular systems across multiple time-series microarray datasets Wenyuan Li, Min Xu, Xianghong Jasmine Zhou * Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, California, CA 90089, USA article info Article history: Received 7 June 2009 Available online xxxx Keywords: Time-series microarray data Complex temporal association abstract Unraveling the temporal complexity of cellular systems is a challenging task, as the subtle coordination of molecular activities cannot be adequately captured by simple mathematical concepts such as correlation. This paper addresses the challenge with a data-mining approach. We introduce the novel concept of a ‘‘frequent temporal association pattern” (FTAP): a set of genes simultaneously exhibit complex temporal expression patterns recurrently across multiple microarray datasets. Such temporal signals are hard to identify in individual microarray datasets, but become significant by their frequent occurrences across multiple datasets. We designed an efficient two-stage algorithm to identify FTAPs. First, for each gene we identify expression trends that occur frequently across multiple datasets. Second, we look for a set of genes that simultaneously exhibit their respective trends recurrently in multiple datasets. We applied this algorithm to 18 yeast time-series microarray datasets. The majority of FTAPs identified by the algorithm are associated with specific biological functions. Moreover, a significant number of patterns include genes that are functionally related but do not exhibit co-expression; such gene groups cannot be captured by clustering algorithms. Our approach offers advantages: (1) it can identify complex asso- ciations of temporal trends in gene expression, an important step towards understanding the complex mechanisms governing cellular systems; (2) it is capable of integrating time-series data with different time scales and intervals; and (3) it yields results that are robust against outliers. Ó 2010 Elsevier Inc. All rights reserved. 1. Introduction The cell is a dynamic system, where diverse biological processes are coordinated in precise and complex ways to maintain temporal coherence [1]. Understanding the rules governing such temporal complexity is an extremely challenging task. A possible first step is to extract frequent association rules, e.g., a set of genes which frequently exhibit the same activities. Recent progress in micro- array has made it possible to simultaneously measure time-depen- dent expression profiles of all genes in an entire genome. Clustering algorithms can be used to identify groups of genes that exhibit similar temporal expression patterns [2–4]; time-warping approaches can be used to identify shifted gene expression corre- lations [5–7]. Neither approach, however, is up to the challenge of identifying more complex temporal expression associations (i.e., those involving more than simple co-expression or shifted co-expression). An example of complex temporal association during a cell cycle is displayed in Fig. 1. The expression profiles of two yeast genes, CLN2 and SIC1, are plotted across two consecutive cell cycles based on a microarray study [8]. It is known that CLN2 is the major acti- vator of SIC1 phosphorylation, an essential step for the cell cycle to progress through the G1 phase to the S phase [9]. There is no direct or shifted correlation between the two genes in either cycle. None- theless, their expression data contain distinct patterns that remain consistent across the two cell cycles (the solid points and lines in Fig. 1). In particular, the solid-line pattern of CLN2 and that of SIC1 show coordinated change repeatedly across the two cell cy- cles. Such periodic coupling suggests the existence of a non-trivial interaction between the two genes, a process beyond simple acti- vation or inhibition at the expression level. Naturally, such complex patterns are difficult to extract from the genome-wide expression profiles unless they occur frequently within a single dataset, or across multiple datasets. As time-series microarray datasets are generally quite short (more than 80% of all time-series expression datasets contain fewer than 9 time points) [10], we focus on identifying patterns that are repeated across mul- tiple datasets. The rapid accumulation of publicly available time- series microarray data has made the task of identifying frequent temporal associations possible. Integrating multiple microarray time-series datasets is a non- trivial task for several reasons. Firstly, the gene expression 1532-0464/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2009.12.006 * Corresponding author. E-mail addresses: [email protected] (W. Li), [email protected] (M. Xu), [email protected] (X.J. Zhou). Journal of Biomedical Informatics xxx (2010) xxx–xxx Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin ARTICLE IN PRESS Please cite this article in press as: Li W et al. Unraveling complex temporal associations in cellular systems across multiple time-series microarray datasets. J Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

Journal of Biomedical Informatics xxx (2010) xxx–xxx

ARTICLE IN PRESS

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier .com/locate /y jb in

Unraveling complex temporal associations in cellular systems across multipletime-series microarray datasets

Wenyuan Li, Min Xu, Xianghong Jasmine Zhou *

Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, California, CA 90089, USA

a r t i c l e i n f o a b s t r a c t

Article history:Received 7 June 2009Available online xxxx

Keywords:Time-series microarray dataComplex temporal association

1532-0464/$ - see front matter � 2010 Elsevier Inc. Adoi:10.1016/j.jbi.2009.12.006

* Corresponding author.E-mail addresses: [email protected] (W. Li), mxu@usc.

(X.J. Zhou).

Please cite this article in press as: Li W et al. UnrJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.

Unraveling the temporal complexity of cellular systems is a challenging task, as the subtle coordination ofmolecular activities cannot be adequately captured by simple mathematical concepts such as correlation.This paper addresses the challenge with a data-mining approach. We introduce the novel concept of a‘‘frequent temporal association pattern” (FTAP): a set of genes simultaneously exhibit complex temporalexpression patterns recurrently across multiple microarray datasets. Such temporal signals are hard toidentify in individual microarray datasets, but become significant by their frequent occurrences acrossmultiple datasets. We designed an efficient two-stage algorithm to identify FTAPs. First, for each genewe identify expression trends that occur frequently across multiple datasets. Second, we look for a setof genes that simultaneously exhibit their respective trends recurrently in multiple datasets. We appliedthis algorithm to 18 yeast time-series microarray datasets. The majority of FTAPs identified by thealgorithm are associated with specific biological functions. Moreover, a significant number of patternsinclude genes that are functionally related but do not exhibit co-expression; such gene groups cannotbe captured by clustering algorithms. Our approach offers advantages: (1) it can identify complex asso-ciations of temporal trends in gene expression, an important step towards understanding the complexmechanisms governing cellular systems; (2) it is capable of integrating time-series data with differenttime scales and intervals; and (3) it yields results that are robust against outliers.

� 2010 Elsevier Inc. All rights reserved.

1. Introduction

The cell is a dynamic system, where diverse biological processesare coordinated in precise and complex ways to maintain temporalcoherence [1]. Understanding the rules governing such temporalcomplexity is an extremely challenging task. A possible first stepis to extract frequent association rules, e.g., a set of genes whichfrequently exhibit the same activities. Recent progress in micro-array has made it possible to simultaneously measure time-depen-dent expression profiles of all genes in an entire genome.Clustering algorithms can be used to identify groups of genes thatexhibit similar temporal expression patterns [2–4]; time-warpingapproaches can be used to identify shifted gene expression corre-lations [5–7]. Neither approach, however, is up to the challengeof identifying more complex temporal expression associations(i.e., those involving more than simple co-expression or shiftedco-expression).

An example of complex temporal association during a cell cycleis displayed in Fig. 1. The expression profiles of two yeast genes,

ll rights reserved.

edu (M. Xu), [email protected]

aveling complex temporal assoc12.006

CLN2 and SIC1, are plotted across two consecutive cell cycles basedon a microarray study [8]. It is known that CLN2 is the major acti-vator of SIC1 phosphorylation, an essential step for the cell cycle toprogress through the G1 phase to the S phase [9]. There is no director shifted correlation between the two genes in either cycle. None-theless, their expression data contain distinct patterns that remainconsistent across the two cell cycles (the solid points and lines inFig. 1). In particular, the solid-line pattern of CLN2 and that ofSIC1 show coordinated change repeatedly across the two cell cy-cles. Such periodic coupling suggests the existence of a non-trivialinteraction between the two genes, a process beyond simple acti-vation or inhibition at the expression level.

Naturally, such complex patterns are difficult to extract fromthe genome-wide expression profiles unless they occur frequentlywithin a single dataset, or across multiple datasets. As time-seriesmicroarray datasets are generally quite short (more than 80% of alltime-series expression datasets contain fewer than 9 time points)[10], we focus on identifying patterns that are repeated across mul-tiple datasets. The rapid accumulation of publicly available time-series microarray data has made the task of identifying frequenttemporal associations possible.

Integrating multiple microarray time-series datasets is a non-trivial task for several reasons. Firstly, the gene expression

iations in cellular systems across multiple time-series microarray datasets.

Page 2: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

2 W. Li et al. / Journal of Biomedical Informatics xxx (2010) xxx–xxx

ARTICLE IN PRESS

values generated by different platforms are not directly compa-rable. Even using the same platform, alternative experimentalparameters may result in systematic variations between data-sets. These effects often lie beyond the capability of statisticalnormalization to remove [11–13]. Secondly, the time scale ofthe time-series datasets may vary (e.g., from minutes to hours).Thirdly, biological processes may unfold at different rates in re-sponse to experimental conditions, the organisms studied andeven among individuals [5]. In other words, we can expect sys-tematic variations in the magnitudes of gene expression valuesacross datasets, as well as in the time scale of the variationand the units employed.

We propose a novel approach to overcoming the above prob-lems, based on the concept of a ‘‘frequent temporal association pat-tern (FTAP)”. A typical example is illustrated in Fig. 2, which plotsthe expression profiles of three genes observed in three differentdatasets. In each dataset, the Pearson correlations between theexpression profiles are very low (<0.3) in each dataset. However,subtle associations emerge when we investigate the three datasetssimultaneously. Simply put, the rising trend in Gene 1 is always

Exp

ress

ion

Gene 3

Gene 2

Gene 1

Data set 1 Data

Time

Fig. 2. A frequent temporal association pattern (solid points and lines) in three genes andataset.

TimeExp

ress

ion

Cell Cycle 1

CLN2

SIC1

Fig. 1. Complex expression coupling of the yeast genes, CLN2 and SIC1, across two conseduring the two cell cycles. The horizontal axis represents time, and the vertical axis reprerepeated patterns.

Please cite this article in press as: Li W et al. Unraveling complex temporal assocJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

associated with a ‘‘down-up” trend in Gene 2 and a ‘‘up-down”trend in Gene 3. An FTAP exhibits associations in two dimensions:(1) each gene in the pattern exhibits the same trend in multipledatasets (‘‘frequent association”); (2) within each dataset all trendsin the pattern occur on the same subset of time points (‘‘temporalassociation”). In Fig. 2, each gene exhibits the same trend in allthree datasets, and all three genes go through their respectivetrends in each dataset at the same four selected time points. Sucha complex association of three genes cannot be extracted from anysingle dataset.

To identify the FTAPs present in multiple time-series datasets,we designed a two-stage algorithm: (1) for each gene, we identifytrends that occur frequently in the available datasets; and (2) weidentify sets of genes that simultaneously exhibit their respectivetrends in multiple datasets. This approach has several importantadvantages: (a) it identifies complex and subtle temporal expres-sion associations that cannot be captured by co-expression analy-sis; (b) it can integrate multiple time series with different timescales and intervals; and (c) patterns can be reliably identified evenin the presence of outliers, since we allow the trends to skip over a

set 2 Data set 3

d three datasets. Each row corresponds to a gene, and each column corresponds to a

Cell Cycle 2

cutive cell cycles. The solid points and lines represent patterns that repeatedly occursents expression value. The vertical lines highlight the data points that comprise the

iations in cellular systems across multiple time-series microarray datasets.

Page 3: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

W. Li et al. / Journal of Biomedical Informatics xxx (2010) xxx–xxx 3

ARTICLE IN PRESS

limited number of points in the series. Our algorithm is determin-istic in the sense that it produces an exhaustive list of all FTAPs inthe data. It is also scalable, since many of its operations can be per-formed in parallel.

We applied this algorithm to 18 yeast time-series microarraydatasets. The method successfully identified FTAPs that are associ-ated with specific biological functions. A significant number ofthese FTAPs include genes that are functionally related in a mannerthat goes beyond co-expression, patterns which could not havebeen captured by any clustering algorithm. With the rapid accu-mulation of microarray data in public repositories (a threefold in-crease per year) [14], we believe that our method can provide anatlas of complex temporal expression associations that will facili-tate more advanced mechanistic studies.

2. Related work

Pattern extraction from time-series data are an active researcharea in data mining. Recently developed methods include motifdiscovery [15], discord identification [16], longest commonsubsequence matching, sequence averaging, segmentation [17]and indexing [18]. However, almost all these algorithms analyzetime series from a single experiment rather than the variedexperiments.

Clustering and comparing gene expression time series haveattracted a lot of attentions recently. Aligning time series is aconventional method of evaluating the similarity between twotime series, which is known as curve alignment in the biologyliterature, curve registration in statistics, and time warping inengineering. Two recent examples are the clustered alignmentmethod proposed by Smith et al. [19] and Bayesian hierarchicalcurve registration y Telesca and Inoue [20]. These alignmentmethods, however, are also limited to the analysis of time seriesfrom the same experimental source (with the same time scaleand unit). In contrast, the FTAP method proposed in this paperis specifically designed for investigating time series from differ-ent experimental sources with different time scales and inter-vals. A number of clustering algorithms have been applied totime course gene expression data, as surveyed in the reviewby Androulakis et al. [21]. However, most gene expression timeseries contain very few time points, posing a significant chal-lenge to clustering analysis. By integrating multiple datasets,the FTAP method to some degree overcomes the difficulties asso-ciated with limited number of time points in single expressionprofiles. More importantly, FTAP analysis reveals subtle patterns(e.g., that illustrated in Fig. 2) which distinguishes from whatany existing time-series methods identify: curves or shapes oftime series are locally similar to each other, not within the sin-gle dataset, but across multiple datasets. Therefore, such groupsof genes may not be discovered by time-series clustering algo-rithms [10,22].

Time Point

Value

4

3

21

Rank Position

(a) r = [1, 3, 4, 2]

[ 1 3 2 ] [ 4 (b)

(c)T1

Trend[ 4 2 1 3 ]

Fig. 3. (a) Rank vector of a trend; (b) trend

Please cite this article in press as: Li W et al. Unraveling complex temporal assocJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

In recent years, data streams have received considerable atten-tion (a data stream is a particular kind of time series). The topic ofmining multiple data streams is closely related to our work. Papa-dimitriou et al. [23] developed an algorithm called SPIRIT that dis-covers correlations and hidden variables in multiple data streamswith the same time stamps and representing the same types ofmeasurements. However, SPIRIT is not applicable to heterogeneoustime series.

Some of the concepts used in this paper will sound familiar tothose versed in the literature, but there are important differencesto keep in mind. Firstly, out of a ‘‘trend” is different from that foundin recent papers [24,25], where trends must consist of consecutivetime points. As Fig. 2 illustrates, we allow a trend to pass over oneor few time points. Therefore, several common techniques such assliding time windows are not applicable to our problem. We usethe concept of rank vectors which can be easily expanded and enu-merated, to abstract and represent trends. We note that the longestincreasing subsequence (LIS) problem [26], a classic of algorithmproblems, also allows inconsecutive time points. In fact, the LIS isa specific example of the trends that our algorithm can identify.Secondly, the concept of ‘‘association” differs from that appearingin the literature on association rule mining [27], where an associ-ation rule has the form [gene A", gene B;] ? (7 min)[gene C"].Our association definition describes a correlation of genes acrossmultiple datasets on a subset of time points. The genes involvedin a FTAP may exhibit more complicated patterns than the simpleup-or-down trends of association rules.

3. The frequent temporal association model

In this section, we formally define the concept of a FTAP. Wethen decompose the problem of finding FTAPs into two subprob-lems for efficient mining.

3.1. Trend abstraction and representation

Our first task is to represent a temporal trend using its rank vec-tor. The formal definition is given below.

Definition 1. Rank vector of a trendLet x � ½x1; x2; . . . ; xk� 2 R1�k bean expression vector consisting of k time points. The rank vector ofx is defined as rx � ½r1; r2; . . . ; rk�, where ri is the rank of xi if theelements of x were sorted in ascending order.

An example of how to derive the rank vector of a trend is shown

in Fig. 3(a). The rank vector summarizes the essential structure of atrend, bypassing systematic variations in expression values gener-ated by different microarray platforms. Using rank vectors, all pos-sible trends of length-ðkþ 1Þ can easily be constructed from thetrends of length-k. This is accomplished simply by inserting thenumber ðkþ 1Þ into any position of the length-k rank vector. Anexample of this process is shown in Fig. 3(b), and its formal defini-

1 3 2 ] [ 1 4 3 2 ] [ 1 3 4 2 ] [ 1 3 2 4 ]

T3 T5 T6 T6 T7 T9 T10Occurrences

extension; and (c) trend occurrences.

iations in cellular systems across multiple time-series microarray datasets.

Page 4: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

4 W. Li et al. / Journal of Biomedical Informatics xxx (2010) xxx–xxx

ARTICLE IN PRESS

tion is given in Property 1. Iterating this process automatically gen-erates all possible trends without duplicates, as presented in Prop-erty 2 and illustrated in Fig. 4. This property will prove veryimportant to the efficiency of our method. Throughout the paper,trends are represented solely by their rank vectors r.

Property 1 (Trend extension or super-trend). A trendr � ½r1; r2; . . . ; rk� can be extended to the trend r0 with length ðkþ 1Þby placing the number ðkþ 1Þ before or after any element of r. That is,the possible vectors r0 that can be obtained by extending r are½kþ1;r1;r2; . . . ;rk�; ½r1;kþ1;r2; . . . ;rk�; ½r1;r2;kþ1; . . . ;rk�; � � � ; ½r1;r2; . . . ;

kþ1;rk�; ½r1;r2; . . . ;rk;kþ1�. The vectors r0 are called super-trends of r.This relationship is denoted r0 � r.

Property 2 (Completeness of trend extension). All trends withlength greater than k can be obtained by extending all trends oflength-k, according to the process described in Property 1. No dupli-cate trends are generated during this process.

3.2. Trend occurrences/substantiation

Before defining trend occurrence, we introduce the concept of atime-series sub-vector.

Definition 2 (Time-series sub-vector). Let x � ½x1; x2; . . . ; xn� be atime-series vector. A sub-vector of x is defined as y � ½xi1

; xi2 ; . . . ;

xik�, where 1 6 i1 < i2 < . . . < ik 6 n.

For the rest of the paper, we use the symbol ‘‘Ti” to represent theith time point associated with xi in a time-series vector x.

Definition 3 (Occurrence of a trend). Let x � ½x1; x2; . . . ; xn� be atime-series vector and r be a trend. If 9y # x such that ry is equal tor, we say that r occurs in x, or x supports r. For eachy � ½xi1 ; xi2

; . . . ; xik � satisfying this condition, t ¼ ½Ti1; Ti2; . . . ; Tik� iscalled the occurrence vector of this trend.

An example of trend occurrence is shown in Fig. 3(c). Note thatour definition allows for inconsecutive time points, since micro-array data are noisy and biological processes may unfold at differ-ent rates under different conditions. Also, a given trend may occurmultiple times within a long time series.

3.3. Frequent trends

In the following, we formalize the concept of frequent trend anddefine a related property.

Definition 4 (Frequent trend). Given a set of time-series vectors(which can be of different lengths) and a single trend r, thesupport of r, denoted suppðrÞ, is defined as the number of time-series vectors in which r occurs. Given a positive integer b (the‘‘support threshold”), if suppðrÞP b, we say that r is a frequenttrend.

Fig. 4. Illustration of how all possible length-k trends can be obtained by expandingthe trend of length-1. The trend ‘‘[3 2 1]” can only be expanded from the trend ‘‘[21]”, since the element ‘‘3” was added last, and ‘‘[2 1]” can only come from ‘‘[1]”.Thus, no duplicate trends are generated during this process.

Please cite this article in press as: Li W et al. Unraveling complex temporal assocJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

An important property of frequent trends is that their supportfunction is anti-monotonic. This fact is commonly known as theApriori property for other types of frequent patterns [28,29].

Property 3 (Apriori property). Given a set of time-series vectors, asingle trend r, its support function suppðrÞ and the super-trends r0 of r,we find that 8r0 suppðr0Þ 6 suppðrÞ.

Intuitively, if a trend r does not qualify as frequent, none of itssuper-trends will qualify as frequent either. This property can beused to eliminate a great many candidate trends from the analysis,vastly increasing the efficiency of the frequent pattern miningalgorithm. Section 4.1 describes a search algorithm based on theApriori property.

3.4. Definition of a frequent temporal association pattern

A frequent temporal association pattern has three components:genes, datasets, and time points. Let G ¼ G1; . . . ;GjGj

� �be a subset

of genes, D ¼ D1; . . . ;DjDj� �

be a subset of datasets, and T ¼tðD1Þ; . . . ; tðDjDjÞ� �

be a set of time point sequences selected fromeach dataset in D. That is, tðDkÞ is a vector containing the indexesof time points selected from the Dkth dataset. In the tripletðG;D;TÞ, each combination of elements ðGi;Dj; tðDjÞÞ identifies atime-series sub-vector taken from the Gith gene in the Djth dataset.The components of the sub-vector are associated with the timepoints tðDjÞ. We denote the sub-vector itself as yðGi;Dj; tðDjÞÞ, andits rank vector as ryðGi ;Dj ;tðDjÞÞ. For example, the association patternshown in Fig. 2 can be expressed as a triplet ðG;D;TÞ, whereG ¼ fG1;G2;G3g; D ¼ fD1;D2;D3g; T ¼ tðD1Þ; tðD2Þ; tðD3Þf g. Theindividual vectors in T are tðD1Þ ¼ ½T5; T6; T7; T8�; tðD2Þ¼ ½T1; T2; T3; T5�; and tðD3Þ ¼ ½T6; T7; T8; T10�. In these terms, aFTAP satisfies the following definition.

Definition 5 (FTAP). Let ðG;D;TÞ be a triplet of genes, datasetsand time points, and let ryðGi ;Dj ;tðDjÞÞ be the rank vectors of the time-series sub-vectors defined by T : yðGi;Dj; tðDjÞÞ. Then ðG;D;TÞ is aFTAP if it satisfies

8Gi 2 G; ryðGi ;D1 ;tðD1ÞÞ ¼ � � � ¼ ryðGi ;DjDj ;tðDjDj ÞÞ

Simply put, within a given dataset the sub-vectors of all se-lected genes must include the same time points (temporal associ-ation), and for a given gene all the time-series sub-vectors mustyield the same rank vector (frequent association).

3.5. Constraints on FTAPs

Often we need to take into account constraints when miningfrequent trends. Let r � ½r1; . . . ; rk� be a trend, and lety � ½xi1 ; xi2 ; . . . ; xik � be a specific occurrence of the trend r in thetime series x.

Time interval constraint HI: although we allow trends to skipover time points, it is often preferable to have few skipped pointsor none at all. Given a trend y � ½xi1 ; xi2 ; . . . ; xik �, this constraint isformalized as follows: 8j; ðijþ1 � ijÞ 6 hI , where 1 6 j 6 ðk� 1Þand hI is a positive threshold.

Variance constraint HV : as we often prefer trends associatedwith significant expression changes, we may require y to have largevariance. The variance constraint is formalized as rðyÞ=y P hV ,where hV is a positive real number. rðyÞ and y are the standarddeviation and mean of y, respectively.

Definition 6 (Anti-monotonic constraint). Let y be a sub-vector ofx, i.e., y � x, and let H be a constraint on y. If y does not complywith H, then no vector z satisfying y � z # x will comply with H.Such constraints are called ‘‘anti-monotonic constraints”.

iations in cellular systems across multiple time-series microarray datasets.

Page 5: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

Fig. 5. Decomposition of the FTAP discovery problem.

W. Li et al. / Journal of Biomedical Informatics xxx (2010) xxx–xxx 5

ARTICLE IN PRESS

All constraints are useful for reducing the search space. How-ever, only anti-monotonic constraints can be used to reject all pos-sible super-trends after a given candidate trend has been ruled out.HI is anti-monotonic constraint, but HV is not.

3.6. Problem formulation and decomposition

Based on the above definitions, we formulate the problem ofFTAP discovery as follows.

SubProbleTemporal asso

...

...

...... ...

...

...

G1

G2

Gn

D2D1 Dm...

...

G2

G7

G9

D5D2 D10

T1 T2 T3 T4 T3 T5 T6 T7 T1 T3 T4 T9

G11

G23

G39

D7D1 D17

T1 T2 T3 T5 T2 T6 T8 T9 T1 T4 T5 T7

...

(a) Input: m time-series data sets of n genes.Each row in the table represents a gene , and each column

corresponds to a data set .

For each gene, find frequent trends that occur in many data sets

SubProblem-IFrequent association

Align time pofrequent trends

different ge

(c) Frequent Temporal Association Patterns Output : FTAP =( Genes, Datasets, Tim

(b) Trend collectioentry in the i-th row athe j-th dataset. The r

Fig. 6. Overview of our two-stage algorithm for m

Please cite this article in press as: Li W et al. Unraveling complex temporal assocJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

3.6.1. Problem statementGiven a, the minimum length of a trend; b, the minimum num-

ber of datasets in which a frequent trend must occur; and n, theminimum number of genes, our goal is to identify allFTAP � ðG;D;TÞ satisfying Definition 5.

3.6.2. Problem decompositionAs discussed previously, there are two associations in a FTAP:

‘‘frequent association” and ‘‘temporal association”. Thus it is natu-ral to begin by discovering frequent trends in individual genesacross the available datasets (denoted Subproblem-I). We will thencompare the frequent trends of different genes to look for align-ments along the time axis (denoted Subproblem-II).

There are m datasets D1;D2; . . . ;Dmf g, each with n genes and li

time points Di 2 Rn�li� �

. The rows of Di correspond to genes, andthe columns correspond to time points. The inputs and outputsof the two subproblems are summarized in Fig. 5.

Fig. 6 illustrates this search process. The complete collection ofdatasets, genes, and time series is illustrated in Fig. 6(a). Subprob-lem-I discovers which trends occur frequently among the datasets,for one gene at a time. We employed standard techniques of fre-quent sequential pattern mining to design the search algorithmFrequentTrends that exploits the Apriori property to rule outsuper-trends. Since the procedure is considers each gene indepen-dently, it can be easily implemented in a parallel computing frame-work. Its output is a table of all frequent trends and theiroccurrences in the dataset, as shown in Fig. 6(b). Subproblem-IIlooks for alignments between the frequent trends of differentgenes along the time axis. The final result is a set of frequent pat-terns with identical temporal associations. We model Subproblem-II as a frequent itemset mining problem, and adopted an existingalgorithm for this purpose. This problem can also be parallelizedby dividing the search space (i.e., restricting the collection totrends of length k see Section 4.2).

m-IIciation :

D 1 D 3 D 4 D 5D 2

T6 T7 T8 T9 T6 T7 T8 T9 T20 T22 T24 T25

T1 T2 T3 T5 T7 T8 T9 T10 T7 T8 T9 T10

T3 T4 T5 T6 T14 T16 T17 T18

T1 T2 T3 T4 T18 T19 T20 T22

T8 T9 T10 T12

T8 T10 T12 T14

T2 T3 T4 T5 T3 T5 T7 T9 T6 T8 T10 T11 T7 T8 T9 T10

...

...

...

...

..................

...

G1

Gn

Trend 1

Trend 2

Trend 3

Trend p-1

Trend p

......

ints of across nes

epoints )

n: Each row corresponds to a trend ; each column corresponds to a data set ; the nd j-th column of the table records the occurrence information of the i-th trend in ed cross indicates that the trend does not occur in the corresponding data set .

...

ining frequent temporal association patterns.

iations in cellular systems across multiple time-series microarray datasets.

Page 6: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

6 W. Li et al. / Journal of Biomedical I

ARTICLE IN PRESS

4. The algorithm

4.1. Subproblem-I: discovering frequent trends for each gene

Given a set of time-series vectors corresponding to the samegene, identifying all the frequent trends is not a trivial task. Thereare k! candidate trends for a time series of length k. As presented inProperty 3 of Section 3.3, the support (frequency) of a trend neverincreases with length (i.e., Apriori property). We can exploit thisproperty to efficiently discover all frequent trends, starting withthose of length a and working our way up to longer trends. A for-mal description of this process is given in Algorithm 1.

Algorithm 1 consists of three steps. (1) Initialization: set k to aand enumerate all candidate trends of length k. (2) Frequency cal-culation: identify all frequent trends among the length-k candi-dates. (3) Extension: extend all the frequent length-k trends tolength kþ 1, and repeat steps (1)–(3). By approaching longertrends in this manner, a great many candidates are pruned in eachiteration thanks to the Apriori property. To further increase effi-ciency, anti-monotonic constraints are also evaluated during thesearch for frequent trends. If the constraints invalidate a trend,all of its extensions will also fail. According to our experience,imposing even one anti-monotonic constraint such as HI greatlyspeeds up Algorithm 1. All other constraints (such as HV ) arechecked at the end (Line 20) of Algorithm 1.

The routine EnumOccurrences (Line 6 of Algorithm 1) discov-ers every sub-vector y of x whose rank vector is equal to r, i.e.,ry ¼ r. We developed an efficient, depth-first search algorithm thatrecursively matches each element of r with every possible elementof y. The algorithm progresses through the elements of r one byone, until the entire rank vector has been checked. The elementsof r and y are always matched from highest to lowest.

Algorithm 1 FrequentTrends

Input: (1) P ¼ x1; . . . ; xmf g, m times-series vectors(2) minimum trend length a, minimum dataset support b(3) H1, the thresholds of anti-monotonic constraints such

as hI

(4) H0, the threshold of other constraints such as hV

/* All constraint thresholds are optional. */Output: a list of couples ðD;TTÞ identifying all the frequent trends

in P.1: k a, then generate all length-k trends and store in Rc; /* trends

with length less than a are trivial. */2: repeat3: Rq ;; /* set of Qualified length-k trends. */4: for all r in the set of candidate length-k trends Rc do5: for all x in P do6: EnumOccurrences(r,x,H1); /* Find all occurrences of r in x

complying with the anti-monotonic constraints defined in H1 . */7: end for8: if suppðrÞP b then9: Rq Rq [ r;10: if k P a then11: Record all its occurrence information to the couple

ðD;TTÞ; /* This trend qualifies as frequent. */12: end if13: end if14: end for15: if Rq–; then16: Rc ;, extend all qualified length-k trends in Rq to generate

a list of candidate length-ðkþ 1Þ trends, and store them in Rc;17: k kþ 1;18: end if19: until Rq is empty20: Remove all occurrences ðD;TTÞ which do not comply with any

constraints in H0.

Please cite this article in press as: Li W et al. Unraveling complex temporal assocJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

Fig. 7 illustrates the above process. For simplicity, we replace x

with its rank vector. First, the algorithm matches the highest value(3) of r with the highest value (5) of x (Node 1 at Level 1). Since thesecond-highest value (2) of r is two positions forward, there mustexist a lower element at least two positions forward in x. Thisrequirement invalidates the two other possible matches for r1

(Nodes 2 and 3 at Level 1). In the first node, two elements (1 and4) of x satisfy the second match. These possibilities are representedas Nodes 1 and 2 at Level 2, the children of the Level 1 node. Weproceed to investigate the third match element (1), whose positionmust be between the two previously matched values. In Node 1 atLevel 2, we identify two candidate matches for the third element ofr: the values 3 and 1. These trends become the two nodes of Level3. In Node 2 at Level 2, however, we fail to find a possible match.All three elements of r have now been matched to y, so all nodesof Level 3 are solutions. If there were more than one possiblematch in Level 1, we could return to Level 1 and build up anothersolution tree from the next node.

A formal description of the above process is given in Algorithm2. We use the Routine ‘‘match” (Line 2) to calculate whether the ithmatched element branches to possible matches for element ðiþ 1Þ,as detailed above. The Algorithm EnumOccurrences traverses thesearch tree in the depth-first order just described, from Level 1downward. The tree has k ¼ lengthðrÞ levels, and many branchesare terminated early, as shown in Fig. 7. The great advantage of thisalgorithm is that anti-monotonic constraints can prune largeswathes of the search space (i.e., search nodes as shown in Fig. 7).

Algorithm 2 EnumOccurrences

Input: (1) x ¼ ½x1; . . . ; xn�, a time-series vector(2) r ¼ ½r1; . . . ; rk�, the rank order of a trend(3) H1, the thresholds of anti-monotonic constraints such as

hI 2 H1

Output: Wm�k, occurrence matrix r in x.1: Sort both x and r in descending order to get their index vectors

Ix and Ir, respectively. Together with H1 and W; Ix and Ir areglobal variables.

2: for all beginIx ¼ 1 to ðn� kþ 1Þ do3: matchðbeginIx;1Þ;4: end for

nformatics xxx (2010) xxx–xxx

4.2. Subproblem-II: aligning trend occurrences across genes

In Subproblem-I, we obtained the frequent trends for each geneand recorded all occurrences of those trends. In Subproblem-II, wewill identify gene groupings with several simultaneous frequenttrends, and where this pattern is repeated across multiple datasets.As Subproblem-I generates the complete list of length-k frequenttrends, we can state the following important property:

Property 4. All FTAPs of length k can be identified by mining the list oflength-k frequent trends obtained in Subproblem-I.

This property allows us to divide the search space of Subprob-lem-II by trend length, greatly simplifying the computation. Theindividual searches are reduced in size, and can be farmed out ina parallel processing framework.

Let ðD;TTÞðgÞk (corresponding to a row in Fig. 8) denote a singlelength-k frequent trend of gene g. We collect the entire list oflength-k trends of all genes, and mine the list of length-k FTAPs.Fig. 8(a) illustrates the process for k ¼ 3. The frequent trend matrixis composed of length-3 vectors. The search for FTAPs can be mod-eled as a Frequent Itemset Mining (FIM) problem [28]. To explainour algorithm, we briefly digress to a discussion of FIM and relatedconcepts.

iations in cellular systems across multiple time-series microarray datasets.

Page 7: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

T3

Rank vector:

3 1 2Rank vector:

Trend: r

Time Series: xThe searching space is

represented by a tree . We search this tree by the depth -

first order. 4352 1

Time

Value

Rank vector: 2 5 3 4Time Series: x

1 Rank vector: 2 5 3 4Time Series: x

1 Rank vector: 2 5 3 4Time Series: x

1

Node 1 at Level 1 Node 2 at Level 1 Node 3 at Level 1

Rank vector: 2 5 3 4Time Series: x

1Rank vector: 2 5 3 4Time Series: x

1

Node 1 at Level 2 Node 2 at Level 2

Rank vector:

2 5 3 4Time Series: x

1

Node 1 at Level 3

Rank vector: 2 3 4

Time Series: x

1

Node 2 at Level 3

T25

Lev

el

1L

evel

2

Lev

el

3

3 1 2Rank vector:

Trend: r

No jump

3 1 2Rank vector:

Trend: r

Jump backward between 2 positions

3 1 2Rank vector:

Trend: r

Go forward at least 2 positions

T1 T3 T4 T5 4T3T1T 2T T5

4T1T 2T T51T 2T T3 T4 T5

W =T2 T3 T5T2 T4 T5

Fig. 7. An example showing how the match routine (Line 3 of Algorithm 2 – EnumOccurrences) traverses the search space.

W. Li et al. / Journal of Biomedical Informatics xxx (2010) xxx–xxx 7

ARTICLE IN PRESS

FIM is rooted in basket analysis, where ‘‘basket” refers to agroup of simultaneously purchased items or a bundled set of finan-cial transactions. Given a historical dataset of baskets, market ana-lysts are interested in knowing which items are frequentlypurchased together. To put it another way, they with to identify‘‘frequent itemsets”. The frequency of an itemset is defined as thenumber of baskets in which all the items occur together. Thisknowledge helps store managers with business decisions such asdesigning coupons and shelving merchandise to maximize profit.

Let I ¼ i1; i2; . . . ; in be a set of items, and let D be a database of mtransactions. Each transaction T is a set of items such that T � I. LetX be another set of items such that X � I. A transaction T is said tocontain X if and only if X � T. X is called a frequent itemset if at leasts transactions in the database contain X. The output of an FIM algo-rithm is a list of all itemsets X that occur in at least s transactions.

In our case, a single ‘‘item” is the doublet ðD;TTÞ, basically, asubset of time points in a single dataset, e.g., D1T6;T7;T8. Each trend(row) of the table can be viewed as a transaction containing these‘‘items”. When we redraw the table of Fig. 8(a) in the form ofFig. 8(b), it is obvious that an FTAP corresponds to a frequent item-set of ðD;TTÞ over multiple trends and genes. Many efficient algo-rithms have already been developed for the frequent itemsetmining problem. We employ the winner [30] of the ‘‘best imple-mentation” award in the Workshop on Frequent Itemset MiningImplementations.1 As mentioned earlier, since the mining is doneindependently for each trend length k, this algorithm is suitablefor parallel computing.

1 http://fimi.cs.helsinki.fi/fimi03

Please cite this article in press as: Li W et al. Unraveling complex temporal assocJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

5. Experimental study

5.1. Data description

Yeast DNA microarray time-series datasets were collected fromthe Gene Expression Omnibus (GEO) database.2 We obtained 18datasets, each containing more than 10 time points (detaileddescription of the datasets are provided on the supplementarywebsite: http://zhoulab.usc.edu/FTAP). These time-series datasetscontain 5869 yeast genes and are measured under various experi-mental conditions: cell cycle, hydrogen peroxide response, fila-mentous-form growth, etc. The log transform was applied to datagenerated with affymetrix platforms. Expression values of differentprobes measuring the same gene were averaged. Datasets fromreplicated experiments were also combined and averaged.

5.2. Experimental setting and processing

We set the minimum trend length a ¼ 5, as trends over 5 timepoints are more likely to represent specific patterns. We require aFTAP to occur in at least three datasets and three genesðb ¼ 3 and n ¼ 3Þ. Additionally, we do not allow trends to skipmore than one time point ðhI ¼ 2Þ and require a large varianceðhV ¼ 0:5Þ. We ran the algorithm on a Linux parallel computingcluster. The computation time of the algorithm in this setting is ta-ken to be the greatest time required by any one node, even if manynodes were active for much shorter time. The maximum time isabout 10 h for both Subproblem-I and Subproblem-II.

2 http://www.ncbi.nlm.nih.gov/geo/

iations in cellular systems across multiple time-series microarray datasets.

Page 8: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

8 W. Li et al. / Journal of Biomedical Informatics xxx (2010) xxx–xxx

ARTICLE IN PRESS

Our data-mining procedure resulted in 184,116 patterns, eachof which corresponds to a distinct gene group. The patterns rangein size from 3 to 156 genes, and involve a total of 4847 genes. In thenext section, we analyze the results.

5.3. Analysis of frequent temporal association patterns

To assess whether the patterns obtained are biologically mean-ingful, we first determined the percentage of gene groups that arefunctionally homogeneous. Using the Gene Ontology (GO) biologi-cal process annotations, we consider a gene group to be function-ally homogeneous if the functional homogeneity modeled by thehypergeometric distribution [31] to be significant at thea ¼ 0:005 level. In addition, functional categories must be associ-ated with Gene Ontology nodes at least four levels below the root.By these criteria, 56.28% of the gene groups are functionallyhomogenous while randomization test (based on randomly as-signed gene groups) only yields 30.38%. Furthermore, amonggroups containing at least four genes, 67.26% are functionallyhomogeneous (versus 35.98% by randomization test); amonggroups of at least five genes, 72.82% are functionally homogeneous(versus 39.37% by randomization test); and among groups of atleast six genes, 75.36% are functionally homogeneous (versus41.23% by randomization test). The gene groups cover a wide rangeof functional categories, e.g., cell cycle, secretory pathway, re-

...

(a) Classify frequent trends according to their lengths; t

discover length-k FTAPs in all length-k frequent tre

Details of this strefer to Fig.6 (b

...

D 1 D 3D 2

T6 T7 T8 T6 T7 T9

T1 T5 T7 T7 T9 T10

T2 T4 T9 T8 T11 T13 T9 T10 T12

T3 T4 T5 T5 T7 T9

...

...

...

............

...

G1

Gn

Trend 1

Trend 2

Trend p-1

Trend p

......

...

G1

G9

G21

D3D1 D11

T6 T7 T8 T6 T7 T9 T1 T4 T5

G1G9

G21

D3D1 D11

T6 T7 T8 T6 T7 T9 T1 T4 T5

T6 T7 T8 T6 T7 T9 T1 T4 T5

T6 T7 T8 T6 T7 T9 T1 T4 T5

All length-3 FTAPs

All length-3 frequent trends

D7D1T7 T9 T10 T2 T4 T5 T

G1G6

G11

D7D2

T7 T9 T10 T2 T4 T5 T

T7 T9 T10 T2 T4 T5 T

T7 T9 T10 T2 T4 T5 T

G1

G11

G6

Fig. 8. Schematic of our solution to solve Subproblem-II. (a) The first stage consists of orgthe trend does not occur in the corresponding dataset. (b) The second stage, discovery ofproblem. We take length k ¼ 3 trends as an example. (For interpretation of the references

Please cite this article in press as: Li W et al. Unraveling complex temporal assocJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

sponse to stress, regulation of transcription, and cellular catabolicprocess.

Interestingly, we obtained 131 groups containing 100 or moregenes. All are functionally homogeneous, and most are related torRNA processing and ribonucleoprotein complex biogenesis. Ourmethod’s ability to capture large homogenous patterns can beattributed to two factors: (1) it is designed specifically to identifycomplex temporal associations, and (2) it focuses on the mostimportant trends by allowing the patterns to skip time points. Thisflexibility greatly enhances the sensitivity of our method, evenwhen mining noisy data. In principle, allowing the algorithm toskip multiple time points would also allow us to look for associa-tions between time series with different time units. For example,time series with 10-min and 30-min intervals could be alignedby skipping two time points.

It is widely known that while co-expressed genes tend to befunctionally related, not all functionally related genes are co-ex-pressed [32]. Identifying functional related genes beyond co-expression has been a major challenge in microarray analysis.Our method identified a significant number of such functionally re-lated gene sets: 38.4% of the groups have average pairwise correla-tions less than 0.6. The power of our method attributes to itsintegrative analysis of multiple time-series datasets, since subtlepatterns may emerge that could hardly be identified in any individ-ual dataset. In a single dataset, it is difficult to separate complexpatterns from the background noise. Our method is able to detect

...

...

............

...

G1

Gn

Trend 1

Trend 2

Trend p-1

Trend p

......

...(T6,T7,T8)

D1(T6,T7,T9)

D3

(T1,T5,T7)D1(T7,T9,T10)D2

(T2,T4,T9)D1(T8,T11,T13)D2 (T9,T10,T12)D3

(T3,T4,T5)D1

(T5,T7,T9)D3

Each trend is a transaction and contains items, each of which corresponds to a set of time points in a dataset, e.g.

A frequent itemset corresponds to a FTAP, for example

Frequent Itemset Mining

(T6,T7,T8)D1

(T6,T7,T8)D1 (T6,T7,T9)D3(T1,T4,T5)D11( )

...

hen

nds.

(b) Mining FTAP can be modeled as

a frequent itemset mining problem

(T6,T7,T8)D1 (T6,T7,T9)D3

(T6,T7,T8)D1(T6,T7,T9)D3

(T1,T4,T5)D11

(T1,T4,T5)D11

(T6,T7,T8)D1(T6,T7,T9)D3 (T1,T4,T5)D11

G1

G2

G21

Trend collection (example from all length-3 trends)

ep )

FrequentItemset

FTAP :

:

D95 T6 T8

D9

5 T6 T8

5 T6 T8

5 T6 T8

anizing the frequent trends enumerated by Subproblem-I. A red cross indicates thatall length-k FTAPs among the length-k frequent trends, is a frequent itemset miningto color in this figure legend, the reader is referred to the web version of this paper.)

iations in cellular systems across multiple time-series microarray datasets.

Page 9: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

EAF6

DIN7

YLL066C

GDS1752 GDS608 GDS2318

SEC2

SVP26

YPT31

GDS2318 GDS38 GDS2350

(a) cell cycle (b) vesicle-mediated transport

Fig. 9. The genes in these three association patterns are functionally related, but have low Pearson correlations.

W. Li et al. / Journal of Biomedical Informatics xxx (2010) xxx–xxx 9

ARTICLE IN PRESS

them because when a pattern occurs in multiple datasets, it ismore easily differentiated from the background.

Fig. 9(a) shows one example of a functionally related genegroup without co-expression. The genes DIN7, EAF6 and YLL066C,repeat the same coupled pattern on simultaneous time points inthree different datasets (GDS1752, GDS608, GDS2318), represent-ing three different growth conditions. Both EAF6 and DIN7 areknown to be involved in DNA repair. The function of gene YLL066Cis unknown, but evidences point to its role in DNA repair as well[33]. More specifically, DIN7 is a DNA damage-inducible mitochon-drial nuclease that modulates the stability of mitochondrial DNA(mtDNA) in Saccharomyces cerevisiae [34]. EAF6, a subunit of theacetyltransferase complex NuA4, is recruited to the chromatin sur-rounding the damage and, through a series of chromatin modifica-tions, it too contributes to DNA repair [35]. YLL066C, as a predictedY0 element helicase, may play an important role in maintaining andrepairing telomere [33]. Despite the genes’ possible involvement inthe same biological process, their average pairwise correlations inthe three datasets are only 0.24, 0.19, and 0.21, respectively. Thisassociation would not be captured by clustering algorithms.

To take another example, Fig. 9(b) shows a recurrent temporalcoordination of the genes SEC2, SVP26, and YPT31. All three partic-ipate in vesicle-mediated transport, where SEC2 is essential forpost-Golgi vesicle transport, SVP26 is a component of the earlyGolgi apparatus and is involved in COP II vesicle transport, andYPT31 mediates intra-Golgi traffic or the budding of post-Golgivesicles from the trans-Golgi. Their average pairwise correlationsin the three datasets are 0.31, 0.11, and 0.63.

The next interesting problem is to elucidate the mechanismunderlying the subtle temporal activity coupling. For example, inFig. 9(b), why does the expression of SEC2 increase initially whilethe expressions of SVP26 and YPT31 decrease? And why does thetrend then rapidly reverse? These are not simple questions, andtheir answers lie beyond the scope of this paper. However, theadvantage of our method is that we need no prior knowledge ofpatterns (their shapes or underlying mechanisms). We can simplysearch for all patterns with ‘‘frequent” occurrence, because highrecurrence itself implies biological significance. Extracting andcharacterizing these frequent patterns can serve as a first step to-wards understanding the complex mechanisms governing them.

6. Conclusion

In this paper, we have addressed the challenge of identifyingcomplex temporal couplings in cellular systems. We proposed

Please cite this article in press as: Li W et al. Unraveling complex temporal assocJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

the novel concept of a ‘‘frequent temporal association pattern”(FTAP): multiple genes simultaneously exhibiting different expres-sion trends, where the same pattern can be observed in severalmicroarray datasets. By their very nature, such signals cannot berecognized within a single time-series dataset, but the availabilityof many such sets makes it possible to separate frequent associa-tions from the background of noisy data and competing processes.

The FTAP has several advantages over traditional methods. First,it can identify the subtle signals which exist only in multiple time-series datasets and thus cannot be captured by co-expression anal-ysis or time-series analysis within the single time-series dataset.Second, unlike existing time-series analysis methods, the FTAP def-inition does not require that the trends (or curves) of differentgenes are similar. This is important for FTAP to seize groups offunctionally related genes whose time series in the single datasetare not similar whatever globally or locally. To the best of ourknowledge, we are the first to propose looking for this type of asso-ciation pattern. Our experimental results demonstrate that theeffectiveness of such patterns, and the real-world examples inFig. 9 intuitively illustrate how three uncorrelated genes are con-tained in an FTAP.

To discover such complex patterns, we designed an efficientdata-mining approach that exploits the anti-monotonic propertyof frequent trends. We applied the method to 18 yeast microarraytime-series datasets, and discovered a large number of FTAPs. Asignificant fraction of these include functionally related genes be-yond co-expression, which could not have been captured by clus-tering algorithms. Our method is efficient, scalable, and easilycomputed in parallel. With the rapid accumulation of time-seriesmicroarray data, this method could soon provide an atlas of tempo-ral association patterns to facilitate more advanced and detailedstudies.

We conclude with suggestions for future work. As our algorithmdiscovered a great patterns, it is useful to devise a means of statis-tically evaluating their importance. Secondly, it would be useful topost-process the patterns, arriving at a reduced set of gene groupswith fewer overlaps. Finally, our method is equally applicable totime series from other domains such as stock market analysis, geo-physics, meteorology, and social sciences. We expect it to discovermany novel and complex patterns which have been overlooked byconventional methods.

References

[1] Kholodenko B. Cell-signalling dynamics in time and space. Nat Rev Mol CellBiol 2006;7(3):165–76.

iations in cellular systems across multiple time-series microarray datasets.

Page 10: Journal of Biomedical Informatics - UCLA Healthpathology.ucla.edu/workfiles/Research Services/li2010unraveling.pdf · Wenyuan Li, Min Xu, Xianghong Jasmine Zhou* Molecular and Computational

10 W. Li et al. / Journal of Biomedical Informatics xxx (2010) xxx–xxx

ARTICLE IN PRESS

[2] Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998;95(25):14863–8.

[3] Tavazoie S, Hughes J, Campbell M, Cho R, Church G. Systematic determinationof genetic network architecture. Nat Genet 1999;22:281–5.

[4] Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, et al.Molecular classification of cancer: class discovery and class prediction by geneexpression monitoring. Science 1999;286(5439):531.

[5] Criel J, Tsiporkova E. Gene time expression warper: a tool for, alignment,template matching and visualization of gene expression time series.Bioinformatics 2006;22(2):251–2.

[6] Aach J, Church G. Aligning gene expression time series with time warpingalgorithms. Bioinformatics 2001;17(6):495–508.

[7] Qian J, Dolled-Filhart M, Lin J, Yu H, Gerstein M. Beyond synexpressionrelationships: local clustering of time-shifted and inverted gene expressionprofiles identifies new, biologically relevant interactions. J Mol Biol2001;314(5):1053–66.

[8] Fink G, Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, et al. Comprehensiveidentification of cell cycle-regulated genes of the yeast saccharomycescerevisiae by microarray hybridization. Mol Biol Cell 1998;9(12):3273–97.

[9] Baumer M, Kunzler M, Steigemann P, Braus G, Irniger S. Yeast Ran-bindingprotein Yrb1p is required for efficient proteolysis of cell cycle regulatoryproteins Pds1p and Sic1p. J Biol Chem 2000;275(49):38929–37.

[10] Ernst J, Nau G, Bar-Joseph Z. Clustering short time series gene expression data.Bioinformatics 2005;21(Suppl 1):S159–68.

[11] Irizarry R, Warren D, Spencer F, Kim I, Biswal S, Frank B, et al. Multiple-laboratory comparison of microarray platforms. Nat Meth 2005;2(5):345–50.

[12] Larkin J, Frank B, Gavras H, Sultana R, Quackenbush J. Independence andreproducibility across microarray platforms. Nat Meth 2005;2(5):337–44.

[13] Shi L, Tong W, Fang H, Scherf U, Han J, Puri R, et al. Cross-platformcomparability of microarray technology: intra-platform consistency andappropriate data analysis procedures are essential, BMC Bioinform 6 (Suppl 2).

[14] Butte A, Chen R. Finding disease-related genomic experiments within aninternational repository: first steps in translational bioinformatics. In: AMIAannual symposium proceedings; 2006. p. 106–10.

[15] Tanaka Y, Iwamoto K, Uehara K. Discovery of time-series motif from multi-dimensional data based on MDL principle. Machine Learn 2005;58:269–300.

[16] Keogh E, Lin J, Lee S-H, Herle HV. Finding the most unusual time seriessubsequence: algorithms and applications. Knowl Inf Syst 2006;11(1):1–27.

[17] Gionis A, Mannila H. Finding recurrent sources in sequences. In: Proceedings ofthe RECOMB; 2003.

[18] Keogh E, Kasetty S. On the need for time series data mining benchmarks: asurvey and empirical demonstration. In: Proceedings of the SIGKDD; 2002.

Please cite this article in press as: Li W et al. Unraveling complex temporal assocJ Biomed Inform (2010), doi:10.1016/j.jbi.2009.12.006

[19] Smith AA, Vollrath A, Bradfield CA, Craven M. Clustered alignments of gene-expression time series data. Bioinformatics 2009;25(12):i119–27.

[20] Telesca D, Inoue L. Bayesian hierarchical curve registration. J Am Stat Assoc2008;103(481):328–39.

[21] Androulakis I, Yang E, Almon R. Analysis of time-series gene expression data:Methods, challenges, and opportunities. Annu Rev Biomed Eng2007;9:205–28.

[22] Chen X. Curve-based clustering of time course gene expression data using self-organizing maps. J Bioinform Comput Biol 2009;7(4):645–61.

[23] Papadimitriou S, Sun J, Faloutsos C. Streaming pattern discovery in multipletime-series. In: Proceedings of the VLDB; 2005.

[24] Indyk P, Koudas N, Muthukrishnan S. Identifying representative trends inmassive time series data sets using sketches. In: Proceedings of the VLDB;2000.

[25] Udechukwu A, Barker K, Alhajj R. Discovering all frequent trends in time series.In: ACM international conference on proceedings series; 2004.

[26] Hunt J, Szymanski T. A fast algorithm for computing longest commonsubsequences. Commun ACM 1977;20(5):350–3.

[27] Nam H, Lee K, Lee D. Identification of temporal association rules from time-series microarray data sets. BMC Bioinform 2009;10(Suppl 3):S6.

[28] Han J, Kamber M. In: Data mining: concepts and techniques. Los Altos(CA): Morgan Kaufman; 2006.

[29] Wang W, Yang J. Mining sequential patterns from large datasets. Berlin: Springer; 2005.

[30] Grahne G, Zhu J. Efficiently using prefix-trees in mining frequent itemsets. In:FIMI’03 Workshop on frequent itemset mining implementations; 2003.

[31] Wu L, Hughes T, Davierwala A, Robinson M, Stoughton R, Altschuler S, et al.Large-scale prediction of Saccharomyces cerevisiae gene function usingoverlapping transcriptional clusters. Nat Genet 2002;31(3):255–65.

[32] Zhou X, Kao M-CJ, Wong WH. Transitive functional annotation by shortestpath analysis of gene expression data. Proc Natl Acad Sci USA2002;99(20):12783–8.

[33] Yamada M, Hayatsu N, Matsuura A, Ishikawa F. Y0-Help1, a DNA helicaseencoded by the yeast subtelomeric Y0 element, is induced in survivorsdefective for telomerase. J Biol Chem 1998;273(50):33360–6.

[34] Yoshitania A, Yoshida M, Ling F. A novel cis-acting element required for DNAdamage-inducible expression of yeast DIN7. Biochem Biophys Res Commun2008;365(1):183–8.

[35] Mitchell L, Lambert J-P, Gerdes M, Al-Madhoun AS, Skerjanc IS, Figeys D, et al.Functional dissection of the NuA4 histone acetyltransferase reveals its role as agenetic hub and that Eaf1 is essential for complex integrity. Mol Cell Biol2008;28(7):2244–56.

iations in cellular systems across multiple time-series microarray datasets.