17
arXiv:1104.2180v1 [stat.ME] 12 Apr 2011 Statistical Science 2010, Vol. 25, No. 4, 476–491 DOI: 10.1214/09-STS312 c Institute of Mathematical Statistics, 2010 The EM Algorithm and the Rise of Computational Biology Xiaodan Fan, Yuan Yuan and Jun S. Liu Abstract. In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive in- terdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the “central dogma” of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolution- ary models and mRNA expression microarray data analysis. Key words and phrases: EM algorithm, computational biology, liter- ature review. 1. INTRODUCTION 1.1 Computational Biology Started by a few quantitatively minded biologists and biologically minded mathematicians in the 1970s, computational biology has been transformed in the past decades to an attractive interdisciplinary field drawing in many scientists. The use of formal statis- tical modeling and computational tools, the expecta- tion–maximization (EM) algorithm, in particular, contributed significantly to this dramatic transition in solving several key computational biology prob- lems. Our goal here is to review some of the histor- ical developments with technical details, illustrat- Xiaodan Fan is Assistant Professor in Statistics, Department of Statistics, the Chinese University of Hong Kong, Hong Kong, China e-mail: [email protected]. Yuan Yuan is Quantitative Analyst, Google, Mountain View, California, USA e-mail: [email protected]. Jun S. Liu is Professor of Statistics, Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, Massachusetts 02138, USA e-mail: [email protected]. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in Statistical Science, 2010, Vol. 25, No. 4, 476–491. This reprint differs from the original in pagination and typographic detail. ing how biology, traditionally regarded as an empir- ical science, has come to embrace rigorous statistical modeling and mathematical reasoning. Before getting into details of various applications of the EM algorithm in computational biology, we first explain some basic concepts of molecular biol- ogy. Three kinds of chain biopolymers are the cen- tral molecular building blocks of life: DNA, RNA and proteins. The DNA molecule is a double-stranded long sequence composed of four types of nucleotides (A, C, G and T). It has the famous double-helix structure, and stores the hereditary information. RNA molecules are very similar to DNAs, composed also of four nucleotides (A, C, G and U). Proteins are chains of 20 different basic units, called amino acids. The genome of an organism generally refers to the collection of all its DNA molecules, called the chro- mosomes. Each chromosome contains both the pro- tein (or RNA) coding regions, called genes, and non- coding regions. The percentage of the coding regions varies a lot among genomes of different species. For example, the coding regions of the genome of baker’s yeast are more than 50%, whereas those of the hu- man genome are less than 3%. RNAs are classified into many types, and the three most basic types are as follows: messenger RNA (mRNA), transfer RNA (tRNA) and ribosomal RNA (rRNA). An mRNA can be viewed as an intermedi- ate copy of its corresponding gene and is used as a 1

The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

arX

iv:1

104.

2180

v1 [

stat

.ME

] 1

2 A

pr 2

011

Statistical Science

2010, Vol. 25, No. 4, 476–491DOI: 10.1214/09-STS312c© Institute of Mathematical Statistics, 2010

The EM Algorithm and the Rise ofComputational BiologyXiaodan Fan, Yuan Yuan and Jun S. Liu

Abstract. In the past decade computational biology has grown froma cottage industry with a handful of researchers to an attractive in-terdisciplinary field, catching the attention and imagination of manyquantitatively-minded scientists. Of interest to us is the key role playedby the EM algorithm during this transformation. We survey the use ofthe EM algorithm in a few important computational biology problemssurrounding the “central dogma” of molecular biology: from DNA toRNA and then to proteins. Topics of this article include sequence motifdiscovery, protein sequence alignment, population genetics, evolution-ary models and mRNA expression microarray data analysis.

Key words and phrases: EM algorithm, computational biology, liter-ature review.

1. INTRODUCTION

1.1 Computational Biology

Started by a few quantitatively minded biologistsand biologically minded mathematicians in the 1970s,computational biology has been transformed in thepast decades to an attractive interdisciplinary fielddrawing in many scientists. The use of formal statis-tical modeling and computational tools, the expecta-tion–maximization (EM) algorithm, in particular,contributed significantly to this dramatic transitionin solving several key computational biology prob-lems. Our goal here is to review some of the histor-ical developments with technical details, illustrat-

Xiaodan Fan is Assistant Professor in Statistics,

Department of Statistics, the Chinese University of

Hong Kong, Hong Kong, China e-mail:

[email protected]. Yuan Yuan is Quantitative

Analyst, Google, Mountain View, California, USA

e-mail: [email protected]. Jun S. Liu is Professor of

Statistics, Department of Statistics, Harvard University,

1 Oxford Street, Cambridge, Massachusetts 02138, USA

e-mail: [email protected].

This is an electronic reprint of the original articlepublished by the Institute of Mathematical Statistics inStatistical Science, 2010, Vol. 25, No. 4, 476–491. Thisreprint differs from the original in pagination andtypographic detail.

ing how biology, traditionally regarded as an empir-ical science, has come to embrace rigorous statisticalmodeling and mathematical reasoning.Before getting into details of various applications

of the EM algorithm in computational biology, wefirst explain some basic concepts of molecular biol-ogy. Three kinds of chain biopolymers are the cen-tral molecular building blocks of life: DNA, RNAand proteins. The DNAmolecule is a double-strandedlong sequence composed of four types of nucleotides(A, C, G and T). It has the famous double-helixstructure, and stores the hereditary information. RNAmolecules are very similar to DNAs, composed alsoof four nucleotides (A, C, G and U). Proteins arechains of 20 different basic units, called amino acids.The genome of an organism generally refers to the

collection of all its DNA molecules, called the chro-mosomes. Each chromosome contains both the pro-tein (or RNA) coding regions, called genes, and non-coding regions. The percentage of the coding regionsvaries a lot among genomes of different species. Forexample, the coding regions of the genome of baker’syeast are more than 50%, whereas those of the hu-man genome are less than 3%.RNAs are classified into many types, and the three

most basic types are as follows: messenger RNA(mRNA), transfer RNA (tRNA) and ribosomal RNA(rRNA). An mRNA can be viewed as an intermedi-ate copy of its corresponding gene and is used as a

1

Page 2: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

2 X. FAN, Y. YUAN AND J. S. LIU

template for constructing the target protein. tRNAis needed to recruit various amino acids and trans-port them to the template mRNA. mRNA, tRNAand amino acids work together with the construc-tion machineries called ribosomes to make the finalproduct, protein. One of the main components ofribosomes is the third kind of RNA, rRNA.Proteins carry out almost all essential functions in

a cell, such as catalysation, signal transduction, generegulation, molecular modification, etc. These capa-bilities of the protein molecules are dependent oftheir 3-dimensional shapes, which, to a large extent,are uniquely determined by their one-dimensionalsequence compositions. In order to make a protein,the corresponding gene has to be transcribed intomRNA, and then the mRNA is translated into theprotein. The “central dogma” refers to the concertedeffort of transcription and translation of the cell.The expression level of a gene refers to the amountof its mRNA in the cell.Differences between two living organisms are most-

ly due to the differences in their genomes. Withina multicellular organism, however, different cells maydiffer greatly in both physiology and function eventhough they all carry identical genomic information.These differences are the result of differential geneexpression. Since the mid-1990s, scientists have de-veloped microarray techniques that can monitor si-multaneously the expression levels of all the genes ina cell, making it possible to construct the molecular“signature” of different cell types. These techniquescan be used to study how a cell responds to differ-ent interventions, and to decipher gene regulatorynetworks. A more detailed introduction of the ba-sic biology for statisticians is given by Ji and Wong(2006).With the help of the recent biotechnology revolu-

tion, biologists have generated an enormous amountof molecular data, such as billions of base pairs ofDNA sequence data in the GenBank, protein struc-ture data in PDB, gene expression data, biologicalpathway data, biopolymer interaction data, etc. Theexplosive growth of various system-level moleculardata calls for sophisticated statistical models for in-formation integration and for efficient computationalalgorithms. Meanwhile, statisticians have acquireda diverse array of tools for developing such modelsand algorithms, such as the EM algorithm (Demp-ster, Laird and Rubin (1977)), data augmentation(Tanner and Wong (1987)), Gibbs sampling (Ge-man and Geman (1984)), the Metropolis–Hastings

algorithm (Metropolis and Ulam (1949); Metropoliset al. (1953); Hastings (1970)), etc.

1.2 The Expectation–Maximization Algorithm

The expectation–maximization (EM) algorithm(Dempster, Laird and Rubin, 1977) is an iterativemethod for finding the mode of a marginal likeli-hood function (e.g., the MLE when there is missingdata) or a marginal distribution (e.g., the maximuma posteriori estimator). Let Y denote the observeddata, Θ the parameters of interest, and Γ the nui-sance parameters or missing data. The goal is tomaximize the function

p(Y|Θ) =

p(Y,Γ|Θ)dΓ,

which cannot be solved analytically. A basic assump-tion underlying the effectiveness of the EM algo-rithm is that the complete-data likelihood or theposterior distribution, p(Y,Γ|Θ), is easy to deal

with. Starting with a crude parameter estimateΘ(0),the algorithm iterates between the following Expec-tation (E-step) and Maximization (M-step) stepsuntil convergence:

• E-step: Compute the Q-function:

Q(Θ|Θ(t))≡EΓ|Θ(t),Y

[log p(Y,Γ|Θ)].

• M-step: Finding the maximand:

Θ(t+1) = argmaxΘ

Q(Θ|Θ(t)).

Unlike the Newton–Raphson and scoring algorithms,the EM algorithm does not require computing thesecond derivative or the Hessian matrix. The EMalgorithm also has the nice properties of monotonenondecreasing in the marginal likelihood and sta-ble convergence to a local mode (or a saddle point)under weak conditions. More importantly, the EMalgorithm is constructed based on the missing dataformulation and often conveys useful statistical in-sights regarding the underlying statistical model. Amajor drawback of the EM algorithm is that its con-vergence rate is only linear, proportional to the frac-tion of “missing information” about Θ (Dempster,Laird and Rubin (1977)). In cases with a large pro-portion of missing information, the convergence rateof the EM algorithm can be very slow. To monitorthe convergence rate and the local mode problem, abasic strategy is to start the EM algorithm with mul-tiple initial values. More sophisticated methods areavailable for specific problems, such as the “backup-buffering” strategy in Qin, Niu and Liu (2002).

Page 3: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

EM IN COMPUTATIONAL BIOLOGY 3

1.3 Uses of the EM Algorithm in Biology

The idea of iterating between filling in the miss-ing data and estimating unknown parameters is sointuitive that some special forms of the EM algo-rithm appeared in the literature long before Demp-ster, Laird and Rubin (1977) defined it. The earliestexample on record is by McKendrick (1926), who in-vented a special EM algorithm for fitting a Poissonmodel to a cholera infection data set. Other earlyforms of the EM algorithm appeared in numerousgenetics studies involving allele frequency estima-tion, segregation analysis and pedigree data anal-ysis (Ceppellini, Siniscalco and Smith, 1955; Smith,1957; Ott, 1979). A precursor to the broad recog-nition of the EM algorithm by the computationalbiology community is Churchill (1989), who appliedthe EM algorithm to fit a hidden Markov model(HMM) for partitioning genomic sequences into re-gions with homogenous base compositions. Lawrenceand Reilly (1990) first introduced the EM algorithmfor biological sequence motif discovery. Haussler et al.(1993) and Krogh et al. (1994) formulated an inno-vative HMM and used the EM algorithm for pro-tein sequence alignment. Krogh, Mian and Haussler(1994) extended these algorithms to predict genesin E. coli DNA data. During the past two decades,probabilistic modeling and the EM algorithm havebecome a more and more common practice in com-putational biology, ranging from multiple sequencealignment for a single protein family (Do et al., 2005)to genome-wide predictions of protein–protein inter-actions (Deng et al., 2002), and to single-nucleotidepolymorphism (SNP) haplotype estimation (Kanget al. (2004)).As noted in Meng and Pedlow (1992) and Meng

(1997), there are too many EM-related papers totrack. This is true even within the field of computa-tional biology. In this paper we only examine a fewkey topics in computational biology and use typicalexamples to show how the EM algorithm has pavedthe road for these studies. The connection betweenthe EM algorithm and statistical modeling of com-plex systems is essential in computational biology.It is our hope that this brief survey will stimulatefurther EM applications and provide insight for thedevelopment of new algorithms.Discrete sequence data and continuous expression

data are two of the most common data types in com-putational biology. We discuss sequence data analy-sis in Sections 2–5, and gene expression data analysis

in Section 6. A main objective of computational bi-ology research surrounding the “central dogma” isto study how the gene sequences affect the gene ex-pression. In Section 2 we attempt to find conservedpatterns in functionally related gene sequences asan effort to explain the relationship of their geneexpression. In Section 3 we give an EM algorithmfor multiple sequence alignment, where the goal is toestablish “relatedness” of different sequences. Basedon the alignment of evolutionary related DNA se-quences, another EM algorithm for detecting po-tentially expression-related regions is introduced inSection 4. An alternative way to deduce the relation-ship between gene sequence and gene expression is tocheck the effect of sequence variation within the pop-ulation of a species. In Section 5 we provide an EMalgorithm to deal with this type of small sequencevariation. In Section 6 we review the clustering anal-ysis of microarray gene-expression data, which isimportant for connecting the phenotype variationamong individuals with the expression level varia-tion. Finally, in Section 7 we discuss trends in com-putational biology research.

2. SEQUENCE MOTIF DISCOVERY AND

GENE REGULATION

In order for a gene to be transcribed, special pro-teins called transcription factors (TFs) are often re-quired to bind to certain sequences, called transcrip-tion factor binding sites (TFBSs). These sites areusually 6–20 bp long and are mostly located up-stream of the gene. One TF is usually involved in theregulation of many genes, and the TFBSs that theTF recognizes often exhibit strong sequence speci-ficity and conservation (e.g., the first position ofthe TFBSs is likely T, etc.). This specific patternis called a TF binding motif (TFBM). For example,Figure 1 shows a motif of length 6. The motif is rep-resented by the position-specific frequency matrix(θ1, . . . ,θ6), which is derived from the alignment of5 motif sites by calculating position-dependent fre-quencies of the four nucleotides.In order to understand how genes’ mRNA expres-

sion levels are regulated in the cell, it is crucialto identify TFBSs and to characterize TFBMs. Al-though much progress has been made in developingexperimental techniques for identifying these TF-BSs, these techniques are typically expensive andtime-consuming. They are also limited by experi-mental conditions, and cannot pinpoint the bind-ing sites exactly. In the past twenty years, compu-tational biologists and statisticians have developed

Page 4: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

4 X. FAN, Y. YUAN AND J. S. LIU

many successful in silico methods to aid biologistsin finding TFBSs, and these efforts have contributedsignificantly to our understanding of transcriptionregulation.Likewise, motif discovery for protein sequences is

important for identifying structurally or function-ally important regions (domains) and understand-ing proteins’ functional components, or active sites.For example, using a Gibbs sampling-based motiffinding algorithm, Lawrence et al. (1993) was ableto predict the key helix-turn-helix motif among afamily of transcription activators. Experimental ap-proaches for determining protein motifs are evenmore expensive and slower than those for DNAs,whereas computational approaches are more effec-tive than those for TFBSs predictions.The underlying logic of computational motif disco-

very is to find patterns that are “enriched” in a givenset of sequence data. Common methods include wordenumeration (Sinha and Tompa (2002); Hampson,Kibler and Baldi (2002); Pavesi et al. (2004)), po-sition-specific frequency matrix updating (Stormoand Hartzell (1989); Lawrence and Reilly (1990);Lawrence et al. (1993)) or a combination of the two(Liu, Brutlag and Liu, 2002). The word enumera-tion approach uses a specific consensus word to rep-resent a motif. In contrast, the position-specific fre-quency matrix approach formulates a motif as aweight matrix. Jensen et al. (2004) provide a re-view of these motif discovery methods. Tompa et al.(2005) compared the performance of various motifdiscovery tools. Traditionally, researchers have em-ployed various heuristics, such as evaluating exces-siveness of word counts or maximizing certain in-formation criteria to guide motif finding. The EMalgorithm was introduced by Lawrence and Reilly(1990) to deal with the motif finding problem.As shown in Figure 1, suppose we are given a

set of K sequences Y ≡ (Y1, . . . ,YK), where Yk ≡

Fig. 1. Transcription factor binding sites and motifs. (A)Each of the five sequences contains a TFBS of length 6. Thelocal alignment of these sites is shown in the gray box. (B) Thefrequency of the nucleotides outside of the gray box is shownas θ0. The frequency of the nucleotides in the ith column ofthe gray box is shown as θi.

(Yk,1, . . . , Yk,Lk) and Yk,l takes values in an alphabet

of d residues (d= 4 for DNA/RNA and 20 for pro-tein). The alphabet is denoted by R ≡ (r1, . . . , rd).Motif sites in this paper refer to a set of contiguoussegments of the same length w (e.g., the marked 6-mers in Figure 1). This concept can be further gener-alized via a hidden Markov model to allow gaps andposition deletions (see Section 3 for HMM discus-sions). The weight matrix, or Product-Multinomial

motif model, was first introduced by Stormo andHartzell (1989) and later formulated rigorously inLiu, Neuwald and Lawrence (1995). It assumes that,if Yk,l is the ith position of a motif site, it followsthe multinomial distribution with the probabilityvector θi ≡ (θi1, . . . , θid); we denote this model asPM (θ1, . . . ,θw). If Yk,l does not belong to any motifsite, it is generated independently from the multino-mial distribution with parameter θ0 ≡ (θ01, . . . , θ0d).Let Θ ≡ (θ0,θ1, . . . ,θw). For sequence Yk, there

are L′k = Lk−w+1 possible positions a motif site of

length w may start. To represent the motif locations,we introduce the unobserved indicators Γ ≡ {Γk,l |1≤ k ≤K,1≤ l≤ L′

k}, where Γk,l = 1 if a motif sitestarts at position l in sequence Yk, and Γk,l = 0 oth-erwise. As shown in Figure 1, it is straightforward toestimate Θ if we know where the motif sites are. Themotif location indicators Γ are the missing data thatmakes the EM framework a natural choice for thisproblem. For illustration, we further assume thatthere is exactly one motif site within each sequenceand that its location in the sequence is uniformlydistributed. This means that

lΓk,l = 1 for all kand P (Γk,l = 1) = 1

L′k

.

Given Γk,l = 1, the probability of each observedsequence Yk is

P (Yk|Γk,l = 1,Θ) = θh(Bk,l)0

w∏

j=1

θh(Yk,l+j−1)i .(1)

In this expression, Bk,l ≡ {Yk,j : j < l or j ≥ l + w}is the set of letters of nonsite positions of Yk. Thecounting function h(·) takes a set of letter symbolsas input and outputs the column vector (n1, . . . , nd)

T ,where ni is the number of base type ri in the input

set. We define the vector power function as θh(·)i ≡

∏dj=1 θ

nj

ij for i = 0, . . . ,w. Thus, the complete-data

likelihood function is the product of equation (1)for k from 1 to K, that is,

P (Y,Γ|Θ)∝K∏

k=1

L′k∏

l=1

P (Yk|Γk,l = 1,Θ)Γk,l

Page 5: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

EM IN COMPUTATIONAL BIOLOGY 5

= θh(BΓ)0

w∏

i=1

θh(M

(i)Γ

)i ,

where BΓ is the set of all nonsite bases, and M(i)Γ

is the set of nucleotide bases at position i of theTFBSs given the indicators Γ.The MLE of Θ from the complete-data likelihood

can be determined by simple counting, that is,

θi =h(M

(i)Γ)

Kand θ0 =

h(BΓ)∑K

k=1(Lk −w).

The EM algorithm for this problem is quite intu-itive. In the E-step, one uses the current parame-ter values Θ(t) to compute the expected values of

h(M(i)Γ) and h(BΓ). More precisely, for sequence

Yk, we compute its likelihood of being generatedfrom Θ(t) conditional on each possible motif loca-tion Γk,l = 1,

wk,l ≡ P (Yk|Γk,l = 1,Θ(t))

=

(

θ1

θ0

)h(Yk,l)

· · ·

(

θw

θ0

)h(Yk,l+w−1)

θh(Yk)0 .

Letting Wk ≡∑L′

k

l=1wk,l, we then compute the ex-pected count vectors as

EΓ|Θ(t),Y

[h(M(i)Γ)] =

K∑

k=1

L′k∑

l=1

wk,l

Wk

h(Yk,l+i−1),

EΓ|Θ(t),Y

[h(BΓ)] = h({Yk,l : 1≤ k ≤K,1≤ l≤ Lk})

w∑

i=1

EΓ|Θ(t),Y

[h(M(i)Γ)].

In the M-step, one simply computes

θ(t+1)i =

EΓ|Θ(t),Y

[h(M(i)Γ)]

Kand

θ(t+1)0 =

EΓ|Θ(t),Y

[h(BΓ)]∑K

k=1(Lk −w).

It is necessary to start with a nonzero initial weightmatrix Θ(0) so as to guarantee that P (Yk|Γk,l =

1,Θ(t)) > 0 for all l. At convergence the algorithm

yields both the MLE Θ and predictive probabili-ties for candidate TFBS locations, that is, P (Γk,l =

1|Θ,Y).Cardon and Stormo (1992) generalized the above

simple model to accommodate insertions of variable

lengths in the middle of a binding site. To over-come the restriction that each sequence contains ex-actly one motif site, Bailey and Elkan (1994, 1995a,1995b) introduced a parameter p0 describing theprior probability for each sequence position to be thestart of a motif site, and designed a modified EM al-gorithm called the Multiple EM for Motif Elicitation(MEME). Independently, Liu, Neuwald and Lawrence(1995) presented a full Bayesian framework and Gibbssampling algorithm for this problem. Compared withthe EM approach, the Markov chain Monte Carlo(MCMC)-based approach has the advantages of mak-ing more flexible moves during the iteration and in-corporating additional information such as motif lo-cation and orientation preference in the model.The generalizations in Bailey and Elkan (1994)

and Liu, Neuwald and Lawrence (1995) assume thatall overlapping subsequences of length w in the se-quence data set are from a finite mixture model.More precisely, each subsequence of length w is trea-ted as an independent sample from a mixture ofPM (θ1, . . . ,θw) and PM (θ0, . . . ,θ0) [independentMultinomial(θ0) in all w positions]. The EM solu-tion of this mixture model formulation then leadsto the MEME algorithm of Bailey and Elkan (1994).To deal with the situation that w may not be knownprecisely, MEME searches motifs of a range of dif-ferent widths separately, and then performs modelselection by optimizing a heuristic function basedon the maximum likelihood ratio test. Since its re-lease, MEME has been one of the most popular mo-tif discovery tools cited in the literature. The Googlescholar search gives a count of 1397 citations as ofAugust 30th, 2009. Although it is 15 years old, itsperformance is still comparable to many new algo-rithms (Tompa et al., 2005).

3. MULTIPLE SEQUENCE ALIGNMENT

Multiple sequence alignment (MSA) is an impor-tant tool for studying structures, functions and theevolution of proteins. Because different parts of aprotein may have different functions, they are sub-ject to different selection pressures during evolution.Regions of greater functional or structural impor-tance are generally more conserved than other re-gions. Thus, a good alignment of protein sequencescan yield important evidence about their functionaland structural properties.Many heuristic methods have been proposed to sol-

ve the MSA problem. A popular approach is the pro-gressive alignment method (Feng and Doolittle, 1987),

Page 6: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

6 X. FAN, Y. YUAN AND J. S. LIU

in which the MSA is built up by aligning the mostclosely related sequences first and then adding moredistant sequences successively. Many alignment pro-grams are based on this strategy, such as MUL-TALIGN (Barton and Sternberg, 1987), MULTAL(Taylor, 1988) and, the most influential one, ClustalW(Thompson, Higgins and Gibson, 1994). Usually, aguide tree based on pairwise similarities between theprotein sequences is constructed prior to the multi-ple alignment to determine the order for sequencesto enter the alignment. Recently, a few new pro-gressive alignment algorithms with significantly im-proved alignment accuracies and speed have beenproposed, including T-Coffee (Notredame, Higginsand Heringa (2000)), MAFFT (Katoh et al., 2005),PROBCONS (Do et al., 2005) and MUSCLE (Edgar,2004a, 2004b). They differ from previous approachesand each other mainly in the construction of theguide tree and in the objective function for judgingthe goodness of the alignment. Batzoglou (2005) andWallace, Blackshields and Higgins (2005) reviewedthese algorithms.An important breakthrough in solving the MSA

problem is the introduction of a probabilistic gen-erative model, the profile hidden Markov model byKrogh et al. (1994). The profile HMM postulatesthat the N observed sequences are generated as in-dependent but indirect observations (emissions) froma Markov chain model illustrated in Figure 2. Theunderlying unobserved Markov chain consists of threetypes of states: match, insertion and deletion. Eachmatch or insertion state emits a letter chosen fromthe alphabet R (size d = 20 for proteins) accord-ing to a multinomial distribution. The deletion statedoes not emit any letter, but makes the sequencegenerating process skip one or more match states. Amultiple alignment of the N sequences is producedby aligning the letters that are emitted from thesame match state.Let Γi denote the unobserved state path through

which the ith sequence is generated from the profileHMM, and S the set of all states. Let Θ denote theset of all global parameters of this model, includingemission probabilities in match and insertion stateselr (l ∈ S, r ∈R), and transition probabilities amongall hidden states tab (a, b ∈ S). The complete-datalog-likelihood function can be written as

logP (Y,Γ|Θ)

=

N∑

i=1

[logP (Yi|Γi,Θ) + logP (Γi|Θ)]

Fig. 2. Profile hidden Markov model. A modified toy exam-ple is adopted from Eddy (1998). It shows the alignment offive sequences, each containing only three to five letters. Thefirst position is enriched with Cysteine (C), the fourth posi-tion is enriched with Histidine (H), and the fifth position isenriched with Phenylalanine (F) and Tyrosine (Y). The thirdsequence has a deletion at the fourth position, and the fourthsequence has an insertion at the third position. This simplifiedmodel does not allow insertion and deletion states to followeach other.

=

N∑

i=1

[

l∈S,r∈R

Mlr(Γi) log elr

+∑

a,b∈S

Nab(Γi) log tab

]

,

where Mlr(Γi) is the count of letter r in sequenceYi that is generated from state l according to Γi,and Nab(Γi) is the count of state transitions from ato b in the path Γi for sequence Yi.The E-step involves calculating the expected counts

of emissions and transitions, that is, E[Mlr(Γi)|Θ(t)]

and E[Nab(Γi)|Θ(t)], averaging over all possible gen-

erating paths Γi. The Q-function is

Q(Θ|Θ(t)) =

N∑

i=1

Γi

P (Γi,Yi|Θ(t))

P (Yi|Θ(t))

·

[

l∈S,r∈R

log(elr)Mlr(Γi)

+∑

a,b∈S

log(tab)Nab(Γi)

]

.

A brute-force enumeration of all paths is prohibitivelyexpensive in computation. Fortunately, one can ap-ply a forward–backward dynamic programming tech-nique to compute the expectations for each sequenceand then sum them all up.In the M-step, the emission and transition prob-

abilities are updated as the ratio of the expectedevent occurrences (sufficient statistics) divided by

Page 7: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

EM IN COMPUTATIONAL BIOLOGY 7

the total expected emission or transition events:

e(t+1)lr =

i{mlr(Yi)/P (Yi|Θ(t))}

i{ml(Yi)/P (Yi|Θ(t))}

,

t(t+1)ab =

i{nab(Yi)/P (Yi|Θ(t))}

i{na(Yi)/P (Yi|Θ(t))}

,

where

mlr(Yi) =∑

Γi

Mlr(Γi)P (Γi,Yi|Θ(t)),

nab(Yi) =∑

Γi

Nab(Γi)P (Γi,Yi|Θ(t)),

ml(Yi) =∑

r∈R

mlr(Yi), na(Yi) =∑

b∈S

nab(Yi).

This method is called the Baum–Welch algorithm(Baum et al., 1970), and is mathematically equiva-lent to the EM algorithm. Conditional on the MLEΘ, the best alignment path for each sequence canbe found efficiently by the Viterbi algorithm (seeDurbin et al., 1998, Chapter 5, for details).The profile HMM provides a rigorous statistical

modeling and inference framework for the MSA prob-lem. It has also played a central role in advancingthe understanding of protein families and domains.A protein family database, Pfam (Finn et al., 2006),has been built using profile HMM and has servedas an essential source of data in the field of pro-tein structure and function research. Currently thereare two popular software packages that use profileHMMs to detect remote protein homologies: HM-MER (Eddy, 1998) and SAM (Hughey and Krogh,1996; Karplus, Barrett and Hughey, 1999). Maderaand Gough (2002) gave a comparison of these twopackages.There are several challenges in fitting the profile

HMM. First, the size of the model (the number ofmatch, insertion and deletion states) needs to bedetermined before model fitting. It is common tobegin fitting a profile HMM by setting the num-ber of match states equal to the average sequencelength. Afterward, a strategy called “model surgery”(Krogh et al., 1994) can be applied to adjust themodel size (by adding or removing a match statedepending on whether an insertion or a deletion isused too often). Eddy (1998) used amaximum a pos-

teriori (MAP) strategy to determine the model sizein HMMER. In this method the number of matchstates is given a prior distribution, which is equiva-lent to adding a penalty term in the log-likelihoodfunction.

Second, the number of sequences is sometimes toosmall for parameter estimation. When calculatingthe conditional expectation of the sufficient statis-tics, which are counts of residues at each state andstate transitions, there may not be enough data, re-sulting in zero counts which could make the esti-mation unstable. To avoid the occurrence of zerocounts, pseudo-counts can be added. This is equiv-alent to using a Dirichlet prior for the multinomialparameters in a Bayesian formulation.Third, the assumption of sequence independence

is often violated. Due to the underlying evolution-ary relationship (unknown), some of the sequencesmay share much higher mutual similarities than oth-ers. Therefore, treating all sequences as i.i.d. sam-ples may cause serious biases in parameter estima-tion. One possible solution is to give each sequencea weight according to its importance. For example,if two sequences are identical, it is reasonable to giveeach of them half the weight of other sequences. Theweights can be easily integrated into the M-step ofthe EM algorithm to update the model parameters.For example, when a sequence has a weight of 0.5,all the emission and transition events contributed bythis sequence will be counted by half. Many meth-ods have been proposed to assign weights to the se-quences (Durbin et al., 1998), but it is not clear howto set the weights in a principled way to best accountfor sequence dependency.Last, since the EM algorithm can only find lo-

cal modes of the likelihood function, some stochas-tic perturbation can be introduced to help find bet-ter modes and improve the alignment. Starting frommultiple random initial parameters is strongly rec-ommended. Krogh et al. (1994) combined simulatedannealing into Baum–Welch and showed some im-provement. Baldi and Chauvin (1994) developed ageneralized EM (GEM) algorithm using a gradientascent calculation in an attempt to infer HMM pa-rameters in a smoother way.Despite many advantages of the profile HMM, it

is no longer the mainstream MSA tool. A main rea-son is that the model has too many free parameters,which render the parameter estimation very unsta-ble when there are not enough sequences (fewer than50, say) in the alignment. In addition, the vanillaEM algorithm and its variations developed by earlyresearchers for the MSA problem almost always con-verge to suboptimal alignments. Recently, Edlefsen(2009) have developed an ECM algorithm for MSAthat appears to have much improved convergence

Page 8: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

8 X. FAN, Y. YUAN AND J. S. LIU

properties. It is also difficult for the profile HMMto incorporate other kinds of information, such as3D protein structure and guide tree. Some recentprograms such as 3D-Coffee (O’Sullivan et al., 2004)and MAFFT are more flexible as they can incorpo-rate this information into the objective function andoptimize it. We believe that the Monte Carlo-basedBayesian approaches, which can impose more modelconstraints (e.g., to capitalize on the “motif” con-cept) and make more flexible MCMC moves, mightbe a promising route to rescue profile HMM (seeLiu, Neuwald and Lawrence, 1995; Neuwald and Liu,2004).

4. COMPARATIVE GENOMICS

A main goal of comparative genomics is to iden-tify and characterize functionally important regionsin the genome of multiple species. An assumptionunderlying such studies is that, due to evolution-ary pressure, functional regions in the genome evolvemuch more slowly than most nonfunctional regionsdue to functional constraints (Wolfe, Sharp and Li,1989; Boffelli et al., 2003). Regions that evolve moreslowly than the background are called evolutionarilyconserved elements.Conservation analysis (comparing genomes of re-

lated species) is a powerful tool for identifying func-tional elements such as protein/RNA coding regionsand transcriptional regulatory elements. It beginswith an alignment of multiple orthologous sequences(sequences evolved from the same common ances-tral sequence) and a conservation score for each col-umn of the alignment. The scores are calculatedbased on the likelihood that each column is locatedin a conserved element. The phylogenetic hiddenMarkov model (Phylo-HMM) was introduced to in-fer the conserved regions in the genome (Yang, 1995;Felsenstein and Churchill, 1996; Siepel et al., 2005).The statistical power of Phylo-HMM has been sys-tematically studied by Fan et al. (2007). Siepel et al.(2005) used the EM algorithm for estimating param-eters in Phylo-HMM. Their results, provided by theUCSC genome browser database (Karolchik et al.,2003), are very influential in the computational bi-ology community. By August 2009, the paper of Sie-pel et al. (2005) had been cited 413 times accordingto the Web of Science database.As shown in Figure 3, the alignment modeled by

Phylo-HMM can be seen as generated from two steps.First, a sequence of L sites is generated from a two-state HMM, with the hidden states being conserved

or nonconserved sites. Second, a nucleotide is gener-ated for each site of the common ancestral sequenceand evolved to the contemporary nucleotides alongall branches of a phylogenetic tree independently ac-cording to the corresponding phylogenetic model.Let µ and ν be the transition probabilities be-

tween the two states, and let the phylogenetic mod-els for nonconserved and conserved states be ψn =(Q,π, τ,β) and ψc = (Q,π, τ, ρβ), respectively. Hereπ is the emission probability vector of the four nu-cleotides (A, C, G and T) in the common ances-tral sequence x0; τ is the tree topology of the cor-responding phylogeny; β is a vector of non-negativereal numbers representing branch lengths of the tree,which are measured by the expected number of sub-stitutions per site. The difference between the twostates is characterized by a scaling parameter ρ ∈[0,1) applied to the branch lengths of only the con-served state, which means fewer substitutions. Thenucleotide substitution model considers a descen-dent nucleotide to have evolved from its ancestor bya continuous-time time-homogeneous Markov pro-cess with transition kernel Q, also called the sub-stitution rate matrix (Tavare, 1986). The transitionkernels for all branches are assumed to be the same.Many parametric forms are available for the 4-by-4nucleotide substitution rate matrix Q, such as theJukes–Cantor substitution matrix and the generaltime-reversible substitution matrix (Yang, 1997). Thenucleotide transition probability matrix for a branchof length βi is e

βiQ.Siepel et al. (2005) assumed that the tree topol-

ogy τ and the emission probability vector π areknown. In this case, the observed alignment Y =(y1·,y2·,y3·,y4·) is a matrix of nucleotides. The pa-rameter of interest isΘ= (µ, ν,Q,ρ,β). The missinginformation Γ= (z,X) includes the state sequence zand the ancestral DNA sequences X. The complete-data likelihood is written as

P (Y,Γ|Θ)

= bz1P (y·1,x·1|ψz1)

L∏

i=2

azi−1ziP (y·i,x·i|ψzi).

Here y·i is the ith column of the alignment Y, zi ∈{c,n} is the hidden state of the ith column, (bc, bn) =( νµ+ν

, µµ+ν

) is the initial state probability of the HMM

if the chain is stationary, and azi−1zi is the transitionprobability (as illustrated in Figure 3).The EM algorithm is applied to obtain the MLE

of Θ. In the E-step, we calculate the expectation

Page 9: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

EM IN COMPUTATIONAL BIOLOGY 9

Fig. 3. Two-state Phylo-HMM. (A) Phylogenetic tree: Thetree shows the evolutionary relationship of four contem-porary sequences (y1·,y2·,y3·,y4·). They are evolved fromthe common ancestral sequence x0·, with two additional in-ternal nodes (ancestors), x1· and x2·. The branch lengthsβ = (β0, β1, β2, β3, β4, β5) indicate the evolutionary distancebetween two nodes, which are measured by the expected num-ber of substitutions per site. (B) HMM state-transition dia-gram: The system consists of a state for conserved sites and astate for nonconserved sites (c and n, respectively). The twostates are associated with different phylogenetic models (ψc

and ψn), which differ by a scaling parameter ρ. (C) An illus-trative alignment generated by this model: A state sequence(z) is generated according to µ and ν. For each site in thestate sequence, a nucleotide is generated for the root node inthe phylogenetic tree and then for subsequent child nodes ac-cording to the phylogenetic model (ψc or ψn). The observedalignment Y = (y1·,y2·,y3·,y4·) is composed of all nucleotidesin the leaf nodes. The state sequence z and all ancestral se-quences X= (x0·,x1·,x2·) are unobserved.

of the complete-data log-likelihood under the distri-bution P (z,X|Θ(t),Y). The marginalization of X,conditional on z and other variables, can be accom-plished efficiently site-by-site using the peeling orpruning algorithm for the phylogenetic tree (Felsen-stein (1981)). The marginalization of z can be doneefficiently by the forward–backward procedure forHMM (Baum et al., 1970; Rabiner, 1989). For theM-step, we can use the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton algorithm. After weobtain the MLE of Θ, a forward–backward dynamicprogramming method (Liu, 2001) can then be usedto compute the posterior probability that a givenhidden state is conserved, that is, P (zi = c|Θ,Y),which is the desired conservation score.

As shown in the Phylo-HMM example, the phylo-genetic tree model is key to integrating multiple se-quences for evolutionary analysis. This model is alsoused for comparing protein or RNA sequences. Dueto its intuitive and efficient handling of the miss-ing evolutionary history, the EM algorithm has al-ways been a main approach for estimating parame-ters of the tree. For example, Felsenstein (1981) usedthe EM algorithm to estimate the branch length β,Bruno (1996) and Holmes and Rubin (2002) usedthe EM algorithm to estimate the residue usage πand the substitution rate matrix Q, Friedman et al.(2002) used an extension of the EM algorithm to es-timate the phylogenetic tree topology τ , and Holmes(2005) used the EM algorithm for estimating inser-tion and deletion rates. Yang (1997) implementedsome of the above algorithms in the phylogeneticanalysis software PAML. A limitation of the Phylo-HMM model is the assumption of a good multiplesequence alignment, which is often not available.

5. SNP HAPLOTYPE INFERENCE

A Single Nucleotide Polymorphism (SNP) is a DNAsequence variation in which a single base is alteredthat occurs in at least 1% of the population. Forexample, the DNA fragments CCTGAGGAG andCCTGTGGAG from two homologous chromosomes(the paired chromosomes of the same individual, onefrom each parent) differ at a single locus. This ex-ample is actually a real SNP in the human β-globingene, and it is associated with the sickle-cell disease.The different forms (A and T in this example) of aSNP are called alleles. Most SNPs have only two al-leles in the population. Diploid organisms, such ashumans, have two homologous copies of each chro-mosome. Thus, the genotype (i.e., the specific al-lelic makeup) of an individual may be AA, TT orAT in this example. A phenotype is a morphologi-cal feature of the organism controlled or affected bya genotype. Different genotypes may produce thesame phenotype. In this example, individuals withgenotype TT have a very high risk of the sickle-celldisease. A haplotype is a combination of alleles atmultiple SNP loci that are transmitted together onthe same chromosome. In other words, haplotypesare sets of phased genotypes. An example is givenin Figure 4, which shows the genotypes of three in-dividuals at four SNP loci. For the first individual,the arrangement of its alleles on two chromosomesmust be ACAC and ACGC, which are the haplo-types compatible with its observed genotype data.

Page 10: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

10 X. FAN, Y. YUAN AND J. S. LIU

One of the main tasks of genetic studies is to lo-cate genetic variants (mainly SNPs) that are associ-ated with inheritable diseases. If we know the hap-lotypes of all related individuals, it will be easierto rebuild the evolutionary history and locate thedisease mutations. Unfortunately, the phase infor-mation needed to build haplotypes from genotypeinformation is usually unavailable because labora-tory haplotyping methods, unlike genotyping tech-nologies, are expensive and low-throughput.The use of the EM algorithm has a long history in

population genetics, some of which predates Demp-ster, Laird and Rubin (1977). For example, Cep-pellini, Siniscalco and Smith (1955) invented an EMalgorithm to estimate allele frequencies when thereis no one-to-one correspondence between phenotypeand genotype; Smith (1957) used an EM algorithmto estimate the recombination frequency; and Ott(1979) used an EM algorithm to study genotype-phenotype relationships from pedigree data. Weeksand Lange (1989) reformulated these earlier appli-cations in the modern EM framework of Dempster,Laird and Rubin (1977). Most early works were single-SNP Association studies. Thompson (1984) and Lan-der and Green (1987) designed EM algorithms forjoint linkage analysis of three or more SNPs. Withthe accumulation of SNP data, more and more re-searchers have come to realize the importance ofhaplotype analysis (Liu et al., 2001). Haplotype re-construction based on genotype data has thereforebecome a very important intermediate step in dis-ease association studies.The haplotype reconstruction problem is illustra-

ted in Figure 4. Suppose we observed the genotypedata Y = (Y1, . . . , Yn) for n individuals, and we wish

Fig. 4. Haplotype reconstruction. We observed the genotypesof three individuals at 4 SNP loci. The 1st and 3rd individualseach have a unique haplotype phase, whereas the 2nd individ-ual has two compatible haplotype phases. We pool all possi-ble haplotypes together and associated with them a haplotypefrequency vector (θ1, . . . , θ6). Each individual’s two haplotypesare then assumed to be random draws (with replacement) fromthis pool of weighted haplotypes.

to predict the corresponding haplotypes Γ= (Γ1, . . . ,Γn), where Γi = (Γ+

i ,Γ−i ) is the haplotype pair of the

ith individual. The haplotype pair Γi is said to becompatible with the genotype Yi, which is expressedas Γ+

i ⊕ Γ−i = Yi, if the genotype Yi can be gener-

ated from the haplotype pair. Let H= (H1, . . . ,Hm)be the pool of all distinct haplotypes and let Θ =(θ1, . . . , θm) be the corresponding frequencies in thepopulation.The first simple model considered in the litera-

ture assumes that each individual’s genotype vectoris generated by two haplotypes from the pool cho-sen independently with probability vector Θ. Thisis a very good model if the region spanned by themarkers in consideration is sufficiently short thatno recombination has occurred, and if mating in thepopulation is random. Under this model, we have

P (Y|Θ) =n∏

i=1

(

(j,k):Hj⊕Hk=Yi

θjθk

)

.

If Γ is known, we can directly write down the MLEof Θ as θj =

nj

2n , where the sufficient statistic nj isthe number of occurrences of haplotype Hj in Γ.Therefore, in the EM framework, we simply replacenj by its expected value over the distribution of Γwhen Γ is unobserved. More specifically, the EMalgorithm is a simple iteration of

θ(t+1)j =

EΓ|Θ(t),Y

(nj)

2n,

where Θ(t) is the current estimate of the haplotypefrequencies, and nj is the count of haplotypes Hj

that exist in Y.The use of the EM algorithm for haplotype analy-

sis has been coupled with the large-scale generationof SNP data. Early attempts include Excoffier andSlatkin (1995), Long, Williams and Urbanek (1995),Hawley and Kidd (1995) and Chiano and Clayton(1998). One problem of these traditional EM ap-proaches is that the computational complexity of theE-step grows exponentially as the number of SNPsin the haplotype increases. Qin, Niu and Liu (2002)incorporated a “partition–ligation” strategy into theEM algorithm in an effort to surpass this limitation.Lu, Niu and Liu (2003) used the EM for haplotypeanalysis in the scenario of case-control studies. Kanget al. (2004) extended the traditional EM haplotypeinference algorithm by incorporating genotype un-certainty. Niu (2004) gave a review of general algo-rithms for haplotype reconstruction.

Page 11: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

EM IN COMPUTATIONAL BIOLOGY 11

6. FINITE MIXTURE CLUSTERING FOR

MICROARRAY DATA

In cluster analysis one seeks to partition observeddata into groups such that coherence within eachgroup and separation between groups are maximizedjointly. Although this goal is subjectively defined(depending on how one defines “coherence” and “sep-aration”), clustering can serve as an initial explorato-ry analysis for high-dimensional data. One examplein computational biology is microarray data anal-ysis. Microarrays are used to measure the mRNAexpression levels of thousands of genes at the sametime. Microarray data are usually displayed as amatrix Y. The rows of Y represent the genes ina study and the columns are arrays obtained in dif-ferent experiment conditions, in different stages ofa biological system or from different biological sam-ples. Cluster analysis of microarray data has been ahot research field because groups of genes that sharesimilar expression patterns (clustering the rows ofY) are often involved in the same or related biolog-ical functions, and groups of samples having a sim-ilar gene expression profile (clustering the columnsof Y) are often indicative of the relatedness of thesesamples (e.g., the same cancer type).Finite mixture models have long been used in clus-

ter analysis (see Fraley and Raftery, 2002 for a re-view). The observations are assumed to be generatedfrom a finite mixture of distributions. The likelihoodof a mixture model with K components can be writ-ten as

P (Y|θ1, . . . ,θK ; τ1, . . . , τK) =n∏

i=1

K∑

k=1

τkfk(Yi|θk),

where fk is the density function of the kth compo-nent in the mixture, θk are the corresponding pa-rameters, and τk is the probability that an observeddatum is generated from this component model (τk ≥0,∑

k τk = 1). One of the most commonly used fi-nite mixture models is the Gaussian mixture model,in which θk is composed of mean µk and covari-ance matrix Σk. Outliers can be accommodated bya special component in the mixture that allows fora larger variance or extreme values.A standard way to simplify the statistical compu-

tation with mixture models is to introduce a variableindicating which component an observation Yi wasgenerated from. Thus, the “complete data” can beexpressed asXi = (Yi,Γi), where Γi = (γi1, . . . , γiK),

and γik = 1 if Yi is generated by the kth compo-nent and γik = 0 otherwise. The complete-data log-likelihood function is

logP (Y,Γ|θ1, . . . ,θK ; τ1, . . . , τK)

=

n∑

i=1

K∑

k=1

γik log[τkfk(Yi|θi)].

Since the complete-data log-likelihood function islinear in the γjk’s, in the E-step we only need tocompute

γik ≡E(γik|Θ(t),Y) =

τ(t)k fk(Yi|θ

(t)k )

∑Kj=1 τ

(t)j fj(Yi|θ

(t)j )

.

The Q-function can be calculated as

Q(Θ|Θ(t)) =

n∑

i=1

K∑

k=1

γik log[τkfk(Yi|θi)].(2)

The M-step updates the component probability τkas

τ(t+1)k =

1

n

n∑

i=1

γik,

and the updating of θk would depend on the den-sity function. In mixture Gaussian models, the Q-function is quadratic in the mean vector and can bemaximized to achieve the M-step.Yeung et al. (2001) are among the pioneers who

applied the model-based clustering method in mi-croarray data analysis. They adopted the Gaussianmixture model framework and represented the co-variance matrix in terms of its eigenvalue decompo-sition

Σk = λkDkAkDTk .

In this way, the orientation, shape and volume ofthe multivariate normal distribution for each clus-ter can be modeled separately by eigenvector ma-trix Dk, eigenvalue matrix Ak and scalar λk, respec-tively. Simplified models are straightforward underthis general model setting, such as setting λk, Dk

or Ak to be identical for all clusters or restrictingthe covariance matrices to take some special forms(e.g., Σk = λkI). Yeung and colleagues used the EMalgorithm to estimate the model parameters. To im-prove convergence, the EM algorithm can be initial-ized with a model-based hierarchical clustering step(Dasgupta and Raftery, 1998).

Page 12: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

12 X. FAN, Y. YUAN AND J. S. LIU

WhenYi has some dimensions that are highly cor-related, it can be helpful to project the data onto alower-dimensional subspace. For example, McLach-lan, Bean and Peel (2002) attempted to cluster tis-sue samples instead of genes. Each tissue sample isrepresented as a vector of length equal to the num-ber of genes, which can be up to several thousand.Factor analysis (Ghahramani and Hinton, 1997) canbe used to reduce the dimensionality, and can beseen as a Gaussian model with a special constrainton the covariance matrix. In their study, McLach-lan, Bean and Peel used a mixture of factor ana-lyzers, equivalent to a mixture Gaussian model, butwith fewer free parameters to estimate because ofthe constraints. A variant of the EM algorithm, theAlternating Expectation–Conditional Maximization(AECM) algorithm (Meng and van Dyk, 1997), wasapplied to fit this mixture model.Many microarray data sets are composed of sev-

eral arrays in a series of time points so as to studybiological system dynamics and regulatory networks(e.g., cell cycle studies). It is advantageous to modelthe gene expression profile by taking into accountthe smoothness of these time series. Ji et al. (2004)clustered the time course microarray data using amixture of HMMs. Bar-Joseph et al. (2002) and Luanand Li (2003) implemented mixture models withspline components. The time-course expression datawere treated as samples from a continuous smoothprocess. The coefficients of the spline bases can beeither fixed effect, random effect or a mixture effectto accommodate different modeling needs. Ma et al.(2006) improved upon these methods by adding agene-specific effect into the model:

yij = µk(tij) + bi + εij ,

where µk(t) is the mean expression of cluster k attime t, composed of smoothing spline components;bi ∼ N(0, σ2bk) explains the gene specific deviationfrom the cluster mean; and εij ∼ N(0, σ2) is themeasurement error. The Q-function in this case isa weighted version of the penalized log-likelihood:

−K∑

k=1

{

n∑

i=1

γik

(

T∑

j=1

(yij − µk(tij)− bi)2

2σ2+

b2i2σ2bk

)

(3)

− λkT

[µ′′k(t)]2 dt

}

,

where the integral is the smoothness penalty term.A generalized cross-validation method was appliedto choose the values for σ2bk and λk.

An interesting variation on the EM algorithm, therejection-controlled EM (RCEM), was introduced inMa et al. (2006) to reduce the computational com-plexity of the EM algorithm for mixture models. Inall mixture models, the E-step computes the mem-bership probabilities (weights) for each gene to be-long to each cluster, and the M-step maximizes aweighted sum function as in Luan and Li (2003).To reduce the computational burden of the M-step,we can “throw away” some terms with very smallweights in an unbiased weight using the rejectioncontrol method (Liu, Chen and Wong, 1998). Moreprecisely, a threshold c (e.g., c = 0.05) is chosen.Then, the new weights are computed as

γik =

{

max{γik, c}, with probability min{1, γik/c},

0, otherwise.

The new weight γik then replaces the old weight γikin the Q-function calculation in (2) in general, andin (3) more specifically. For cluster k, genes witha membership probability higher than c are not af-fected, while the membership probabilities of othergenes will be set to c or 0, with probabilities γik/cand 1− γik/c, respectively. By giving a zero weightto many genes with low γik/c, the number of termsto be summed in the Q-function is greatly reduced.In many ways finite mixture models are similar

to the K-means algorithm, and they may producevery similar clustering results. However, finite mix-ture models are more flexible in the sense that theinferred clusters do not necessarily have a sphereshape, and the shapes of the clusters can be learnedfrom the data. Researchers such as Suresh, Dinakaranand Valarmathie (2009) tried to combine the twoways of thinking to make better clustering algo-rithms.For cluster analysis, one intriguing question is how

to set the total number of clusters. Bayesian infor-mation criterion (BIC) is often used to determinethe number of clusters (Yeung et al. (2001); Fra-ley and Raftery (2002); Ma et al. (2006)). A ran-dom subsampling approach is suggested by Dudoit,Fridlyand and Speed (2002) for the same purpose.When external information of genes or samples isavailable, cross-validation can be used to determinethe number of clusters.

7. TRENDS TOWARD INTEGRATION

Biological systems are generally too complex tobe fully characterized by a snapshot from a sin-gle viewpoint. Modern high-throughput experimen-tal techniques have been used to collect massive

Page 13: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

EM IN COMPUTATIONAL BIOLOGY 13

amounts of data to interrogate biological systemsfrom various angles and under diverse conditions.For instance, biologists have collected many typesof genomic data, including microarray gene expres-sion data, genomic sequence data, ChIP–chip bind-ing data and protein–protein interaction data. Cou-pled with this trend, there is a growing interest incomputational methods for integrating multiple sour-ces of information in an effort to gain a deeper un-derstanding of the biological systems and to over-come the limitations of divided approaches. For ex-ample, the Phylo-HMM in Section 4 takes as in-put an alignment of multiple sequences, which, asshown in Section 3, is a hard problem by itself.On the other hand, the construction of the align-ment can be improved a lot if we know the underly-ing phylogeny. It is therefore preferable to infer themultiple alignment and the phylogenetic tree jointly(Lunter et al., 2005).Hierarchical modeling is a principled way of inte-

grating multiple data sets or multiple analysis steps.Because of the complexity of the problems, the in-clusion of nuisance parameters or missing data atsome level of the hierarchical models is usually ei-ther structurally inevitable or conceptually prefer-able. The EM algorithm and Markov chain MonteCarlo algorithms are often the methods of choice forthese models due to their close connection with theunderlying statistical model and the missing datastructure.For example, EM algorithms have been used to

combine motif discovery with evolutionary informa-tion. The underlying logic is that the motif sites suchas TFBSs evolved slower than the surrounding ge-nomic sequences (the background) because of func-tional constraints and natural selection. Moses, Chi-ang and Eisen (2004) developed EMnEM (Expecta-tion–Maximization on Evolutionary Mixtures), whichis a generalization of the mixture model formulationfor motif discovery (Bailey and Elkan, 1994). Moreprecisely, they treat an alignment of multiple orthol-ogous sequences as a series of alignments of lengthw, each of which is a sample from the mixture of amotif model and a background model. All observedsequences are assumed to evolve from a common an-cestor sequence according to an evolutionary processparameterized by a Jukes–Cantor substitution ma-trix. PhyME (Sinha, Blanchette and Tompa, 2004)is another EM approach for motif discovery in or-thologous sequences. Instead of modeling the com-mon ancestor, they modeled one designated “refer-ence species” using a two-state HMM (motif state

or background state). Only the well-aligned part ofthe reference sequence was assumed to share a com-mon evolutionary origin with other species. PhyMEassumes a symmetric star topology instead of a bi-nary phylogenetic tree for the evolutionary process.OrthoMEME (Prakash et al., 2004) deals with pairsof orthologous sequences and is a natural extensionof the EM algorithm of Lawrence and Reilly (1990)described in Section 2.Steps have also been taken to incorporate micro-

array gene expression data into motif discovery(Bussemaker, Li and Siggia (2001); Conlon et al.(2003)). Kundaje et al. (2005) used a graphical modeland the EM algorithm to combine DNA sequencedata with time-series expression data for gene clus-tering. Its basic logic is that co-regulated genes shouldshow both similar TFBS occurrence in their up-stream sequences and similar gene-expression time-series curves. The graphical model assumes that theTFBS occurrence and gene-expression are indepen-dent, conditional on the co-regulation cluster assign-ment. Based on predicted TFBSs in promoter re-gions and cell-cycle time-series gene-expression dataon budding yeast, this algorithm infers model pa-rameters by integrating out the latent variables forcluster assignment. In a similar setting, Chen andBlanchette (2007) used a Bayesian network and anEM-like algorithm to integrate TFBS information,TF expression data and target gene expression datafor identifying the combination of motifs that areresponsible for tissue-specific expression. The rela-tionships among different data are modeled by theconnections of different nodes in the Bayesian net-work. Wang et al. (2005) used a mixture model todescribe the joint probability of TFBS and targetgene expression data. Using the EM algorithm, theyprovide a refined representation of the TFBS andcalculate the probability that each gene is a truetarget.As we show in this review, the EM algorithm has

enjoyed many applications in computational biology.This is partly driven by the need for complex sta-tistical models to describe biological knowledge anddata. The missing data formulation of the EM algo-rithm addresses many computational biology prob-lems naturally. The efficiency of a specific EM al-gorithm depends on how efficiently we can integrateout unobserved variables (missing data/nuisance pa-rameters) in the E-step and how complex the opti-mization problem is in the M-step. Special depen-dence structures can often be imposed on the unob-served variables to greatly ease the computational

Page 14: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

14 X. FAN, Y. YUAN AND J. S. LIU

burden of the E-step. For example, the computa-tion is simple if latent variables are independent inthe conditional posterior distribution, such as in themixture motif example in Section 2 and the haplo-type example in Section 5. Efficient exact calculationmay also be available for structured latent variables,such as the forward–backward procedure for HMMs(Baum et al., 1970), the pruning algorithm for phy-logenetic trees (Felsenstein, 1981) and the inside–outside algorithm for the probabilistic context-freegrammar in predicting RNA secondary structures(Eddy and Durbin, 1994). As one of the drawbacksof the EM algorithm, the M-step can sometimes betoo complicated to compute directly, such as in thePhylo-HMM example in Section 4 and the smooth-ing spline mixture model in Section 6, in which casesinnovative numerical tricks are called for.

ACKNOWLEDGMENTS

We thank Paul T. Edlefsen for helpful discussionsabout the profile hidden Markov model, as well asto Yves Chretien for polishing the language. Thisresearch is supported in part by the NIH Grant R01-HG02518-02 and the NSF Grant DMS-07-06989. Thefirst two authors should be regarded as joint first au-thors.

REFERENCES

Bailey, T. L. and Elkan, C. (1994). Fitting a mixturemodel by expectation maximization to discover motifs inbiopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2 28–36.

Bailey, T. L. and Elkan, C. (1995a). Unsupervised learn-ing of multiple motifs in biopolymers using EM. MachineLearning 21 51–58.

Bailey, T. L. and Elkan, C. (1995b). The value of priorknowledge in discovering motifs with MEME. Proc. Int.Conf. Intell. Syst. Mol. Biol. 3 21–29.

Baldi, P. and Chauvin, Y. (1994). Smooth on-line learningalgorithms for hidden Markov models. Neural Computation6 305–316.

Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola,

T. and Simon, I. (2002). A new approach to analyzinggene expression time series data. In Proc. Sixth Ann. In-ter. Conf. Comp. Biol. 39–48. ACM Press, New York.

Barton, G. and Sternberg, M. (1987). A strategy for therapid multiple alignment of protein sequences. J. Mol. Biol.198 327–337.

Batzoglou, S. (2005). The many faces of sequence align-ment. Briefings in Bioinformatics 6 6–22.

Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970).A maximization technique occurring in the statistical anal-ysis of probabilistic functions of Markov chains. Ann. Math.Statist. 41 164–171. MR0287613

Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis,

K. D., Ovcharenko, I., Pachter, L. and Rubin, E. M.

(2003). Phylogenetic shadowing of primate sequences tofind functional regions of the human genome. Science 299

1391–1394.Bruno, W. (1996). Modeling residue usage in aligned pro-

tein sequences via maximum likelihood. Mol. Biol. Evol.13 1368–1374.

Bussemaker, H. J., Li, H. and Siggia, E. D. (2001). Regu-latory element detection using correlation with expression.Nature Genetics 27 167–171.

Cardon, L. R. and Stormo, G. D. (1992). Expectationmaximization algorithm for identifying protein-bindingsites with variable lengths from unaligned DNA fragments.J. Mol. Biol. 223 159–170.

Ceppellini, R., Siniscalco, M. and Smith, C. A. B.

(1955). The estimation of gene frequencies in a random-mating population. Annals of Human Genetics 20 97–115.MR0075523

Chen, X. and Blanchette, M. (2007). Prediction of tissue-specific cis-regulatory modules using Bayesian networksand regression trees. BMC Bioinformatics 8 (Suppl 10) S2.

Chiano, M. N. and Clayton, D. G. (1998). Fine geneticmapping using haplotype analysis and the missing dataproblem. Annals of Human Genetics 62 55–60.

Churchill, G. A. (1989). Stochastic models for hetero-geneous DNA sequences. Bull. Math. Biol. 51 79–94.MR0978904

Conlon, E. M., Liu, X. S., Lieb, J. D. and Liu, J. S.

(2003). Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA 100

3339–3344.Dasgupta, A. and Raftery, A. (1998). Detecting features

in spatial point processes with clutter via model-based clus-tering. J. Amer. Statist. Assoc. 93 294–302.

Dempster, A., Laird, N. and Rubin, D. (1977). Maximumlikelihood from incomplete data via the EM algorithm. J.Roy. Statist. Soc. Ser. B 39 1–38. MR0501537

Deng, M., Mehta, S., Sun, F. and Chen, T. (2002). In-ferring domain–domain interactions from protein–proteininteractions. Genome Res. 12 1540–1548.

Do, C. B., Mahabhashyam, M. S. P., Brudno, M.

and Batzoglou, S. (2005). Probcons: Probabilisticconsistency-based multiple sequence alignment. GenomeRes. 15 330–340.

Dudoit, S., Fridlyand, J. and Speed, T. (2002). Compar-ison of discrimination methods for the classification of tu-mors using gene expression data. J. Amer. Statist. Assoc.97 77–87. MR1963389

Durbin, R., Eddy, S., Krogh, A. and Mitchison, G.

(1998). Biological Sequence Analysis: Probabilistic Modelsof Proteins and Nucleic Acids. Cambridge Univ. Press,Cambridge.

Eddy, S. R. (1998). Profile hidden Markov models. Bioinfor-matics 14 755–763.

Eddy, S. R. and Durbin, R. (1994). RNA sequence analysisusing covariance models. Nucleic Acids Res. 22 2079–2088.

Edgar, R. (2004a). MUSCLE: A multiple sequence align-ment method with reduced time and space complexity.BMC Bioinformatics 5 113.

Page 15: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

EM IN COMPUTATIONAL BIOLOGY 15

Edgar, R. (2004b). MUSCLE: Multiple sequence alignmentwith high accuracy and high throughput. Nucleic AcidsRes. 32 1792–1797.

Edlefsen, P. T. (2009). Conditional Baum–Welch, dynamicmodel surgery, and the three Poisson Dempster–Shafermodel. Ph.D. thesis, Dept. Statistics, Harvard Univ.

Excoffier, L. and Slatkin, M. (1995). Maximum-likelihoodestimation of molecular haplotype frequencies in a diploidpopulation. Mol. Biol. Evol. 12 921–927.

Fan, X., Zhu, J., Schadt, E. and Liu, J. (2007). Statisticalpower of phylo-HMM for evolutionarily conserved elementdetection. BMC Bioinformatics 8 374.

Felsenstein, J. (1981). Evolutionary trees from DNA se-quences: A maximum likelihood approach. J. Mol. Evol.17 368–376.

Felsenstein, J. and Churchill, G. A. (1996). A hiddenMarkov model approach to variation among sites in rate ofevolution. Mol. Biol. Evol. 13 93–104.

Feng, D. and Doolittle, R. (1987). Progressive sequencealignment as a prerequisite to correct phylogenetic trees.J. Mol. Evol. 25 351–360.

Finn, R., Mistry, J., Schuster-Bockler, B., Griffiths-

Jones, S., Hollich, V., Lassmann, T., Moxon, S.,

Marshall, M., Khanna, A., Durbin, R., Eddy, S.,

Sonnhammer, E. and Bateman, A. (2006). Pfam: Clans,web tools and services. Nucleic Acids Res. Database Issue

34 D247–D251.Fraley, C. and Raftery, A. E. (2002). Model-based clus-

tering, discriminant analysis, and density estimation. J.Amer. Statist. Assoc. 97 611–631. MR1951635

Friedman, N., Ninio, M., Pe’er, I. and Pupko, T. (2002).A structural EM algorithm for phylogenetic inference. J.Comput. Biol. 9 331–353.

Geman, S. and Geman, D. (1984). Stochastic relaxation,Gibbs distribution, and the Bayesian restoration of images.IEEE Trans. Pattern Anal. Mach. Intell. 6 721–741.

Ghahramani, Z. and Hinton, G. E. (1997). The EM algo-rithm for factor analyzers. Technical Report CRG-TR-96-1, Univ. Toronto, Toronto.

Hampson, S., Kibler, D. and Baldi, P. (2002). Distributionpatterns of over-represented k-mers in non-coding yeastDNA. Bioinformatics 18 513–528.

Hastings, W. K. (1970). Monte Carlo sampling methodsusings Markov chains and their applications. Biometrika57 97–109.

Haussler, D., Krogh, A., Mian, I. S. and Sjolander,

K. (1993). Protein modeling using hidden Markov models:Analysis of globins. In Proc. Hawaii Inter. Conf. Sys. Sci.792–802. IEEE Computer Society Press, Los Alamitos, CA.

Hawley, M. E. and Kidd, K. K. (1995). HAPLO: A pro-gram using the EM algorithm to estimate the frequenciesof multi-site haplotypes. Journal of Heredity 86 409–411.

Holmes, I. (2005). Using evolutionary expectation maximiza-tion to estimate indel rates. Bioinformatics 21 2294–2300.

Holmes, I. and Rubin, G. M. (2002). An expectation max-imization algorithm for training hidden substitution mod-els. J. Mol. Biol. 317 753–764.

Hughey, R. and Krogh, A. (1996). Hidden Markov modelsfor sequence analysis. Extension and analysis of the basicmethod. Comput. Appl. Biosci. 12 95–107.

Jensen, S. T., Liu, X. S., Zhou, Q. and Liu, J. S.

(2004). Computational discovery of gene regulatory bind-ing motifs: A Bayesian perspective. Statist. Sci. 19 188–204.MR2082154

Ji, H. and Wong, W. H. (2006). Computational biology:Toward deciphering gene regulatory information in mam-malian genomes. Biometrics 62 645–663. MR2247187

Ji, X., Yuan, Y., Sun, Z. and Li, Y. (2004). HMMGEP:Clustering gene expression data using hidden Markov mod-els. Bioinformatics 20 1799–1800.

Kang, H., Qin, Z. S., Niu, T. and Liu, J. S. (2004). In-corporating genotyping uncertainty in haplotype inferencefor single-nucleotide polymorphisms. American Journal ofHuman Genetics 74 495–510.

Karolchik, D., Baertsch, R., Diekhans, M., Furey,

T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M.,

Schwartz, M., Sugnet, C. W., Thomas, D. J., Weber,

R. J., Haussler, D. and Kent, W. J. (2003). The UCSCgenome browser database. Nucleic Acids Res. 31 51–54.

Karplus, K., Barrett, C. and Hughey, R. (1999). HiddenMarkov models for detecting remote protein homologies.Bioinformatics 14 846–856.

Katoh, K., Kuma, K., Toh, H. and Miyata, T. (2005).MAFFT version 5: Improvement in accuracy of multiplesequence alignment. Nucleic Acids Res. 33 511–518.

Krogh, A., Brown, M., Mian, I. S., Sjolander, K. andHaussler, D. (1994). Hidden Markov models in compu-tational biology applications to protein modeling. J. Mol.Biol. 235 1501–1531.

Krogh, A., Mian, I. S. and Haussler, D. (1994). A hiddenMarkov model that finds genes in E. coli DNA. NucleicAcids Res. 22 4768–4778.

Kundaje, A., Middendorf, M., Gao, F., Wiggins, C. andLeslie, C. (2005). Combining sequence and time series ex-pression data to learn transcriptional modules. IEEE/ACMTrans. Comp. Biol. Bioinfo. 2 194–202.

Lander, E. S. and Green, P. (1987). Construction of mul-tilocus genetic linkage maps in humans. Proc. Natl. Acad.Sci. USA 84 2363–2367.

Lawrence, C. E. and Reilly, A. A. (1990). An expectationmaximization (EM) algorithm for the identification andcharacterization of common sites in unaligned biopolymersequences. Proteins 7 41–51.

Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu,

J. S., Neuwald, A. F. and Wootton, J. C. (1993). De-tecting subtle sequence signals: A Gibbs sampling strategyfor multiple alignment. Science 262 208–214.

Liu, J. S. (2001). Monte Carlo Strategies in Scientific Com-puting. Springer, New York. MR1842342

Liu, J. S., Chen, R. and Wong, W. H. (1998). Rejec-tion control and sequential importance sampling. J. Amer.Statist. Assoc. 93 1022–1031. MR1649197

Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1995).Bayesian models for multiple local sequence alignment andGibbs sampling strategies. J. Amer. Statist. Assoc. 90

1156–1170.Liu, J. S., Sabatti, C., Teng, J., Keats, B. J. and Risch,

N. (2001). Bayesian analysis of haplotypes for linkage dis-equilibrium mapping. Genome Res. 11 1716–1724.

Page 16: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

16 X. FAN, Y. YUAN AND J. S. LIU

Liu, X. S., Brutlag, D. L. and Liu, J. S. (2002). An al-gorithm for finding protein-DNA binding sites with appli-cations to chromatin-immunoprecipitation microarray ex-periments. Nature Biotechnology 20 835–839.

Long, J. C., Williams, R. C. and Urbanek, M. (1995).An E-M algorithm and testing strategy for multiple-locushaplotypes. American Journal of Human Genetics 56 799–810.

Lu, X., Niu, T. and Liu, J. S. (2003). Haplotype informationand linkage disequilibrium mapping for single nucleotidepolymorphisms. Genome Res. 13 2112–2117.

Luan, Y. and Li, H. (2003). Clustering of time-course geneexpression data using a mixed-effects model with B-splines.Bioinformatics 19 474–482.

Lunter, G., Miklos, I., Drummond, A., Jensen, J. andHein, J. (2005). Bayesian coestimation of phylogeny andsequence alignment. BMC Bioinformatics 6 83.

Ma, P., Castillo-Davis, C., Zhong, W. and Liu, J. (2006).A data-driven clustering method for time course gene ex-pression data. Nucleic Acids Res. 34 1261–1269.

Madera, M. and Gough, J. (2002). A comparison of pro-file hidden Markov model procedures for remote homologydetection. Nucleic Acids Res. 30 4321–4328.

McKendrick, A. G. (1926). Applications of mathematicsto medical problems. Proceedings Edinburgh MethematicsSociety 44 98–130.

McLachlan, G. J., Bean, R. W. and Peel, D. (2002). Amixture model-based approach to the clustering of microar-ray expression data. Bioinformatics 18 413–422.

Meng, X. and van Dyk, D. (1997). The EM algorithm—Anold folk song sung to a fast new tune (with discussion). J.Roy. Statist. Soc. Ser. B 59 511–567. MR1452025

Meng, X.-L. (1997). The EM algorithm and medical studies:A historical linik. Statistical Methods in Medical Research6 3–23.

Meng, X.-L. and Pedlow, S. (1992). EM: A bibliographicreview with missing articles. In Proc. Stat. Comp. Sec. 24–27. Amer. Statist. Assoc., Washington, DC.

Metropolis, N., Rosenbluth, A., Rosenbluth, M.,

Teller, A. and Teller, E. (1953). Equation of state cal-culations by fast computing machines. Journal of ChemicalPhysics 21 1087–1092.

Metropolis, N. and Ulam, S. (1949). The Monte Carlomethod. J. Amer. Statist. Assoc. 44 335–341. MR0031341

Moses, A., Chiang, D. and Eisen, M. (2004). Phylogeneticmotif detection by expectation–maximization on evolution-ary mixtures. In Pacific Symposium on Biocomputing 324–335. World Scientific, Singapore.

Neuwald, A. and Liu, J. (2004). Gapped alignment of pro-tein sequence motifs through Monte Carlo optimization ofa hidden Markov model. BMC Bioinformatics 5 157.

Niu, T. (2004). Algorithms for inferring haplotypes. GeneticEpidemiology 27 334–347.

Notredame, C., Higgins, D. and Heringa, J. (2000). T-Coffee: A novel method for fast and accurate multiple se-quence alignment. J. Mol. Biol. 302 205–217.

O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D. G.

and Notredame, C. (2004). 3DCoffee: Combining proteinsequences and structures within multiple sequence align-ments. J. Mol. Biol. 340 385–395.

Ott, J. (1979). Maximum likelihood estimation by count-ing methods under polygenic and mixed models in humanpedigrees. American Journal of Human Genetics 31 161–175.

Pavesi, G., Mereghetti, P., Mauri, G. and Pesole,

G. (2004). Weeder Web: Discovery of transcription factorbinding sites in a set of sequences from co-regulated genes.Nucleic Acids Res. 32 W199–W203.

Prakash, A., Blanchette, M., Sinha, S. and Tompa, M.

(2004). Motif discovery in heterogeneous sequence data. InPacific Symposium on Biocomputing 348–359. World Sci-entific, Singapore.

Qin, Z. S., Niu, T. and Liu, J. S. (2002). Partition–ligation–expectation–maximization algorithm for haplotype in-ference with single-nucleotide polymorphisms. AmericanJournal of Human Genetics 71 1242–1247.

Rabiner, L. R. (1989). A tutorial on hidden Markov modelsand selected applications in speech recognition. Proceedingsof the IEEE 77 257–286.

Siepel, A., Bejerano, G., Pedersen, J. S., Hinrichs,

A. S., Hou, M., Rosenbloom, K., Clawson, H., Spi-

eth, J., Hillier, L. W., Richards, S., Weinstock,

G. M., Wilson, R. K., Gibbs, R. A., Kent, W. J.,

Miller, W. and Haussler, D. (2005). Evolutionarilyconserved elements in vertebrate, insect, worm, and yeastgenomes. Genome Res. 15 1034–1050.

Sinha, S. and Tompa, M. (2002). Discovery of novel tran-scription factor binding sites by statistical overrepresenta-tion. Nucleic Acids Res. 30 5549–5560.

Sinha, S., Blanchette, M. andTompa, M. (2004). PhyME:A probabilistic algorithm for finding motifs in sets of or-thologous sequences. BMC Bioinformatics 5 170.

Smith, C. A. B. (1957). Counting methods in genetical statis-tics. Annals of Human Genetics 35 254–276. MR0088408

Stormo, G. D. and Hartzell, G. W. I. (1989). Identifyingprotein-binding sites from unaligned DNA fragments. Proc.Natl. Acad. Sci. USA 86 1183–1187.

Suresh, R. M., Dinakaran, K. and Valarmathie, P.

(2009). Model based modified K-means clustering for mi-croarray data. In International Conference on InformationManagement and Engineering 271–273. IEEE ComputerSociety, Los Alamitos, CA.

Tanner, M. A. and Wong, W. H. (1987). The calculationof posterior distributions by data augmentation (with dis-cussion). J. Amer. Statist. Assoc. 82 528–540. MR0898357

Tavare, S. (1986). Some probabilistic and statistical prob-lems in the analysis of DNA sequences. In Some Mathemat-ical Questions in Biology—DNA Sequence Analysis (NewYork, 1984). Lectures on Mathematics in the Life Sciences17 57–86. Amer. Math. Soc., Providence, RI. MR0846877

Taylor, W. (1988). A flexible method to align large numbersof biological sequences. J. Mol. Evol. 28 161–169.

Thompson, E. A. (1984). Information gain in joint linkageanalysis. Math. Med. Biol. 1 31–49.

Thompson, J., Higgins, D. and Gibson, T. (1994).CLUSTAL W: Improving the sensitivity of progressivemultiple sequence alignment through sequence weighting,position-specific gap penalties and weight matrix choice.Nucleic Acids Res. 22 4673–4680.

Page 17: The EM Algorithm and the Rise of Computational Biologycs.brown.edu/courses/csci1820/resources/Fan_etal._2010.pdf · c Institute of Mathematical Statistics, 2010 The EM Algorithm and

EM IN COMPUTATIONAL BIOLOGY 17

Tompa, M., Li, N., Bailey, T. L., Church, G. M.,

De Moor, B., Eskin, E., Favorov, A. V., Frith, M. C.,

Fu, Y., Kent, W. J., Makeev, V. J., Mironov, A. A.,

Noble, W. S., Pavesi, G., Pesole, G., Regnier, M.,

Simonis, N., Sinha, S., Thijs, G., van Helden, J., Van-

denbogaert, M., Weng, Z., Workman, C., Ye, C. andZhu, Z. (2005). Assessing computational tools for the dis-covery of transcription factor binding sites. Nature Biotech-nology 23 137–144.

Wallace, I. M., Blackshields, G. and Higgins, D. G.

(2005). Multiple sequence alignments. Current Opinion inStructural Biology 15 261–266.

Wang, W., Cherry, J. M., Nochomovitz, Y., Jolly, E.,

Botstein, D. and Li, H. (2005). Inference of combina-torial regulation in yeast transcriptional networks: A casestudy of sporulation. Proc. Natl. Acad. Sci. USA 102 1998–2003.

Weeks, D. E. and Lange, K. (1989). Trials, tribulations,and triumphs of the EM algorithm in pedigree analysis.Math. Med. Biol. 6 209–232. MR1052291

Wolfe, K. H., Sharp, P. M. and Li, W. H. (1989). Muta-tion rates differ among regions of the mammalian genome.Nature 337 283–285.

Yang, Z. (1995). A space–time process model for the evolu-tion of DNA sequences. Genetics 139 993–1005.

Yang, Z. (1997). PAML: A program package for phylogeneticanalysis by maximum likelihood. Comput. Appl. Biosci. 13555–556.

Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E.

and Ruzzo, W. L. (2001). Model-based clustering anddata transformations for gene expression data. Bioinfor-matics 17 977–987.