43
Profile HMMs Tandy Warnow BioE/CS 598AGB

Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Embed Size (px)

Citation preview

Page 1: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Profile HMMs

Tandy WarnowBioE/CS 598AGB

Page 2: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Profile Hidden Markov Models

• Basic tool in sequence analysis• Look more complicated than they really are• Used to model a family of sequences• Can be built from a multiple sequence

alignment• Algorithms using profile HMMs are based on

dynamic programming (much like Needleman-Wunsch)

Page 3: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Profile

• Given a gap-less multiple sequence alignment, we can build a profile describing what we see

• S1 = A C T A G• S2 = A C A A G• S3 = A T T T G• S4 = G T T C G

1 2 3 4 5 A 0.75 0.0 0.25 0.50 0.0 C 0.00 0.5 0.00 0.25 0.0 T 0.00 0.5 0.75 0.25 0.0 G 0.25 0.0 0.00 0.00 1.0

Page 4: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Using a profile• 1 2 3 4 5 • A 0.75 0.0 0.25 0.50 0.0• C 0.00 0.5 0.00 0.25 0.0• T 0.00 0.5 0.75 0.25 0.0• G 0.25 0.0 0.00 0.00 1.0

EndEnd

0.750.00.00.25

0.00.50.50.0

0.250.00.750.0

0.500.250.250.0

0.00.00.01.0

Start1.0 1.0 1.0 1.0 1.0 1.0

The profile yields a probability distribution of sequences – here, all of the same length.

Page 5: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Adding in insertions

• The profile shown in the previous slide only had *match* states (indicated by rectangles). It doesn’t allow any insertions or deletions.

• To model indels, we just have to add additional states to the graphical model.– Insertion states: Diamonds (have non-zero

emission probabilities)– Deletion states: Circles (nothing emitted)

Page 6: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

Page 7: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

Page 8: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Building Profile HMMs

• Profile HMMs can be built from a given multiple sequence alignment – this is not too difficult.

• Profile HMMs can also be built from unaligned sequences. This is a bit complicated.

Page 9: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Using Profile HMMs

• Given a Profile HMM computed for a multiple sequence alignment, you can use it to – Recognize related sequences– Add related sequences into the multiple sequence

alignment• See PFAM, http://pfam.xfam.org, for how

profile HMMs are used to represent groups of functionally and structurally related proteins.

Page 10: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Using profile HMMs to align sequences

• Given an MSA for a set S of representative sequences for a gene and set X of additional sequences (query sequences) – You build a profile HMM for the MSA on S.– For each s in X you find for the gene, you find the path

through the profile HMM that is most likely to generate s.

– The path specifies how to add the sequence into the MSA (only the match states count).

– Transitivity gives you the final MSA after you add in all the other sequences.

Page 11: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

Page 12: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Other uses of profile HMMs

• Given two profile HMMs (H1 and H2), and a sequence s, you can determine which one is more likely to generate s (again, using dynamic programming).

• Note that computing the probability that a profile HMM generates a sequence requires calculating the probability for *every* path and adding up the probabilities. This can be calculated in polynomial time, using dynamic programming.

Page 13: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Applications of profile HMMs

• Recognizing membership in protein families (see PFAM http://pfam.xfam.org)

• Progressive multiple sequence alignment methods, such as SATCHMO (Edgar and Sjolander, Bioinformatics 2003)

• Phylogenetic placement (e.g., SEPP, Mirarab et al., PSB 2012)

• Taxonomic identification of metagenomic data (e.g., TIPP, Nguyen et al., Bioinformatics 2014)

Page 14: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Another application: UPP

• UPP (Nguyen et al., RECOMB 2015) is a method for ultra-large multiple sequence alignment: – Up to 1,000,000 sequences– Robust to fragmentary sequences

• Nam-phuong Nguyen (IGB postdoctoral fellow) will present this method on Tuesday.

Page 15: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

HMMER

• http://hmmer.janelia.org• One of the most popular collection of tools to

perform analyses based on profile HMMs. • HMMER web server: interactive sequence

similarity searching, NAR 2011, http://nar.oxfordjournals.org/content/39/suppl_2/W29

Page 16: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Supertree Estimation

• Purposes:– Divide-and-conquer tree estimation– Combining analyses performed by other research

groups– Tree of Life

Page 17: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Supertree Estimation

Challenges:• Tree compatibility is NP-complete (therefore, even if

subtrees are correct, supertree estimation is hard)• Estimated subtrees have error

Advantages:• Estimating individual gene trees can be

computationally feasible (compared to the combined analysis of many genes)

• Can use different types of data for each gene

Page 18: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Many Supertree Methods

• MRP• weighted MRP• MRF• MRD• Robinson-Foulds

Supertrees• Min-Cut• Modified Min-Cut• Semi-strict Supertree

• QMC• Q-imputation• SDM• PhySIC• Majority-Rule Supertrees• Maximum Likelihood

Supertrees• and many more ...

Matrix Representation with Parsimony(Most commonly used and most accurate)

Page 19: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

MRP and MRL

• MRP (Matrix Representation with Parsimony)• MRL (Matrix Representation with Likelihood)

– The input set of source trees is represented by a binary matrix with ?s for missing data

– Each edge in each source tree gives you one column in the matrix (based on the bipartition it defines on its leafset)

– Then you analyze the matrix using parsimony or CFN maximum likelihood

Page 20: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Single gene vs. multi-gene analyses

• Most methods analyze single genes (or other genomic region). These produce estimated “gene trees”.

• But species trees are estimated using multiple genes.

Page 21: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Not all genes present in all species

gene 1S1

S2

S3

S4

S7

S8

TCTAATGGAA

GCTAAGGGAA

TCTAAGGGAA

TCTAACGGAA

TCTAATGGAC

TATAACGGAA

gene 3TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

S1

S3

S4

S7

S8

gene 2GGTAACCCTC

GCTAAACCTC

GGTGACCATC

GCTAAACCTC

S4

S5

S6

S7

Page 22: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

. . .

Analyzeseparately

SupertreeMethod

Two competing approaches gene 1 gene 2 . . . gene k

. . . Combined Analysis

Spec

ies

Page 23: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

FN rate of MRP vs. combined analysis

Scaffold Density (%)

Page 24: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

SuperFine-boosting: improves accuracy of MRP

Scaffold Density (%)

(Swenson et al., Syst. Biol. 2012)

Page 25: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

SuperFine

• First, construct a supertree with low false positives The Strict Consensus

• Then, refine the tree to reduce false negatives by resolving each polytomy using a “base” supertree method (e.g., MRP)

Quartet Max Cut

Page 26: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Obtaining a supertree with low FP

The Strict Consensus Merger (SCM)

SCM of two treesComputes the strict consensus on the common leaf

setThen superimposes the two trees, contracting more

edges in the presence of “collisions”

Page 27: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Strict Consensus Merger (SCM)a b

c d

e

fg

a b

cdh

i j

e

fg

hi j

a b

c

d

a b

c

d

e

fg

a b

c

dh

i j

Page 28: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Theoretical results for SCM

• SCM can be computed in polynomial time

• For certain types of inputs, the SCM method solves the NP-hard “Tree Compatibility” problem

• All splits in the SCM “appear” in at least one source tree (and are not contradicted by any source tree)

Page 29: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Performance of SCM

• Low false positive (FP) rate(Estimated supertree has few false edges)

• High false negative (FN) rate(Estimated supertree is missing many true edges)

Page 30: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Part II of SuperFine

• Refine the tree to reduce false negatives by resolving each polytomy using a base supertree method (e.g., MRP)

Page 31: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Part 1 of SuperFinea b

c d

e

fg

a b

cdh

i j

e

fg

hi j

a b

c

d

a b

c

d

e

fg

a b

c

dh

i j

Page 32: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Part 2 of SuperFinee

fg

a b

c

dh

i j

a bc e

hi j

d fg

1 2 3

4 5 6

a b

c d

e

fg

a b

cdh

i j

1 1

1 4

1

65

1 1

142

3 3

4

1

65

1

42 3

Page 33: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Theorem

Given – a set of source trees, – SCM tree T, – and a polytomy in T,

after relabelling and reducing, each source tree has at most one leaf with each label.

Page 34: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Step 2: Apply MRP to the collection of reduced source trees

1

2 3

4

1 4

56MRP

1

2 3

4

6

5

Page 35: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Replace polytomy using tree from MRP

1

2 3

4

6

5

a bc e

hi j

d fg

e

fg

a b

c

dh

i jh

dg

fi

j

a

bc

e

Page 36: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Resolving a single polytomy, v, using MRP

• Step 1: Reduce each source tree to a tree on leafset, {1,2,...,d} where d=degree(v)

• Step 2: Apply MRP to the collection of reduced source trees, to produce a tree t on {1,2,...,d}

• Step 3: Replace the star tree at v by tree t

Page 37: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

SuperFine-boosting: improves accuracy of MRP

Scaffold Density (%)

(Swenson et al., Syst. Biol. 2012)

Page 38: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

SuperFine is also much faster

MRP 8-12 sec.SuperFine 2-3 sec.

Scaffold Density (%) Scaffold Density (%)Scaffold Density (%)

Page 39: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Uses of Supertree Methods

• DACTAL: divide-and-conquer trees almost without alignments (Nelesen et al, 2012)

• DCM1-boosting

In these methods, the dataset is divided into subsets (using chordal graph theory), and then trees on the subsets are combined (using supertree methods). Proofs about the algorithm guarantees are established using chordal graph theory.

Page 40: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

Chordal graph algorithms yield phylogeny estimation from polynomial length sequences

• Theorem (Warnow et al., SODA 2001): DCM1-NJ correct with high probability given sequences of length O(ln n eO(ln n))

• Simulation study from Nakhleh et al. ISMB 2001

NJ

DCM1-NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Page 41: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

DACTAL: divide-and-conquer trees without alignment (submitted)

New supertree method:

SuperFine

Existing Method:

RAxML(MAFFT)

pRecDCM3

BLAST-based

Overlapping subsets

A tree for each

subset

Unaligned Sequences

A tree for the

entire dataset

Page 42: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

DACTAL more accurate than all standard methods, and much faster

than SATé

Average results on 3 large RNA datasets (6K to 28K)

CRW: Comparative RNA database, structural alignments

3 datasets with 6,323 to 27,643 sequencesReference trees: 75% RAxML bootstrap treesDACTAL (shown in red) run for 5 iterations

starting from FT(Part)

SATé-1 fails on the largest dataset

SATé-2 runs but is not more accurate than DACTAL, and takes longer

Page 43: Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used

More divide-and-conquer

• Recall that SATe and PASTA use divide-and-conquer (and also iteration) to improve alignment estimation.

• Alignments on different subsets are merged together using techniques like OPAL and Muscle.

• However, alignments can also be merged together using HMM-HMM alignment.

• You should think of your own algorithmic designs for improving scalability and accuracy, whether for MSA or for tree estimation!