46
Profile Hidden Markov Models PHMM 1 Mark Stamp

Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models Here, we assume you know about HMMs o If not, see “A revealing introduction to

Embed Size (px)

Citation preview

Page 1: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

1

Profile Hidden Markov Models

PHMM

Mark Stamp

Page 2: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

2

Hidden Markov Models

Here, we assume you know about HMMso If not, see “A revealing introduction to

hidden Markov models” Executive summary of HMMs

o HMM is a machine learning technique…o …and a discrete hill climb techniqueo Train model based on observation sequenceo Score any given sequence to determine how

closely it matches the modelo Efficient algorithms, many, many useful apps

PHMM

Page 3: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

3

HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation:

PHMM

Page 4: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

4

Hidden Markov Models

Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) And more all the time

PHMM

Page 5: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

5

Limitations of HMMs

Positional information not consideredo HMM has no “memory” beyond previous stateo Higher order models have more “memory”o But no explicit use of positional information

With HMM, no insertions or deletions These limitations are serious problems in

some applicationso In bioinformatics string comparison, sequence

alignment is criticalo Also, insertions and deletions can occur

PHMM

Page 6: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

6

Profile HMM

Profile HMM (PHMM) designed to overcome limitations on previous slideo In some ways, PHMM easier than HMMo In some ways, PHMM more complex

The basic idea of PHMM ?o Define multiple B matriceso Almost like having an HMM for each

position in sequencePHMM

Page 7: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

7

PHMM

In bioinformatics, begin by aligning multiple related sequenceso Multiple sequence alignment (MSA)o Analogous to training phase for HMM

Generate PHMM based on given MSAo This is easy, once MSA is knowno Again, hard part is generating MSA

Then can score sequences using PHMMo Use forward algorithm, similar to HMM

PHMM

Page 8: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

8

Training: PHMM vs HMM

Training PHMMo Determine MSA nontrivialo Determine PHMM matrices trivial

Training HMMo Append training sequences trivialo Determine HMM matrices nontrivial

PHMM and HMM are, in this sense, opposites…

PHMM

Page 9: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

9

Generic View of PHMM

Have delete, insert, and match stateso Match states correspond to HMM states

Arrows are possible transitionso Each transition has a probability

Transition probabilities are A matrixEmission probabilities are B matrices

o In PHMM, observations are emissionso Match and insert states have emissions

PHMM

Page 10: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

10

Generic View of PHMM Circles are delete states, diamonds are insert states, squares are match states

Also, begin and end states

PHMM

Page 11: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

11

PHMM Notation Notation

PHMM

Page 12: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

12

PHMM

Match state probabilities easily determined from MSA aMi,Mi+1 transitions between match states

eMi(k) emission probability at match state

Many other transition probabilities o For example, aMi,Ii and aMi,Di+1

Emissions at all match & insert stateso Remember, emission == observation

PHMM

Page 13: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

13

Multiple Sequence Alignment

First we show MSA constructiono This is the difficult parto Lots of ways to do thiso “Best” way depends on specific problem

Then construct PHMM from MSAo This is the easy parto Standard algorithm for this

How to score a sequence?o Forward algorithm, similar to HMM

PHMM

Page 14: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

14

MSA

How to construct MSA?o Construct pairwise alignmentso Combine pairwise alignments into

MSA Allow gaps to be inserted

o To make better matches Gaps tend to weaken PHMM scoring

o So, tradeoff between number of gaps and strength of score

PHMM

Page 15: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

15

Global vs Local Alignment In these pairwise alignment examples

o “-” is gapo “|” means elements alignedo “*” for omitted beginning/ending symbols

PHMM

Page 16: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

16

Global vs Local Alignment Global alignment is lossless

o But gaps tend to proliferateo And gaps increase when we do MSA o More gaps, more random sequences match…o …and result is less useful for scoring

We usually only consider local alignmento That is, omit ends for better alignment

For simplicity, assume global alignment in examples presented here

PHMM

Page 17: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

17

Pairwise Alignment

Allow gaps when aligning How to score an alignment?

o Based on n x n substitution matrix So Where n is number of symbols

What algorithm(s) to align sequences?o Usually, dynamic programmingo Sometimes, HMM is usedo Other?

Local alignment? Additional issues arise…

PHMM

Page 18: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

18

Pairwise Alignment Example

Tradeoff gaps vs misaligned elementso Depends on matrix S and gap penalty

PHMM

Page 19: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

19

Substitution Matrix

For example, masquerade detectiono Detect imposter using computer

account Consider 4 different operations

o E == send emailo G == play gameso C == C programmingo J == Java programming

How similar are these to each other?

PHMM

Page 20: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

20

Substitution Matrix

Consider 4 different operations:o E, G, C, J

Possible substitution matrix: Diagonal matches

o High positive scores Which others most similar?

o J and C, so substituting C for J is a high score Game playing/programming, very

differento So substituting G for C is a negative score

PHMM

Page 21: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

21

Substitution Matrix Depending on problem, might be easy

or very difficult to find useful S matrix Consider masquerade detection based

on UNIX commandso Sometimes difficult to say how “close” 2

commands are Suppose instead, aligning DNA

sequenceso Biological reasons for S matrix

PHMM

Page 22: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

22

Gap Penalty Generally must allow gaps to be inserted But gaps make alignment more generic

o Less useful for scoring, so we penalize gaps How to penalize gaps? Linear gap penalty function:

g(x) = ax (constant penalty for every gap) Affine gap penalty function

g(x) = a + b(x – 1)o Gap opening penalty a and constant penalty

of b for each extension of existing gap

PHMM

Page 23: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

23

Pairwise Alignment Algorithm

We use dynamic programmingo Based on S matrix, gap penalty function

Notation:

PHMM

Page 24: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

24

Pairwise Alignment DP Initialization:

Recursion:

where

PHMM

Page 25: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

25

MSA from Pairwise Alignments

Given pairwise alignments… How to construct MSA? Generally use “progressive alignment”

o Select one pairwise alignmento Select another and combine with firsto Continue to add more until all are combined

Relatively easy (good) Gaps proliferate, and it’s unstable (bad)

PHMM

Page 26: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

26

MSA from Pairwise Alignments

Lots of ways to improve on generic progressive alignmento Here, we mention one such approacho Not necessarily “best” or most popular

Feng-Dolittle progressive alignmento Compute scores for all pairs of n sequenceso Select n-1 alignments that a) “connect” all

sequences and b) maximize pairwise scoreso Then generate a minimum spanning treeo For MSA, add sequences in the order that they

appear in the spanning tree

PHMM

Page 27: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

27

MSA Construction

Create pairwise alignmentso Generate substitution matrix S o Dynamic program for pairwise alignments

Use pairwise alignments to make MSAo Use pairwise alignments to construct

spanning tree (e.g., Prim’s Algorithm)o Add sequences in spanning tree order

(from high score, insert gaps as needed)o Note: gap penalty is used here

PHMM

Page 28: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

28

MSA Example Suppose 10 sequences, with the following

pairwise alignment scores

PHMM

Page 29: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

29

MSA Example: Spanning Tree

Spanning tree based on scores

So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9)

PHMM

Page 30: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

30

MSA Snapshot

Intermediate step and finalo Use “+” for

neutral symbol

o Then “-” for gaps in MSA

Note increase in gaps

PHMM

Page 31: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

31

PHMM from MSA

In PHMM, determine match and insert states & probabilities from MSA

“Conservative” columns == match stateso Half or less of symbols are gaps

Other columns are insert stateso Majority of symbols are gaps

Delete states are a separate issuePHMM

Page 32: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

32

PHMM States from MSA Consider a simpler MSA… Columns 1,2,6 are match

states 1,2,3, respectivelyo Since less than half gaps

Columns 3,4,5 are combined to form insert state 2o Since more than half gapso Match states between

insertPHMM

Page 33: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

33

Probabilities from MSA Emission probabilities

o Based on symbol distribution in match and insert states

State transition probso Based on transitions in

the MSA

PHMM

Page 34: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

34

Probabilities from MSA Emission probabilities:

But 0 probabilities are bado Model overfits the datao So, use “add one” ruleo Add one to each numerator,

add total to denominators

PHMM

Page 35: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

35

Probabilities from MSA More emission probabilities:

But 0 probabilities still bado Model overfits the datao Again, use “add one” ruleo Add one to each numerator,

add total to denominators

PHMM

Page 36: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

36

Probabilities from MSA Transition probabilities:

We look at some exampleso Note that “-” is delete state

First, consider begin state:

Again, use add one rule

PHMM

Page 37: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

37

Probabilities from MSA Transition probabilities When no information in

MSA, set probs to uniform For example I1 does not

appear in MSA, so

PHMM

Page 38: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

38

Probabilities from MSA Transition probabilities,

another example What about transitions

from state D1? Can only go to M2, so

Again, use add one rule:

PHMM

Page 39: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

39

PHMM Emission Probabilities Emission probabilities for the given MSA

o Using add-one rule

PHMM

Page 40: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

40

PHMM Transition Probabilities Transition probabilities for the given MSA

o Using add-one rule

PHMM

Page 41: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

41

PHMM Summary

Construct pairwise alignmentso Usually, use dynamic programming

Use these to construct MSAo Lots of ways to do this

Using MSA, determine probabilitieso Emission probabilitieso State transition probabilities

Then we have trained a PHMMo Now what???

PHMM

Page 42: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

42

PHMM Scoring

Want to score sequences to see how closely they match PHMM

How did we score using HMM?o Forward algorithm

How to score sequences with PHMM?o Forward algorithm (surprised?)

But, algorithm is a little more complexo Due to more complex state transitions

PHMM

Page 43: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

43

Forward Algorithm

Notationo Indices i and j are columns in MSAo xi is ith observation (emission) symbol

o qxi is distribution of xi in “random model”o Base case iso is score of x1,…,xi up to state j (note

that in PHMM, i and j may not agree)o Some states undefinedo Undefined states ignored in calculation

PHMM

Page 44: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

44

Forward Algorithm

Compute P(X|λ) recursively

Note that depends on , and o And corresponding state transition probs

PHMM

Page 45: Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to

45

PHMM

We will see examples of PHMM later

In particular,o Malware detection based on opcodeso Masquerade detection based on UNIX

commands

PHMM