Hidden Markov Modelsrshamir/algmb/presentations/HMM-1stLec.pdf · 1 =s)} •A: Transition prob....

Preview:

Citation preview

© Ron Shamir, CG’08 1

Hidden Markov Models

© Ron Shamir, CG’08 2

• Dr Richard Durbin is a graduate in mathematics from Cambridge University and one of the founder members of the Sanger Institute. He has also held carried out research at the Laboratory of Molecular Biology in Cambridge and at Harvard and Stanford Universities in the USA. He is currently head of the informatics division in the Sanger Center.

Main source: Durbin et al.,

“Biological Sequence Alignment”

(Cambridge, ‘98)

© Ron Shamir, CG’08 3

The occasionally dishonest casino

13652656643662612564

13652656643662612564 PA(1) =

PA(2) =

… = 1/6

PB(1)=0.1

...

PB(5)=0.1

PB(6) =0.5

PA->B =

PB->A =

1/2

A B

Can we tell when the loaded die is used?

© Ron Shamir, CG’08 4

Example - CpG islands • CpG islands:

– DNA stretches (100~1000bp) with frequent CG pairs (contiguous on same strand).

– Rare, appear in significant genome parts. • Problem (1): Given a short genome sequence,

decide if it comes from a CpG island.

Preliminaries: Markov Chains

(S, A, p) • S: State set • p: Initial state prob. vector {p(x1=s)} • A: Transition prob. matrix ast = P(xi=t | xi-1=s) Assumption: X=x1…xn is a random process with

memory length 1, i.e.: siS P(xi=si | x1=s1,…,xi-1=si-1) = P(xi=si | xi-1=si-1) = asi-1,si • Sequence probability: P(X) = p(x1) · i=2…Laxi-1,xi

Can avoid p by adding

0 ‘begin’ state +

transition probs A0*

© Ron Shamir, CG’08 7

Sequence probability T G C A - 0.210 0.285 0.205 0.300 A 0.302 0.078 0.298 0.322 C 0.208 0.298 0.246 0.248 G 0.292 0.292 0.239 0.177 T

P(X) = p(x1) · i=2…Laxi-1,xi

© Ron Shamir, CG’08 8

Markov model - Example

• Markov model,

• Adding “begin” and “end” states

G C

T A

B E

© Ron Shamir, CG’08 9

Andrei Andreyevich Markov

• Born: 14 June 1856 in Ryazan, Russia

• Died: 20 July 1922 in Petrograd (now St Petersburg), Russia

• Seminal contributions to – central limit

theorem – stochastic processes – random walks,….

http://www-groups.dcs.st-and.ac.uk/~history/

© Ron Shamir, CG’08 10

Markov Models • - Transition probs for non-CpG islands

• + Transition probs for CpG islands

TGCA+

0.1200.4250.2740.180A

0.1880.2740.3680.171C

0.1250.3750.3390.161G

0.1820.3840.3550.079T

T G C A - 0.210 0.285 0.205 0.300 A 0.302 0.078 0.298 0.322 C 0.208 0.298 0.246 0.248 G 0.292 0.292 0.239 0.177 T

© Ron Shamir, CG’08 11

CpG islands: Fixed Window

• Problem (1): Given a short genome sequence X, decide if it comes from a CpG island.

• Solution: Model by a Markov chain. Let

– a+st: transition prob. in CpG islands,

– a-st: transition prob. outside CpG islands.

Decide by log-likelihood ratio score:

n

i ,xx

,xx

ii

ii

a

a

islandCpGnonXP

islandCpGXPXscore

11

1log)|(

)|(log)(

n

i ,xx

,xx

ii

ii

a

a

nXscorebits

1

2

1

1log1

)(_

© Ron Shamir, CG’08 12

Discrimination of sequences via Markov Chains

Durbin et. al, Fig. 3.2

48 CpG islands, tot length ~60K nt. Similar non-CpG.

© Ron Shamir, CG’08 13

CpG islands – the general case

• Problem(2): Detect CpG islands in a long DNA sequence.

• Naive Solution - Sliding windows: 1 k L-l,

– window: Xk = (xk+1,…,xk+l)

– score: score(Xk)

– positive score potential CpG island

Disadvantage: what is the length of the islands? How do we identify transitions?

Idea: Use Markov chains as before, with additional (hidden) states

© Ron Shamir, CG’08 14

Hidden Markov Model (HMM)

path =1,…,n (sequence of states - simple Markov chain)

Given sequence X = (x1,…,xL):

• akl = P(i=l | i-1=k),

• ek(b) = P(xi=b | i=k)

Alphabet of symbols Example: {A, C, G, T}

Finite set of states, capable

of emitting symbols. Example:

Q = {A+,C+,G+,T+,A-,C-,G-,T-}

=(A,E)

A: Transition

prob. akl k,lQ

E: Emission

prob. ek(b) kQ,

b

Joint prob. of observed sequence

X and path (convention: 0 - begin, L+1 - end)

M=(, Q, )

P(X,) = a0,1·i=1…Lei(xi) ·ai,i+1

Goal: Finding path * maximizing P(X,)

© Ron Shamir, CG’08 15

Viterbi’s Decoding Algorithm (finding most probable state path)

vk(i) = prob. of most probable path ending in state k at step i.

Init: v0(0) = 1; vk(0)=0 k>0 Step: vl(i+1)=el(xi+1)·maxk{vk(i)·akl} End: P(X, *) = maxk{vk(L) · ak0}

Time complexity: O(Ln2) for n states, m symbols, L steps

Can find * using back pointers.

Want: path maximizing P(X, )

© Ron Shamir, CG’08 16

The occasionally dishonest casino (2)

13652656643662612564

13652656643662612564 A

B

emission probabilities

© Ron Shamir, CG ‘08 17

The occasionally dishonest casino (2)

© Ron Shamir, CG’08 18

HMM for CpG Islands • States: A+ C+ G+ T+ A- C- G- T-

• Symbols: A C G T A C G T

• Path =1,…,n: sequence of states

TGCA+

0.1200.4250.2740.180A

0.1880.2740.3680.171C

0.1250.3750.3390.161G

0.1820.3840.3550.079T

TGCA-

0.2100.2850.2050.300A

0.3020.0780.2980.322C

0.2080.2980.2460.248G

0.2920.2920.2390.177T

transition prob. http://www.cs.huji.ac.il/~cbio/handouts/class4.ppt

© Ron Shamir, CG’08 19

HMM for CpG Islands

G- C-

T- A-

G

+ C +

T + A+

© Ron Shamir, CG’08 20

Posterior State Probabilities Goal: calculate P(i=k | X)

• Our strategy: • P(X, i=k) = = P(x1,…,xi, i=k) · P(xi+1,…,xL | x1,…,xi, i=k) = P(x1,…,xi, i=k) · P(xi+1,…,xL | i=k) • P(i=k | X) = P(i=k, X) / P(X) Need to compute these two terms - and P(X)

© Ron Shamir, CG’08 21

Forward Algorithm

Goal: calculate P(X) = P(X, ) Approximation: take max path * from Viterbi alg. Not justified when several near maximal paths Exact alg : (a.k.a. “Forward Algorithm”) fk(i) = P(x0,…,xi, i=k) • Init: f0(0) = 1; fk(0)=0 k>0 • Step: fj(i+1) = ej(xi+1) · k fk(i)·akj

• End: P(X) = k fk(L)·ak0

© Ron Shamir, CG’08 22

Backward Algorithm

• bk(i) = P(xi+1,…xL | i=k)

• init: k, bk(L) = ak0

• step: bk(i) = l akl·el(xi+1)·bl(i+1)

• End: P(X) = k a0k·ek(x1)·bk(1)

© Ron Shamir, CG’08 23

Posterior State Probabilities (2)

Goal: calculate P(i=k | X) • Recall:

– fk(i) = P(x0,…,xi , i=k) – bk(i) = P(xi+1,…xL | i=k) – Each can be used to compute P(X)

• P(X, i=k) = = P(x1,…,xi, i=k) · P(xi+1,…,xL | x1,…,xi, i=k) = P(x1,…,xi, i=k) · P(xi+1,…,xL | i=k) = fk(i) · bk(i) • P(i=k | X) = P(i=k, X) / P(X)

© Ron Shamir, CG’08 24

Durbin et al. pp. 60

Dishonest Casino (3)

© Ron Shamir, CG’08 25

e.g., CpG island

S={A+,C+,G+,T+}

Posterior Decoding

• Now we have P(i=k | X). How do we decode?

1. i*=argmaxk P(i=k | X)

– Good when interested in state at particular point

– path of states 1*,.., L

* may not be legal

2. Define a function of interest g(i) on the states. Compute G(i|X) = k P(i=k | X) · g(k) • E.g.: g(i) =1 for states in S, 0 on the rest: G(i|X)

is posterior prob of symbol i coming from S

© Ron Shamir, CG’08 26

Andrew Viterbi • Dr. Andrew J. Viterbi is a pioneer in the

field of Wireless Communications. He received his Bachelors and Masters degrees from MIT, and his Ph.D. in digital communications from the University of Southern California (USC). He taught at UCLA and consulted for the Jet Propulsion Laboratory (JPL) Immediately after obtaining his Ph.D. He was a co-founder of Linkabit in 1968, a small military contractor, and co-founded QualComm with Irwin Jacobs in 1985. He created the Viterbi Algorithm for interference suppression and efficient decoding of a digital transmission sequence, used by all four international standards for digital cellular telephony. QualComm is the recognized pioneer of the Code Division Multiple Access (CDMA) digital wireless technology, which allows many users to share the same radio frequencies, and thereby increase system capacity many times over analog system capacity. He is a Life Fellow of the IEEE, and was inducted as a member of the National Academy of Engineering in 1978 and of the National Academy of Sciences in 1996. http://www.ieee.org/organizations/history

_center/comsoc/viterbi.html