151
CS5263 Bioinformatics Lecture 12-14: Markov Chain and Hidden Markov Models

CS5263 Bioinformatics

Embed Size (px)

DESCRIPTION

CS5263 Bioinformatics. Lecture 12-14: Markov Chain and Hidden Markov Models. Outline. Background on probability Hidden Markov models Algorithms Applications. Probability Basics. Definition (informal) Experiment : e.g. toss a coin 10 times or sequence a genome - PowerPoint PPT Presentation

Citation preview

Page 1: CS5263 Bioinformatics

CS5263 Bioinformatics

Lecture 12-14: Markov Chain and Hidden Markov Models

Page 2: CS5263 Bioinformatics

Outline

• Background on probability

• Hidden Markov models– Algorithms– Applications

Page 3: CS5263 Bioinformatics

Probability Basics

• Definition (informal)– Experiment: e.g. toss a coin 10 times or sequence a genome– Outcome: A possible result of an experiment. e.g.

HHHTTTHTTH or ACGTTAAACGTA– The sample space S of a random experiment is the set of all

possible outcomes. e.g {H, T}10

– Event: any subset of the sample space. E.g.: > 4 heads– Probabilities are numbers assigned to events that indicate “how

likely” it is that the event will occur when a random experiment is performed

– A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment

Page 4: CS5263 Bioinformatics

Example

0 P(Ai) 1

P(S) = 1

Page 5: CS5263 Bioinformatics

Probabilistic Calculus

• P(A U B) = P(A) + P(B) – P(A ∩ B)

• If A, B are mutually exclusive:P(A ∩ B) = 0

P(A U B) = P(A) + P(B)

• A and not(A) are mutually exclusive– Thus: P(not(A)) = P(Ac) = 1 – P(A)

Either A or B both A and B

A B

s

Page 6: CS5263 Bioinformatics

Joint and conditional probability

• The joint probability of two events A and B P(A∩B), or simply P(A, B) is the probability that event A and B occur at the same time.

• The conditional probability of P(A|B) is the probability that A occurs given B occurred.

P(A | B) = P(A ∩ B) / P(B)P(A ∩ B) = P(A | B) * P(B)

Page 7: CS5263 Bioinformatics

Example

• Roll a die– If I tell you the number is less than 4– What is the prob for the number to be even?

P(d = even | d < 4)

= P(d = even ∩ d < 4) / P(d < 4)

= P(d = 2) / P(d = 1, 2, or 3)

= (1/6) / (3/6) = 1/3

Page 8: CS5263 Bioinformatics

Independence

• P(A | B) = P(A ∩ B) / P(B)

=> P(A ∩ B) = P(B) * P(A | B)• A, B are independent iff

– P(A ∩ B) = P(A) * P(B) – That is, P(A) = P(A | B)

• Also implies that P(B) = P(B | A)– P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A)

Page 9: CS5263 Bioinformatics

Examples

• Are P(d = even) and P(d < 4) independent?P(d = even and d < 4) = 1/6 P(d = even) * P(d < 4) = 1/4

orP(d = even) = ½ P(d = even | d < 4) = 1/3

• If the die has 8 faces, will P(d = even) and P(d < 5) be independent?

Page 10: CS5263 Bioinformatics

Theorem of total probability

• Let B1, B2, …, BN be mutually exclusive events whose union equals the sample space S. We refer to these sets as a partition of S.

• An event A can be represented as:

• Since B1, B2, …, BN are mutually exclusive, then

P(A) = P(A∩B1) + P(A∩B2) + … + P(A∩BN)

• And therefore

P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2) + … + P(A|BN)*P(BN)

= i P(A | Bi) * P(Bi)Exhaustive conditionalization

Marginalization

Page 11: CS5263 Bioinformatics

Example

• A loaded die: – P(6) = 0.5– P(1) = … = P(5) = 0.1

• Prob of even number? P(even) = P(even | d < 6) * P (d<6) +

P(even | d = 6) * P (d=6)= 2/5 * 0.5 + 1 * 0.5= 0.7

Page 12: CS5263 Bioinformatics

Another example

• A box of dice:– 99% fair– 1% loaded

• P(6) = 0.5.• P(1) = … = P(5) = 0.1

– Randomly pick a die and roll, P(6)?

• P(6) = P(6 | F) * P(F) + P(6 | L) * P(L)– 1/6 * 0.99 + 0.5 * 0.01 = 0.17

Page 13: CS5263 Bioinformatics

Chain rule

• P(x1, x2, x3) = P(x1, x2, x3) / P(x2, x3)

* P(x2, x3) / P(x3)

* P(x3)

= P(x1 | x2, x3) P(x2 | x3) P(x3)

x3 x2 x1

Page 14: CS5263 Bioinformatics

Bayes theorem

• P(A ∩ B) = P(B) * P(A | B) = P(A) * P(B | A)

AP

BPABP

)(

)()|( ==>

Posterior probability Prior of A (Normalizing constant)

BAP )|( Prior of B

Conditional probability(likelihood)

This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics

Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition

Page 15: CS5263 Bioinformatics

Bayes theorem (cont’d)

• Given B1, B2, …, BN, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj?

• P(Bj | A) = P(A | Bj) * P(Bj) / P(A)

= P(A | Bj) * P(Bj) / jP(A | Bj)*P(Bj)Bj: different models

In the observation of A, should you choose a model that maximizes P(Bj | A) or P(A | Bj)? Depending on how much you know about Bj !

Posterior probabilityLikelihood Prior of Bj

Normalizing constant

(theorem of total probabilities)

Page 16: CS5263 Bioinformatics

Example

• Prosecutor’s fallacy– Some crime happened– The criminal left not evidence except hair– The police got his DNA from his hair

• Expert matched the DNA with someone’s DNA in a database– Expert said both false-positive and false negative

rates are 10-6

• Can this be used as an evidence of guilty against the suspect?

Page 17: CS5263 Bioinformatics

Prosecutor’s fallacy

• False Pos: P(match | innocent) = 10-6

• False Neg: P(no match | guilty) = 10-6

• P(match | guilty) = 1 - 10-6 ~ 1

• P(no match | innocent) = 1 - 10-6 ~ 1

• P(guilty | match) = ?

Page 18: CS5263 Bioinformatics

Prosecutor’s fallacy

P (g | m) = P (m | g) * P(g) / P (m) ~ P(g) / P(m)

– P(g): the prior probability for someone to be guilty with no DNA evidence

– P(m): the probability for a DNA match

• How to get these two numbers?– Don’t really care P(m)– Want to compare two models:

• P(g | m) and P(i | m)

Page 19: CS5263 Bioinformatics

Prosecutor’s fallacy

• P(i | m) = P(m | i) * P(i) / P(m)• P(g | m) = P(m | g) * P(g) / P(m)• Therefore

P(i | m) / P(g | m) = P(m | i) / P(m | g) * P(i) / P(g)

= 10-6 * P(i) / P(g)• P(i) + p(g) = 1• It is clear, therefore, that whether we can

conclude the suspect is guilty depends on the prior probability P(g)

Page 20: CS5263 Bioinformatics

Prosecutor’s fallacy

• How do you get P(g)?• Depending on what other information you have on the

suspect• Say if the suspect has no other connection with the

crime, and the overall crime rate is 10-7

• That’s a reasonable prior for P(g)• P(g) = 10-7, P(i) ~ 1• P(i | m) / P(g | m) = 10-6 * P(i) / P(g) = 10-6/10-7 = 10• Or: P(i | m) = 0.91 and P(g | m) = 0.09• Suspect is more likely to be innocent than guilty, given

only the DNA samples

Page 21: CS5263 Bioinformatics

Another example

• A test for a rare disease claims that it will report positive for 99.5% of people with disease, and negative 99.9% of time for those without.

• The disease is present in the population at 1 in 100,000

• What is P(disease | positive test)?– P(D|P) / P(H|P) ~ 0.01

• What is P(disease | negative test)?– P(D|N) / P(H|N) ~ 5e-8

Page 22: CS5263 Bioinformatics

Yet another example

• We’ve talked about the boxes of casinos: 99% fair, 1% loaded (50% at six)

• We said if we randomly pick a die and roll, we have 17% of chance to get a six

• If we get 3 six in a row, what’s the chance that the die is loaded?

• How about 5 six in a row?

Page 23: CS5263 Bioinformatics

• P(loaded | 666) = P(666 | loaded) * P(loaded) / P(666) = 0.53 * 0.01 / (0.53 * 0.01 + (1/6)3 * 0.99) = 0.21

• P(loaded | 66666) = P(66666 | loaded) * P(loaded) / P(66666) = 0.55 * 0.01 / (0.55 * 0.01 + (1/6)5 * 0.99) = 0.71

Page 24: CS5263 Bioinformatics

Simple probabilistic models for DNA sequences

• Assume nature generates a type of DNA sequence as follows:

1. A box of dice, each with four faces: {A,C,G,T}2. Select a die suitable for the type of DNA3. Roll it, append the symbol to a string.4. Repeat 3, until all symbols have been generated.

• Given a string say X=“GATTCCAA…” and two dice

– M1 has the distribution of pA=pC=pG=pT=0.25. – M2 has the distribution: pA=pT=0.20, pC=pG=0.30

• What is the probability of the sequence being generated by M1 or M2?

Page 25: CS5263 Bioinformatics

Model selection by maximum likelihood criterion

• X = GATTCCAA

• P(X | M1) = P(x1,x2,…,xn | M1)

= i=1..n P(xi|M1)

= 0.258

• P(X | M2) = P(x1,x2,…,xn | M2)

= i=1..n P(xi|M2)

= 0.25 0.33

P(X|M1) / P(X|M2) = P(xi|M1)/P(xi|M2) = (0.25/0.2)5 (0.25/0.3)3

LLR = log(P(xi|M1)/P(xi|M2))

= nASA + nCSC + nGSG + nTST

= 5 * log(1.25) + 3 * log(0.833) = 0.57

Si = log (P(i | M1) / P(i | M2)), i = A, C, G, T

Log likelihood ratio (LLR)

Page 26: CS5263 Bioinformatics

Model selection by maximum a posterior probability criterion

• Take the prior probabilities of M1 and M2 into consideration if knownLog (P(M1|X) / P(M2|X))

= LLR + log(P(M1)) – log(P(M2))

= nASA + nCSC + nGSG + nTST + log(P(M1)) – log(P(M2))

• If P(M1) ~ P(M2), results will be similar to LLR test

Page 27: CS5263 Bioinformatics

Markov models for DNA sequences

We have assumed independence of nucleotides in different positions - unrealistic in biology

Page 28: CS5263 Bioinformatics

Example: CpG islands

• CpG - 2 adjacent nucleotides, same strand (not base-pair; “p” stands for the phosphodiester bond of the DNA backbone)

• In mammal promoter regions, CpG is more frequent than other regions of genome– often mark gene-rich regions

Page 29: CS5263 Bioinformatics

CpG islands

• CpG Islands– More CpG than elsewhere– More C & G than elsewhere, too– Typical length: few 100 to few 1000 bp

• Questions– Is a short sequence (say, 200 bp) a CpG

island or not?– Given a long sequence (say, 10-100kb), find

CpG islands?

Page 30: CS5263 Bioinformatics

Markov models

• A sequence of random variables is a k-th order Markov chain if, for all i, ith value is independent of all but the previous k values:

• First order (k=1):• Second order:

• 0th order: (independence)

Page 31: CS5263 Bioinformatics

First order Markov model

Page 32: CS5263 Bioinformatics

A 1st order Markov model for CpG islands

• Essentially a finite state automaton (FSA)• Transitions are probabilistic (instead of deterministic)

• 4 states: A, C, G, T• 16 transitions: ast = P(xi = t | xi-1 = s)• Begin/End states

Page 33: CS5263 Bioinformatics

Probability of emitting sequence x

Page 34: CS5263 Bioinformatics

Probability of a sequence

• What’s the probability of ACGGCTA in this model?

P(A) * P(C|A) * P(G|C) … P(A|T)

= aBA aAC aCG …aTA

• Equivalent: follow the path in the automaton, and multiply the transition probabilities on the path

Page 35: CS5263 Bioinformatics

Training

• Estimate the parameters of the model– CpG+ model: Count the transition frequencies from

known CpG islands – CpG- model: Also count the transition frequencies

from sequences without CpG islands

– ast = #(s→t) / #(s → )

a+st a-

st

Page 36: CS5263 Bioinformatics

Discrimination / Classification

• Given a sequence, is it CpG island or not?

• Log likelihood ratio (LLR)

βCG = log2(a+CG/a -

CG) = log2(0.274/0.078) = 1.812

βBA = log2(a+ A/a -

A) = log2(0.591/1.047) = -0.825

Page 37: CS5263 Bioinformatics

Example

• X = ACGGCGACGTCG

• S(X) = βBA + βAC +βCG +βGG +βGC +βCG +βGA + βAC +βCG +βGT +βTC +βCG

= βBA + 2βAC +4βCG +βGG +βGC +βGA +βGT +βTC

= -0.825 + 2*.419 + 4*1.812+.313 +.461 - .624 - .730 + .573

= 7.25

Page 38: CS5263 Bioinformatics

CpG island scores

Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

Page 39: CS5263 Bioinformatics

Questions

• Q1: given a short sequence, is it more likely from CpG+ model or CpG- model?

• Q2: Given a long sequence, where are the CpG islands (if any)?– Approach 1: score (e.g.) 100 bp windows

• Pro: simple• Con: arbitrary, fixed length, inflexible

– Approach 2: combine +/- models.

Page 40: CS5263 Bioinformatics

Combined model

• Given a long sequence, predict which state each position is in. (states are hidden: Hidden Markov model)

Page 41: CS5263 Bioinformatics

Hidden Markov Model (HMM)

• Introduced in the 70’s for speech recognition• Have been shown to be good models for biosequences

– Alignment– Gene prediction– Protein domain analysis– …

• An observed sequence data that can be modeled by a Markov chain– State path unknown– Model parameter known or unknown

• Observed data: emission sequences X = (x1x2…xn)

• Hidden data: state sequences Π = (π1π2…πn)

Page 42: CS5263 Bioinformatics

Hidden Markov model (HMM)Definition: A hidden Markov model (HMM) is a five-tuple• Alphabet = { b1, b2, …, bM }• Set of states Q = { 1, ..., K }• Transition probabilities between any two states

aij = transition prob from state i to state jai1 + … + aiK = 1, for all states i = 1…K

• Start probabilities a0i

a01 + … + a0K = 1

• Emission probabilities within each stateek(b) = P( xi = b | i = k)ek(b1) + … + ek(bM) = 1, for all states k = 1…K

K

1

2

Page 43: CS5263 Bioinformatics

HMM for the Dishonest Casino

A casino has two dice:• Fair die

P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6• Loaded die

P(1) = P(2) = P(3) = P(4) = P(5) = 1/10P(6) = 1/2

Casino player switches back and forth between fair and loaded die once in a while

Page 44: CS5263 Bioinformatics

The dishonest casino model

Fair LOADED

aLF = 0.05

eF(1) = 1/6eF(2) = 1/6eF(3) = 1/6eF(4) = 1/6eF(5) = 1/6eF(6) = 1/6

eL(1) = 1/10eL(2) = 1/10eL(3) = 1/10eL(4) = 1/10eL(5) = 1/10eL(6) = 1/2Transition probability

Emission probability

aFL = 0.05 aLL = 0.95aFF = 0.95

Page 45: CS5263 Bioinformatics

Simple scenario

• You don’t know the probabilities• The casino player lets you observe which die

he/she uses every time– The “state” of each roll is known

• Training (parameter estimation)– How often the casino player switches dice?– How “loaded” is the loaded die?– Simply count the frequency that each face appeared

and the frequency of die switching– May add pseudo-counts if number of observations is

small

Page 46: CS5263 Bioinformatics

More complex scenarios

• The “state” of each roll is unknown:– You are given the results of a series of rolls– You don’t know which number is generated by

which die

• You may or may not know the parameters– How “loaded” is the loaded die– How frequently the casino player switches

dice

Page 47: CS5263 Bioinformatics

The three main questions on HMMs

1. Decoding

GIVEN a HMM M, and a sequence x,FIND the sequence of states that maximizes P )x, | M (

2. Evaluation

GIVEN a HMM M, and a sequence x,FIND P ) x | M ( [or P(x) for simplicity]

3. Learning

GIVEN a HMM M with unspecified transition/emission probs.,and a sequence x,

FIND parameters = )ei(.), aij( that maximize P (x | )

Sometimes written as P )x, ( for simplicity.

Page 48: CS5263 Bioinformatics

Question # 1 – Decoding

GIVEN

A HMM with parameters. And a sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION

What portion of the sequence was generated with the fair die, and what portion with the loaded die?

This is the DECODING question in HMMs

Page 49: CS5263 Bioinformatics

A parse of a sequence

Given a sequence x = x1……xN, and a HMM with k states,

A parse of x is a sequence of states = 1, ……, N

1

2

K

1

2

K

1

2

K

1

2

K

x1 x2 x3 xK

2

1

K

2

Page 50: CS5263 Bioinformatics

Probability of a parse

Given a sequence x = x1……xN

and a parse = 1, ……, N

To find how likely is the parse:

(given our HMM)

P(x, ) = P(x1, …, xN, 1, ……, N)

= P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1)

= P(xN | N) P(N | N-1) ……P(x2 | 2) P(2 | 1) P(x1 | 1) P(1)

= a01 a12……aN-1N e1(x1)……eN(xN)

1

2

K

1

2

K

1

2

K

1

2

K

x1 x2 x3 xK

2

1

K

2

Page 51: CS5263 Bioinformatics

Example

• What’s the probability of = Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, FairX = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?

Fair LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

Page 52: CS5263 Bioinformatics

Example

• What’s the probability of = Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, FairX = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4?

P = ½ * P(1 | F) P(Fi+1 | Fi) …P(5 | F) P(Li+1 | Fi) P(6|L) P(Li+1 | Li) …P(4 | F)= ½ x 0.957 0.052 x (1/6)6 x (1/10)2 x (1/2)2 = 5 x 10-11

0.05 0.05

Page 53: CS5263 Bioinformatics

Decoding

• Parse (path) is unknown. What to do?• Alternative algorithms:

– Most probable single path (Viterbi algorithm)

– Sequence of most probable states (Forward-backward algorithm)

Page 54: CS5263 Bioinformatics

The Viterbi algorithm

• Goal: to find

• Is equivalent to find ((,)log)* maxarg

xP

Page 55: CS5263 Bioinformatics

The Viterbi algorithm

• Find a path with the following objective: – Maximize the product of transition and emission probabilities

Maximize the sum of log probabilities

B

P(s|F) = 1/6, for s in [1..6]P(s|L) = 1/10, for s in [1..5]P(6|L) = 1/2

L L L L L L L L L L

F F F F F F F F F F

Edge weight(symbol independent)

Node weight (depend on symbols in seq)

Page 56: CS5263 Bioinformatics

The Viterbi algorithm

B

L L L L L L

F F F F F F

x1 x2 x3 … xi xi+1 … xn-1 xn

VF )i+1( = rF )xi+1( + max VF )i( + wFF

VL )i( + wLF

L

F

Weight for the best parse of (x1…xi+1), with xi+1 emitted by state F

Weight for the best parse of (x1…xi+1), with xi+1 emitted by state L

E

wFF = log )aFF(

wFF

wLF

rF)xi+1( = log )eF)xi+1((

VL )i+1( = rL )xi+1( + max VF )i( + wFL

VL )i( + wLL

Page 57: CS5263 Bioinformatics

Recursion from FSA directly

Fair LOADED

WFL=-3.00

WLF=-3.00

wFF=-0.05

rF(s) = -1.8s = 1...6

rL(6) = -0.7rL(s) = -2.3(s = 1…5)

VF )i+1( = rF )xi+1( + max {VL )i( + WLF

VF )i( + WFF }

VL )i+1( = rL )xi+1( + max {VL )i( + WLL

VF )i( + WFL }

wLL=-0.05

Fair LOADED

aFL=0.05

aLF=0.05

aLL=0.95aFF=0.95

P(s|F) = 1/6s = 1…6

P(6|L) = ½P(s|L) = 1/10(s = 1...5)

PF )i+1( = eF )xi+1( max {PL )i( aLF

PF )i( aFF }

PL )i+1( = eL )xi+1( max {PL )i( aLL

PF )i( aFL }

Page 58: CS5263 Bioinformatics

In general: more states / symbols• Alphabet = { b1, b2, …, bM }

• Set of states Q = { 1, ..., K }

• States are completely connected. – K2 transitions probabilities (some may be 0)

– Each state has M transition probabilities (some may be 0)

(())max()(1) ..1 klkKkill wiVxriV

1

2

k

K

1

2

l

K

xi xi+1

……

……

1 2

lK

… k

Page 59: CS5263 Bioinformatics

The Viterbi Algorithm

Similar to “aligning” a set of states to a sequence

Time: O(K2N)

Space: O(KN)

x1 x2 x3 … … xi+1……… … … ……………………xN

State 1

2

K

Vl(i+1)

(())max()(1) ..1 klkKkill wiVxriV

l

Page 60: CS5263 Bioinformatics

The Viterbi Algorithm (in log space)

Input: x = x1……xN

Initialization:V0(0) = 0 (zero in subscript is the start state.)Vl(0) = -inf, for all l > 0 (0 in parenthesis is the imaginary first position)

Iteration:for each i for each l

Vl(i) = rl(xi) + maxk (wkl + Vk(i-1)) // rj(xi) = log(ej(xi)), wkj = log(akj)Ptrl(i) = argmaxk (wkl + Vk(i-1))

endend

Termination:Prob(x, *) = exp{maxk Vk(N)}

Traceback: N* = argmaxk Vk(N) i-1* = Ptri (i)

Page 61: CS5263 Bioinformatics

The Viterbi Algorithm (in prob space)

Input: x = x1……xN

Initialization:P0(0) = 1 (zero in subscript is the start state.)Pl(0) = 0, for all l > 0 (0 in parenthesis is the imaginary first position)

Iteration:for each i for each l

Pl(i) = el(xi) maxk (akl Pk(i-1)) Ptrl(i) = argmaxk (akl Pk(i-1))

endend

Termination:Prob(x, *) = maxk Pk(N)

Traceback: N* = argmaxk Pk(N) i-1* = Ptri (i)

Page 62: CS5263 Bioinformatics
Page 63: CS5263 Bioinformatics

CpG islands

• Data: 41 human sequences, including 48 CpG islands of about 1kbp each

• Viterbi: – Found 46 of 48 – plus 121 “false positives”

• Post-processing: – merge within 500bp– discard < 500– Found 46/48– 67 false positive

Page 64: CS5263 Bioinformatics

Problems with Viterbi decoding

• Most probable path not necessarily the only interesting one– Single optimal vs multiple sub-optimal

• What if there are many sub-optimal paths with slightly lower probabilities?

– Global optimal vs local optimal• What’s best globally may not be the best for each

individual

Page 65: CS5263 Bioinformatics

Example

• The dishonest casino• Say x = 12341623162616364616234161221341• Most probable path: = FF……F• However: marked letters more likely to be L than

unmarked letters• Another way to interpret the problem

– With Viterbi, every position is assigned a single label– Confidence level for each assignment?

Page 66: CS5263 Bioinformatics

Posterior decoding

• Viterbi finds the path with the highest probability

• We want to know

k = 1

• In order to do posterior decoding, we need to know P(x) and P(i = k, x), since

• Computing P(x) and P(x,i=k) is called the evaluation problem

• The solution: Forward-backward algorithm

Page 67: CS5263 Bioinformatics

Probability of a sequence

• P(X | M): prob that X can be generated by M• Sometimes simply written as P(X)• May be written as P(X | M, θ) or P(X | θ) to emphasize

that we are looking for θ to optimize the likelihood (discussed later in learning)

• Not equal to the probability of a path P(X, )– Many possible paths can generate X. Each with a probability

– P(X) = P(X, ) = P(X | ) P()– How to compute without summing over all possible paths

(exponential of them)?• Dynamic programming

Page 68: CS5263 Bioinformatics

The forward algorithm

• Define fk(i) = P(x1…xi, i=k)– Implicitly: sum over all possible paths for x1…xi-1

xi

k

k kkn

nnnk

nfkxPxP

kxPkxxPnf

()(,)()

(,)(,...)() 1

Page 69: CS5263 Bioinformatics

jjkjik

jjkiiik

jiiiiik

jiiiii

iiii

iik

aifxe

ajxxPxe

jkPjxxPxe

kjxxPkxP

kxPkxxP

kxxPif

(1)()

(,...)()

(|)(,...)()

(,,...)(|)

(|)(,...)

(,...)()

111

1111

111

11

1

Page 70: CS5263 Bioinformatics

The forward algorithm

xi

k

Page 71: CS5263 Bioinformatics

The forward algorithm

We can compute fk(i) for all k, i, using dynamic programming!

Initialization:f0(0) = 1fk(0) = 0, for all k > 0

Iteration:

fk(i) = ek(xi) j fj(i-1) ajk

Termination:

Prob(x) = k fk(N)

Page 72: CS5263 Bioinformatics

Relation between Forward and Viterbi

VITERBI (in prob space)

Initialization:

P0(0) = 1

Pk(0) = 0, for all k > 0

Iteration:

Pk(i) = ek(xi) maxj Pj(i-1) ajk

Termination:

Prob(x, *) = maxk Pk(N)

FORWARD

Initialization:

f0(0) = 1

fk(0) = 0, for all k > 0

Iteration:

fk(i) = ek(xi) j fj(i-1) ajk

Termination:

Prob(x) = k fk(N)

Page 73: CS5263 Bioinformatics

Posterior decoding

• Viterbi finds the path with the highest probability

• We want to know

k = 1

• In order to do posterior decoding, we need to know P(x) and P(i = k, x), since

Have just shown how to compute this

Need to know how to compute this

Page 74: CS5263 Bioinformatics

k

xi

Page 75: CS5263 Bioinformatics

The backward algorithm

• Define bk(i) = P(xi+1…xn | i=k)

– Implicitly: sum over all possible paths for xi…xn

k

xi

Page 76: CS5263 Bioinformatics

1

This does not include the emission probability of xi

xi

k

Page 77: CS5263 Bioinformatics

The forward-backward algorithm

• Compute fk(i) for each state k and position i

• Compute bk(i), for each state k and position i

• Compute P(x) = kfk(N)

• Compute P(i=k | x) = fk(i) * bk(i) / P(x)

Page 78: CS5263 Bioinformatics

The prob of x, with the constraint that xi was generated by state k

stateSequence

x

P(i=k | x)Space: O(KN)

Time: O(K2N)

/ P(X)Forward probabilities Backward probabilities

Page 79: CS5263 Bioinformatics

Relation to another F-B algorithm

• We’ve learned a forward-backward algorithm in linear-space sequence alignment– Hirscheberg’s algorithm– Also known as forward-backward alignment algorithm

xy

Page 80: CS5263 Bioinformatics

What’s P(i=k | x) good for?

• For each position, you can assign a probability (in [0, 1]) to the states that the system might be in at that point – confidence level

• Assign each symbol to the most-likely state according to this probability rather than the state on the most-probable path – posterior decoding

^i = argmaxk P(i = k | x)

Page 81: CS5263 Bioinformatics

Posterior decoding for the dishonest casino

If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die

Page 82: CS5263 Bioinformatics

Posterior decoding for another dishonest casino

In this example, Viterbi predicts that all rolls were from the fair die.

Page 83: CS5263 Bioinformatics

CpG islands again

• Data: 41 human sequences, including 48 CpG islands of about 1kbp each

• Viterbi: Post-process:– Found 46 of 48 46/48– plus 121 “false positives” 67 false pos

• Posterior Decoding:– same 2 false negatives 46/48– plus 236 false positives 83 false pos

Post-process: merge within 500; discard < 500

Page 84: CS5263 Bioinformatics

What if a new genome comes?We just sequenced the porcupine genome

We know CpG islands play the same role in this genome

However, we have not many known CpG islands for porcupines

We suspect the frequency and characteristics of CpG islands are quite different in porcupines

How do we adjust the parameters in our model?

- LEARNING

Page 85: CS5263 Bioinformatics

Learning

• When the state path is known– We’ve already done that– Estimate parameters from labeled data

(known CpG and non-CpG)– “Supervised” learning

• When the state path is unknown– Estimate parameters without labeled data– “unsupervised” learning

Page 86: CS5263 Bioinformatics

Basic idea

1. Estimate our “best guess” on the model parameters θ

2. Use θ to predict the unknown labels

3. Re-estimate a new set of θ

4. Repeat 2 & 3

Two ways

Page 87: CS5263 Bioinformatics

Viterbi training

1. Estimate our “best guess” on the model parameters θ

2. Find the Viterbi path using current θ

3. Re-estimate a new set of θ based on the Viterbi path

– Count transitions/emissions on those paths, getting new θ

4. Repeat 2 & 3 until converge

Page 88: CS5263 Bioinformatics

Baum-Welch training

1. Estimate our “best guess” on the model parameters θ

2. Find P(i=k | x,θ) using forward-backward algorithm3. Re-estimate a new set of θ based on all possible

paths For example, according to Viterbi, pos i is in state k and pos

(i+1) is in state l• This contributes 1 count towards the frequency that transition

k l is used• In Baum-Welch, pos i has some prob in state k and pos (i+1)

has some prob in state l. This transition is counted only partially, according to the prob of this transition

4. Repeat 2 & 3 until converge

Page 89: CS5263 Bioinformatics

Probability that a transition is used

k

l

i i+1

Page 90: CS5263 Bioinformatics

Estimated # of kl transition

Page 91: CS5263 Bioinformatics

Viterbi vs Baum-Welch training

• Viterbi training– Returns a single path– Each position labeled with a fixed state– Each transition counts one– Each emission also counts one

• Baum-Welch training– Does not return a single path– Considers the prob that each transition is used and

the prob that a symbol is generated by a certain state– They only contribute partial counts

Page 92: CS5263 Bioinformatics

Viterbi vs Baum-Welch training

• Both guaranteed to converges

• Baum-Welch improves the likelihood of the data in each iteration: P(X)– True EM (expectation-maximization)

• Viterbi improves the probability of the most probable path in each iteration: P(X, *)– EM-like

Page 93: CS5263 Bioinformatics

expectation-maximization (EM)

• Baum-Welch algorithm is a special case of the expectation-maximization (EM) algorithm, a widely used technique in statistics for learning parameters from unlabeled data

• E-step: compute the expectation (e.g. prob for each pos to be in a certain state)

• M-step: maximum-likelihood parameter estimation

• We’ll see EM and similar techniques again in motif finding

• k-means clustering has some similar flavor

Page 94: CS5263 Bioinformatics
Page 95: CS5263 Bioinformatics

Compute sum of probabilities in log space

• Two probabilities x and y, x < y• lx = log(x), ly = log(y), (lx < ly)• z = x + y = y (1 + x/y)

lz = log(z) = log(x+y)= log(y) + log(1 + x/y)= ly + log(1 + exp(log(x)-log(y))= ly + log(1 + exp(lx – ly))

Also see page 4 in this doc: http://cs.utsa.edu/~jruan/teaching/cs5263_fall_2007/proj1.pdf

Page 96: CS5263 Bioinformatics

HMM summary

• Viterbi – best single path

• Forward – sum over all paths

• Backward – similar

• Baum-Welch – training via EM and forward-backward

• Viterbi – another “EM”, but Viterbi-based

Page 97: CS5263 Bioinformatics

Silent states

• Silent states are states that do not emit symbols (e.g., the state 0 in our previous examples)

• Silent states can be introduced in HMMs to reduce the number of transitions

Page 98: CS5263 Bioinformatics

Silent states

• Suppose we want to model a sequence in which arbitrary deletions are allowed (later this lecture)

• In that case we need some completely forward connected HMM (O(m2) edges)

Page 99: CS5263 Bioinformatics

Silent states

• If we use silent states, we use only O(m) edges

• Nothing comes freeSuppose we want to assign high probability to 1→5 and 2→4, there is no way to have also low probability on 1→4and 2→5.

Algorithms can be modified easily to deal with silent states, so long as no silent-state loops

Page 100: CS5263 Bioinformatics

Modular design of HMM

• HMM can be designed modularly

• Each modular has own begin / end states (silent, i.e. no emission)

• Each module communicates with other modules only through begin/end states

Page 101: CS5263 Bioinformatics

A+

C+ G+

T+B+ E+

B-A- T-

C- G-

E-

HMM modules and non-HMM modules can be mixed

Page 102: CS5263 Bioinformatics

HMM applications

• Pair-wise sequence alignment

• Multiple sequence alignment

• Gene finding

• Speech recognition: a good tutorial on course website

• Machine translation

• Many others

Page 103: CS5263 Bioinformatics

FSA for global alignment

Xi and Yj aligned

Xi aligned to a gap

Yj aligned to a gap

d

d

e

e

Page 104: CS5263 Bioinformatics

HMM for global alignment

Xi and Yj aligned

Xi aligned to a gap

Yj aligned to a gap

1-2

1-

1-

Pair-wise HMM: emit two sequences simultaneously

Algorithm is similar to regular HMM, but need an additional dimension.

e.g. in Viterbi, we need Vk(i, j) instead of Vk(i)

P(xi,yj)

16 emission probabilities

q(xi): 4 emission probabilities

q(yj): 4 emission probabilities

Page 105: CS5263 Bioinformatics

HMM and FSA for alignment

• After proper transformation between the probabilities and substitution scores, the two are identical (a, b) log [p(a, b) / (q(a) q(b))] d log e log

• Details in Durbin book chap 4• Finding an optimal FSA alignment is equivalent

to finding the most probable path with Viterbi

Page 106: CS5263 Bioinformatics

HMM for pair-wise alignment

• Theoretical advantages:– Full probabilistic interpretation of alignment scores– Probability of all alignments instead of the best

alignment (forward-backward alignment)

– Posterior probability that Ai is aligned to Bj

– Sampling sub-optimal alignments

• Not commonly used in practice– Needleman-Wunsch and Smith-Waterman algorithms

work pretty well, and more intuitive to biologists

Page 107: CS5263 Bioinformatics

Other applications

• HMM for multiple alignment– Widely used

• HMM for gene finding– Foundation for most gene finders– Include many knowledge-based fine-tunes

and GHMM extensions– We’ll only discuss basic ideas

Page 108: CS5263 Bioinformatics

HMM for multiple seq alignment

• Proteins form families both across and within species– Ex: Globins, Zinc finger– Descended from a common ancestor – Typically have similar three-dimensional structures,

functions, and significant sequence similarity• Identifying families is very useful: suggest

functions• So: search and alignment are both useful• Multiple alignment is hard• One very useful approach: profile-HMM

Page 109: CS5263 Bioinformatics

Say we already have a database of reliable multiple alignment of protein families

When a new protein comes, how do we align it to the existing alignments and find the family that the protein may belong to?

Page 110: CS5263 Bioinformatics

Solution 1

• Use regular expression to represent the consensus sequences– Implemented in the Prosite database– for example

C - x(2) - P - F - x - [FYWIV] - x(7) - C - x(8,10) - W - C - x(4) - [DNSR] - [FYW] - x(3,5) - [FYW] - x - [FYWI] - C

Page 111: CS5263 Bioinformatics

Multi-alignments represented by consensus

• Consensus sequences are very intuitive• Gaps can be represented by Do-not-cares• Aligning sequences with regular expressions can be

done easily with DP• Possible to allow mismatches in searching• Problems

– Limited power in expressing more divergent positions• E.g. among 100 seqs, 60 have Alanine, 20 have Glycine, 20 others

– Specify boundaries of indel not so easy • unaligned regions have more flexibilities to evolve

– May have to change the regular expression when a new member of a protein family is identified

Page 112: CS5263 Bioinformatics

Solution 2

• For a non-gapped alignment, we can create a position-specific weight matrix (PWM)

• Also called a profile• May use

pseudocounts

A 4 8 3 4 7 6 1 1 5 6

R 3 1 0 10 2 1 1 2 1 13

N 3 1 2 8 40 1 4 0 7 1

D 6 0 8 5 0 1 1 3 5 2

C 0 9 12 0 0 3 0 2 2 4

E 1 5 3 0 2 0 2 4 3 3

Q 3 0 1 2 3 3 2 0 2 11

G 3 6 5 3 5 4 2 1 0 6

H 2 4 4 2 4 4 32 7 6 7

I 7 2 25 13 2 2 1 50 6 4

L 4 4 6 8 0 1 1 3 10 8

K 33 5 1 2 4 1 1 9 31 2

M 7 1 2 10 4 2 1 4 1 2

F 6 7 8 3 2 4 2 1 7 10

P 1 27 2 7 9 1 3 3 1 1

S 2 4 1 9 2 2 1 0 1 4

T 5 0 8 8 2 60 37 1 2 4

W 2 7 1 3 7 2 3 1 3 6

Y 4 0 5 1 4 1 1 5 3 1

V 4 8 2 1 1 0 4 3 2 6

Page 113: CS5263 Bioinformatics

Scoring by PWMsA 4 8 3 4 7 6 1 1 5 6

R 3 1 0 10 2 1 1 2 1 13

N 3 1 2 8 40 1 4 0 7 1

D 6 0 8 5 0 1 1 3 5 2

C 0 9 12 0 0 3 0 2 2 4

E 1 5 3 0 2 0 2 4 3 3

Q 3 0 1 2 3 3 2 0 2 11

G 3 6 5 3 5 4 2 1 0 6

H 2 4 4 2 4 4 32 7 6 7

I 7 2 25 13 2 2 1 50 6 4

L 4 4 6 8 0 1 1 3 10 8

K 33 5 1 2 4 1 1 9 31 2

M 7 1 2 10 4 2 1 4 1 2

F 6 7 8 3 2 4 2 1 7 10

P 1 27 2 7 9 1 3 3 1 1

S 2 4 1 9 2 2 1 0 1 4

T 5 0 8 8 2 60 37 1 2 4

W 2 7 1 3 7 2 3 1 3 6

Y 4 0 5 1 4 1 1 5 3 1

V 4 8 2 1 1 0 4 3 2 6

x = KCIDNTHIKR

P(x | M) = i ei(xi)

Random model: each amino acid xi can be generated with probability q(xi)

P(x | random) = i q(xi)

Log-odds ratio:

S = log P(X|M) / P(X|random) = i log ei(xi) / q(xi)

Page 114: CS5263 Bioinformatics

PWMs

• Advantage:– Can be used to identify both strong and weak

homologies– Easy to implement and use– Probabilistic interpretation

• PWMs are used in PSI-BLAST– Given a sequence, get k similar seqs by BLAST– Construct a PWM with these sequences– Search the database for seqs matching the PWM– Improved sensitivity

• Problem: – Not intuitive to deal with gaps

Page 115: CS5263 Bioinformatics

PWMs are HMMs

B E

20 emission probabilities for each state

Transition probability = 1

• This can only be used to search for sequences without insertion / deletions (indels)

• We can add additional states for indels!

M1 Mk

Page 116: CS5263 Bioinformatics

Dealing with insertions

• This would allow an arbitrary number of insertions after the j-th position– i.e. the sequence being compared can be

longer than the PWM

B EM1 Mj Mk

Ij

Page 117: CS5263 Bioinformatics

Dealing with insertions

• This allows insertions at any position

B EM1 Mj Mk

Ij IkI1

Page 118: CS5263 Bioinformatics

Dealing with deletions

• This would allow a subsequence [i, j] being deleted– i.e. the sequence being compared can be

shorter than the PWM

B EMi Mj

Page 119: CS5263 Bioinformatics

Dealing with deletions

• This would allow an arbitrary length of deletion– Completely forward connected– Too many transitions

B E

Page 120: CS5263 Bioinformatics

Deletion with silent states

B Mj E

Dj

• Still allows any length of deletions

• Fewer parameters

• Less detailed control

Silent state

Page 121: CS5263 Bioinformatics

Full model

• Called profile-HMM

B Mj E

Dj

Ij

Algorithm: basically the same as a regular HMM

D: deletion stateI: insertion stateM: matching state

Page 122: CS5263 Bioinformatics

Using profile HMM

• Alignment– Align a sequence to a profile HMM (Viterbi)

• Searching– Protein classification: Given a sequence, and HMMs for different

protein families, which family the sequence may belong to?– New family member detection: Given a HMM for a protein family,

and many proteins, which protein may belong to the family?– Viterbi or forward

• Training / Learning– Given a multiple alignment, construct a HMM (supervised)– Given an unaligned protein family, construct a HMM

(unsupervised: BM or Viterbi)

Page 123: CS5263 Bioinformatics

Pfam

• A database of protein families– Developed by Sean Eddy and colleagues while

working in Durbin’s lab

• Hand-curated “seed” multiple alignment• Train HMM from seed alignment• Hand-chosen score thresholds• Automatic classification / classification of all

other protein sequences• 7973 families in Rfam 18.0, 8/2005

(covers ~75% of proteins)

Page 124: CS5263 Bioinformatics

Build models from aligned sequences

• Matching state for columns with no gaps

• When gaps exist, how to decide whether they are insertions or matching?– Heuristic rule: >50%

gaps, insertion, otherwise, matching

• How to add pseudocount?– Simply add one– According to

background distribution– Use a mixture of priors

(Dirchlet mixtures)• Sequence weighting

– Very similar sequences should each get less weight

Page 125: CS5263 Bioinformatics

Build models from unaligned sequences

• Choose a model length and initial parameters– Commonly use average seq length as model length

• Baum-Welch or Viterbi training– Usually necessary to use multiple starting points or

other heuristics to escape from local optima

• Align all sequences to the final model using Viterbi

Page 126: CS5263 Bioinformatics

Searching

• Scoring – Log likelihood: Log P(X | M)– Log odds: Log P(X | M) / P(X | random)– Length-normalization

• Is the matching biologically interesting?– How does the score compare with those

for sequences already in the family?– How does the score compare with those

for random sequences?

Protein Database

Score for each protein

+

Page 127: CS5263 Bioinformatics

Example: modeling and searching for globins

• 300 randomly picked globin sequence

• Build a profile HMM from scratch (without pre-alignment)

• Align 60,000 proteins to the HMM

Page 128: CS5263 Bioinformatics

• Even after length normalization, LL is still length-dependent

• Log-odd score provides better separation– Takes amino acid composition into account– Real globins could have scores less than 0

Page 129: CS5263 Bioinformatics

• Estimate mean score and standard deviation for non-globin sequences for each length

• Z-score = (raw score – mean) / (standard deviation)– Z-score is length-invariant– Real globins have positive scores

Page 130: CS5263 Bioinformatics

Gene prediction

Page 131: CS5263 Bioinformatics

Gene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

Exon: codingIntron: non-codingIntergenic: non-coding

5’ 3’Intergenic

Page 132: CS5263 Bioinformatics

Finding genes

GATCGGTCGAGCGTAAGCTAGCTAG

ATCGATGATCGATCGGCCATATATC

ACTAGAGCTAGAATCGATAATCGAT

CGATATAGCTATAGCTATAGCCTAT

Human

Fugu

worm

E.coli

As the coding/non-coding length ratio decreases, exon prediction becomes more complex

Page 133: CS5263 Bioinformatics

Gene prediction in prokaryote

• Finding long ORFs (open reading frame)• An ORF may not contain stop codons

– Average ORF length = 64/3– Expect 300bp ORF per 36kbp – Actual ORF length ~ 1000bp

• Codon biases– Some triplets are used more frequently than

others– Codon third position biases

Page 134: CS5263 Bioinformatics

HMM for eukaryote gene finding

• Basic idea is the same: the distributions of nucleotides is different in exon and other regions– Alone won’t work very well

• More signals are needed

• How to combine all the signal together?

exon1 exon2 exon3intron1 intron2

5’ 3’

Intergenic

ATG

5’-UTR

Promoter

Splicing donor: GT

Splicing acceptor: AG STOP

3’-UTR

Poly-A

Page 135: CS5263 Bioinformatics

Simplest model

• Exon length may not be exact multiple of 3• Basically have to triple the number of states to remember the

excess number of bases in the previous state

Intergenic exon intron

64 triplets emission probabilities

4 emission probability

4 emission probability

Actually more accurate at the di-amino-acid level, i.e. 2 codons. Many methods use 5th-order Markov model for all regions

Page 136: CS5263 Bioinformatics

More detailed model

Intergenic

Init exon

intron

Term exon

Internal Exon

Single exon

Page 137: CS5263 Bioinformatics

Sub-models

• START, STOP are PWMs• Including start and stop codons and surrounding bases

5’-UTR START CDSInit exon

3’-UTRSTOPCDSTerm exon

CDS: coding sequence

Page 138: CS5263 Bioinformatics

Sub-model for intron

• Sequence logos: an informative display of PWMs• Within each column, relative height represents probability• Height of each column reflects “information content”

Splice donorIntron body

Splice acceptorIntron

Page 139: CS5263 Bioinformatics
Page 140: CS5263 Bioinformatics

Duration modeling

• For any sub-path, the probability consists of two components– The product of emission probabilities

• Depend on symbols and state path

– The product of transition probabilities• Depend on state path

Page 141: CS5263 Bioinformatics

Duration modeling

• Model a stretch of DNA for which the distribution does not change for a certain length

• The simplest model implies that

P(length = L) = pL-1(1-p)• i.e., length follows geometric distribution

– Not always appropriate

s

P

1-p

Duration: the number of times that a state is used consecutively without visiting other states

L

p

Page 142: CS5263 Bioinformatics

Duration models

s

P

s ss

s

P

s ss

1-p

Negative binominal

Min, then geometric

1-p

PPP

1-p1-p1-p

Page 143: CS5263 Bioinformatics

Explicit duration modeling

Intron

P(A | I) = 0.3P(C | I) = 0.2P(G | I) = 0.2P(T | I) = 0.3

L

P

Exon Intergenic

Empirical intron length distribution

Generalized HMM. Often used in gene finders

Page 144: CS5263 Bioinformatics

Explicit duration modeling

• For each position j and each state i– Need to consider the transition from all previous

positions

• Time: O(N2K2)• N can be 108

x1 x2 x3 ………………………………………..xN

1

2

K

Pk(i)

Page 145: CS5263 Bioinformatics

Speedup GHMM

• Restrict maximum duration length to be L– O(LNK2)

• However, intergenic and intron can be quite long– L can be 105

• Compromise: explicit duration for exons only, geometric for all other states

• Pre-compute all possible starting points of ORFs– For init exon: ATG– For internal/terminal exon: splice donor signal (GT)

Page 146: CS5263 Bioinformatics

GeneScan model

Page 147: CS5263 Bioinformatics

Approaches to gene finding

• Homology– BLAST, BLAT, etc.

• Ab initio– Genscan, Glimmer, Fgenesh, GeneMark, etc.– Each one has been tuned towards certain organisms

• Hybrids– Twinscan, SLAM– Use pair-HMM, or pre-compute score for potential

coding regions based on alignment

• None are perfect, never used alone in practice

Page 148: CS5263 Bioinformatics

Current status

• More accurate on internal exons

• Determining boundaries of init and term exons is hard

• Biased towards multiple-exon genes

• Alternative splicing is hard

• Non-coding RNA is hard

Page 149: CS5263 Bioinformatics

• State of the Art:– predictions ~ 60% similar to real proteins– ~80% if database similarity used– lab verification still needed, still expensive

Page 150: CS5263 Bioinformatics

HMM wrap up

• We’ve talked about– Probability, mainly Bayes Theorem– Markov models– Hidden Markov models– HMM parameter estimation given state path– Decoding given HMM and parameters

• Viterbi• F-B

– Learning• Baum-Welch (Expectation-Maximization)• Viterbi

Page 151: CS5263 Bioinformatics

HMM wrap up

• We’ve also talked about– Extension to gHMMs– HMM for multiple sequence alignment– gHMM for gene finding

• We did not talk about– Higher-order Markov models– How to escape from local optima in learning