40
Hidden Markov Models What are the good for? Morten Nielsen CBS

Hidden Markov Models What are the good for? Morten Nielsen CBS

Embed Size (px)

Citation preview

Page 1: Hidden Markov Models What are the good for? Morten Nielsen CBS

Hidden Markov Models

What are the good for?

Morten NielsenCBS

Page 2: Hidden Markov Models What are the good for? Morten Nielsen CBS

Absolutely nothing!

Page 3: Hidden Markov Models What are the good for? Morten Nielsen CBS

Objectives

• Introduce Hidden Markov models and understand that they are just weight matrices with gaps

• See the beauty of sequence profiles• Position specific scoring matrices (PSSMs)

• Understand what biological problems are best described using HMM’s– And which are not!

Page 4: Hidden Markov Models What are the good for? Morten Nielsen CBS

Outline

• What is an HMM– What are they good for?

• How to construct an HMM• How to “score” a sequence to an HMM

– Viterbi decoding• HMM’s that made a difference

– Profile HMMs– TMHMM

• Links to HMM packages

Page 5: Hidden Markov Models What are the good for? Morten Nielsen CBS

Markov Models

• A model with no memory– What I decide depends only on “state” now, not

on what I have learned in the past

– No dependence on i-1, i-2 …

Pi+1 =α i+1,i ⋅Pi

Page 6: Hidden Markov Models What are the good for? Morten Nielsen CBS

A Markov model?

• No memory• Model generates numbers

– 312453666641

1:1/62:1/63:1/64:1/65:1/66:1/6Fair

1:1/102:1/103:1/104:1/105:1/106:1/2Loaded

0.95

0.10

0.05

0.9

The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

Page 7: Hidden Markov Models What are the good for? Morten Nielsen CBS

Why hidden?

• Model generates numbers– 312453666641

• Does not tell which dice was used

• Alignment (decoding) can give the most probable solution/path (Viterby)– FFFFFFLLLLLL

• Or most probable set of states– FFFFFFLLLLLL

1:1/62:1/63:1/64:1/65:1/66:1/6Fair

1:1/102:1/103:1/104:1/105:1/106:1/2Loaded

0.95

0.10

0.05

0.9

The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

Page 8: Hidden Markov Models What are the good for? Morten Nielsen CBS

HMM (a simple example)

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• Example from A. Krogh• Core region defines the

number of states in the HMM (red)

• Insertion and deletion statistics are derived from the non-core part of the alignment (black)

Core of alignment

Page 9: Hidden Markov Models What are the good for? Morten Nielsen CBS

.2

.8

.2

ACGT

ACGT

ACGT

ACGT

ACGT

ACGT.8

.8 .8.8

.2.2.2

.2

1

ACGT

.2

.2

.4

1. .4 1. 1.1.

.6.6

.4

HMM construction

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• 5 matches. A, 2xC, T, G• 5 transitions in gap region

• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

Page 10: Hidden Markov Models What are the good for? Morten Nielsen CBS

Align sequence to HMM

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2

ACAC--AGC = 1.2x10-2

Consensus:

ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2

Exceptional:

TGCT--AGG = 0.0023x10-2

Page 11: Hidden Markov Models What are the good for? Morten Nielsen CBS

Align sequence to HMM - Null model

• Score depends strongly on length

• Null model is a random model. For length L the score is 0.25L

• Log-odds score for sequence S– Log( P(S)/0.25L)

• Positive score means more likely than Null model

ACA---ATG = 4.9

TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97

Note!

Page 12: Hidden Markov Models What are the good for? Morten Nielsen CBS

Model decoding (Viterby)

• Example: 1245666. What was the series of dice used to generate this output?

1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78

Fair

1:-12:-13:-14:-15:-16:-0.3Loaded

-0.02

-1

-1.3

-0.05Log model

Page 13: Hidden Markov Models What are the good for? Morten Nielsen CBS

Dynamic programming: computation of scores

T C G C A

T

C

C

A

x

Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).

=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from.

Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner.

score(x,y) = max

score(x,y-1) - gap-penalty

score(x-1,y-1) + substitution-score(x,y)

score(x-1,y) - gap-penalty

Page 14: Hidden Markov Models What are the good for? Morten Nielsen CBS

Model decoding (Viterby)

• Example: 1245666. What was the series of dice used to generate this output?

1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78

Fair

1:-12:-13:-14:-15:-16:-0.3Loaded

-0.02

-1

-1.3

-0.05Log model

1 2 4 5 6 6 6

F -0.78

L Null -3.08

Page 15: Hidden Markov Models What are the good for? Morten Nielsen CBS

Model decoding (Viterby)

1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78

Fair

1:-12:-13:-14:-15:-16:-0.3

Loaded

-0.02

-1

-1.3

-0.05Log model

1 2 4 5 6 6 6

F -0.78 -1.58

L Null -3.08 -3.88

log(PL (4)) = −1− 0.05 − 3.08 = −4.13 or

log(PL (4)) = −1−1.3 −1.58 = −3.88

Page 16: Hidden Markov Models What are the good for? Morten Nielsen CBS

Model decoding (Viterby)

1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78

Fair

1:-12:-13:-14:-15:-16:-0.3Loaded

-0.02

-1

-1.3

-0.05Log model

1 2 4 5 6 6 6

F -0.78 -1.58

L Null -3.08 -3.88

Identify what series of dice was used to generate this output?

Page 17: Hidden Markov Models What are the good for? Morten Nielsen CBS

Model decoding (Viterby)

1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78

Fair

1:-12:-13:-14:-15:-16:-0.3Loaded

-0.02

-1

-1.3

-0.05Log model

1 2 4 5 6 6 6

F -0.78 -1.58 -2.38 -3.18 -3.98 -4.78 -5.58

L Null -3.08 -3.88 -4.68 -4.78 -5.13 -5.48

Series of dice is FFFFLLL

Page 18: Hidden Markov Models What are the good for? Morten Nielsen CBS

HMM’s and weight matrices

• In the case of un-gapped alignments HMM’s become simple weight matrices

Page 19: Hidden Markov Models What are the good for? Morten Nielsen CBS

.2

.8

.2

ACGT

ACGT

ACGT

ACGT

ACGT

ACGT.8

.8 .8.8

.2.2.2

.2

1

ACGT

.2

.2

.4

1. .4 1. 1.1.

.6.6

.4

HMM construction

X

Page 20: Hidden Markov Models What are the good for? Morten Nielsen CBS

.8

.2

ACGT

ACGT

ACGT

ACGT

ACGT

ACGT.8

.8 .8.8

.2.2.2

.2

11. 1. 1. 1.1.

HMM construction

ACA---ATG sco = 0.8x1x0.8x1x0.8x1x1x1x0.8x1x0.2 = 3.3x10-2 or

Log-sco = log(0.8)+log(0.8)+log(0.8)+log(1)+log(0.8)+log(0.2)

Page 21: Hidden Markov Models What are the good for? Morten Nielsen CBS

HMM’s and weight matrices

• In the case of un-gapped alignments HMM’s become simple weight matrices

• To achieve high performance, the emission frequencies are estimated using the techniques of – Sequence weighting– Pseudo counts

Page 22: Hidden Markov Models What are the good for? Morten Nielsen CBS

HMMs. What are they good for?

• Weight matrices do not deal with insertions and deletions

• In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension

• HMM is a natural frame work where insertions/deletions are dealt with explicitly

Page 23: Hidden Markov Models What are the good for? Morten Nielsen CBS

Profile HMM’s

• Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner

• Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix)

• Profile HMM’s are ideal suited to describe such position specific variations

Page 24: Hidden Markov Models What are the good for? Morten Nielsen CBS

What goes wrong when Blast fails?

• Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

Page 25: Hidden Markov Models What are the good for? Morten Nielsen CBS

Alignment scoring matrices

• Blosum62 score matrix. Fg=1. Ng=0?

L A G D S D

F

I

G

D

S

L

Page 26: Hidden Markov Models What are the good for? Morten Nielsen CBS

Alignment scoring matrices• Blosum62 score matrix. Fg=1. Ng=0?

• Score =2+6+6+4-1=17

L A G D S D

F 0 -2 -3 -3 -2 -3

I 2 -1 -4 -3 -2 -3

G -4 0 6 -1 0 -1

D -4 -2 -1 6 0 6

S -2 1 0 0 4 0

L 4 -1 -4 -4 -2 -4

LAGDSI-GDS

Page 27: Hidden Markov Models What are the good for? Morten Nielsen CBS

What goes wrong when Blast fails?

• Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences• This scoring matrix is identical at all positions in the protein sequence!

EVVFIGDSLVQLMHQC

X X X

X X X

AGDS.GGGDS

Page 28: Hidden Markov Models What are the good for? Morten Nielsen CBS

When Blast works!

1PLC

._

1PLB._

Page 29: Hidden Markov Models What are the good for? Morten Nielsen CBS

When Blast fails!

1PLC

._

1PMY._

Page 30: Hidden Markov Models What are the good for? Morten Nielsen CBS

Sequence profiles

• In reality not all positions in a protein are equally likely to mutate

• Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high

• Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score

• Sequence profiles can capture these differences

Page 31: Hidden Markov Models What are the good for? Morten Nielsen CBS

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVPTVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Profile HMM’s

Conserved

Core: Position with < 2 gaps

Deletion

Insertion

Non-conserved

Must have a G Any thing can match

Page 32: Hidden Markov Models What are the good for? Morten Nielsen CBS

HMM vs. alignment

• Detailed description of core– Conserved/variable positions

• Price for insertions/deletions varies at different locations in sequence

• These features cannot be captured in conventional alignments

Page 33: Hidden Markov Models What are the good for? Morten Nielsen CBS

Profile-profile scoring matrix

1K

7C

.A

1WAB._

Page 34: Hidden Markov Models What are the good for? Morten Nielsen CBS

Profile HMM’s

All M/D pairs must be visited once

L1- Y2A3V4R5- I6

P1D2P3P4I4P5D6P7

Page 35: Hidden Markov Models What are the good for? Morten Nielsen CBS

Example. Sequence profiles

• Alignment of protein sequences 1PLC._ and 1GYC.A• E-value > 1000• Profile alignment

– Align 1PLC._ against Swiss-prot– Make position specific weight matrix from

alignment– Use this matrix to align 1PLC._ against 1GYC.A

• E-value < 10-22. Rmsd=3.3

Page 36: Hidden Markov Models What are the good for? Morten Nielsen CBS

Example continued

Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + +Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G VSbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126

Rmsd=3.3 ÅModel redStructure blue

Page 37: Hidden Markov Models What are the good for? Morten Nielsen CBS

HMMs. What are they good for II

• Trans membrane helix proteins

Page 38: Hidden Markov Models What are the good for? Morten Nielsen CBS

HMMs. What are they good for II

• Transmembrane helix proteins

TMHMM. A. Krogh, 2001

Page 39: Hidden Markov Models What are the good for? Morten Nielsen CBS

Gene Finding

Page 40: Hidden Markov Models What are the good for? Morten Nielsen CBS

HMM packages

• HMMER (http://hmmer.wustl.edu/)– S.R. Eddy, WashU St. Louis. Freely available.

• SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa

Cruz. Freely available to academia, nominal license fee for commercial users.

• META-MEME (http://metameme.sdsc.edu/)– William Noble Grundy, UC San Diego. Freely available. Combines

features of PSSM search and profile HMM search.

• NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)– Freely available to academia, nominal license fee for commercial

users.– Allows HMM architecture construction.

• EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/)– Webserver for Gibbs sampling of proteins sequences