Upload
justina-greene
View
223
Download
2
Embed Size (px)
Citation preview
1
Profile Hidden Markov Models
PHMM
Mark Stamp
2
Hidden Markov Models
Here, we assume you know about HMMso If not, see “A revealing introduction to
hidden Markov models” Executive summary of HMMs
o HMM is a machine learning technique…o …and a discrete hill climb techniqueo Train model based on observation sequenceo Score any given sequence to determine how
closely it matches the modelo Efficient algorithms, many, many useful apps
PHMM
3
HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation:
PHMM
4
Hidden Markov Models
Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) And more all the time
PHMM
5
Limitations of HMMs
Positional information not consideredo HMM has no “memory” beyond previous stateo Higher order models have more “memory”o But no explicit use of positional information
With HMM, no insertions or deletions These limitations are serious problems in
some applicationso In bioinformatics string comparison, sequence
alignment is criticalo Also, insertions and deletions can occur
PHMM
6
Profile HMM
Profile HMM (PHMM) designed to overcome limitations on previous slideo In some ways, PHMM easier than HMMo In some ways, PHMM more complex
The basic idea of PHMM ?o Define multiple B matriceso Almost like having an HMM for each
position in sequencePHMM
7
PHMM
In bioinformatics, begin by aligning multiple related sequenceso Multiple sequence alignment (MSA)o Analogous to training phase for HMM
Generate PHMM based on given MSAo This is easy, once MSA is knowno Again, hard part is generating MSA
Then can score sequences using PHMMo Use forward algorithm, similar to HMM
PHMM
8
Training: PHMM vs HMM
Training PHMMo Determine MSA nontrivialo Determine PHMM matrices trivial
Training HMMo Append training sequences trivialo Determine HMM matrices nontrivial
PHMM and HMM are, in this sense, opposites…
PHMM
9
Generic View of PHMM
Have delete, insert, and match stateso Match states correspond to HMM states
Arrows are possible transitionso Each transition has a probability
Transition probabilities are A matrixEmission probabilities are B matrices
o In PHMM, observations are emissionso Match and insert states have emissions
PHMM
10
Generic View of PHMM Circles are delete states, diamonds are insert states, squares are match states
Also, begin and end states
PHMM
11
PHMM Notation Notation
PHMM
12
PHMM
Match state probabilities easily determined from MSA aMi,Mi+1 transitions between match states
eMi(k) emission probability at match state
Many other transition probabilities o For example, aMi,Ii and aMi,Di+1
Emissions at all match & insert stateso Remember, emission == observation
PHMM
13
Multiple Sequence Alignment
First we show MSA constructiono This is the difficult parto Lots of ways to do thiso “Best” way depends on specific problem
Then construct PHMM from MSAo This is the easy parto Standard algorithm for this
How to score a sequence?o Forward algorithm, similar to HMM
PHMM
14
MSA
How to construct MSA?o Construct pairwise alignmentso Combine pairwise alignments into
MSA Allow gaps to be inserted
o To make better matches Gaps tend to weaken PHMM scoring
o So, tradeoff between number of gaps and strength of score
PHMM
15
Global vs Local Alignment In these pairwise alignment examples
o “-” is gapo “|” means elements alignedo “*” for omitted beginning/ending symbols
PHMM
16
Global vs Local Alignment Global alignment is lossless
o But gaps tend to proliferateo And gaps increase when we do MSA o More gaps, more random sequences match…o …and result is less useful for scoring
We usually only consider local alignmento That is, omit ends for better alignment
For simplicity, assume global alignment in examples presented here
PHMM
17
Pairwise Alignment
Allow gaps when aligning How to score an alignment?
o Based on n x n substitution matrix So Where n is number of symbols
What algorithm(s) to align sequences?o Usually, dynamic programmingo Sometimes, HMM is usedo Other?
Local alignment? Additional issues arise…
PHMM
18
Pairwise Alignment Example
Tradeoff gaps vs misaligned elementso Depends on matrix S and gap penalty
PHMM
19
Substitution Matrix
For example, masquerade detectiono Detect imposter using computer
account Consider 4 different operations
o E == send emailo G == play gameso C == C programmingo J == Java programming
How similar are these to each other?
PHMM
20
Substitution Matrix
Consider 4 different operations:o E, G, C, J
Possible substitution matrix: Diagonal matches
o High positive scores Which others most similar?
o J and C, so substituting C for J is a high score Game playing/programming, very
differento So substituting G for C is a negative score
PHMM
21
Substitution Matrix Depending on problem, might be easy
or very difficult to find useful S matrix Consider masquerade detection based
on UNIX commandso Sometimes difficult to say how “close” 2
commands are Suppose instead, aligning DNA
sequenceso Biological reasons for S matrix
PHMM
22
Gap Penalty Generally must allow gaps to be inserted But gaps make alignment more generic
o Less useful for scoring, so we penalize gaps How to penalize gaps? Linear gap penalty function:
g(x) = ax (constant penalty for every gap) Affine gap penalty function
g(x) = a + b(x – 1)o Gap opening penalty a and constant penalty
of b for each extension of existing gap
PHMM
23
Pairwise Alignment Algorithm
We use dynamic programmingo Based on S matrix, gap penalty function
Notation:
PHMM
24
Pairwise Alignment DP Initialization:
Recursion:
where
PHMM
25
MSA from Pairwise Alignments
Given pairwise alignments… How to construct MSA? Generally use “progressive alignment”
o Select one pairwise alignmento Select another and combine with firsto Continue to add more until all are combined
Relatively easy (good) Gaps proliferate, and it’s unstable (bad)
PHMM
26
MSA from Pairwise Alignments
Lots of ways to improve on generic progressive alignmento Here, we mention one such approacho Not necessarily “best” or most popular
Feng-Dolittle progressive alignmento Compute scores for all pairs of n sequenceso Select n-1 alignments that a) “connect” all
sequences and b) maximize pairwise scoreso Then generate a minimum spanning treeo For MSA, add sequences in the order that they
appear in the spanning tree
PHMM
27
MSA Construction
Create pairwise alignmentso Generate substitution matrix S o Dynamic program for pairwise alignments
Use pairwise alignments to make MSAo Use pairwise alignments to construct
spanning tree (e.g., Prim’s Algorithm)o Add sequences in spanning tree order
(from high score, insert gaps as needed)o Note: gap penalty is used here
PHMM
28
MSA Example Suppose 10 sequences, with the following
pairwise alignment scores
PHMM
29
MSA Example: Spanning Tree
Spanning tree based on scores
So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9)
PHMM
30
MSA Snapshot
Intermediate step and finalo Use “+” for
neutral symbol
o Then “-” for gaps in MSA
Note increase in gaps
PHMM
31
PHMM from MSA
In PHMM, determine match and insert states & probabilities from MSA
“Conservative” columns == match stateso Half or less of symbols are gaps
Other columns are insert stateso Majority of symbols are gaps
Delete states are a separate issuePHMM
32
PHMM States from MSA Consider a simpler MSA… Columns 1,2,6 are match
states 1,2,3, respectivelyo Since less than half gaps
Columns 3,4,5 are combined to form insert state 2o Since more than half gapso Match states between
insertPHMM
33
Probabilities from MSA Emission probabilities
o Based on symbol distribution in match and insert states
State transition probso Based on transitions in
the MSA
PHMM
34
Probabilities from MSA Emission probabilities:
But 0 probabilities are bado Model overfits the datao So, use “add one” ruleo Add one to each numerator,
add total to denominators
PHMM
35
Probabilities from MSA More emission probabilities:
But 0 probabilities still bado Model overfits the datao Again, use “add one” ruleo Add one to each numerator,
add total to denominators
PHMM
36
Probabilities from MSA Transition probabilities:
We look at some exampleso Note that “-” is delete state
First, consider begin state:
Again, use add one rule
PHMM
37
Probabilities from MSA Transition probabilities When no information in
MSA, set probs to uniform For example I1 does not
appear in MSA, so
PHMM
38
Probabilities from MSA Transition probabilities,
another example What about transitions
from state D1? Can only go to M2, so
Again, use add one rule:
PHMM
39
PHMM Emission Probabilities Emission probabilities for the given MSA
o Using add-one rule
PHMM
40
PHMM Transition Probabilities Transition probabilities for the given MSA
o Using add-one rule
PHMM
41
PHMM Summary
Construct pairwise alignmentso Usually, use dynamic programming
Use these to construct MSAo Lots of ways to do this
Using MSA, determine probabilitieso Emission probabilitieso State transition probabilities
Then we have trained a PHMMo Now what???
PHMM
42
PHMM Scoring
Want to score sequences to see how closely they match PHMM
How did we score using HMM?o Forward algorithm
How to score sequences with PHMM?o Forward algorithm (surprised?)
But, algorithm is a little more complexo Due to more complex state transitions
PHMM
43
Forward Algorithm
Notationo Indices i and j are columns in MSAo xi is ith observation (emission) symbol
o qxi is distribution of xi in “random model”o Base case iso is score of x1,…,xi up to state j (note
that in PHMM, i and j may not agree)o Some states undefinedo Undefined states ignored in calculation
PHMM
44
Forward Algorithm
Compute P(X|λ) recursively
Note that depends on , and o And corresponding state transition probs
PHMM
45
PHMM
We will see examples of PHMM later
In particular,o Malware detection based on opcodeso Masquerade detection based on UNIX
commands
PHMM
46
References Durbin, et al, Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids
L. Huang and M. Stamp, Masquerade detection using profile hidden Markov models, Computers & Security, 30(8):732-747, 2011
S. Attaluri, S. McGhee, and M. Stamp, Profile hidden Markov models for metamorphic virus detection, Journal in Computer Virology, 5(2):151-169, 2009
PHMM