View
74
Download
1
Category
Tags:
Preview:
DESCRIPTION
Introduction to Profile Hidden Markov Models. Mark Stamp. Hidden Markov Models. Here, we assume you know about HMMs If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs HMM is a machine learning technique Also, a discrete hill climb technique - PowerPoint PPT Presentation
Citation preview
PHMM 1
Introduction to Profile Hidden Markov Models
Mark Stamp
PHMM 2
Hidden Markov Models
Here, we assume you know about HMMs If not, see “A revealing introduction to hidden
Markov models” Executive summary of HMMs
HMM is a machine learning technique Also, a discrete hill climb technique Train model based on observation sequence Score given sequence to see how closely it
matches the model Efficient algorithms, many useful applications
PHMM 3
HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation:
PHMM 4
Hidden Markov Models
Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) Many more, and more all the time
PHMM 5
Limitations of HMMs
Positional information not considered HMM has no “memory” Higher order models have some memory But no explicit use of positional information
Does not handle insertions or deletions These limitations are serious problems in
some applications In bioinformatics string comparison, sequence
alignment is critical Also, insertions and deletions occur
PHMM 6
Profile HMM
Profile HMM (PHMM) designed to overcome limitations on previous slide In some ways, PHMM easier than HMM In some ways, PHMM more complex
The basic idea of PHMM Define multiple B matrices Almost like having an HMM for each
position in sequence
PHMM 7
PHMM
In bioinformatics, begin by aligning multiple related sequences Multiple sequence alignment (MSA) This is like training phase for HMM
Generate PHMM based on given MSA Easy, once MSA is known Hard part is generating MSA
Then can score sequences using PHMM Use forward algorithm, like HMM
PHMM 8
Generic View of PHMM
Circles are Delete states Diamonds are Insert states Rectangles are Match states
Match states correspond to HMM states Arrows are possible transitions
Each transition has associated probability Transition probabilities are A matrix Emission probabilities are B matrices
In PHMM, observations are emissions Match and insert states have emissions
PHMM 9
Generic View of PHMM
Circles are Delete states, diamonds are Insert states, rectangles are Match states
Also, begin and end states
PHMM 10
PHMM Notation Notation
PHMM 11
PHMM
Match state probabilities easily determined from MSA, that is aMi,Mi+1 transitions between match states eMi(k) emission probability at match
state Note: other transition probabilities
For example, aMi,Ii and aMi,Di+1
Emissions at all match & insert states Remember, emission == observation
PHMM 12
MSA
First we show MSA construction This is the difficult part Lots of ways to do this “Best” way depends on specific problem
Then construct PHMM from MSA The easy part Standard algorithm for this
How to score a sequence? Forward algorithm, similar to HMM
PHMM 13
MSA
How to construct MSA? Construct pairwise alignments Combine pairwise alignments to obtain
MSA Allow gaps to be inserted
Makes better matches But gaps tend to weaken scoring
So there is a tradeoff
PHMM 14
Global vs Local Alignment In these pairwise alignment examples
“-” is gap “|” are aligned “*” omitted beginning and ending symbols
PHMM 15
Global vs Local Alignment
Global alignment is lossless But gaps tend to proliferate And gaps increase when we do MSA More gaps implies more sequences match So, result is less useful for scoring
We usually only consider local alignment That is, omit ends for better alignment
For simplicity, we assume global alignment here
PHMM 16
Pairwise Alignment
We allow gaps when aligning How to score an alignment?
Based on n x n substitution matrix S Where n is number of symbols
What algorithm(s) to align sequences? Usually, dynamic programming Sometimes, HMM is used Other?
Local alignment --- more issues
PHMM 17
Pairwise Alignment
Example
Note gaps vs misaligned elements Depends on S and gap penalty
PHMM 18
Substitution Matrix
Masquerade detection Detect imposter using an account
Consider 4 different operations E == send email G == play games C == C programming J == Java programming
How similar are these to each other?
PHMM 19
Substitution Matrix
Consider 4 different operations: E, G, C, J
Possible substitution matrix: Diagonal --- matches
High positive scores Which others most similar?
J and C, so substituting C for J is a high score Game playing/programming, very different
So substituting G for C is a negative score
PHMM 20
Substitution Matrix
Depending on problem, might be easy or very difficult to get useful S matrix
Consider masquerade detection based on UNIX commands Sometimes difficult to say how “close” 2
commands are Suppose aligning DNA sequences
Biological rationale for closeness of symbols
PHMM 21
Gap Penalty
Generally must allow gaps to be inserted But gaps make alignment more generic
So, less useful for scoring Therefore, we penalize gaps
How to penalize gaps? Linear gap penalty function
f(g) = dg (i.e., constant penalty per gap) Affine gap penalty function
f(g) = a + e(g – 1) Gap opening penalty a, then constant factor of e
PHMM 22
Pairwise Alignment Algorithm
We use dynamic programming Based on S matrix, gap penalty function
Notation:
PHMM 23
Pairwise Alignment DP
Initialization:
Recursion:
PHMM 24
MSA from Pairwise Alignments
Given pairwise alignments… …how to construct MSA? Generic approach is “progressive
alignment” Select one pairwise alignment Select another and combine with first Continue to add more until all are combined
Relatively easy (good) Gaps may proliferate, unstable (bad)
PHMM 25
MSA from Pairwise Alignments
Lots of ways to improve on generic progressive alignment Here, we mention one such approach Not necessarily “best” or most popular
Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences Select n-1 alignments that a) “connect” all
sequences and b) maximize pairwise scores Then generate a minimum spanning tree For MSA, add sequences in the order that they
appear in the spanning tree
PHMM 26
MSA Construction
Create pairwise alignments Generate substitution matrix Dynamic program for pairwise alignments
Use pairwise alignments to make MSA Use pairwise alignments to construct
spanning tree (e.g., Prim’s Algorithm) Add sequences to MSA in spanning tree
order (from highest score, insert gaps as needed)
Note: gap penalty is used
PHMM 27
MSA Example Suppose 10 sequences, with the following
pairwise alignment scores:
PHMM 28
MSA Example: Spanning Tree
Spanning tree based on scores
So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9)
PHMM 29
MSA Snapshot
Intermediate step and final Use “+” for
neutral symbol
Then “-” for gaps in MSA
Note increase in gaps
PHMM 30
PHMM from MSA
For PHMM, must determine match and insert states & probabilities from MSA
“Conservative” columns are match states Half or less of symbols are gaps
Other columns are insert states Majority of symbols are gaps
Delete states are a separate issue
PHMM 31
PHMM States from MSA
Consider a simpler MSA… Columns 1,2,6 are match
states 1,2,3, respectively Since less than half gaps
Columns 3,4,5 are combined to form insert state 2 Since more than half gaps Match states between
insert
PHMM 32
PHMM Probabilities from MSA
Emission probabilities Based on symbol
distribution in match and insert states
State transition probs Based on transitions in
the MSA
PHMM 33
PHMM Probabilities from MSA
Emission probabilities:
But 0 probabilities are bad Model “overfits” the data So, use “add one” rule Add one to each numerator,
add total to denominators
PHMM 34
PHMM Probabilities from MSA
More emission probabilities:
But 0 probabilities are bad Model “overfits” the data Again, use “add one” rule Add one to each numerator,
add total to denominators
PHMM 35
PHMM Probabilities from MSA
Transition probabilities:
We look at some examples Note that “-” is delete state
First, consider begin state:
Again, use add one rule
PHMM 36
PHMM Probabilities from MSA
Transition probabilities When no information in
MSA, set probs to uniform For example I1 does not
appear in MSA, so
PHMM 37
PHMM Probabilities from MSA
Transition probabilities, another example
What about transitions from state D1?
Can only go to M2, so
Again, use add one rule:
PHMM 38
PHMM Emission Probabilities Emission probabilities for the given MSA
Using add-one rule
PHMM 39
PHMM Transition Probabilities Transition probabilities for the given MSA
Using add-one rule
PHMM 40
PHMM Summary
Construct pairwise alignments Usually, use dynamic programming
Use these to construct MSA Lots of ways to do this
Using MSA, determine probabilities Emission probabilities State transition probabilities
In effect, we have trained a PHMM Now what???
PHMM 41
PHMM Scoring
Want to score sequences to see how closely they match PHMM
How did we score sequences with HMM? Forward algorithm
How to score sequences with PHMM? Forward algorithm
But, algorithm is a little more complex Due to complex state transitions
PHMM 42
Forward Algorithm
Notation Indices i and j are columns in MSA xi is ith observation symbol qxi is distribution of xi in “random model” Base case is is score of x1,…,xi up to state j (note
that in PHMM, i and j may not agree) Some states undefined Undefined states ignored in calculation
PHMM 43
Forward Algorithm
Compute P(X|λ) recursively
Note that depends on , and And corresponding state transition probs
PHMM 44
PHMM
We will see examples of PHMM later In particular,
Malware detection based on opcodes Masquerade detection based on UNIX
commands
PHMM 45
References
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al
Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security
Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. McGhee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151-169
Recommended