Upload
others
View
4
Download
6
Embed Size (px)
Citation preview
February 3, 2012
Statistical Machine learning from HIV genomic data using HMMs
Jedidiah Francis Twitter: @jedidiahfrancis Email: [email protected] Blog: jedidiahfrancis.com Mobile: 07917184089
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 2
Talk outline
Primer on Hidden Markov Models (HMMs) Inference in HIV genomic data Conclusion
Practical uses
uses include: § finance (time series modeling), speech recognition, handwriting
recognition, medical (heart attack prediction), genomics (sequence analysis & alignment), robotics, meteorological (weather forecasting / modeling)
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 3
Introduction to HMMs 1st order Markov chain:
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 4
S R
0.4
0.80.6
0.2
S S S S
R R R R
S R
0.4
0.80.6
0.2
T W T W
0.2 0.8 0.9 0.1
S S S S
R R R R
T WT T
Pr(Xt|X1, X2, . . . , Xt�1) = Pr(Xt|Xt�1)
Problem 1
Given some model & parameters and sequence of observation D, compute . Observation: W T T W T W W W T W T T T T W T T W T T § Naïve approach sum over all possible paths (221≈2.1 million
paths).
§ Luckily we can use dynamic programming (forward algorithm) to reduce this mn operations to mn (42).
§ A similar algorithm (backward algorithm) does the same thing but in reverse order.
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 5
� = (A,B,⇡)Pr(D|�)
Solution 1
Algorithm: forward algorithm
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 6
S S
R R
T
Pr(Xt|X1, X2, . . . , Xt�1) = Pr(Xt|Xt�1)� = (A,B,⇡)
Emission probability: ✏S(Xi)Transition probability: qij
Initialisation (i = 0) :f0(0) = 1, fk(0) = 0 8 k > 0
Recursion (i = 1, . . . , L) :fs(i+ 1) = [fS(i) qSS + fR(i) qRS ]⇥ ✏S(Xi+1)
Termination :Pr(D|�) =
Pk fk(L)
1
Problem 2
Given some model 𝜆=(A,B,π) and sequence of observation D, find the most probable sequence of the underlying states. Observation: W T T W T W W W T W T T T T W T T W T T Path: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? § use the Viterbi algorithm
§ A trace back matrix keeps track of which is the most likely path
§ The most likely path can be found from:
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 7
Vk(i+ 1) = max[Vj(i) qjk]⇥ ✏S(Xi+1)
tk(i+ 1) = argmaxj [Vj(i)qjk]
maxk[Vk(L)]
Solution 2
Observation: W T T W T W W W T W T T T T W T T W T T Path: S R R R R S S S S S R R R R R R S S S R
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 8
S S
R R
S
R
Xi-1 Xi Xi+1
VS(i+ 1) = max[Vj(i) qjS ]⇥ ✏S(Xi+1)
tS(i+ 1) = argmaxj [Vj(i)qjS ]
HIV recombination
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 9
MASTER - 1003-102301p-8 1003-102301p-8 1003-022002p-25 1003-011702p-3 1003-103001p-50 1003-011702p-12 1003-102301p-29 1003-102301p-31 1003-011702p-21 1003-103001p-32 1003-102301p-22a 1003-022002p-1 1003-102301p-11 1003-103001p-68 1003-102301p-35 1003-103001p-35 1003-102301p-48 1003-102301p-7 1003-022002p-19 1003-022002p-11 1003-103001p-30 1003-022002p-15 1003-103001p-28 1003-102301p-14 1003-103001p-47 1003-022002p-52 1003-102301p-12a 1003-022002p-51 1003-103001p-54 1003-011702p-21a 1003-022002p-12 1003-103001p-10 1003-102301p-11a 1003-022002p-32 1003-103001p-21 1003-102301p-20 1003-102301p-15a 1003-103001p-38 1003-103001p-46 1003-103001p-25 1003-103001p-14 1003-111301p-3 1003-103001p-41a 1003-102301p-20a 1003-022002p-40 1003-102301p-53 1003-102301p-4a 1003-103001p-44a 1003-102301p-21a 1003-011702p-13 1003-022002p-7 1003-103001p-8 1003-102301p-9a 1003-022002p-30 1003-022002p-28 1003-103001p-1 1003-022002p-45 1003-022002p-17 1003-011702p-20 1003-102301p-3 1003-011702p-11 1003-022002p-20 1003-011702p-20a 1003-103001p-48 1003-103001p-6 1003-022002p-38 1003-022002p-3 1003-022002p-37 1003-102301p-6a 1003-022002p-31 1003-103001p-43 1003-011702p-26 1003-011702p-2 1003-103001p-46a 1003-022002p-4 1003-011702p-10 1003-103001p-9 1003-022002p-42 1003-011702p-23 1003-022002p-47 1003-102301p-52 1003-102301p-10a 1003-102301p-1 1003-022002p-44 1003-103001p-12 1003-011702p-23a 1003-102301p-19 1003-022002p-13 1003-022002p-33 1003-103001p-69 1003-022002p-53 1003-103001p-33a 1003-102301p-47 1003-103001p-49a 1003-102301p-54 1003-022002p-49 1003-103001p-44 1003-103001p-60 1003-022002p-41 1003-103001p-40 1003-011702p-16 1003-102301p-50 1003-022002p-46 1003-103001p-7 1003-103001p-50a 1003-103001p-16 1003-022002p-54 1003-102301p-55 1003-111301p-9 1003-102301p-30 1003-102301p-17 1003-102301p-42 1003-103001p-39 1003-011702p-22 1003-022002p-50 1003-111301p-4 1003-103001p-27a 1003-102301p-6 1003-102301p-45 1003-103001p-64 1003-102301p-51 1003-103001p-39a 1003-103001p-24 1003-111301p-12 1003-022002p-35 1003-103001p-52a 1003-103001p-58 1003-022002p-34 1003-102301p-49 1003-111301p-18 1003-103001p-48a 1003-103001p-15 1003-022002p-9 1003-102301p-43 1003-111301p-8 1003-102301p-10 1003-102301p-23 1003-103001p-61 1003-011702p-24 1003-011702p-22a 1003-103001p-59 1003-011702p-30 1003-103001p-29a 1003-103001p-38a 1003-103001p-51 1003-022002p-14 1003-103001p-41 1003-103001p-34a 1003-103001p-2 1003-102301p-18 1003-102301p-1a 1003-022002p-2 1003-103001p-36a 1003-111301p-5 1003-102301p-33 1003-102301p-41 1003-103001p-62 1003-103001p-49 1003-103001p-65 1003-102301p-7a 1003-102301p-4 1003-103001p-70 1003-011702p-18a 1003-103001p-53 1003-011702p-19a 1003-103001p-63 1003-011702p-19 1003-111301p-2 1003-111301p-21 1003-022002p-21 1003-111301p-1 1003-102301p-24a 1003-103001p-37a 1003-022002p-22 1003-011702p-18 1003-103001p-56 1003-011702p-1 1003-103001p-55 1003-102301p-15 1003-103001p-43a 1003-022002p-29 1003-022002p-48 1003-011702p-8 1003-022002p-36 1003-022002p-23 1003-103001p-42a 1003-103001p-45 1003-022002p-8 1003-103001p-57 1003-011702p-15 1003-111301p-7 1003-011702p-6 1003-103001p-42 1003-111301p-10 1003-011702p-14 1003-103001p-3 1003-022002p-18 1003-022002p-39 1003-103001p-37 1003-111301p-6 1003-103001p-13 1003-103001p-31 1003-102301p-12 1003-011702p-5 1003-103001p-20 1003-102301p-44 1003-103001p-45a 1003-102301p-37 1003-111301p-23 1003-111301p-22 1003-111301p-11 1003-022002p-6 1003-111301p-16 1003-111301p-24 1003-103001p-52 1003-102301p-38 1003-103001p-71 1003-111301p-17 1003-011702p-28 1003-011702p-25 1003-103001p-18 1003-102301p-9 1003-103001p-66 1003-011702p-7 1003-011702p-32 1003-022002p-27 1003-111301p-15 1003-103001p-51a 1003-103001p-40a 1003-111301p-19 1003-103001p-4 1003-111301p-20 1003-102301p-24 1003-011702p-9 1003-102301p-3a 1003-103001p-26 1003-102301p-16a 1003-103001p-36 1003-102301p-16 1003-102301p-13 1003-102301p-25 1003-102301p-13a 1003-102301p-36 1003-102301p-17a 1003-103001p-23 1003-103001p-47a 1003-022002p-26 1003-102301p-14a 1003-102301p-46 1003-102301p-8a 1003-102301p-2 1003-103001p-67 1003-102301p-19a 1003-102301p-26 1003-102301p-23a 1003-102301p-5a 1003-102301p-28 1003-102301p-27 1003-102301p-5
0 500 1000
Sequences compared to master
Base number
A:G
reen
, T:R
ed, G
:Ora
nge,
C:L
ight
blu
e, IU
PAC:
Dark
blu
e, G
aps:
Gra
y
Generating estimates for 𝜌
builds hk+1 as an imperfect mosaic of h1,…,hk. Imperfect copying process
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 10
Modeling the copy process
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 11
K
K+1
t1
t2
Δt
Single time point
Two time points
Viterbi most likely path
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 12
Statistical inference
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 13
Closing remarks
Advantages of HMMs § Easy enough to implement and allows for tractable
computation § Rich enough to model very complex biological process Disadvantages § States are supposed to be conditionally independent, this is
sometimes not true. § Local maxima
§ Model may not converge to a truly global parameter max § Speed
§ Almost everything one does in an HMM involves enumerating all possible paths through the model
§ Can be sped up in various ways but still can be relatively slow.
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs
Page 14