Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
On theOn the DirichletDirichlet Mixture ModelMixture ModelOn the On the DirichletDirichlet Mixture Model Mixture Model for Mining Protein Sequence for Mining Protein Sequence
Data Data
XugangXugang YeYe11, Stephen Altschul, Stephen Altschul11
11National Canter for Biotechnology InformationNational Canter for Biotechnology Information
ICML 2011, Bellevue (WA), USAICML 2011, Bellevue (WA), USA
BackgroundBackgroundBiologists need to find from the raw data like this
ICML 2011, Bellevue (WA), USA
BackgroundBackgroundthe information like this
• The result is called a multiple sequence alignment (MSA)The result is called a multiple sequence alignment (MSA)• MSA is often used for assessing sequence conservation of protein domains, inference
of biological functions, reconstruction of ancestries, and eventually drug discoveries
ICML 2011, Bellevue (WA), USA
BackgroundBackground• Fundamental problem: identification of important regions of p p g
DNA or protein molecules via multiple sequence alignment (MSA).Ch ll• Challenges:
MSA problem is in general intractableGiven a similarity measure, there is no polynomial algorithm for solving the MSA problemThere is still no consensus about the best similarity measureThere is still no satisfactory MSA program that can significantly reduce y p g g ythe manual labor
• New perspective: Bayesian analysis based machine learning approachapproach
ICML 2011, Bellevue (WA), USA
BackgroundBackground
Gold-standardMSA data
Training Statistical Model
New sequence data
Searching
Adjusting
New patternsNew gold-standardMSA data
R fi i
Manual efforts
RefiningMSA programs
ICML 2011, Bellevue (WA), USA
BackgroundBackground
• Model the gold-standard MSA data• Infer the model• Adjust the model• Use the model to search for patterns
R fi th tt i MSA• Refine the patterns using MSA programs• Incorporate the new discoveries into the libraries
ICML 2011, Bellevue (WA), USA
Dirichlet Mixture Model
Independent columns:
Multiple sequence alignments
Count vectors:pCol(1), Col(2), ..., Col(T)
Count vectors:(1) (2) ( ), , ..., Tn n nr r r
( )( )1
P(Col | )t
jl ntjj
p p=∏r
( | )f Θr
Dirichlet Mixture (DM) prior
Hierarchical model:1
( | ) jjp p
=∏
Hyper parameters: mi, i = 1, 2, …, M; αij, i = 1, 2, …, M, j = 1, 2, …, L
1
11
1
( | )
( ,..., )
ijLjM j
iii iL
M
f p
pm
α
α α
−
=
=
Θ
= ⋅Β∏∑
∑
rij
Multinomial parameters: pj, j = 1, 2, …, L
Why hierarchical? Because a cluster of columns is not governed by a
1Dir( ; )M
i iim p α
== ⋅∑ rr
Count vector: nj, j = 1, 2, …, L
y g yfixed . It’s governed by a random whose distribution is parameterized by a set of higher lever parameters
pr pr
ICML 2011, Bellevue (WA), USA
Theoretical BackgroundTheoretical Background• What is a DM
1( )( | ) LM Lif αα −ΓΘ ∑ ∑∏r 1
1 11
1
1
( )( | ) ,( )
0, 1,..., ; 1
ijLM Lii j i ijLi jj
ijj
Lj jj
f p m p
p j L p
αα α αα= ==
=
=
ΓΘ = ⋅ =
Γ
> = =
∑ ∑∏∏∑
r
C j i
1{ 0 : 1,..., ; 1} { 0 : 1,..., ; 1,..., }M
i i ijim i M m i M j Lα
=Θ = > = = ∪ > = =∑
• Conjugate priors
1
1 11
( )( | ; ) ,( )
ij jLM Lnii j jLi jj
ij j
nf p n m p n nn
ααα
+ −
= ==
Γ +′Θ = ⋅ =Γ +
∑ ∑∏∏
r r1
2
nn⎡ ⎤⎢ ⎥⎢ ⎥1
1
( )
( )( )( ) ( )
, 1,...,( )( )
ij jj
L ij jii j
i iji
L
n
nm
nm i Mn
α
ααα α
αα
=
=
Γ +ΓΓ + Γ
′ = =Γ +Γ
∏
∏nj = number of the j-th
2
L
nn
n
⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦
r
L
''1 1
' '
( )( )( ) ( )
LM i j jiii j
i i j
nm
nαα
α α′′= =
Γ +ΓΓ + Γ∑ ∏
nj number of the j th letter in a given column
ICML 2011, Bellevue (WA), USA
Theoretical BackgroundTheoretical Background• Bayesian classifier (for clustering independent columns)
⎡ ⎤ ( l | ) ( l | )M∑Total probability:
1
2
nn
n
⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥
r
LCol
1P(Col | ) P(Col | ; )M
iim i
=Θ = ⋅ Θ∑
( )( ) L nαα Γ +ΓLikelihood:
Ln⎢ ⎥⎣ ⎦ 1
( )( )P(Col | ; )( ) ( )
L ij jij
i ij
ni
nαα
α α=
Γ +ΓΘ =
Γ + Γ∏
P(C l | )i ΘPosterior:
P(Col | ; )P( | Col; )P(Col | )
ii mi Θ ⋅Θ =
Θ
ICML 2011, Bellevue (WA), USA
Theoretical Backgroundg• Log-odds ratio (for measuring the similarity between two
columns)(1)
1(1)
(1) 2
nn
n
⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥
rCol(1)
(1) (2)1 1(1) (2)
(12) 2 2
n nn n
n
⎡ ⎤+⎢ ⎥+⎢ ⎥=⎢ ⎥
rMerge Col(1),Col(2) Col(12)
(1)Ln
⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
L(1) (2)
L Ln n
⎢ ⎥⎢ ⎥
+⎢ ⎥⎣ ⎦
Lg ,
(1) (2)P(Col | Col ; )Θ(2)
1(2)
(2) 2
nn
n
⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥
r
LCol(2)
(1)
(1) (2)
(1) (2)
P(Col | Col ; )lnP(Col | )P(Col ,Col | )ln
P(Col | )P(Col | )
ΘΘ
Θ=
Θ Θ(2)
Ln
⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
( ) ( )
(12)
(1) (2)
P(Col | )P(Col | )P(Col | )ln
P(Col | )P(Col | )
Θ Θ
Θ=
Θ Θ BILD score
*For details, see Altschul et al. 2010. The Construction and Use of log-odds Substitution Scores for Multiple Sequence Alignment. PLoS Comput. Biol., 6(7), e1000852.Also see Ye et.al. 2011. An Assessment of Substitution Scores for Protien Profile-Profile Comparison. Bioinformatics, in press
ICML 2011, Bellevue (WA), USA
Theoretical BackgroundTheoretical Background• Pattern score (for measuring the pattern significance)
Pattern window:( )
1P( | ) P( | )wX x τ
τ =Θ = Θ∏ r
Probability:
Odds-ratio:
( )
( )
1
P( | )( | ) w
L b
xR X τ
τ
τ
ΘΘ =∏
∏
r is background (marginal) probability
bjp
(1) (2) ( )( )wX r r r ( )1
1
( | )( ) jL nb
jjp
ττ =
=
∏∏
( g ) p yof letter j
Log odds-ratio:( )P( | )ln ( | ) lnw xR Xτ Θ
Θ ∑r
(1) (2) ( )( , ,..., )wX x x x=
( )1
1
( ) ( )1 1
ln ( | ) ln( )
ln P( | ) ln
jL nbjj
w L bj jj
R Xp
x n p
ττ
τ ττ
=
=
= =
Θ =
⎡ ⎤= Θ −⎣ ⎦
∑∏
∑ ∑r
BILD score
ICML 2011, Bellevue (WA), USA
Theoretical Background• Position posteriors (for measuring the aligning likelihood of a
position in a sequence given a pattern)position in a sequence, given a pattern)
P i i ipt = P(“position t is aligned to the first
( ) ( ) ( )1 2, ,...,t t t
wy y y
Position t in a sequence column of the given pattern”)( ) ( )1
( ') ( ')1' '
P( ,..., | ; )P( | ; )= ,P( ' | ; ) P( ,..., | ; )
t tw
t twt t
y y Xt Xt X y y X
ΘΘ=
Θ Θ∑ ∑
( ) ( )( ) ( ) 11
P( ,..., , | )P( ,..., | ; )t t
t t ww
y y Xy y X ΘΘ =
where under independence assumption,
(1) (2) ( )( , ,..., )wX x x x=r r r
Given pattern1
( ) ( )
( )1
( , , | ; )P( | )P( , | )
P( | )
w
tw
y yXy x
x
ττ
ττ =
Θ
Θ=
Θ∏r
r
* One can also normalize the ratio( ) ( ) ( ) ( )1
( ) ( ) ( ) ( )11
P( ,..., | ; ) P( , | ) .P( ,..., | ) P( | )P( | )
t t tww
t t tw
y y X y xy y x y
ττ
τττ
=
Θ Θ=
Θ Θ Θ∏r
r
ICML 2011, Bellevue (WA), USA
Theoretical Background• Shift posteriors (for measuring the shifting likelihood of a
pattern)Staying likelihood:
0 ( )1
P( | ) P( | )wX x ττ =
Θ = Θ∏ r
0 (1) (2) ( )( , ,..., )wX x x x=r r r
( )( 1)
( )
P( | )P( | ) P( | )P( | )
tt t
t w
xX Xx
−− + −
− +
ΘΘ = Θ ⋅
Θ
r
r
Shifting likelihoods:
1 (0) (1) ( 1)( , ,..., )wX x x x− −=r r r
( 1)( 1)
( 1)
P( | )P( | ) P( | )P( | )
t wt t
t
xX Xx
+ ++
+
ΘΘ = Θ ⋅
Θ
r
r
pt = P(“Shifting the current pattern ( ) pt ( g pwindow t columns”)
''
P( | )=P( | )
t
tt
XXΘΘ∑
1 (2) ( ) ( 1)( ,..., , )w wX x x x+ +=r r r * One can also normalize the ratio
( )
1
P( | )P( | )
tw
tii
xx
τ
ττ
+
+=
ΘΘ∏ ∏
r
ICML 2011, Bellevue (WA), USA
InferenceInference• Goal: finding a maximum-likelihood prior for describing a set of
independent columnsindependent columns* ( )
1
( )
arg max ln P(Col | )
( )( )
T tt
tLT M nαα
=Θ
Θ = Θ
Γ +Γ
∑
∑ ∑
• Chanllege: this (constrianed) optimization problem is in general intractable
( )1 1 1
( )( )arg max ln( ) ( )
LT M ij jii tt i j
i ij
nm
nαα
α α= = =Θ
Γ +Γ= ⋅
Γ + Γ∑ ∑ ∏
1) Nonconcave2) High dimentional (ML + M−1)3) Large size of data (T)3) a ge s e o da a ( )
• Previous researchers have developed some good heuristic optimization procedures
• We have designed a better oneWe have designed a better one
ICML 2011, Bellevue (WA), USA
InferenceInferenceGibbs sampling based optimization procedure (basic form):• Step 0 Starting from a M-component DM Θ(0)Step 0. Starting from a M-component DM Θ( ).• Step 1. Given current M-component DM Θ(k), create M empty bins. For
each t, compute the posterior P(i |Col(t); Θ(k)) for each i. For each t, assign Col(t) to the i-th bin according to P(i |Col(t); Θ(k))Col( ) to the i-th bin according to P(i |Col( ); Θ( )).
• Step 2. Update mixture parameters: mi' = Ti / T, where Ti is the size of bin i.• Step 3. For each bin i, solve the optimization problem*
( )( )t⎡ ⎤Γ
• Go to Step 1.
( )( 1)
( )1 1
( )( )arg max ln( ) ( )i
tLT ij jk i
i tt ji ij
nnα
αααα α
+= =
⎡ ⎤Γ +Γ= ⎢ ⎥
Γ + Γ⎢ ⎥⎣ ⎦∑ ∏r
r
* Variational method:( ) ( ) ( )ln P(Col | ) ln P(Col | ; ) lnP(Col | ; )T T M T Mt t t
i im i m iΘ = Θ ≥ Θ∑ ∑ ∑ ∑ ∑1 1 1 1 1
( )1 1
ln P(Col | ) ln P(Col | ; ) lnP(Col | ; )
lnP(Col | ; )
i it t i t iM T t
ii t
m i m i
m i= = = = =
= =
Θ Θ ≥ Θ
= Θ
∑ ∑ ∑ ∑ ∑∑ ∑
ICML 2011, Bellevue (WA), USA
InferenceDimension reduction via first moment estimation• nj | n, pj ~ Binomial(n, pj)• p | α α ~ Beta(α α − α )pj | αi1, …, αiL ~ Beta(αij, αi αij)• E(nj) = E((nj | n, pj)) = E(npj) = E(n)E(pj) = E(n)·αij/αi ⇒ αij = E(nj)/E(n)·αi
• Replacing qij = E(nj)/E(n) with and plugging into the objective function in Step3 reduce the dimension from L to 1:
= E( )/E( )ij jq n n) ))
objective function in Step3 reduce the dimension from L to 1:( )
( 1)( )1 1
( )( )arg max ln( ) ( )i
tLT ij i jk i
i tt ji ij i
q nn qα
αααα α
+= =
⎡ ⎤Γ +Γ= ⎢ ⎥
Γ + Γ⎢ ⎥⎣ ⎦∑ ∏
)
)
Second moment estimate as the starting point for Newton’s method
( ) ( )i i ij iq⎢ ⎥⎣ ⎦
2
2
(1 )E( )( ) [ (1 ) ] /[ ],E ( ) E( )
ij iji ij ij ij ij
q qnj q q v vn n
α−
= − − −) ) )
) ) )) )
2Var( ) Var( )jn n) )
1( ),L
i ij ijq jα α
==∑) ))
22 2
Var( ) Var( )E ( ) E ( )
jij ij
n nv qn n
= − )) )
ICML 2011, Bellevue (WA), USA
I fInferenceIssues:Issues:• Jumping out of local optima via complete bin
assignment (stochastic search)g ( )• Refinement via partial bin assignment (non-stochastic
search))• Initialization strategies• Number of componentsu be o co po e s
ICML 2011, Bellevue (WA), USA
Model ComplexityModel ComplexityTotal cost function (description length) = -log(Data likelihood) +
Model complexityModel complexity* ≈ Complexity of M-1 mixture parameters (Comp1) +
Complexity of M Dirichlet components (Comp2) –Overcounted complexity (Comp3),
11 1log log log ( / 2) (1)
2 2 2M TComp M oπ−
= + − Γ +
Where
2 2 2
2 ,1 1[ log log log ( / 2) log( 1) (1)]
2 2 2 2 L nL T L nComp M L L o
M−
= + − Γ − − + Δ +
As the number of components increases, the value of -log(Data likelihood) decreases, but the model complexity increases
3 log( !)Comp M=
decreases, but the model complexity increases*For details, see Yu and Altschul 2011. The complexity of the Dirichlet model for multiple alignment data. J. Comput. Biol., in press.
ICML 2011, Bellevue (WA), USA
A (To ) Inference E ampleA (Toy) Inference Example• Experiment:
Ground truth model Artificial data Infered model Θ Θ′(1: )Tnr
for t = 1 to T
( )1
~ ( | ) Dir( ; )Mti ii
p f p m p α=
Θ = ⋅∑ rr r rdraw
draw Inference Algorithm
n(t) ~ Poisson(n; λ)draw
( ) ( ) ( )~ Multinomial( ; , )t t tn n p nr r r
ICML 2011, Bellevue (WA), USA
A (Toy) Inference ExampleA (Toy) Inference ExampleGround truth (mi and αij)
m 0 13817700 0 13904300 0 08104890 0 06581460 0 02579240 0 24203400 0 09586050 0 13965400 0 07257600
A 9-component DM:
m 0.13817700 0.13904300 0.08104890 0.06581460 0.02579240 0.24203400 0.09586050 0.13965400 0.07257600
A 0.39908600 0.01582840 0.04759630 0.09969660 0.00000145 0.55526200 0.11428400 0.27425900 0.13620300
C 0.05420540 0.01311490 0.01793070 0.00941176 0.00222160 0.05038680 0.00418655 0.07987990 0.05896460
D 0.04941170 0.01416740 0.00579683 0.00008857 0.00000133 0.50257200 0.78424300 0.02031100 0.05251870
E 0.02845140 0.01232240 0.01302770 0.11627800 0.00000128 0.90223900 0.43477100 0.03894260 0.06140270
F 0 02591200 0 00102215 0 23249400 0 02026550 0 22089800 0 09011690 0 01252450 0 10836100 0 70405100F 0.02591200 0.00102215 0.23249400 0.02026550 0.22089800 0.09011690 0.01252450 0.10836100 0.70405100
G 0.25260300 0.04945070 0.01069620 0.09818010 0.00000204 0.26072500 0.24440200 0.03979830 0.07961690
H 0.02367120 0.00745678 0.00841542 0.10135500 0.00871764 0.19098700 0.06603250 0.02067700 0.21530800
I 0.02425040 0.00249989 0.42645700 0.05080130 0.00000390 0.16955000 0.01017580 0.87321800 0.17152900
K 0.05788260 0.00525776 0.00000126 0.78829500 0.00000133 0.75006000 0.15614500 0.05859190 0.08389390
L 0 05685540 0 00645843 1 12570000 0 13809100 0 02457390 0 29736400 0 01321530 0 52930500 0 36405600L 0.05685540 0.00645843 1.12570000 0.13809100 0.02457390 0.29736400 0.01321530 0.52930500 0.36405600
M 0.02530400 0.00126565 0.22821000 0.04358250 0.00015563 0.12071100 0.00301977 0.12966800 0.11294100
N 0.11182400 0.00971496 0.01214120 0.12342400 0.00000131 0.37148600 0.36926200 0.02440750 0.09727970
P 0.12844600 0.02412470 0.02611130 0.09111190 0.00000127 0.28072300 0.09127260 0.05222050 0.05702610
Q 0.02867630 0.00535386 0.02659940 0.22541700 0.00000132 0.56156800 0.11887700 0.02838930 0.06834880
R 0.03932920 0.01008840 0.01855830 0.71342700 0.00037950 0 46791400 0 05369930 0 04072360 0 094694400.46791400 0.05369930 0.04072360 0.09469440
S 0.44507700 0.00187662 0.01232570 0.09276430 0.00000140 0.55988900 0.21704900 0.06296060 0.12344600
T 0.28037800 0.00629699 0.00000180 0.09193250 0.00151323 0.50655300 0.09415750 0.22034700 0.11868400
V 0.09074450 0.00442848 0.22871500 0.05469920 0.00000412 0.27808200 0.00937226 1.12861000 0.16444900
W 0.00733549 0.00450365 0.02189680 0.01274710 0.07318310 0.03643850 0.00486890 0.01703660 0.15407100
Y 0.02651190 0.00224834 0.02679920 0.03594690 0.17363700 0.12555800 0.03150400 0.05824380 0.833950000.12555800 0.03150400 0.05824380 0.83395000
αi 2.155955 0.197480 2.489474 2.907515 0.505300 7.078185 2.833062 3.805951 3.752434
From http://compbio.soe.ucsc.edu/dirichlets/byst-4.5-0-3.9compICML 2011, Bellevue (WA), USA
A (Toy) Inference Example( y) p
Figure adapted from Ye et al. 2011. On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison. J. Comput. Biol., in press.
ICML 2011, Bellevue (WA), USA
A (Toy) Inference ExampleA (Toy) Inference Examplei
*T = 2k T =100k T = 2k T =100k T = 2k T =100k
'/i im m '/i iα α ( ) ( )( , ')i iJS q qr r
T = 2k,λ = 20
T =100k,λ = 80
T = 2k,λ = 20
T =100k,λ = 80
T = 2k,λ = 20
T =100k,λ = 80
1 1.030 1.007 1.009 0.996 0.00162 0.000022 1 337 1 000 0 939 0 998 0 00176 0 000042 1.337 1.000 0.939 0.998 0.00176 0.000043 0.838 0.994 1.028 0.985 0.02608 0.000134 1.165 0.993 0.823 0.985 0.00923 0.000065 0.884 0.998 0.742 1.003 0.00553 0.000066 0.421 1.010 1.963 0.997 0.02997 0.000227 0.997 0.994 1.165 0.998 0.00618 0.000067 0.997 0.994 1.165 0.998 0.00618 0.000068 1.003 0.996 1.206 0.996 0.00573 0.000239 1.196 1.013 1.491 1.005 0.03329 0.00109
Table adapted from Ye et al. 2011. On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison. J. Comput. Biol., in press.
* qj(i) = αij/αi, j = 1, …, L; i = 1, …, M , JS(·,·) denotes the Jensen-Shannon divergence
ICML 2011, Bellevue (WA), USA
Inference for Real DataM -log2(Data) Complexity DL y0 9.9605E+07 2.0571E+02 9.9605E+07 0.00001 7.5622E+07 2.1196E+02 7.5622E+07 1.00332 7.5017E+07 4.0291E+02 7.5018E+07 1.02853 7.4768E+07 5.8573E+02 7.4768E+07 1.03904 7.4654E+07 7.6324E+02 7.4654E+07 1.04375 7.4601E+07 9.3678E+02 7.4602E+07 1.04596 7.4562E+07 1.1071E+03 7.4563E+07 1.04767 7.4518E+07 1.2749E+03 7.4520E+07 1.04948 7.4486E+07 1.4404E+03 7.4488E+07 1.05079 7.4438E+07 1.6038E+03 7.4439E+07 1.052710 7.4394E+07 1.7656E+03 7.4396E+07 1.054611 7.4365E+07 1.9257E+03 7.4367E+07 1.055812 7.4343E+07 2.0844E+03 7.4345E+07 1.056713 7 4332 07 2 2418 03 7 4334 07 1 057113 7.4332E+07 2.2418E+03 7.4334E+07 1.057114 7.4300E+07 2.3980E+03 7.4303E+07 1.058415 7.4288E+07 2.5531E+03 7.4290E+07 1.059016 7.4275E+07 2.7070E+03 7.4278E+07 1.059517 7.4265E+07 2.8600E+03 7.4268E+07 1.059918 7.4253E+07 3.0121E+03 7.4256E+07 1.060419 7.4243E+07 3.1633E+03 7.4246E+07 1.060820 7.4234E+07 3.3137E+03 7.4237E+07 1.061221 7.4221E+07 3.4632E+03 7.4225E+07 1.061722 7 4214E+07 3 6120E+03 7 4218E+07 1 062022 7.4214E+07 3.6120E+03 7.4218E+07 1.062023 7.4198E+07 3.7601E+03 7.4202E+07 1.062724 7.4193E+07 3.9075E+03 7.4197E+07 1.062925 7.4184E+07 4.0543E+03 7.4188E+07 1.063226 7.4178E+07 4.2004E+03 7.4182E+07 1.063527 7.4172E+07 4.3459E+03 7.4176E+07 1.063728 7.4169E+07 4.4908E+03 7.4174E+07 1.063829 7.4153E+07 4.6351E+03 7.4158E+07 1.064530 7.4152E+07 4.7789E+03 7.4156E+07 1.064631 7 4146E+07 4 9222E+03 7 4150E+07 1 064831 7.4146E+07 4.9222E+03 7.4150E+07 1.064832 7.4142E+07 5.0649E+03 7.4147E+07 1.065033 7.4136E+07 5.2072E+03 7.4141E+07 1.065234 7.4131E+07 5.3490E+03 7.4136E+07 1.065435 7.4131E+07 5.4903E+03 7.4136E+07 1.065436 7.4131E+07 5.6312E+03 7.4136E+07 1.065437 7.4131E+07 5.7716E+03 7.4136E+07 1.065438 7.4131E+07 5.9116E+03 7.4136E+07 1.0654
Figure adapted from Ye et al. 2011. On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison. J. Comput. Biol., in press. The data set can be downloaded from UCSC, it contains T = 314585 columns, with = 75.99.n
ICML 2011, Bellevue (WA), USA
AdjustmentAdjustment• A DM, constructed from a curated set of data with a specific
l ( i id) i i ill i l hiletter (e.g. amino acid) composition, will imply this same set of background frequencies ( ). One may, however, wish to apply Bayesian analysis to a set of sequences
1, 1,...,M ij
j iii
q m j Lαα=
= =∑pp y y y q
with markedly different letter composition• It would be impractical to derive a new set of priors for such
seq ences b t non optimal to se a prior that implies asequences, but non-optimal to use a prior that implies a different composition
• We have therefore sought to adjust a given set of DM priors to g j g pa non-standard composition. We define this as constructing a DM prior that implies the desired composition, but that is "close" to the original priorclose to the original prior
ICML 2011, Bellevue (WA), USA
Adjustment ModelAdjustment Model• Implied freqencies Target frequencies
• Ideal formulation (difficult)
1, 1,...,M ij
j iii
q m j Lαα=
= =∑ # of letter , 1,...,#of all lettersj
jq j L′ = =
( )
Minimize ( | )( ; ) ( | ) ln( | )p
f pG f p dpf p
Θ′Θ Θ = Θ′Θ∫r
rr r
rΘ′
Subject to1
, 1,...,M iji ji
m q j Lαα=
′′ ′= =
′∑
• A pratical formulation (easy)
1 jiiα
∑
Minimize1
Dir( ; )( ; ) Dir( ; ) lnM ii ii
pF m p dpαα′Θ Θ =′∑ ∫rr
rr rr1
( ; ) ( ; )Dir( ; )i ii p
i
p pp α= ′∑ ∫r rr
Θ′
Subject to , 1,..., ;, 1,..., ;
i i
i i
m m i Mi Mα α
′ = =′ = =
1, 1,...,M ij
i jii
m q j Lαα=
′′= =∑
ICML 2011, Bellevue (WA), USA
Local Adjustment ModelLocal Adjustment Model• Objective function in practical formulation
1
( ; )Dir( ; )Dir( ; ) lnDir( ; )
M ii ii p
i
Fpm p dppααα=
′Θ Θ
=′∑ ∫rrr
rr rrr
( )( )1 1
2 31 1
ln ( ) ln ( ) ( )( )
1 ( )( ) (( ) )2
M Li ij ij ij ij iji j
M Li ij ij ij ij iji j
m
m O
α α ψ α α α
ψ α α α α α
= =
= =
′ ′= Γ − Γ − −
⎛ ⎞⎛ ⎞⎛ ⎞ ′ ′ ′= − + −⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠
∑ ∑
∑ ∑
• As α ′ij is close to αij
1 1 2 j j j j ji j ⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠∑ ∑
21 1
( ; )1 ( )( )21 1
M Li ij ij iji j
F
mψ α α α= =
′Θ Θ
′ ′≈ −∑ ∑2
1 1 1
1 1( ) ( ) ( )2 2
M L M Tij ij ij i i i i ii j i
R Rα α α α α α= = =
′ ′ ′= − = − −∑ ∑ ∑ r r r r,1 ,{ ,..., }i i i LR diag R R=
ICML 2011, Bellevue (WA), USA
S i L l AdjSuccessive Local Adjustment• Intermediate targetsIntermediate targets
( )
( ) ( 1)
( ) / , 1,...,( ) /
t
t t
q t q q N q t Nq q q q q N−
′= − + =
′Δ = − = −
r r r r
r r r r r
qr q′r
r ( )N1
M iji ji
i
m qαα=
=∑( )
1
NM ij
i jii
m qαα=
′=∑
• Lagrange-Newton method
( ) ( 1) ( 1) ( 1)t t t tim S B− − −+ Δr r r
A N
iαr ( )N
iαr
( ) ( 1) ( 1) ( 1)
( 1) 1( 1) ( 1) 1
( 1) 1
,
11 ( )( ) (I ),1 ( ) 1
t t t tii i i i
i
T tt t i
i i T ti
S B q
RS RR
α αα
− −− − −
− −
= + Δ
= −r r
r r
As N →∞,( ) ( ) , 1,...,N
i i i Mα α ∞→ =r r
We ha e pro ed* that if one performs( 1) ( 1)
2( 1) ( 1)
21
( ) ,i
t ti i
Mt tii ii
i
B A
mA Sα
− − +
− −=
=
=∑
We have proved* that, if one performs infinitesimal background frequency changes along different paths, the same solution as will be reached. ( )
iα∞r
i i
*For proof, see Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117
ICML 2011, Bellevue (WA), USA
A Demo of Adjusting A 3 3 lA 3-component 3-letter DM
(3)xr
1q⎡ ⎤1
2
3
qq q
q
⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦
r 1
2
xx
x⎡ ⎤
= ⎢ ⎥⎣ ⎦
r (1) (2) (3)1 2 3x q x q x q x= + +
r r r r
3q⎣ ⎦
(1)xr (2)xr
ICML 2011, Bellevue (WA), USA
(3)xr
(1)xr (2)xr
ICML 2011, Bellevue (WA), USA
(3)xr
(1)xr (2)xr
ICML 2011, Bellevue (WA), USA
(3)xr
(1)xr (2)xr
ICML 2011, Bellevue (WA), USA
(3)xr
(1)xr (2)xr
ICML 2011, Bellevue (WA), USA
(3)xr
(1)xr (2)xr
ICML 2011, Bellevue (WA), USA
(3)xr
(1)xr (2)xr
ICML 2011, Bellevue (WA), USA
A Real Case of AdjustingA 20 20 lA 20-component 20-letter DM
Table from Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117
ICML 2011, Bellevue (WA), USA
A Real Case of AdjustingA 20 t 20 l tt DMA 20-component 20-letter DM
• The 20-component DM is “recode4” fromhttp://compbio soe ucsc edu/dirichlets/recode4 20comphttp://compbio.soe.ucsc.edu/dirichlets/recode4.20comp
• The target frequencies are from a set of 53 Api-AP2 proteins from Toxoplasma gondii (Altschul et al., 2010)Adj t t ith N 1000 t t k 0 28 !• Adjustment with N = 1000 steps takes 0.28 sec!
2.5
3.0
1 0
1.5
2.0x.
rela
tive
erro
r (%
)( ) (10000)
(10000),
| |( ) max
Nij ij
m i jij
r Nα α
α−
=
0 200 400 600 800 10000.0
0.5
1.0
N
Ma
*0.01 min{ : ( ) 0.01}mN N r N= <
* 146N N0.01 146N =
Figure from Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117
ICML 2011, Bellevue (WA), USA
Ho large N isHow large N is• Experiment
1200
1400We take the baseline to be “recode4”. We generate 1000 sets of “desired” background frequencies centered
b li f i leq′r
qr
600
800
1000
*0.01N
on by sampling from a single Dirichlet distribution .For each , we caculate . We sort the into bins according to the quantity For
qDir( ;75 )p qr r
q′r *0.01N
q′r200 05 | l ( / ) |A ′∑
200
400
600 quantity . For each bin, dots represent the observed mean value of , with error bars showing one standard deviation for this estimate Triangles represent the
210.05 | log ( / ) |j jj
A q q=
′= ∑*0.01N
2021
1 | log ( / ) |20 j jj
A q q′= ∑0.4 0.5 0.6 0.7 0.8 0.9 1
0
200 this estimate. Triangles represent the 90th percentile for values of within each bin. The particular case studied in Table 2 and the previous figure is shown by an “x”
*0.01N
21| g ( ) |
20 j jjq q
=∑ figure is shown by an x .Figure from Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117
ICML 2011, Bellevue (WA), USA
Pattern SearchPattern Search
• Objective: given a set of protein sequencesObjective: given a set of protein sequences, find significant patterns using the Dirichletmixture prior (trained from gold-standard datamixture prior (trained from gold-standard data and adjusted according to the letter composition of the target data set)composition of the target data set)
• Search method: Gibbs sampling procedure based on the position posteriors and shiftbased on the position posteriors and shift posteriors
ICML 2011, Bellevue (WA), USA
Pattern SearchPattern SearchGibbs sampling based optimization procedure (basic form):p g p p ( )• Step 0. Starting from a position vector to form a pattern window X(0)
of length w.• Step 1 Given current pattern window represented by X(k) = ( w) pick
(0)tr
( )ktr
Step 1. Given current pattern window represented by X( ) ( , w), pick up a sequence i, compute the position posteriors of all possible ti.
• Step 2. Draw according to the position posteriors. Set for i' ≠ i
t
( 1) ( 1)
' '
k k
i it t+ +
=( 1)k
it+
≠ i.• Go to Step 1.(Occasionally, after Step 1 and Step 2, one may shift the whole pattern window
di h hif i )according to the shift posteriors)
ICML 2011, Bellevue (WA), USA
Pattern Search• Back to the sequence data given at the beginning
180w = 5
60
80
100
120
140
160
Pat
tern
sco
re
0 500 1000 1500 2000 2500 3000-20
0
20
40
Iteration
ScoreBest score so far
150
200
250
300
re
w = 8
100
150
200
re
w = 10
0
50
100
150
Pat
tern
sco
ScoreBest score so far
0
50Pat
tern
sco
ScoreBest score so far
0 500 1000 1500 2000 2500 3000-50
Iteration
0 500 1000 1500 2000 2500 3000 3500
-50
Iteration
ICML 2011, Bellevue (WA), USA
Pattern Search
For examples, three patterns found p , pwith window lengths 5, 8, and 10
ICML 2011, Bellevue (WA), USA
Conclusion
• The Dirichlet mixture model is a powerful tool for mining the i dprotein sequence data
• The inference of a DM from data is challenging, but we have designed better inference procedure
• A given DM prior might implies marginal letter probabilities that are different from those implied by a target sequence data, but we have designed an adjustment procure
• A DM that is trained from gold-standard alignment data can be used for searching for patterns from newly acquired sequence data. We have designed Gibbs-sampling based search procedures that work very wellprocedures that work very well
• The impact of the work is that considerably amount of manual labor could be saved for biologists and biomedical researchers
ICML 2011, Bellevue (WA), USA
QuestionsQuestions
Thanks
ICML 2011, Bellevue (WA), USA