On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

On theOn the DirichletDirichlet Mixture ModelMixture ModelOn the On the DirichletDirichlet Mixture Model Mixture Model for Mining Protein Sequence for Mining Protein Sequence

Data Data

XugangXugang YeYe11, Stephen Altschul, Stephen Altschul11

11National Canter for Biotechnology InformationNational Canter for Biotechnology Information

ICML 2011, Bellevue (WA), USAICML 2011, Bellevue (WA), USA

BackgroundBackgroundBiologists need to find from the raw data like this

ICML 2011, Bellevue (WA), USA

BackgroundBackgroundthe information like this

• The result is called a multiple sequence alignment (MSA)The result is called a multiple sequence alignment (MSA)• MSA is often used for assessing sequence conservation of protein domains, inference

of biological functions, reconstruction of ancestries, and eventually drug discoveries


BackgroundBackground• Fundamental problem: identification of important regions of p p g

DNA or protein molecules via multiple sequence alignment (MSA).Ch ll• Challenges:

MSA problem is in general intractableGiven a similarity measure, there is no polynomial algorithm for solving the MSA problemThere is still no consensus about the best similarity measureThere is still no satisfactory MSA program that can significantly reduce y p g g ythe manual labor

• New perspective: Bayesian analysis based machine learning approachapproach


BackgroundBackground

Gold-standardMSA data

Training Statistical Model

New sequence data

Searching

Adjusting

New patternsNew gold-standardMSA data

R fi i

Manual efforts

RefiningMSA programs


BackgroundBackground

• Model the gold-standard MSA data• Infer the model• Adjust the model• Use the model to search for patterns

R fi th tt i MSA• Refine the patterns using MSA programs• Incorporate the new discoveries into the libraries


Dirichlet Mixture Model

Independent columns:

Multiple sequence alignments

Count vectors:pCol(1), Col(2), ..., Col(T)

Count vectors:(1) (2) ( ), , ..., Tn n nr r r

( )( )1

P(Col | )t

jl ntjj

p p=∏r

( | )f Θr

Dirichlet Mixture (DM) prior

Hierarchical model:1

( | ) jjp p

=∏

Hyper parameters: mi, i = 1, 2, …, M; αij, i = 1, 2, …, M, j = 1, 2, …, L

1

11

1

( | )

( ,..., )

ijLjM j

iii iL

M

f p

pm

α

α α

−

=

=

Θ

= ⋅Β∏∑

∑

rij

Multinomial parameters: pj, j = 1, 2, …, L

Why hierarchical? Because a cluster of columns is not governed by a

1Dir( ; )M

i iim p α

== ⋅∑ rr

Count vector: nj, j = 1, 2, …, L

y g yfixed . It’s governed by a random whose distribution is parameterized by a set of higher lever parameters

pr pr


Theoretical BackgroundTheoretical Background• What is a DM

1( )( | ) LM Lif αα −ΓΘ ∑ ∑∏r 1

1 11

1

1

( )( | ) ,( )

0, 1,..., ; 1

ijLM Lii j i ijLi jj

ijj

Lj jj

f p m p

p j L p

αα α αα= ==

=

=

ΓΘ = ⋅ =

Γ

> = =

∑ ∑∏∏∑

r

C j i

1{ 0 : 1,..., ; 1} { 0 : 1,..., ; 1,..., }M

i i ijim i M m i M j Lα

=Θ = > = = ∪ > = =∑

• Conjugate priors

1

1 11

( )( | ; ) ,( )

ij jLM Lnii j jLi jj

ij j

nf p n m p n nn

ααα

+ −

= ==

Γ +′Θ = ⋅ =Γ +

∑ ∑∏∏

r r1

2

nn⎡ ⎤⎢ ⎥⎢ ⎥1

1

( )

( )( )( ) ( )

, 1,...,( )( )

ij jj

L ij jii j

i iji

L

n

nm

nm i Mn

α

ααα α

αα

=

=

Γ +ΓΓ + Γ

′ = =Γ +Γ

∏

∏nj = number of the j-th

2

L

nn

n

⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

r

L

''1 1

' '

( )( )( ) ( )

LM i j jiii j

i i j

nm

nαα

α α′′= =

Γ +ΓΓ + Γ∑ ∏

nj number of the j th letter in a given column


Theoretical BackgroundTheoretical Background• Bayesian classifier (for clustering independent columns)

⎡ ⎤ ( l | ) ( l | )M∑Total probability:

1

2

nn

n

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥

r

LCol

1P(Col | ) P(Col | ; )M

iim i

=Θ = ⋅ Θ∑

( )( ) L nαα Γ +ΓLikelihood:

Ln⎢ ⎥⎣ ⎦ 1

( )( )P(Col | ; )( ) ( )

L ij jij

i ij

ni

nαα

α α=

Γ +ΓΘ =

Γ + Γ∏

P(C l | )i ΘPosterior:

P(Col | ; )P( | Col; )P(Col | )

ii mi Θ ⋅Θ =

Θ


Theoretical Backgroundg• Log-odds ratio (for measuring the similarity between two

columns)(1)

1(1)

(1) 2

nn

n

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥

rCol(1)

(1) (2)1 1(1) (2)

(12) 2 2

n nn n

n

⎡ ⎤+⎢ ⎥+⎢ ⎥=⎢ ⎥

rMerge Col(1),Col(2) Col(12)

(1)Ln

⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

L(1) (2)

L Ln n

⎢ ⎥⎢ ⎥

+⎢ ⎥⎣ ⎦

Lg ,

(1) (2)P(Col | Col ; )Θ(2)

1(2)

(2) 2

nn

n

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥

r

LCol(2)

(1)

(1) (2)

(1) (2)

P(Col | Col ; )lnP(Col | )P(Col ,Col | )ln

P(Col | )P(Col | )

ΘΘ

Θ=

Θ Θ(2)

Ln

⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

( ) ( )

(12)

(1) (2)

P(Col | )P(Col | )P(Col | )ln

P(Col | )P(Col | )

Θ Θ

Θ=

Θ Θ BILD score

*For details, see Altschul et al. 2010. The Construction and Use of log-odds Substitution Scores for Multiple Sequence Alignment. PLoS Comput. Biol., 6(7), e1000852.Also see Ye et.al. 2011. An Assessment of Substitution Scores for Protien Profile-Profile Comparison. Bioinformatics, in press


Theoretical BackgroundTheoretical Background• Pattern score (for measuring the pattern significance)

Pattern window:( )

1P( | ) P( | )wX x τ

τ =Θ = Θ∏ r

Probability:

Odds-ratio:

( )

( )

1

P( | )( | ) w

L b

xR X τ

τ

τ

ΘΘ =∏

∏

r is background (marginal) probability

bjp

(1) (2) ( )( )wX r r r ( )1

1

( | )( ) jL nb

jjp

ττ =

=

∏∏

( g ) p yof letter j

Log odds-ratio:( )P( | )ln ( | ) lnw xR Xτ Θ

Θ ∑r

(1) (2) ( )( , ,..., )wX x x x=

( )1

1

( ) ( )1 1

ln ( | ) ln( )

ln P( | ) ln

jL nbjj

w L bj jj

R Xp

x n p

ττ

τ ττ

=

=

= =

Θ =

⎡ ⎤= Θ −⎣ ⎦

∑∏

∑ ∑r

BILD score


Theoretical Background• Position posteriors (for measuring the aligning likelihood of a

position in a sequence given a pattern)position in a sequence, given a pattern)

P i i ipt = P(“position t is aligned to the first

( ) ( ) ( )1 2, ,...,t t t

wy y y

Position t in a sequence column of the given pattern”)( ) ( )1

( ') ( ')1' '

P( ,..., | ; )P( | ; )= ,P( ' | ; ) P( ,..., | ; )

t tw

t twt t

y y Xt Xt X y y X

ΘΘ=

Θ Θ∑ ∑

( ) ( )( ) ( ) 11

P( ,..., , | )P( ,..., | ; )t t

t t ww

y y Xy y X ΘΘ =

where under independence assumption,

(1) (2) ( )( , ,..., )wX x x x=r r r

Given pattern1

( ) ( )

( )1

( , , | ; )P( | )P( , | )

P( | )

w

tw

y yXy x

x

ττ

ττ =

Θ

Θ=

Θ∏r

r

* One can also normalize the ratio( ) ( ) ( ) ( )1

( ) ( ) ( ) ( )11

P( ,..., | ; ) P( , | ) .P( ,..., | ) P( | )P( | )

t t tww

t t tw

y y X y xy y x y

ττ

τττ

=

Θ Θ=

Θ Θ Θ∏r

r


Theoretical Background• Shift posteriors (for measuring the shifting likelihood of a

pattern)Staying likelihood:

0 ( )1

P( | ) P( | )wX x ττ =

Θ = Θ∏ r

0 (1) (2) ( )( , ,..., )wX x x x=r r r

( )( 1)

( )

P( | )P( | ) P( | )P( | )

tt t

t w

xX Xx

−− + −

− +

ΘΘ = Θ ⋅

Θ

r

r

Shifting likelihoods:

1 (0) (1) ( 1)( , ,..., )wX x x x− −=r r r

( 1)( 1)

( 1)

P( | )P( | ) P( | )P( | )

t wt t

t

xX Xx

+ ++

+

ΘΘ = Θ ⋅

Θ

r

r

pt = P(“Shifting the current pattern ( ) pt ( g pwindow t columns”)

''

P( | )=P( | )

t

tt

XXΘΘ∑

1 (2) ( ) ( 1)( ,..., , )w wX x x x+ +=r r r * One can also normalize the ratio

( )

1

P( | )P( | )

tw

tii

xx

τ

ττ

+

+=

ΘΘ∏ ∏

r


InferenceInference• Goal: finding a maximum-likelihood prior for describing a set of

independent columnsindependent columns* ( )

1

( )

arg max ln P(Col | )

( )( )

T tt

tLT M nαα

=Θ

Θ = Θ

Γ +Γ

∑

∑ ∑

• Chanllege: this (constrianed) optimization problem is in general intractable

( )1 1 1

( )( )arg max ln( ) ( )

LT M ij jii tt i j

i ij

nm

nαα

α α= = =Θ

Γ +Γ= ⋅

Γ + Γ∑ ∑ ∏

1) Nonconcave2) High dimentional (ML + M−1)3) Large size of data (T)3) a ge s e o da a ( )

• Previous researchers have developed some good heuristic optimization procedures

• We have designed a better oneWe have designed a better one


InferenceInferenceGibbs sampling based optimization procedure (basic form):• Step 0 Starting from a M-component DM Θ(0)Step 0. Starting from a M-component DM Θ( ).• Step 1. Given current M-component DM Θ(k), create M empty bins. For

each t, compute the posterior P(i |Col(t); Θ(k)) for each i. For each t, assign Col(t) to the i-th bin according to P(i |Col(t); Θ(k))Col( ) to the i-th bin according to P(i |Col( ); Θ( )).

• Step 2. Update mixture parameters: mi' = Ti / T, where Ti is the size of bin i.• Step 3. For each bin i, solve the optimization problem*

( )( )t⎡ ⎤Γ

• Go to Step 1.

( )( 1)

( )1 1

( )( )arg max ln( ) ( )i

tLT ij jk i

i tt ji ij

nnα

αααα α

+= =

⎡ ⎤Γ +Γ= ⎢ ⎥

Γ + Γ⎢ ⎥⎣ ⎦∑ ∏r

r

* Variational method:( ) ( ) ( )ln P(Col | ) ln P(Col | ; ) lnP(Col | ; )T T M T Mt t t

i im i m iΘ = Θ ≥ Θ∑ ∑ ∑ ∑ ∑1 1 1 1 1

( )1 1

ln P(Col | ) ln P(Col | ; ) lnP(Col | ; )

lnP(Col | ; )

i it t i t iM T t

ii t

m i m i

m i= = = = =

= =

Θ Θ ≥ Θ

= Θ

∑ ∑ ∑ ∑ ∑∑ ∑


InferenceDimension reduction via first moment estimation• nj | n, pj ~ Binomial(n, pj)• p | α α ~ Beta(α α − α )pj | αi1, …, αiL ~ Beta(αij, αi αij)• E(nj) = E((nj | n, pj)) = E(npj) = E(n)E(pj) = E(n)·αij/αi ⇒ αij = E(nj)/E(n)·αi

• Replacing qij = E(nj)/E(n) with and plugging into the objective function in Step3 reduce the dimension from L to 1:

= E( )/E( )ij jq n n) ))

objective function in Step3 reduce the dimension from L to 1:( )

( 1)( )1 1

( )( )arg max ln( ) ( )i

tLT ij i jk i

i tt ji ij i

q nn qα

αααα α

+= =

⎡ ⎤Γ +Γ= ⎢ ⎥

Γ + Γ⎢ ⎥⎣ ⎦∑ ∏

)

)

Second moment estimate as the starting point for Newton’s method

( ) ( )i i ij iq⎢ ⎥⎣ ⎦

2

2

(1 )E( )( ) [ (1 ) ] /[ ],E ( ) E( )

ij iji ij ij ij ij

q qnj q q v vn n

α−

= − − −) ) )

) ) )) )

2Var( ) Var( )jn n) )

1( ),L

i ij ijq jα α

==∑) ))

22 2

Var( ) Var( )E ( ) E ( )

jij ij

n nv qn n

= − )) )


I fInferenceIssues:Issues:• Jumping out of local optima via complete bin

assignment (stochastic search)g ( )• Refinement via partial bin assignment (non-stochastic

search))• Initialization strategies• Number of componentsu be o co po e s


Model ComplexityModel ComplexityTotal cost function (description length) = -log(Data likelihood) +

Model complexityModel complexity* ≈ Complexity of M-1 mixture parameters (Comp1) +

Complexity of M Dirichlet components (Comp2) –Overcounted complexity (Comp3),

11 1log log log ( / 2) (1)

2 2 2M TComp M oπ−

= + − Γ +

Where

2 2 2

2 ,1 1[ log log log ( / 2) log( 1) (1)]

2 2 2 2 L nL T L nComp M L L o

M−

= + − Γ − − + Δ +

As the number of components increases, the value of -log(Data likelihood) decreases, but the model complexity increases

3 log( !)Comp M=

decreases, but the model complexity increases*For details, see Yu and Altschul 2011. The complexity of the Dirichlet model for multiple alignment data. J. Comput. Biol., in press.


A (To ) Inference E ampleA (Toy) Inference Example• Experiment:

Ground truth model Artificial data Infered model Θ Θ′(1: )Tnr

for t = 1 to T

( )1

~ ( | ) Dir( ; )Mti ii

p f p m p α=

Θ = ⋅∑ rr r rdraw

draw Inference Algorithm

n(t) ~ Poisson(n; λ)draw

( ) ( ) ( )~ Multinomial( ; , )t t tn n p nr r r


A (Toy) Inference ExampleA (Toy) Inference ExampleGround truth (mi and αij)

m 0 13817700 0 13904300 0 08104890 0 06581460 0 02579240 0 24203400 0 09586050 0 13965400 0 07257600

A 9-component DM:

m 0.13817700 0.13904300 0.08104890 0.06581460 0.02579240 0.24203400 0.09586050 0.13965400 0.07257600

A 0.39908600 0.01582840 0.04759630 0.09969660 0.00000145 0.55526200 0.11428400 0.27425900 0.13620300

C 0.05420540 0.01311490 0.01793070 0.00941176 0.00222160 0.05038680 0.00418655 0.07987990 0.05896460

D 0.04941170 0.01416740 0.00579683 0.00008857 0.00000133 0.50257200 0.78424300 0.02031100 0.05251870

E 0.02845140 0.01232240 0.01302770 0.11627800 0.00000128 0.90223900 0.43477100 0.03894260 0.06140270

F 0 02591200 0 00102215 0 23249400 0 02026550 0 22089800 0 09011690 0 01252450 0 10836100 0 70405100F 0.02591200 0.00102215 0.23249400 0.02026550 0.22089800 0.09011690 0.01252450 0.10836100 0.70405100

G 0.25260300 0.04945070 0.01069620 0.09818010 0.00000204 0.26072500 0.24440200 0.03979830 0.07961690

H 0.02367120 0.00745678 0.00841542 0.10135500 0.00871764 0.19098700 0.06603250 0.02067700 0.21530800

I 0.02425040 0.00249989 0.42645700 0.05080130 0.00000390 0.16955000 0.01017580 0.87321800 0.17152900

K 0.05788260 0.00525776 0.00000126 0.78829500 0.00000133 0.75006000 0.15614500 0.05859190 0.08389390

L 0 05685540 0 00645843 1 12570000 0 13809100 0 02457390 0 29736400 0 01321530 0 52930500 0 36405600L 0.05685540 0.00645843 1.12570000 0.13809100 0.02457390 0.29736400 0.01321530 0.52930500 0.36405600

M 0.02530400 0.00126565 0.22821000 0.04358250 0.00015563 0.12071100 0.00301977 0.12966800 0.11294100

N 0.11182400 0.00971496 0.01214120 0.12342400 0.00000131 0.37148600 0.36926200 0.02440750 0.09727970

P 0.12844600 0.02412470 0.02611130 0.09111190 0.00000127 0.28072300 0.09127260 0.05222050 0.05702610

Q 0.02867630 0.00535386 0.02659940 0.22541700 0.00000132 0.56156800 0.11887700 0.02838930 0.06834880

R 0.03932920 0.01008840 0.01855830 0.71342700 0.00037950 0 46791400 0 05369930 0 04072360 0 094694400.46791400 0.05369930 0.04072360 0.09469440

S 0.44507700 0.00187662 0.01232570 0.09276430 0.00000140 0.55988900 0.21704900 0.06296060 0.12344600

T 0.28037800 0.00629699 0.00000180 0.09193250 0.00151323 0.50655300 0.09415750 0.22034700 0.11868400

V 0.09074450 0.00442848 0.22871500 0.05469920 0.00000412 0.27808200 0.00937226 1.12861000 0.16444900

W 0.00733549 0.00450365 0.02189680 0.01274710 0.07318310 0.03643850 0.00486890 0.01703660 0.15407100

Y 0.02651190 0.00224834 0.02679920 0.03594690 0.17363700 0.12555800 0.03150400 0.05824380 0.833950000.12555800 0.03150400 0.05824380 0.83395000

αi 2.155955 0.197480 2.489474 2.907515 0.505300 7.078185 2.833062 3.805951 3.752434

From http://compbio.soe.ucsc.edu/dirichlets/byst-4.5-0-3.9compICML 2011, Bellevue (WA), USA

A (Toy) Inference Example( y) p

Figure adapted from Ye et al. 2011. On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison. J. Comput. Biol., in press.


A (Toy) Inference ExampleA (Toy) Inference Examplei

*T = 2k T =100k T = 2k T =100k T = 2k T =100k

'/i im m '/i iα α ( ) ( )( , ')i iJS q qr r

T = 2k,λ = 20

T =100k,λ = 80

T = 2k,λ = 20

T =100k,λ = 80

T = 2k,λ = 20

T =100k,λ = 80

1 1.030 1.007 1.009 0.996 0.00162 0.000022 1 337 1 000 0 939 0 998 0 00176 0 000042 1.337 1.000 0.939 0.998 0.00176 0.000043 0.838 0.994 1.028 0.985 0.02608 0.000134 1.165 0.993 0.823 0.985 0.00923 0.000065 0.884 0.998 0.742 1.003 0.00553 0.000066 0.421 1.010 1.963 0.997 0.02997 0.000227 0.997 0.994 1.165 0.998 0.00618 0.000067 0.997 0.994 1.165 0.998 0.00618 0.000068 1.003 0.996 1.206 0.996 0.00573 0.000239 1.196 1.013 1.491 1.005 0.03329 0.00109

Table adapted from Ye et al. 2011. On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison. J. Comput. Biol., in press.

* qj(i) = αij/αi, j = 1, …, L; i = 1, …, M , JS(·,·) denotes the Jensen-Shannon divergence


Inference for Real DataM -log2(Data) Complexity DL y0 9.9605E+07 2.0571E+02 9.9605E+07 0.00001 7.5622E+07 2.1196E+02 7.5622E+07 1.00332 7.5017E+07 4.0291E+02 7.5018E+07 1.02853 7.4768E+07 5.8573E+02 7.4768E+07 1.03904 7.4654E+07 7.6324E+02 7.4654E+07 1.04375 7.4601E+07 9.3678E+02 7.4602E+07 1.04596 7.4562E+07 1.1071E+03 7.4563E+07 1.04767 7.4518E+07 1.2749E+03 7.4520E+07 1.04948 7.4486E+07 1.4404E+03 7.4488E+07 1.05079 7.4438E+07 1.6038E+03 7.4439E+07 1.052710 7.4394E+07 1.7656E+03 7.4396E+07 1.054611 7.4365E+07 1.9257E+03 7.4367E+07 1.055812 7.4343E+07 2.0844E+03 7.4345E+07 1.056713 7 4332 07 2 2418 03 7 4334 07 1 057113 7.4332E+07 2.2418E+03 7.4334E+07 1.057114 7.4300E+07 2.3980E+03 7.4303E+07 1.058415 7.4288E+07 2.5531E+03 7.4290E+07 1.059016 7.4275E+07 2.7070E+03 7.4278E+07 1.059517 7.4265E+07 2.8600E+03 7.4268E+07 1.059918 7.4253E+07 3.0121E+03 7.4256E+07 1.060419 7.4243E+07 3.1633E+03 7.4246E+07 1.060820 7.4234E+07 3.3137E+03 7.4237E+07 1.061221 7.4221E+07 3.4632E+03 7.4225E+07 1.061722 7 4214E+07 3 6120E+03 7 4218E+07 1 062022 7.4214E+07 3.6120E+03 7.4218E+07 1.062023 7.4198E+07 3.7601E+03 7.4202E+07 1.062724 7.4193E+07 3.9075E+03 7.4197E+07 1.062925 7.4184E+07 4.0543E+03 7.4188E+07 1.063226 7.4178E+07 4.2004E+03 7.4182E+07 1.063527 7.4172E+07 4.3459E+03 7.4176E+07 1.063728 7.4169E+07 4.4908E+03 7.4174E+07 1.063829 7.4153E+07 4.6351E+03 7.4158E+07 1.064530 7.4152E+07 4.7789E+03 7.4156E+07 1.064631 7 4146E+07 4 9222E+03 7 4150E+07 1 064831 7.4146E+07 4.9222E+03 7.4150E+07 1.064832 7.4142E+07 5.0649E+03 7.4147E+07 1.065033 7.4136E+07 5.2072E+03 7.4141E+07 1.065234 7.4131E+07 5.3490E+03 7.4136E+07 1.065435 7.4131E+07 5.4903E+03 7.4136E+07 1.065436 7.4131E+07 5.6312E+03 7.4136E+07 1.065437 7.4131E+07 5.7716E+03 7.4136E+07 1.065438 7.4131E+07 5.9116E+03 7.4136E+07 1.0654

Figure adapted from Ye et al. 2011. On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison. J. Comput. Biol., in press. The data set can be downloaded from UCSC, it contains T = 314585 columns, with = 75.99.n


AdjustmentAdjustment• A DM, constructed from a curated set of data with a specific

l ( i id) i i ill i l hiletter (e.g. amino acid) composition, will imply this same set of background frequencies ( ). One may, however, wish to apply Bayesian analysis to a set of sequences

1, 1,...,M ij

j iii

q m j Lαα=

= =∑pp y y y q

with markedly different letter composition• It would be impractical to derive a new set of priors for such

seq ences b t non optimal to se a prior that implies asequences, but non-optimal to use a prior that implies a different composition

• We have therefore sought to adjust a given set of DM priors to g j g pa non-standard composition. We define this as constructing a DM prior that implies the desired composition, but that is "close" to the original priorclose to the original prior


Adjustment ModelAdjustment Model• Implied freqencies Target frequencies

• Ideal formulation (difficult)

1, 1,...,M ij

j iii

q m j Lαα=

= =∑ # of letter , 1,...,#of all lettersj

jq j L′ = =

( )

Minimize ( | )( ; ) ( | ) ln( | )p

f pG f p dpf p

Θ′Θ Θ = Θ′Θ∫r

rr r

rΘ′

Subject to1

, 1,...,M iji ji

m q j Lαα=

′′ ′= =

′∑

• A pratical formulation (easy)

1 jiiα

∑

Minimize1

Dir( ; )( ; ) Dir( ; ) lnM ii ii

pF m p dpαα′Θ Θ =′∑ ∫rr

rr rr1

( ; ) ( ; )Dir( ; )i ii p

i

p pp α= ′∑ ∫r rr

Θ′

Subject to , 1,..., ;, 1,..., ;

i i

i i

m m i Mi Mα α

′ = =′ = =

1, 1,...,M ij

i jii

m q j Lαα=

′′= =∑


Local Adjustment ModelLocal Adjustment Model• Objective function in practical formulation

1

( ; )Dir( ; )Dir( ; ) lnDir( ; )

M ii ii p

i

Fpm p dppααα=

′Θ Θ

=′∑ ∫rrr

rr rrr

( )( )1 1

2 31 1

ln ( ) ln ( ) ( )( )

1 ( )( ) (( ) )2

M Li ij ij ij ij iji j

M Li ij ij ij ij iji j

m

m O

α α ψ α α α

ψ α α α α α

= =

= =

′ ′= Γ − Γ − −

⎛ ⎞⎛ ⎞⎛ ⎞ ′ ′ ′= − + −⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠

∑ ∑

∑ ∑

• As α ′ij is close to αij

1 1 2 j j j j ji j ⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠∑ ∑

21 1

( ; )1 ( )( )21 1

M Li ij ij iji j

F

mψ α α α= =

′Θ Θ

′ ′≈ −∑ ∑2

1 1 1

1 1( ) ( ) ( )2 2

M L M Tij ij ij i i i i ii j i

R Rα α α α α α= = =

′ ′ ′= − = − −∑ ∑ ∑ r r r r,1 ,{ ,..., }i i i LR diag R R=


S i L l AdjSuccessive Local Adjustment• Intermediate targetsIntermediate targets

( )

( ) ( 1)

( ) / , 1,...,( ) /

t

t t

q t q q N q t Nq q q q q N−

′= − + =

′Δ = − = −

r r r r

r r r r r

qr q′r

r ( )N1

M iji ji

i

m qαα=

=∑( )

1

NM ij

i jii

m qαα=

′=∑

• Lagrange-Newton method

( ) ( 1) ( 1) ( 1)t t t tim S B− − −+ Δr r r

A N

iαr ( )N

iαr

( ) ( 1) ( 1) ( 1)

( 1) 1( 1) ( 1) 1

( 1) 1

,

11 ( )( ) (I ),1 ( ) 1

t t t tii i i i

i

T tt t i

i i T ti

S B q

RS RR

α αα

− −− − −

− −

= + Δ

= −r r

r r

As N →∞,( ) ( ) , 1,...,N

i i i Mα α ∞→ =r r

We ha e pro ed* that if one performs( 1) ( 1)

2( 1) ( 1)

21

( ) ,i

t ti i

Mt tii ii

i

B A

mA Sα

− − +

− −=

=

=∑

We have proved* that, if one performs infinitesimal background frequency changes along different paths, the same solution as will be reached. ( )

iα∞r

i i

*For proof, see Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117


A Demo of Adjusting A 3 3 lA 3-component 3-letter DM

(3)xr

1q⎡ ⎤1

2

3

qq q

q

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

r 1

2

xx

x⎡ ⎤

= ⎢ ⎥⎣ ⎦

r (1) (2) (3)1 2 3x q x q x q x= + +

r r r r

3q⎣ ⎦

(1)xr (2)xr


(3)xr

(1)xr (2)xr


(3)xr

(1)xr (2)xr


(3)xr

(1)xr (2)xr


(3)xr

(1)xr (2)xr


(3)xr

(1)xr (2)xr


(3)xr

(1)xr (2)xr


A Real Case of AdjustingA 20 20 lA 20-component 20-letter DM

Table from Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117


A Real Case of AdjustingA 20 t 20 l tt DMA 20-component 20-letter DM

• The 20-component DM is “recode4” fromhttp://compbio soe ucsc edu/dirichlets/recode4 20comphttp://compbio.soe.ucsc.edu/dirichlets/recode4.20comp

• The target frequencies are from a set of 53 Api-AP2 proteins from Toxoplasma gondii (Altschul et al., 2010)Adj t t ith N 1000 t t k 0 28 !• Adjustment with N = 1000 steps takes 0.28 sec!

2.5

3.0

1 0

1.5

2.0x.

rela

tive

erro

r (%

)( ) (10000)

(10000),

| |( ) max

Nij ij

m i jij

r Nα α

α−

=

0 200 400 600 800 10000.0

0.5

1.0

N

Ma

*0.01 min{ : ( ) 0.01}mN N r N= <

* 146N N0.01 146N =

Figure from Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117


Ho large N isHow large N is• Experiment

1200

1400We take the baseline to be “recode4”. We generate 1000 sets of “desired” background frequencies centered

b li f i leq′r

qr

600

800

1000

*0.01N

on by sampling from a single Dirichlet distribution .For each , we caculate . We sort the into bins according to the quantity For

qDir( ;75 )p qr r

q′r *0.01N

q′r200 05 | l ( / ) |A ′∑

200

400

600 quantity . For each bin, dots represent the observed mean value of , with error bars showing one standard deviation for this estimate Triangles represent the

210.05 | log ( / ) |j jj

A q q=

′= ∑*0.01N

2021

1 | log ( / ) |20 j jj

A q q′= ∑0.4 0.5 0.6 0.7 0.8 0.9 1

0

200 this estimate. Triangles represent the 90th percentile for values of within each bin. The particular case studied in Table 2 and the previous figure is shown by an “x”

*0.01N

21| g ( ) |

20 j jjq q

=∑ figure is shown by an x .Figure from Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117


Pattern SearchPattern Search

• Objective: given a set of protein sequencesObjective: given a set of protein sequences, find significant patterns using the Dirichletmixture prior (trained from gold-standard datamixture prior (trained from gold-standard data and adjusted according to the letter composition of the target data set)composition of the target data set)

• Search method: Gibbs sampling procedure based on the position posteriors and shiftbased on the position posteriors and shift posteriors


Pattern SearchPattern SearchGibbs sampling based optimization procedure (basic form):p g p p ( )• Step 0. Starting from a position vector to form a pattern window X(0)

of length w.• Step 1 Given current pattern window represented by X(k) = ( w) pick

(0)tr

( )ktr

Step 1. Given current pattern window represented by X( ) ( , w), pick up a sequence i, compute the position posteriors of all possible ti.

• Step 2. Draw according to the position posteriors. Set for i' ≠ i

t

( 1) ( 1)

' '

k k

i it t+ +

=( 1)k

it+

≠ i.• Go to Step 1.(Occasionally, after Step 1 and Step 2, one may shift the whole pattern window

di h hif i )according to the shift posteriors)


Pattern Search• Back to the sequence data given at the beginning

180w = 5

60

80

100

120

140

160

Pat

tern

sco

re

0 500 1000 1500 2000 2500 3000-20

0

20

40

Iteration

ScoreBest score so far

150

200

250

300

re

w = 8

100

150

200

re

w = 10

0

50

100

150

Pat

tern

sco


0

50Pat

tern

sco


0 500 1000 1500 2000 2500 3000-50

Iteration

0 500 1000 1500 2000 2500 3000 3500

-50

Iteration


Pattern Search

For examples, three patterns found p , pwith window lengths 5, 8, and 10


Conclusion

• The Dirichlet mixture model is a powerful tool for mining the i dprotein sequence data

• The inference of a DM from data is challenging, but we have designed better inference procedure

• A given DM prior might implies marginal letter probabilities that are different from those implied by a target sequence data, but we have designed an adjustment procure

• A DM that is trained from gold-standard alignment data can be used for searching for patterns from newly acquired sequence data. We have designed Gibbs-sampling based search procedures that work very wellprocedures that work very well

• The impact of the work is that considerably amount of manual labor could be saved for biologists and biomedical researchers


QuestionsQuestions

Thanks


Documents

On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National