43
On the On the Dirichlet Dirichlet Mixture Model Mixture Model On the On the Dirichlet Dirichlet Mixture Model Mixture Model for Mining Protein Sequence for Mining Protein Sequence Data Data Xugang Xugang Ye Ye 1 , Stephen Altschul , Stephen Altschul 1 1 National Canter for Biotechnology Information National Canter for Biotechnology Information ICML 2011, Bellevue (WA), USA ICML 2011, Bellevue (WA), USA

On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

On theOn the DirichletDirichlet Mixture ModelMixture ModelOn the On the DirichletDirichlet Mixture Model Mixture Model for Mining Protein Sequence for Mining Protein Sequence

Data Data

XugangXugang YeYe11, Stephen Altschul, Stephen Altschul11

11National Canter for Biotechnology InformationNational Canter for Biotechnology Information

ICML 2011, Bellevue (WA), USAICML 2011, Bellevue (WA), USA

Page 2: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

BackgroundBackgroundBiologists need to find from the raw data like this

ICML 2011, Bellevue (WA), USA

Page 3: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

BackgroundBackgroundthe information like this

• The result is called a multiple sequence alignment (MSA)The result is called a multiple sequence alignment (MSA)• MSA is often used for assessing sequence conservation of protein domains, inference

of biological functions, reconstruction of ancestries, and eventually drug discoveries

ICML 2011, Bellevue (WA), USA

Page 4: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

BackgroundBackground• Fundamental problem: identification of important regions of p p g

DNA or protein molecules via multiple sequence alignment (MSA).Ch ll• Challenges:

MSA problem is in general intractableGiven a similarity measure, there is no polynomial algorithm for solving the MSA problemThere is still no consensus about the best similarity measureThere is still no satisfactory MSA program that can significantly reduce y p g g ythe manual labor

• New perspective: Bayesian analysis based machine learning approachapproach

ICML 2011, Bellevue (WA), USA

Page 5: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

BackgroundBackground

Gold-standardMSA data

Training Statistical Model

New sequence data

Searching

Adjusting

New patternsNew gold-standardMSA data

R fi i

Manual efforts

RefiningMSA programs

ICML 2011, Bellevue (WA), USA

Page 6: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

BackgroundBackground

• Model the gold-standard MSA data• Infer the model• Adjust the model• Use the model to search for patterns

R fi th tt i MSA• Refine the patterns using MSA programs• Incorporate the new discoveries into the libraries

ICML 2011, Bellevue (WA), USA

Page 7: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Dirichlet Mixture Model

Independent columns:

Multiple sequence alignments

Count vectors:pCol(1), Col(2), ..., Col(T)

Count vectors:(1) (2) ( ), , ..., Tn n nr r r

( )( )1

P(Col | )t

jl ntjj

p p=∏r

( | )f Θr

Dirichlet Mixture (DM) prior

Hierarchical model:1

( | ) jjp p

=∏

Hyper parameters: mi, i = 1, 2, …, M; αij, i = 1, 2, …, M, j = 1, 2, …, L

1

11

1

( | )

( ,..., )

ijLjM j

iii iL

M

f p

pm

α

α α

=

=

Θ

= ⋅Β∏∑

rij

Multinomial parameters: pj, j = 1, 2, …, L

Why hierarchical? Because a cluster of columns is not governed by a

1Dir( ; )M

i iim p α

== ⋅∑ rr

Count vector: nj, j = 1, 2, …, L

y g yfixed . It’s governed by a random whose distribution is parameterized by a set of higher lever parameters

pr pr

ICML 2011, Bellevue (WA), USA

Page 8: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Theoretical BackgroundTheoretical Background• What is a DM

1( )( | ) LM Lif αα −ΓΘ ∑ ∑∏r 1

1 11

1

1

( )( | ) ,( )

0, 1,..., ; 1

ijLM Lii j i ijLi jj

ijj

Lj jj

f p m p

p j L p

αα α αα= ==

=

=

ΓΘ = ⋅ =

Γ

> = =

∑ ∑∏∏∑

r

C j i

1{ 0 : 1,..., ; 1} { 0 : 1,..., ; 1,..., }M

i i ijim i M m i M j Lα

=Θ = > = = ∪ > = =∑

• Conjugate priors

1

1 11

( )( | ; ) ,( )

ij jLM Lnii j jLi jj

ij j

nf p n m p n nn

ααα

+ −

= ==

Γ +′Θ = ⋅ =Γ +

∑ ∑∏∏

r r1

2

nn⎡ ⎤⎢ ⎥⎢ ⎥1

1

( )

( )( )( ) ( )

, 1,...,( )( )

ij jj

L ij jii j

i iji

L

n

nm

nm i Mn

α

ααα α

αα

=

=

Γ +ΓΓ + Γ

′ = =Γ +Γ

∏nj = number of the j-th

2

L

nn

n

⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

r

L

''1 1

' '

( )( )( ) ( )

LM i j jiii j

i i j

nm

nαα

α α′′= =

Γ +ΓΓ + Γ∑ ∏

nj number of the j th letter in a given column

ICML 2011, Bellevue (WA), USA

Page 9: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Theoretical BackgroundTheoretical Background• Bayesian classifier (for clustering independent columns)

⎡ ⎤ ( l | ) ( l | )M∑Total probability:

1

2

nn

n

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥

r

LCol

1P(Col | ) P(Col | ; )M

iim i

=Θ = ⋅ Θ∑

( )( ) L nαα Γ +ΓLikelihood:

Ln⎢ ⎥⎣ ⎦ 1

( )( )P(Col | ; )( ) ( )

L ij jij

i ij

ni

nαα

α α=

Γ +ΓΘ =

Γ + Γ∏

P(C l | )i ΘPosterior:

P(Col | ; )P( | Col; )P(Col | )

ii mi Θ ⋅Θ =

Θ

ICML 2011, Bellevue (WA), USA

Page 10: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Theoretical Backgroundg• Log-odds ratio (for measuring the similarity between two

columns)(1)

1(1)

(1) 2

nn

n

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥

rCol(1)

(1) (2)1 1(1) (2)

(12) 2 2

n nn n

n

⎡ ⎤+⎢ ⎥+⎢ ⎥=⎢ ⎥

rMerge Col(1),Col(2) Col(12)

(1)Ln

⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

L(1) (2)

L Ln n

⎢ ⎥⎢ ⎥

+⎢ ⎥⎣ ⎦

Lg ,

(1) (2)P(Col | Col ; )Θ(2)

1(2)

(2) 2

nn

n

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥

r

LCol(2)

(1)

(1) (2)

(1) (2)

P(Col | Col ; )lnP(Col | )P(Col ,Col | )ln

P(Col | )P(Col | )

ΘΘ

Θ=

Θ Θ(2)

Ln

⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

( ) ( )

(12)

(1) (2)

P(Col | )P(Col | )P(Col | )ln

P(Col | )P(Col | )

Θ Θ

Θ=

Θ Θ BILD score

*For details, see Altschul et al. 2010. The Construction and Use of log-odds Substitution Scores for Multiple Sequence Alignment. PLoS Comput. Biol., 6(7), e1000852.Also see Ye et.al. 2011. An Assessment of Substitution Scores for Protien Profile-Profile Comparison. Bioinformatics, in press

ICML 2011, Bellevue (WA), USA

Page 11: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Theoretical BackgroundTheoretical Background• Pattern score (for measuring the pattern significance)

Pattern window:( )

1P( | ) P( | )wX x τ

τ =Θ = Θ∏ r

Probability:

Odds-ratio:

( )

( )

1

P( | )( | ) w

L b

xR X τ

τ

τ

ΘΘ =∏

r is background (marginal) probability

bjp

(1) (2) ( )( )wX r r r ( )1

1

( | )( ) jL nb

jjp

ττ =

=

∏∏

( g ) p yof letter j

Log odds-ratio:( )P( | )ln ( | ) lnw xR Xτ Θ

Θ ∑r

(1) (2) ( )( , ,..., )wX x x x=

( )1

1

( ) ( )1 1

ln ( | ) ln( )

ln P( | ) ln

jL nbjj

w L bj jj

R Xp

x n p

ττ

τ ττ

=

=

= =

Θ =

⎡ ⎤= Θ −⎣ ⎦

∑∏

∑ ∑r

BILD score

ICML 2011, Bellevue (WA), USA

Page 12: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Theoretical Background• Position posteriors (for measuring the aligning likelihood of a

position in a sequence given a pattern)position in a sequence, given a pattern)

P i i ipt = P(“position t is aligned to the first

( ) ( ) ( )1 2, ,...,t t t

wy y y

Position t in a sequence column of the given pattern”)( ) ( )1

( ') ( ')1' '

P( ,..., | ; )P( | ; )= ,P( ' | ; ) P( ,..., | ; )

t tw

t twt t

y y Xt Xt X y y X

ΘΘ=

Θ Θ∑ ∑

( ) ( )( ) ( ) 11

P( ,..., , | )P( ,..., | ; )t t

t t ww

y y Xy y X ΘΘ =

where under independence assumption,

(1) (2) ( )( , ,..., )wX x x x=r r r

Given pattern1

( ) ( )

( )1

( , , | ; )P( | )P( , | )

P( | )

w

tw

y yXy x

x

ττ

ττ =

Θ

Θ=

Θ∏r

r

* One can also normalize the ratio( ) ( ) ( ) ( )1

( ) ( ) ( ) ( )11

P( ,..., | ; ) P( , | ) .P( ,..., | ) P( | )P( | )

t t tww

t t tw

y y X y xy y x y

ττ

τττ

=

Θ Θ=

Θ Θ Θ∏r

r

ICML 2011, Bellevue (WA), USA

Page 13: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Theoretical Background• Shift posteriors (for measuring the shifting likelihood of a

pattern)Staying likelihood:

0 ( )1

P( | ) P( | )wX x ττ =

Θ = Θ∏ r

0 (1) (2) ( )( , ,..., )wX x x x=r r r

( )( 1)

( )

P( | )P( | ) P( | )P( | )

tt t

t w

xX Xx

−− + −

− +

ΘΘ = Θ ⋅

Θ

r

r

Shifting likelihoods:

1 (0) (1) ( 1)( , ,..., )wX x x x− −=r r r

( 1)( 1)

( 1)

P( | )P( | ) P( | )P( | )

t wt t

t

xX Xx

+ ++

+

ΘΘ = Θ ⋅

Θ

r

r

pt = P(“Shifting the current pattern ( ) pt ( g pwindow t columns”)

''

P( | )=P( | )

t

tt

XXΘΘ∑

1 (2) ( ) ( 1)( ,..., , )w wX x x x+ +=r r r * One can also normalize the ratio

( )

1

P( | )P( | )

tw

tii

xx

τ

ττ

+

+=

ΘΘ∏ ∏

r

ICML 2011, Bellevue (WA), USA

Page 14: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

InferenceInference• Goal: finding a maximum-likelihood prior for describing a set of

independent columnsindependent columns* ( )

1

( )

arg max ln P(Col | )

( )( )

T tt

tLT M nαα

Θ = Θ

Γ +Γ

∑ ∑

• Chanllege: this (constrianed) optimization problem is in general intractable

( )1 1 1

( )( )arg max ln( ) ( )

LT M ij jii tt i j

i ij

nm

nαα

α α= = =Θ

Γ +Γ= ⋅

Γ + Γ∑ ∑ ∏

1) Nonconcave2) High dimentional (ML + M−1)3) Large size of data (T)3) a ge s e o da a ( )

• Previous researchers have developed some good heuristic optimization procedures

• We have designed a better oneWe have designed a better one

ICML 2011, Bellevue (WA), USA

Page 15: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

InferenceInferenceGibbs sampling based optimization procedure (basic form):• Step 0 Starting from a M-component DM Θ(0)Step 0. Starting from a M-component DM Θ( ).• Step 1. Given current M-component DM Θ(k), create M empty bins. For

each t, compute the posterior P(i |Col(t); Θ(k)) for each i. For each t, assign Col(t) to the i-th bin according to P(i |Col(t); Θ(k))Col( ) to the i-th bin according to P(i |Col( ); Θ( )).

• Step 2. Update mixture parameters: mi' = Ti / T, where Ti is the size of bin i.• Step 3. For each bin i, solve the optimization problem*

( )( )t⎡ ⎤Γ

• Go to Step 1.

( )( 1)

( )1 1

( )( )arg max ln( ) ( )i

tLT ij jk i

i tt ji ij

nnα

αααα α

+= =

⎡ ⎤Γ +Γ= ⎢ ⎥

Γ + Γ⎢ ⎥⎣ ⎦∑ ∏r

r

* Variational method:( ) ( ) ( )ln P(Col | ) ln P(Col | ; ) lnP(Col | ; )T T M T Mt t t

i im i m iΘ = Θ ≥ Θ∑ ∑ ∑ ∑ ∑1 1 1 1 1

( )1 1

ln P(Col | ) ln P(Col | ; ) lnP(Col | ; )

lnP(Col | ; )

i it t i t iM T t

ii t

m i m i

m i= = = = =

= =

Θ Θ ≥ Θ

= Θ

∑ ∑ ∑ ∑ ∑∑ ∑

ICML 2011, Bellevue (WA), USA

Page 16: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

InferenceDimension reduction via first moment estimation• nj | n, pj ~ Binomial(n, pj)• p | α α ~ Beta(α α − α )pj | αi1, …, αiL ~ Beta(αij, αi αij)• E(nj) = E((nj | n, pj)) = E(npj) = E(n)E(pj) = E(n)·αij/αi ⇒ αij = E(nj)/E(n)·αi

• Replacing qij = E(nj)/E(n) with and plugging into the objective function in Step3 reduce the dimension from L to 1:

= E( )/E( )ij jq n n) ))

objective function in Step3 reduce the dimension from L to 1:( )

( 1)( )1 1

( )( )arg max ln( ) ( )i

tLT ij i jk i

i tt ji ij i

q nn qα

αααα α

+= =

⎡ ⎤Γ +Γ= ⎢ ⎥

Γ + Γ⎢ ⎥⎣ ⎦∑ ∏

)

)

Second moment estimate as the starting point for Newton’s method

( ) ( )i i ij iq⎢ ⎥⎣ ⎦

2

2

(1 )E( )( ) [ (1 ) ] /[ ],E ( ) E( )

ij iji ij ij ij ij

q qnj q q v vn n

α−

= − − −) ) )

) ) )) )

2Var( ) Var( )jn n) )

1( ),L

i ij ijq jα α

==∑) ))

22 2

Var( ) Var( )E ( ) E ( )

jij ij

n nv qn n

= − )) )

ICML 2011, Bellevue (WA), USA

Page 17: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

I fInferenceIssues:Issues:• Jumping out of local optima via complete bin

assignment (stochastic search)g ( )• Refinement via partial bin assignment (non-stochastic

search))• Initialization strategies• Number of componentsu be o co po e s

ICML 2011, Bellevue (WA), USA

Page 18: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Model ComplexityModel ComplexityTotal cost function (description length) = -log(Data likelihood) +

Model complexityModel complexity* ≈ Complexity of M-1 mixture parameters (Comp1) +

Complexity of M Dirichlet components (Comp2) –Overcounted complexity (Comp3),

11 1log log log ( / 2) (1)

2 2 2M TComp M oπ−

= + − Γ +

Where

2 2 2

2 ,1 1[ log log log ( / 2) log( 1) (1)]

2 2 2 2 L nL T L nComp M L L o

M−

= + − Γ − − + Δ +

As the number of components increases, the value of -log(Data likelihood) decreases, but the model complexity increases

3 log( !)Comp M=

decreases, but the model complexity increases*For details, see Yu and Altschul 2011. The complexity of the Dirichlet model for multiple alignment data. J. Comput. Biol., in press.

ICML 2011, Bellevue (WA), USA

Page 19: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

A (To ) Inference E ampleA (Toy) Inference Example• Experiment:

Ground truth model Artificial data Infered model Θ Θ′(1: )Tnr

for t = 1 to T

( )1

~ ( | ) Dir( ; )Mti ii

p f p m p α=

Θ = ⋅∑ rr r rdraw

draw Inference Algorithm

n(t) ~ Poisson(n; λ)draw

( ) ( ) ( )~ Multinomial( ; , )t t tn n p nr r r

ICML 2011, Bellevue (WA), USA

Page 20: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

A (Toy) Inference ExampleA (Toy) Inference ExampleGround truth (mi and αij)

m 0 13817700 0 13904300 0 08104890 0 06581460 0 02579240 0 24203400 0 09586050 0 13965400 0 07257600

A 9-component DM:

m 0.13817700 0.13904300 0.08104890 0.06581460 0.02579240 0.24203400 0.09586050 0.13965400 0.07257600

A 0.39908600 0.01582840 0.04759630 0.09969660 0.00000145 0.55526200 0.11428400 0.27425900 0.13620300

C 0.05420540 0.01311490 0.01793070 0.00941176 0.00222160 0.05038680 0.00418655 0.07987990 0.05896460

D 0.04941170 0.01416740 0.00579683 0.00008857 0.00000133 0.50257200 0.78424300 0.02031100 0.05251870

E 0.02845140 0.01232240 0.01302770 0.11627800 0.00000128 0.90223900 0.43477100 0.03894260 0.06140270

F 0 02591200 0 00102215 0 23249400 0 02026550 0 22089800 0 09011690 0 01252450 0 10836100 0 70405100F 0.02591200 0.00102215 0.23249400 0.02026550 0.22089800 0.09011690 0.01252450 0.10836100 0.70405100

G 0.25260300 0.04945070 0.01069620 0.09818010 0.00000204 0.26072500 0.24440200 0.03979830 0.07961690

H 0.02367120 0.00745678 0.00841542 0.10135500 0.00871764 0.19098700 0.06603250 0.02067700 0.21530800

I 0.02425040 0.00249989 0.42645700 0.05080130 0.00000390 0.16955000 0.01017580 0.87321800 0.17152900

K 0.05788260 0.00525776 0.00000126 0.78829500 0.00000133 0.75006000 0.15614500 0.05859190 0.08389390

L 0 05685540 0 00645843 1 12570000 0 13809100 0 02457390 0 29736400 0 01321530 0 52930500 0 36405600L 0.05685540 0.00645843 1.12570000 0.13809100 0.02457390 0.29736400 0.01321530 0.52930500 0.36405600

M 0.02530400 0.00126565 0.22821000 0.04358250 0.00015563 0.12071100 0.00301977 0.12966800 0.11294100

N 0.11182400 0.00971496 0.01214120 0.12342400 0.00000131 0.37148600 0.36926200 0.02440750 0.09727970

P 0.12844600 0.02412470 0.02611130 0.09111190 0.00000127 0.28072300 0.09127260 0.05222050 0.05702610

Q 0.02867630 0.00535386 0.02659940 0.22541700 0.00000132 0.56156800 0.11887700 0.02838930 0.06834880

R 0.03932920 0.01008840 0.01855830 0.71342700 0.00037950 0 46791400 0 05369930 0 04072360 0 094694400.46791400 0.05369930 0.04072360 0.09469440

S 0.44507700 0.00187662 0.01232570 0.09276430 0.00000140 0.55988900 0.21704900 0.06296060 0.12344600

T 0.28037800 0.00629699 0.00000180 0.09193250 0.00151323 0.50655300 0.09415750 0.22034700 0.11868400

V 0.09074450 0.00442848 0.22871500 0.05469920 0.00000412 0.27808200 0.00937226 1.12861000 0.16444900

W 0.00733549 0.00450365 0.02189680 0.01274710 0.07318310 0.03643850 0.00486890 0.01703660 0.15407100

Y 0.02651190 0.00224834 0.02679920 0.03594690 0.17363700 0.12555800 0.03150400 0.05824380 0.833950000.12555800 0.03150400 0.05824380 0.83395000

αi 2.155955 0.197480 2.489474 2.907515 0.505300 7.078185 2.833062 3.805951 3.752434

From http://compbio.soe.ucsc.edu/dirichlets/byst-4.5-0-3.9compICML 2011, Bellevue (WA), USA

Page 21: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

A (Toy) Inference Example( y) p

Figure adapted from Ye et al. 2011. On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison. J. Comput. Biol., in press.

ICML 2011, Bellevue (WA), USA

Page 22: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

A (Toy) Inference ExampleA (Toy) Inference Examplei

*T = 2k T =100k T = 2k T =100k T = 2k T =100k

'/i im m '/i iα α ( ) ( )( , ')i iJS q qr r

T = 2k,λ = 20

T =100k,λ = 80

T = 2k,λ = 20

T =100k,λ = 80

T = 2k,λ = 20

T =100k,λ = 80

1 1.030 1.007 1.009 0.996 0.00162 0.000022 1 337 1 000 0 939 0 998 0 00176 0 000042 1.337 1.000 0.939 0.998 0.00176 0.000043 0.838 0.994 1.028 0.985 0.02608 0.000134 1.165 0.993 0.823 0.985 0.00923 0.000065 0.884 0.998 0.742 1.003 0.00553 0.000066 0.421 1.010 1.963 0.997 0.02997 0.000227 0.997 0.994 1.165 0.998 0.00618 0.000067 0.997 0.994 1.165 0.998 0.00618 0.000068 1.003 0.996 1.206 0.996 0.00573 0.000239 1.196 1.013 1.491 1.005 0.03329 0.00109

Table adapted from Ye et al. 2011. On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison. J. Comput. Biol., in press.

* qj(i) = αij/αi, j = 1, …, L; i = 1, …, M , JS(·,·) denotes the Jensen-Shannon divergence

ICML 2011, Bellevue (WA), USA

Page 23: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Inference for Real DataM -log2(Data) Complexity DL y0 9.9605E+07 2.0571E+02 9.9605E+07 0.00001 7.5622E+07 2.1196E+02 7.5622E+07 1.00332 7.5017E+07 4.0291E+02 7.5018E+07 1.02853 7.4768E+07 5.8573E+02 7.4768E+07 1.03904 7.4654E+07 7.6324E+02 7.4654E+07 1.04375 7.4601E+07 9.3678E+02 7.4602E+07 1.04596 7.4562E+07 1.1071E+03 7.4563E+07 1.04767 7.4518E+07 1.2749E+03 7.4520E+07 1.04948 7.4486E+07 1.4404E+03 7.4488E+07 1.05079 7.4438E+07 1.6038E+03 7.4439E+07 1.052710 7.4394E+07 1.7656E+03 7.4396E+07 1.054611 7.4365E+07 1.9257E+03 7.4367E+07 1.055812 7.4343E+07 2.0844E+03 7.4345E+07 1.056713 7 4332 07 2 2418 03 7 4334 07 1 057113 7.4332E+07 2.2418E+03 7.4334E+07 1.057114 7.4300E+07 2.3980E+03 7.4303E+07 1.058415 7.4288E+07 2.5531E+03 7.4290E+07 1.059016 7.4275E+07 2.7070E+03 7.4278E+07 1.059517 7.4265E+07 2.8600E+03 7.4268E+07 1.059918 7.4253E+07 3.0121E+03 7.4256E+07 1.060419 7.4243E+07 3.1633E+03 7.4246E+07 1.060820 7.4234E+07 3.3137E+03 7.4237E+07 1.061221 7.4221E+07 3.4632E+03 7.4225E+07 1.061722 7 4214E+07 3 6120E+03 7 4218E+07 1 062022 7.4214E+07 3.6120E+03 7.4218E+07 1.062023 7.4198E+07 3.7601E+03 7.4202E+07 1.062724 7.4193E+07 3.9075E+03 7.4197E+07 1.062925 7.4184E+07 4.0543E+03 7.4188E+07 1.063226 7.4178E+07 4.2004E+03 7.4182E+07 1.063527 7.4172E+07 4.3459E+03 7.4176E+07 1.063728 7.4169E+07 4.4908E+03 7.4174E+07 1.063829 7.4153E+07 4.6351E+03 7.4158E+07 1.064530 7.4152E+07 4.7789E+03 7.4156E+07 1.064631 7 4146E+07 4 9222E+03 7 4150E+07 1 064831 7.4146E+07 4.9222E+03 7.4150E+07 1.064832 7.4142E+07 5.0649E+03 7.4147E+07 1.065033 7.4136E+07 5.2072E+03 7.4141E+07 1.065234 7.4131E+07 5.3490E+03 7.4136E+07 1.065435 7.4131E+07 5.4903E+03 7.4136E+07 1.065436 7.4131E+07 5.6312E+03 7.4136E+07 1.065437 7.4131E+07 5.7716E+03 7.4136E+07 1.065438 7.4131E+07 5.9116E+03 7.4136E+07 1.0654

Figure adapted from Ye et al. 2011. On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison. J. Comput. Biol., in press. The data set can be downloaded from UCSC, it contains T = 314585 columns, with = 75.99.n

ICML 2011, Bellevue (WA), USA

Page 24: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

AdjustmentAdjustment• A DM, constructed from a curated set of data with a specific

l ( i id) i i ill i l hiletter (e.g. amino acid) composition, will imply this same set of background frequencies ( ). One may, however, wish to apply Bayesian analysis to a set of sequences

1, 1,...,M ij

j iii

q m j Lαα=

= =∑pp y y y q

with markedly different letter composition• It would be impractical to derive a new set of priors for such

seq ences b t non optimal to se a prior that implies asequences, but non-optimal to use a prior that implies a different composition

• We have therefore sought to adjust a given set of DM priors to g j g pa non-standard composition. We define this as constructing a DM prior that implies the desired composition, but that is "close" to the original priorclose to the original prior

ICML 2011, Bellevue (WA), USA

Page 25: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Adjustment ModelAdjustment Model• Implied freqencies Target frequencies

• Ideal formulation (difficult)

1, 1,...,M ij

j iii

q m j Lαα=

= =∑ # of letter , 1,...,#of all lettersj

jq j L′ = =

( )

Minimize ( | )( ; ) ( | ) ln( | )p

f pG f p dpf p

Θ′Θ Θ = Θ′Θ∫r

rr r

rΘ′

Subject to1

, 1,...,M iji ji

m q j Lαα=

′′ ′= =

′∑

• A pratical formulation (easy)

1 jiiα

Minimize1

Dir( ; )( ; ) Dir( ; ) lnM ii ii

pF m p dpαα′Θ Θ =′∑ ∫rr

rr rr1

( ; ) ( ; )Dir( ; )i ii p

i

p pp α= ′∑ ∫r rr

Θ′

Subject to , 1,..., ;, 1,..., ;

i i

i i

m m i Mi Mα α

′ = =′ = =

1, 1,...,M ij

i jii

m q j Lαα=

′′= =∑

ICML 2011, Bellevue (WA), USA

Page 26: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Local Adjustment ModelLocal Adjustment Model• Objective function in practical formulation

1

( ; )Dir( ; )Dir( ; ) lnDir( ; )

M ii ii p

i

Fpm p dppααα=

′Θ Θ

=′∑ ∫rrr

rr rrr

( )( )1 1

2 31 1

ln ( ) ln ( ) ( )( )

1 ( )( ) (( ) )2

M Li ij ij ij ij iji j

M Li ij ij ij ij iji j

m

m O

α α ψ α α α

ψ α α α α α

= =

= =

′ ′= Γ − Γ − −

⎛ ⎞⎛ ⎞⎛ ⎞ ′ ′ ′= − + −⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠

∑ ∑

∑ ∑

• As α ′ij is close to αij

1 1 2 j j j j ji j ⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠∑ ∑

21 1

( ; )1 ( )( )21 1

M Li ij ij iji j

F

mψ α α α= =

′Θ Θ

′ ′≈ −∑ ∑2

1 1 1

1 1( ) ( ) ( )2 2

M L M Tij ij ij i i i i ii j i

R Rα α α α α α= = =

′ ′ ′= − = − −∑ ∑ ∑ r r r r,1 ,{ ,..., }i i i LR diag R R=

ICML 2011, Bellevue (WA), USA

Page 27: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

S i L l AdjSuccessive Local Adjustment• Intermediate targetsIntermediate targets

( )

( ) ( 1)

( ) / , 1,...,( ) /

t

t t

q t q q N q t Nq q q q q N−

′= − + =

′Δ = − = −

r r r r

r r r r r

qr q′r

r ( )N1

M iji ji

i

m qαα=

=∑( )

1

NM ij

i jii

m qαα=

′=∑

• Lagrange-Newton method

( ) ( 1) ( 1) ( 1)t t t tim S B− − −+ Δr r r

A N

iαr ( )N

iαr

( ) ( 1) ( 1) ( 1)

( 1) 1( 1) ( 1) 1

( 1) 1

,

11 ( )( ) (I ),1 ( ) 1

t t t tii i i i

i

T tt t i

i i T ti

S B q

RS RR

α αα

− −− − −

− −

= + Δ

= −r r

r r

As N →∞,( ) ( ) , 1,...,N

i i i Mα α ∞→ =r r

We ha e pro ed* that if one performs( 1) ( 1)

2( 1) ( 1)

21

( ) ,i

t ti i

Mt tii ii

i

B A

mA Sα

− − +

− −=

=

=∑

We have proved* that, if one performs infinitesimal background frequency changes along different paths, the same solution as will be reached. ( )

iα∞r

i i

*For proof, see Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117

ICML 2011, Bellevue (WA), USA

Page 28: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

A Demo of Adjusting A 3 3 lA 3-component 3-letter DM

(3)xr

1q⎡ ⎤1

2

3

qq q

q

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

r 1

2

xx

x⎡ ⎤

= ⎢ ⎥⎣ ⎦

r (1) (2) (3)1 2 3x q x q x q x= + +

r r r r

3q⎣ ⎦

(1)xr (2)xr

ICML 2011, Bellevue (WA), USA

Page 29: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

(3)xr

(1)xr (2)xr

ICML 2011, Bellevue (WA), USA

Page 30: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

(3)xr

(1)xr (2)xr

ICML 2011, Bellevue (WA), USA

Page 31: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

(3)xr

(1)xr (2)xr

ICML 2011, Bellevue (WA), USA

Page 32: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

(3)xr

(1)xr (2)xr

ICML 2011, Bellevue (WA), USA

Page 33: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

(3)xr

(1)xr (2)xr

ICML 2011, Bellevue (WA), USA

Page 34: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

(3)xr

(1)xr (2)xr

ICML 2011, Bellevue (WA), USA

Page 35: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

A Real Case of AdjustingA 20 20 lA 20-component 20-letter DM

Table from Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117

ICML 2011, Bellevue (WA), USA

Page 36: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

A Real Case of AdjustingA 20 t 20 l tt DMA 20-component 20-letter DM

• The 20-component DM is “recode4” fromhttp://compbio soe ucsc edu/dirichlets/recode4 20comphttp://compbio.soe.ucsc.edu/dirichlets/recode4.20comp

• The target frequencies are from a set of 53 Api-AP2 proteins from Toxoplasma gondii (Altschul et al., 2010)Adj t t ith N 1000 t t k 0 28 !• Adjustment with N = 1000 steps takes 0.28 sec!

2.5

3.0

1 0

1.5

2.0x.

rela

tive

erro

r (%

)( ) (10000)

(10000),

| |( ) max

Nij ij

m i jij

r Nα α

α−

=

0 200 400 600 800 10000.0

0.5

1.0

N

Ma

*0.01 min{ : ( ) 0.01}mN N r N= <

* 146N N0.01 146N =

Figure from Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117

ICML 2011, Bellevue (WA), USA

Page 37: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Ho large N isHow large N is• Experiment

1200

1400We take the baseline to be “recode4”. We generate 1000 sets of “desired” background frequencies centered

b li f i leq′r

qr

600

800

1000

*0.01N

on by sampling from a single Dirichlet distribution .For each , we caculate . We sort the into bins according to the quantity For

qDir( ;75 )p qr r

q′r *0.01N

q′r200 05 | l ( / ) |A ′∑

200

400

600 quantity . For each bin, dots represent the observed mean value of , with error bars showing one standard deviation for this estimate Triangles represent the

210.05 | log ( / ) |j jj

A q q=

′= ∑*0.01N

2021

1 | log ( / ) |20 j jj

A q q′= ∑0.4 0.5 0.6 0.7 0.8 0.9 1

0

200 this estimate. Triangles represent the 90th percentile for values of within each bin. The particular case studied in Table 2 and the previous figure is shown by an “x”

*0.01N

21| g ( ) |

20 j jjq q

=∑ figure is shown by an x .Figure from Ye et al. 2010. Compositional Adjustment of Dirichlet Mixture Priors. J. Comput. Biol., DOI: 10.1089/cmb.2010.0117

ICML 2011, Bellevue (WA), USA

Page 38: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Pattern SearchPattern Search

• Objective: given a set of protein sequencesObjective: given a set of protein sequences, find significant patterns using the Dirichletmixture prior (trained from gold-standard datamixture prior (trained from gold-standard data and adjusted according to the letter composition of the target data set)composition of the target data set)

• Search method: Gibbs sampling procedure based on the position posteriors and shiftbased on the position posteriors and shift posteriors

ICML 2011, Bellevue (WA), USA

Page 39: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Pattern SearchPattern SearchGibbs sampling based optimization procedure (basic form):p g p p ( )• Step 0. Starting from a position vector to form a pattern window X(0)

of length w.• Step 1 Given current pattern window represented by X(k) = ( w) pick

(0)tr

( )ktr

Step 1. Given current pattern window represented by X( ) ( , w), pick up a sequence i, compute the position posteriors of all possible ti.

• Step 2. Draw according to the position posteriors. Set for i' ≠ i

t

( 1) ( 1)

' '

k k

i it t+ +

=( 1)k

it+

≠ i.• Go to Step 1.(Occasionally, after Step 1 and Step 2, one may shift the whole pattern window

di h hif i )according to the shift posteriors)

ICML 2011, Bellevue (WA), USA

Page 40: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Pattern Search• Back to the sequence data given at the beginning

180w = 5

60

80

100

120

140

160

Pat

tern

sco

re

0 500 1000 1500 2000 2500 3000-20

0

20

40

Iteration

ScoreBest score so far

150

200

250

300

re

w = 8

100

150

200

re

w = 10

0

50

100

150

Pat

tern

sco

ScoreBest score so far

0

50Pat

tern

sco

ScoreBest score so far

0 500 1000 1500 2000 2500 3000-50

Iteration

0 500 1000 1500 2000 2500 3000 3500

-50

Iteration

ICML 2011, Bellevue (WA), USA

Page 41: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Pattern Search

For examples, three patterns found p , pwith window lengths 5, 8, and 10

ICML 2011, Bellevue (WA), USA

Page 42: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

Conclusion

• The Dirichlet mixture model is a powerful tool for mining the i dprotein sequence data

• The inference of a DM from data is challenging, but we have designed better inference procedure

• A given DM prior might implies marginal letter probabilities that are different from those implied by a target sequence data, but we have designed an adjustment procure

• A DM that is trained from gold-standard alignment data can be used for searching for patterns from newly acquired sequence data. We have designed Gibbs-sampling based search procedures that work very wellprocedures that work very well

• The impact of the work is that considerably amount of manual labor could be saved for biologists and biomedical researchers

ICML 2011, Bellevue (WA), USA

Page 43: On theOn the Dirichlet Mixture ModelMixture Model …On theOn the Dirichlet Mixture ModelMixture Model for Mining Protein Sequence Data XugangXugang Ye Ye1, Stephen Altschul1 1National

QuestionsQuestions

Thanks

ICML 2011, Bellevue (WA), USA