24
Building a Building a dictionary for dictionary for genomes genomes By Harmen J. Bussemaker, Hao Li, By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia and Eric D. Siggia Tal Frank

Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Building a dictionary for Building a dictionary for genomesgenomes

By Harmen J. Bussemaker, Hao Li, and Eric By Harmen J. Bussemaker, Hao Li, and Eric D. SiggiaD. Siggia

Tal Frank

Page 2: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Topics that will be Topics that will be discussed discussed

Biological background Biological background

Present the biological problem Present the biological problem

Show an algorithm that treats this problemShow an algorithm that treats this problem statistical mechanics methods statistical mechanics methods

Try our algorithm on two well known problemsTry our algorithm on two well known problems

Page 3: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

What we did so farWhat we did so far

Human Genome Project(2001)Human Genome Project(2001)

This article published(2000) :Sequence is not everything This article published(2000) :Sequence is not everything - Lets do some theory- Lets do some theory

Control over gene expression - when, how muchControl over gene expression - when, how much Control element = Regulator = Sequence motif Control element = Regulator = Sequence motif

Genes are working together = Co-regulated genes Genes are working together = Co-regulated genes

Page 4: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

The goals of this workThe goals of this work

Identify the Control element Identify the Control element

Where are they located ?Where are they located ?

Identify co-regulated genesIdentify co-regulated genes

Page 5: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Multiple control elementsMultiple control elements

ExampleExample : where are the control elements located? : where are the control elements located?

ConceptsConcepts: directionality , upstream ,in the junk : directionality , upstream ,in the junk

TATACGACGAXXTTCGATTCGA

ExampleExample: co-regulated genes: co-regulated genes

naïve approach :naïve approach :TATACGACGAXXTTTTTATAAAYYATGGCA ATGGCA

experimentally :experimentally :TATACGACGAXXTTTTCGCGAAYYATGGCAATGGCA

To activate set of genes: multiple sequences neededTo activate set of genes: multiple sequences needed

Page 6: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

New terminology New terminology

DNA = string of letters DNA = string of letters Control element = word Control element = word Multiple control element = sentences Multiple control element = sentences Genes and junk = background noiseGenes and junk = background noise

Example : S: …Example : S: …GAGAGCGCXXTGTGGGYYGCTT……GCTT…… words words = {GA,TG}= {GA,TG} sentence = GA.TGsentence = GA.TG background = background = genesgenes and junk. and junk.

Page 7: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

MobyDick algorithm MobyDick algorithm

decipher a ‘‘text’’ consisting of a long string of decipher a ‘‘text’’ consisting of a long string of letters written in an unknown language.letters written in an unknown language.

Find the words in the textFind the words in the text Find the right spacing Find the right spacing

example : D={A,T,AT} S=ATT example : D={A,T,AT} S=ATT

P1=A.T.TP1=A.T.T

P2=AT.TP2=AT.T

Page 8: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

How would you do it ? How would you do it ? 1.Look for repeated substring in the string :1.Look for repeated substring in the string :

{went, to, he}{went, to, he} D (dictionary) D (dictionary)

2.Space the text – ooopps Spacing is not that 2.Space the text – ooopps Spacing is not that

simple. simple.

e.g.– D={A,T,AT} S=ATT e.g.– D={A,T,AT} S=ATT

PP11=A.T.T =A.T.T pp11

PP22=AT.T =AT.T pp22

Tal went to Weizmann this morning. When he arrived he didn’t go to his office, he went to drink a cup of coffee ….

Page 9: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

MobyDick Blueprints MobyDick Blueprints

S=TAGATAT

S=TAGATAT

D={T,A,G}

pw ={pA,pT..}

D={A,TA,…}

pw ={pA,pTA.}

1 letter word

Find pw

2 letter word

Find pw

No more optional words stop!

Find spacing S=TA.G.A.TA.T

Page 10: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

statistical mechanics in statistical mechanics in order to ?order to ?

1.How does MobyDick decide {p1.How does MobyDick decide {pww}?}?

2.When does MobyDick add a new 2.When does MobyDick add a new

word?word?

3.Space (parse) the text. 3.Space (parse) the text.

Page 11: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

The likelihood function The likelihood function

k: a possible spacing k: a possible spacing

NNww: number of times the word w appears: number of times the word w appears

Example :Example : D=(T,AT,A) S=TATA D=(T,AT,A) S=TATA

k1=T.A.T.Ak1=T.A.T.A

k2=T.AT.Ak2=T.AT.A

( )wN k

ww

k

z p

2 2A T A T ATZ p p p p p

Page 12: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Likelihood function - intuitionLikelihood function - intuition

Z(D,{pZ(D,{pww})- partition function: <E>,<N>,<T>,…. })- partition function: <E>,<N>,<T>,….

Z(D,{pZ(D,{pww})- the probability to obtain a })- the probability to obtain a

sequence S.sequence S.

Example :Example : D =(T,AT,A) {p D =(T,AT,A) {pTT,p,pAA,p,pATAT} }

Question : Question : what is the probability to S=TATA? what is the probability to S=TATA?

11stst possibility : T.A.T.A possibility : T.A.T.A p pAA*p*pAA*p*pTT*p*pTT

22ndnd possibility: T.AT.A possibility: T.AT.A ppTT*p*pATAT*p*pAA2 2A T A T ATp p p p p p Z

Page 13: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Finding {pFinding {pww}}

Given : D,S

Maximize Z({pw},D) with respect to {pw}

This {pw} gives the highest probability to get the given S

Page 14: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Lets find the {pLets find the {pww} !} ! Definition Definition : - average number of the word w : - average number of the word w

over the different spacings over the different spacings ..

Can prove:Can prove:

maximize Z-maximize Z- solve: solve: solvingsolving : is done by iteration: : is done by iteration:

''

ww

ww

Np

N

( ) lnw ww

N p Zp

pw’ <Nw’> pw

wN

Page 15: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Enough is enough !!!Enough is enough !!!

When is pWhen is pww good enough ? good enough ?

when the new {pwhen the new {pww} don’t give higher Z} don’t give higher Z

We say : this method converges ! We say : this method converges !

Other methods don’t converge. Other methods don’t converge.

Page 16: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Why finding {pWhy finding {pww} using this } using this way ?way ?

Monte-Carlo methods don’t converge. Monte-Carlo methods don’t converge.

Slow method Slow method can transform to fast method can transform to fast method

Order of complexity O(LDl) Order of complexity O(LDl)

L-the length of the string L-the length of the string

D-the size of the dictionary D-the size of the dictionary

l-the length of the longest word in Dl-the length of the longest word in D

Page 17: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Add new words ?Add new words ?

Compose new word ww’

'[ _ '] [ ]?w wN ww N p

Check occurrence

Look at dictionary D={T,A,C,G} S=TATTGA

D={T,A,C,G} S=TATTGA ww’=TA

D={T,A,C,G} S=TATTGA ww’=TA

Yes- add to dictionary D={T,A,C,G,TA} S=TATTGA

Page 18: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

A problem and a bad A problem and a bad solution solution

The algorithm finds only the words which are The algorithm finds only the words which are composed from words already in the dictionary.composed from words already in the dictionary.

Example : S=AATATAAAExample : S=AATATAAA 11stst step : S= step : S=AAAATTAATTAAAAAA D= {A}D= {A} 22ndnd step : S=A step : S=AATATATATAAAAAA AT is not a composition of wordsAT is not a composition of words Solution: Look for repeated long stringsSolution: Look for repeated long strings by consideration the problem by consideration the problem

Page 19: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Spacing Spacing Define :Define : number of times the word w occurs in number of times the word w occurs in

a given spacing.a given spacing.

Quality factor :Quality factor :

The required condition :The required condition :

ww

w

NQ

w

1wQ

Page 20: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

checking the algorithmchecking the algorithm

Applying on the English novel Moby DickApplying on the English novel Moby Dick

Applying on Control elements on the yeast Applying on Control elements on the yeast genomegenome

Not always possible - Voynich manuscript Not always possible - Voynich manuscript (1450)(1450)

Page 21: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Preparing the book Preparing the book MobyDick MobyDick

Call me Ishmael. Some years ago- never mind how long precisely- having littleCall me Ishmael. Some years ago- never mind how long precisely- having littleor no money in my purse, and nothing particular tothought I would sail …or no money in my purse, and nothing particular tothought I would sail …

CallmeIshmaelSomeyearsagonevermindhowlongpreciselyhavingliCallmeIshmaelSomeyearsagonevermindhowlongpreciselyhavinglittleornomoneyinmypurseandnothingparticulartothoughtIwouldsail..ttleornomoneyinmypurseandnothingparticulartothoughtIwouldsail..

CallCallabajaabajameIshmaelmeIshmaelbjklmbbbjklmbbSomeyearsagonevermindhowlon Someyearsagonevermindhowlon EciselyhavinglittleEciselyhavinglittlermsdrrmsdrornomoneyinmypurseandnothingparticu ornomoneyinmypurseandnothingparticu artothoughtIwouldsail …artothoughtIwouldsail …

Page 22: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Results- MobyDick Results- MobyDick

10 first chapters 10 first chapters D={a,b,c….}D={a,b,c….} Text : 4,214 unique wordsText : 4,214 unique words 2,630 occurred only once 2,630 occurred only once Background – increases L by the factor of 3.Background – increases L by the factor of 3.

2,450 words found , 700 in English, 40 2,450 words found , 700 in English, 40 composite words. composite words.

Page 23: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

Results- yeast Results- yeast

D={T,A,C,G}D={T,A,C,G} Text : 443 experimentally determined sites Text : 443 experimentally determined sites Background – genes and junkBackground – genes and junk

500 words found 500 words found 114 match the experimentally predictions 114 match the experimentally predictions

Not that good – it is a beginning!Not that good – it is a beginning!

Page 24: Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank

The endThe end