The Gibbs Motif Sampler By: Andrea Smith Introduction to Bioinformatics December 9, 2004

The Gibbs Motif SamplerThe Gibbs Motif Sampler

By: Andrea SmithBy: Andrea Smith

Introduction to BioinformaticsIntroduction to Bioinformatics

December 9 , 2004December 9 , 2004

•Designed by Eric C. Rouchka and Bill Designed by Eric C. Rouchka and Bill Thompson Thompson

•Identifies motifs or conserved regions Identifies motifs or conserved regions within sequences of DNA or protein. within sequences of DNA or protein.

•Used by several tools such as Used by several tools such as

AlignACE.AlignACE.

Background:Background:

• ““Motif” and “recursive” samplingMotif” and “recursive” sampling assume assume that a given nucleotide or amino acid sequence that a given nucleotide or amino acid sequence can contain zero or more.can contain zero or more.

•““Site” samplingSite” sampling assumes that a assumes that a given nucleotide or amino acid given nucleotide or amino acid sequence contains only one motif sequence contains only one motif element of a particular motif type.element of a particular motif type.

Background:Background:

•The Gibbs Sampler uses a stochastic process. Each time the program is run a different result may be displayed.

•To account for this the Gibbs sampler can be run several times to see which motifs are selected most often.

There are two steps to the Gibbs algorithm used to find the most probable motif pattern:

Initialization and Predictive Uptake step:Initialization and Predictive Uptake step:

Random locations are selected in each Random locations are selected in each sequence but one to create an initial sequence but one to create an initial “null” model. This initial alignment is “null” model. This initial alignment is used to calculate the residue frequencies used to calculate the residue frequencies for both the motif and the “background”for both the motif and the “background”

Sampling step: The background probabilities are given weights and the motif start position is chosen based on these weights.

Basically the sampler randomly selects locations that could be the motif based on the first model. This cycle is continued until a motif model that makes the most sense based onthe calculated probabilities.

• gene or cluster of genes submitted to be searched using FASTA format:

• results are processed and displayed via e-mail:

>PHO5 PHO5 upstream sequence, from -800 to -1, size 800 TTTTACACATCGGACTGATAAGTTACTACTGCACATTGGCATTAGCTAGGAGGGCATCCA AGTAATAATTGCGAGAAACGTGACCCAACTTTGTTGTAGGTCCGCTCCTTCTAATAATCG CTTGTATCTCTACATATGTTCTATTTACTGACCGAAAGTAGCTCGCTACAATAATAATGTTGACCTGATGTCAGTCCCCACGCTAATAGCGGCGTGTCGCACGCTCTCTTTACAGGACG CGGAGACCGGCATTACAAGGATCCGAAAGTTGTATTCAACAAGAATGCGCAAATATGTC ACGTATTTGGAAGTCATCTTATGTGCGCTGCTTTAATGTTTTCTCATGTAAGCGGACGTC GTCTATAAACTTCAAACGAAGGTAAAAGGTTCATAGCGCTTTTTCTTTGTCTGCACAAAG AAATATATATTAAATTAGCACGTTTTCGCATAGAACGCAACTGCACAATGCCAAAAAAAG TAAAAGTGATTAAAAGAGTTAATTGAATAGGCAATCTCTAAATGAATCGATACAACCTTG GCACTCACACGTGGGACTAGCACAGACTAAATTTATGATTCTGGTCCCTGTTTTCGAAGA ATCGCACATGCCAAATTATCAAATTGGTCACCTTACTTGGCAAGGCATATACCCATTTG GGATAAGGGTAAACATCTTTGAATTGTCGAAATGAAACGTATATAAGCGCTGATGTTTTG CTAAGTCGAGGTTAGTATGGCTTCATCTCTCATGAGAATAAGAACAACAACAAATAGAGC AAGCAAATTCGAGATTACCA

Example 1: Saccharomyces cervisiae PHO genes

First is a list of options specified for the run…

Gibbs 2.06.018 Sep 22 2004Data file: /tmp/gibbs2530/data.txtOutput file: /tmp/gibbs2530/outfile.txtBackground Composition Model file: /tmp/gibbs2530/data.txt_info-detCurrent directory: /tmp/gibbs2530The following options are set:Concentrated Region False Sequence type FalseCollapsed Alphabet False Pseudocount weight True Use Expectation/Maximization False Don't Xnu sequence FalseHelp flag False Near optimal cutoff True Number of iterations True Don't fragment FalseDon't use map maximization False Repeat regions FalseOutput file True Informed priors file FalsePlateau periods True palindromic sequence FalseDon't Reverse complement False Number of seeds True Seed Value False Pseudosite weight True Suboptimal sampler output False Overlap FalseAllow width to vary False Wilcoxon signed rank FalseSample along length False Output Scan File FalseOutput prior file False Modular Sampler FalseIgnore Spacing Model False Sample Background FalseBkgnd Comp Model True Init from prior FalseHomologous Seq pairs False Parallel Tempering FalseGroup Sampler False No progress info FalseFragment from middle False Verify Mode FalseAlternate sample on k False No freq. soln. True Calc. def. pseudo wt. False Motif/Recur smpl FalsePhylogenetic Sampling False Supress Near Opt. True site_samp = 1nMotifLen = 10nAlphaLen = 4nNumMotifs = 5dPseudoCntWt = 0.1dPseudoSiteWt = 0.8nMaxIterations = 500lSeedVal = 1101873584nPlateauPeriods = 50nSeeds = 20nNumMotifTypes = 1dCutoff = 0.5RevComplement = 1glOverlapParam = 0Rcutoff factor = 0.001Post Plateau Samples = 0Frag/Shft Per. = 5Frag width = 15

(Advanced options include varying motif (Advanced options include varying motif width, allowing fragmentation, width, allowing fragmentation, sampling reverse complements, etc.sampling reverse complements, etc.

For this example only the minimum amount of data required for sampling was provided- FASTA format and motif width.)

Next is a table indicating the frequency of each residue in each position of the motif throughout the alignment:

Motif model (residue frequency x 100)____________________________________________Pos. # a t c g Info_____________________________ 1 | 100 . . . 1.2

3 | 100 . . . 1.2 4 | . . 100 . 1.9 5 | . . 16 83 1.4 6 | . 100 . . 1.4 7 | 66 33 . . 0.6 8 | . 100 . . 1.4 9 | 100 . . . 1.2 10 | . 100 . . 1.4 11 | 100 . . . 1.2

nonsite 33 28 19 18site 46 33 11 8

From the frequency table and prior calculations a probability model (MAP) is calculated indicating the probability of each residue appearing in each position within the motif: Motif probability model

____________________________________________Pos. # a t c g ____________________________________________ 1 | 0.949 0.022 0.015 0.014

3 | 0.949 0.022 0.015 0.014 4 | 0.026 0.022 0.938 0.014 5 | 0.026 0.022 0.169 0.783 6 | 0.026 0.945 0.015 0.014 7 | 0.641 0.330 0.015 0.014 8 | 0.026 0.945 0.015 0.014 9 | 0.949 0.022 0.015 0.014 10 | 0.026 0.945 0.015 0.014 11 | 0.949 0.022 0.015 0.014

Background probability model 0.326 0.283 0.195 0.181

Next is the alignment of the motifs Next is the alignment of the motifs found:found:

10 columns10 columnsNum Motifs: 6Num Motifs: 6 1, 1 695 aaatg AAACGTATATA agcgc 705 1.00 F PHO5 PHO5 upstream 1, 1 695 aaatg AAACGTATATA agcgc 705 1.00 F PHO5 PHO5 upstream sequence, from -800 to -1, sizesequence, from -800 to -1, size 2, 1 80 tatct AAACGTTTATA ttttt 90 0.96 F PHO8 PHO8 upstream sequence, 2, 1 80 tatct AAACGTTTATA ttttt 90 0.96 F PHO8 PHO8 upstream sequence, from -180 to -1, sizefrom -180 to -1, size 3, 1 672 aagta AAACGTATATA agctc 682 1.00 F PHO11 PHO11 upstream 3, 1 672 aagta AAACGTATATA agctc 682 1.00 F PHO11 PHO11 upstream sequence, from -800 to -1, sisequence, from -800 to -1, si 4, 1 286 gccaa AGACGTATATA ctaag 276 1.00 R PHO81 PHO81 upstream 4, 1 286 gccaa AGACGTATATA ctaag 276 1.00 R PHO81 PHO81 upstream sequence, from -800 to -1, sisequence, from -800 to -1, si 4, 2 429 gctaa AGACCTTTATA atcaa 419 0.00 R PHO81 PHO81 upstream 4, 2 429 gctaa AGACCTTTATA atcaa 419 0.00 R PHO81 PHO81 upstream sequence, from -800 to -1, sisequence, from -800 to -1, si 5, 1 668 ctcgt ATACGTATATA tataa 678 1.00 F PHO84 PHO84 upstream 5, 1 668 ctcgt ATACGTATATA tataa 678 1.00 F PHO84 PHO84 upstream sequence, from -800 to -1, sisequence, from -800 to -1, si * ********* * *********

Finally, the values used and map probability are provided so that the results are reproducible:

Log Motif portion of MAP for motif a = -28.16387Log Fragmentation portion of MAP for motif a = -2.19722

Log Background portion of Map = -4518.82616Log Alignment portion of Map = -44.12383Log Site/seq portion of Map = 0.00000Log Null Map = -4600.15275Log Map = 6.84166

Example 2 uses the amino acid sequence of the enzyme triose phosphate from three different organisms to find motifs…

Gibbs 2.06.018 Sep 22 2004Data file: /tmp/gibbs8904/data.txtOutput file: /tmp/gibbs8904/outfile.txtCurrent directory: /tmp/gibbs8904The following options are set:Concentrated Region False Sequence type FalseCollapsed Alphabet False Pseudocount weight True Use Expectation/Maximization False Don't Xnu sequence FalseHelp flag False Near optimal cutoff True Number of iterations True Don't fragment FalseDon't use map maximization False Repeat regions FalseOutput file True Informed priors file FalsePlateau periods True palindromic sequence FalseDon't Reverse complement False Number of seeds True Seed Value False Pseudosite weight True Suboptimal sampler output False Overlap FalseAllow width to vary False Wilcoxon signed rank FalseSample along length False Output Scan File FalseOutput prior file False Modular Sampler FalseIgnore Spacing Model False Sample Background FalseBkgnd Comp Model False Init from prior FalseHomologous Seq pairs False Parallel Tempering FalseGroup Sampler False No progress info FalseFragment from middle False Verify Mode FalseAlternate sample on k False No freq. soln. True Calc. def. pseudo wt. False Motif/Recur smpl FalsePhylogenetic Sampling False Supress Near Opt. True site_samp = 1nMotifLen = 10nAlphaLen = 20nNumMotifs = 3dPseudoCntWt = 0.1dPseudoSiteWt = 0.8nMaxIterations = 500lSeedVal = 1101948109nPlateauPeriods = 50nSeeds = 20nNumMotifTypes = 1dCutoff = 0.5RevComplement = 0glOverlapParam = 0Rcutoff factor = 0.001Post Plateau Samples = 0Frag/Shft Per. = 5Frag width = 15

Sequences to be Searched:_________________________#1 gi|54042252|sp|P66940|TPIS_MYCTU Triosephosphate isomerase (TIM) (Triose-phosphate isomerase)#2 gi|32130219|sp|Q8CTD5|TPIS_STAEP Triosephosphate isomerase (TIM) (Triose-phosphate isomerase)#3 gi|20140628|sp|Q8ZJK9|TPIS_YERPE Triosephosphate isomerase (TIM) (Triose-phosphate isomerase)Processed Sequence Length: 769 Total sequence length: 769

The parameters provided and the sequences to be searched are listed first.

Again, two matrices are provided, one with the residue frequencies and the resulting probability matrix:

Motif model (residue frequency x 100)____________________________________________Pos. # a v c d e f g h i w k l m n y p q r s t Info_____________________________________________________________________________________________ 1 | . 100 . . . . . . . . . . . . . . . . . . 2.8 2 | . . . . . . . . 100 . . . . . . . . . . . 3.2 3 | 100 . . . . . . . . . . . . . . . . . . . 2.3 4 | . . . . . . . . . . . . . . 100 . . . . . 4.3 5 | . . . . 100 . . . . . . . . . . . . . . . 3.1 6 | . . . . . . . . . . . . . . . 100 . . . . 4.0 7 | . 33 . . . . . . 66 . . . . . . . . . . . 2.3 8 | . . . . . . . . . 100 . . . . . . . . . . 5.9 9 | 100 . . . . . . . . . . . . . . . . . . . 2.3 10 | . . . . . . . . 100 . . . . . . . . . . . 3.2

nonsite 13 8 1 5 6 1 9 2 6 . 6 7 1 3 2 3 3 3 5 4site 20 13 . . 10 . . . 26 10 . . . . 10 10 . . . .

Motif probability model____________________________________________Pos. # a v c d e f g h i w k l m n y p q r s t ____________________________________________ 1 | 0.013 0.917 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 2 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.915 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 3 | 0.922 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 4 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.912 0.003 0.003 0.003 0.005 0.004 5 | 0.013 0.008 0.001 0.005 0.915 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 6 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.912 0.003 0.003 0.005 0.004 7 | 0.013 0.311 0.001 0.005 0.006 0.002 0.008 0.002 0.612 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 8 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.910 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 9 | 0.922 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.006 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004 10 | 0.013 0.008 0.001 0.005 0.006 0.002 0.008 0.002 0.915 0.001 0.006 0.007 0.001 0.003 0.002 0.003 0.003 0.003 0.005 0.004

Background probability model 0.130 0.083 0.012 0.059 0.065 0.020 0.092 0.027 0.056 0.004 0.064 0.079 0.014 0.038 0.023 0.030 0.038 0.031 0.053 0.043

Probability Matrix:

10 columnsNum Motifs: 3 1, 1 168 qigsv VIAYEPVWAI gtgrv 177 1.00 F gi|54042252|sp|P66940|TPIS_MYCTU Triosephosphate i 2, 1 165 qlkev VIAYEPIWAI gtgks 174 1.00 F gi|32130219|sp|Q8CTD5|TPIS_STAEP Triosephosphate i 3, 1 163 afega VIAYEPIWAI gtgks 172 1.00 F gi|20140628|sp|Q8ZJK9|TPIS_YERPE Triosephosphate i **********

Column 1 : Sequence Number, Site NumberColumn 2 : Left End LocationColumn 4 : Motif ElementColumn 5 : Right End LocationColumn 6 : Probability of ElementColumn 7 : Forward Motif (F) or Reverse Complement (R) Column 8 : Sequence Description from Fast A input

Motif Alignment:

Log Motif portion of MAP for motif a = -37.08646Log Fragmentation portion of MAP for motif a = -0.00000

Log Background portion of Map = -2165.36239Log Alignment portion of Map = -19.63910Log Site/seq portion of Map = 0.00000Log Null Map = -2254.63689Log Map = 32.54894

log MAP = sum of motif and fragmentation parts of MAP + background + alignment + sites/seq - Null

Frequency Map = 32.548944Nearopt Map = 32.548944Maximal Map = 32.548944Elapsed time: 2.650000 secs

Values Input:

•Things to watch out for: sequence not in FASTA format, motif length too long, motif length too short, results show repeats in low complexity regions

• For example, by changing the motif width parameter from 10 to 20, the results look significantly different:

Num Motifs: 3 1, 1 76 lsphd SGAYTGDVSGAFLAKLGCSYVVVGHSERR tyhne 104 1.00 F gi|54042252|sp|P66940|TPIS_MYCTU Triosephosphate i 2, 1 73 ayfee SGAYTGETSPVALSELGVKYVVIGHSERR dyfhe 101 1.00 F gi|32130219|sp|Q8CTD5|TPIS_STAEP Triosephosphate i 3, 1 71 vdvnl SGAFTGETSAEMLKDIGAQYIIIGHSERR tyhqe 99 1.00 F gi|20140628|sp|Q8ZJK9|TPIS_YERPE Triosephosphate i

•The Gibbs method is that it can predict biological motifs

in what was formerly called “junk DNA”.

•Only about 3% of the genome codes

for genes.

•Using these statistical methods

regulatory signals within the genomic junk can be

identified (potentially leading to treatment of genetic conditions

ranging from birth defects to cancer.)

Conclusion:

ReferencesReferences

http://http://bayesweb.wadswbayesweb.wadsworth.org/gibbs/giorth.org/gibbs/gibbs.htmlbbs.html

http://kbrin.a-bldhttp://kbrin.a-bldg.louisville.edu/~g.louisville.edu/~rouchka/CECS694rouchka/CECS694/Lecture4.ppt/Lecture4.ppt

http://rsat.ulb.ac.http://rsat.ulb.ac.be/rsat/tutorials/tbe/rsat/tutorials/tut_gibbs.htmlut_gibbs.html

Documents

The Gibbs Motif Sampler By: Andrea Smith Introduction to Bioinformatics December 9, 2004