Upload
beiko
View
111
Download
0
Embed Size (px)
DESCRIPTION
Presentation at HPCS 2014, Halifax
Citation preview
(an example of)
Computing the Microbial World
Rob BeikoJune 25, 2014
Siddique et al. (2014) Front Microbiol
Lawley et al., PLoS Genet (2012)
The Breakfast Organisms"Bacon Fields" Author: Michael DeForge
240M “pieces”, each 150 nucleotides long3.6 x 1010 nucleotides
~40 GB
Hundreds of “species”Genomes between 1.5M – 6M nucleotides
150 nt x 150 nt
We know this And this
But not this
who is doing what?
Marker genes WHO
Environmental “Shotgun” WHAT
The challenge ofMETAGENOME CLASSIFICATION
Clues – Sequence similarity(homology)
150 nt x 150 nt
Referencegenes
Take the WHOLE SEQUENCE
Best
Worst
Clues – composition150 nt x 150 nt
Referencegenome
k-mer profiles
Genome #1:20% G & C30% A & T
Genome #2:24% G & C26% A & T
Best
Worst
Take a K-MER FREQUENCY
DECOMPOSITION
Homology >> Composition
* GGCTGGACCA1 GACTGGACCA2 GGCCGGACTA
But homology evidence canmislead or be absent
Homology + Composition > Homology alone
GGCTGGACCA
GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Query:
Subject:
Exact string search? NO
BLAST? OK, but SLOW!
A compromise: UBLAST
• BLAST seeks out very similar “anchor points” between a pair of sequences before doing a more thorough search• Typically, a query is compared against all candidate DB
sequences, but most will return no hits
UBLAST:GGCTGGACCA
GCCTGTCCANNNNNNNNNNNNNNNNNNNNGCCAGGTGCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCTGGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(1) Query, DB sequences
GGCTGGACCA
GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(3) Rank DBbased on k-mer
matching
GGCTGGACCA
GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
(4) Do detailed searchuntil there is
no more point
X
(2) k-mer table
Compositional models• Interpolated Markov models: adaptively generate
frequency models based on extending k-mers with sufficiently high frequencies
• One model per genome• Evaluate probability of each k-mer in query sequence,
given shorter k-mers in sequence• Model construction can take a while
k = 4 k = 5 k = 6 k = 7
PhymmBL: Brady and Salzberg (2009) Nat Methods
An alternative: Naïve Bayes• Just compute the frequency of each k-mer for a fixed
length k
• Build one frequency model for each genome
• FAST• Assumes conditional independence – may not matter
Probability of a query Fragment originating from genome Gi
For all k-mers in the fragment…
The frequency of that k-mer in Gi
Parks et al. (2011) BMC Bioinformatics
RITA: Rapid Identification of Taxonomic Assignments
UBLAST filter
MacDonald et al. (2012) Nucleic Acids Res
Evaluation set
• “Fake metagenome”: take sequences from known genomes, randomly sample fragments of 50, 100, 200 and 1000 nt in different trials
• Build reference models from other genomes – can leave close relatives out of reference model• Leave out other strains within the same species – not so
hard• Leave out other classes in the same phylum - HARD
But does it work?
Full RITA
Best class (homology and composition agree)
DNA sequence length50
Predicting genus from different species Predicting phylum from different class
Conclusions
• Careful attention needs to be paid to the choice of approach – simple is better
• RITA illustrates two key points in (microbial) bioinformatics:
1. Homology: How heuristic are you willing to go?2. Naïve Bayes: Keep it simple until told otherwise
• Technological change means that many bioinformatics algorithms will be irrelevant in 5 years
FIN