22
(an example of) Computing the Microbial World Rob Beiko June 25, 2014

Beiko hpcs

  • Upload
    beiko

  • View
    111

  • Download
    0

Embed Size (px)

DESCRIPTION

Presentation at HPCS 2014, Halifax

Citation preview

Page 1: Beiko hpcs

(an example of)

Computing the Microbial World

Rob BeikoJune 25, 2014

Page 2: Beiko hpcs
Page 3: Beiko hpcs

Siddique et al. (2014) Front Microbiol

Page 4: Beiko hpcs

Lawley et al., PLoS Genet (2012)

Page 5: Beiko hpcs
Page 6: Beiko hpcs

The Breakfast Organisms"Bacon Fields" Author: Michael DeForge

Page 7: Beiko hpcs

240M “pieces”, each 150 nucleotides long3.6 x 1010 nucleotides

~40 GB

Hundreds of “species”Genomes between 1.5M – 6M nucleotides

Page 8: Beiko hpcs

150 nt x 150 nt

We know this And this

But not this

Page 9: Beiko hpcs

who is doing what?

Marker genes WHO

Environmental “Shotgun” WHAT

The challenge ofMETAGENOME CLASSIFICATION

Page 10: Beiko hpcs

Clues – Sequence similarity(homology)

150 nt x 150 nt

Referencegenes

Take the WHOLE SEQUENCE

Best

Worst

Page 11: Beiko hpcs

Clues – composition150 nt x 150 nt

Referencegenome

k-mer profiles

Genome #1:20% G & C30% A & T

Genome #2:24% G & C26% A & T

Best

Worst

Take a K-MER FREQUENCY

DECOMPOSITION

Page 12: Beiko hpcs

Homology >> Composition

* GGCTGGACCA1 GACTGGACCA2 GGCCGGACTA

But homology evidence canmislead or be absent

Homology + Composition > Homology alone

Page 13: Beiko hpcs

GGCTGGACCA

GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Query:

Subject:

Exact string search? NO

BLAST? OK, but SLOW!

Page 14: Beiko hpcs

A compromise: UBLAST

• BLAST seeks out very similar “anchor points” between a pair of sequences before doing a more thorough search• Typically, a query is compared against all candidate DB

sequences, but most will return no hits

UBLAST:GGCTGGACCA

GCCTGTCCANNNNNNNNNNNNNNNNNNNNGCCAGGTGCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCTGGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(1) Query, DB sequences

GGCTGGACCA

GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(3) Rank DBbased on k-mer

matching

GGCTGGACCA

GCCTGGTCCAGCCAGGTGCAGCCTGTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(4) Do detailed searchuntil there is

no more point

X

(2) k-mer table

Page 15: Beiko hpcs

Compositional models• Interpolated Markov models: adaptively generate

frequency models based on extending k-mers with sufficiently high frequencies

• One model per genome• Evaluate probability of each k-mer in query sequence,

given shorter k-mers in sequence• Model construction can take a while

k = 4 k = 5 k = 6 k = 7

PhymmBL: Brady and Salzberg (2009) Nat Methods

Page 16: Beiko hpcs

An alternative: Naïve Bayes• Just compute the frequency of each k-mer for a fixed

length k

• Build one frequency model for each genome

• FAST• Assumes conditional independence – may not matter

Probability of a query Fragment originating from genome Gi

For all k-mers in the fragment…

The frequency of that k-mer in Gi

Parks et al. (2011) BMC Bioinformatics

Page 17: Beiko hpcs

RITA: Rapid Identification of Taxonomic Assignments

UBLAST filter

MacDonald et al. (2012) Nucleic Acids Res

Page 18: Beiko hpcs

Evaluation set

• “Fake metagenome”: take sequences from known genomes, randomly sample fragments of 50, 100, 200 and 1000 nt in different trials

• Build reference models from other genomes – can leave close relatives out of reference model• Leave out other strains within the same species – not so

hard• Leave out other classes in the same phylum - HARD

Page 19: Beiko hpcs
Page 20: Beiko hpcs

But does it work?

Full RITA

Best class (homology and composition agree)

DNA sequence length50

Predicting genus from different species Predicting phylum from different class

Page 21: Beiko hpcs

Conclusions

• Careful attention needs to be paid to the choice of approach – simple is better

• RITA illustrates two key points in (microbial) bioinformatics:

1. Homology: How heuristic are you willing to go?2. Naïve Bayes: Keep it simple until told otherwise

• Technological change means that many bioinformatics algorithms will be irrelevant in 5 years

Page 22: Beiko hpcs

FIN