CompostBin : A DNA composition based metagenomic binning algorithm

CompostBin : A DNA composition based metagenomic binning algorithmSourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan EisenUC Davis [email protected]

Overview of TalkMetagenomics and the binning problem.CompostBin

The Microbial World

Exploring the Microbial WorldCulturingMajority of microbes currently unculturable.No ecological context. Molecular Surveys (e.g. 16S rRNA)who is out there?what are they doing?

Metagenomics

Interpreting Metagenomic DataNature of Metagenomic DataMosaicIntraspecies polymorphismFragmentaryNew Sequencing TechnologiesEnormous amount of dataShort Reads

Metagenomic BinningClassification of sequences by taxa

Why Bin at all?

Binning in ActionGlassy Winged Sharpshooter (Homalodisca coagulata).Feeds on plant xylem (poor in organic nutrients).Microbial Endosymbionts

Current Binning Methods Assembly Align with Reference GenomeDatabase Search [MEGAN, BLAST]Phylogenetic AnalysisDNA Composition [TETRA,Phylopythia]

Current Binning Methods Need closely related reference genomes.Poor performance on short fragments.Sanger sequence reads 500-1000 bp long.Current assembly methods unreliableComplex Communities Hard to Bin.

Overview of TalkMetagenomics and the binning problem.CompostBin

Genome SignaturesDoes genomic sequence from an organism have a unique signature that distinguishes it from genomic sequence of other organisms?Yes [Karlin et al. 1990s]What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Imperfect WorldHorizontal Gene TransferRecent Estimates [Ge et al. 2005]Varies between 0-6% of genes.Typically ~2%.But Amelioration

DNA-composition metricsThe K-mer Frequency MetricCompostBin uses hexamers

Working with K-mers for Binning.Curse of Dimensionality : O(4K) independent dimensions.Statistical noise increases with decreasing fragment lengths.Project data into a lower dimensional space to decrease noise.Principal Component Analysis.DNA-composition metrics

PCA separates speciesGluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

Effect of Skewed Relative AbundanceB. anthracis and L. monogocytesAbundance 1:1Abundance 20:1

A Weighting SchemeFor each read, find overlap with other sequences

A Weighting SchemeCalculate the redundancy of each position.Weight is inverse of average redundancy.

Weighted PCACalculate weighted mean w :

Calculates weighted co-variance matrix Mw

PCs are eigenvectors of Mw.Use first three PCs for further analysis.NXwN1iiiw==



PCs are eigenvectors of Mw.Use first three PCs for further analysis.



Principal Components are eigenvectors of Mw.Use first three PCs for further analysis.mw=wiXii=1NNMw=wi(Xi-mw)(Xi-mw)Ti

Weighted PCA separates speciesB. anthracis and L. monogocytes : 20:1 PCAWeighted PCA

Un-supervised Classification ?

Semi-Supervised Classification31 Marker Genes [courtesy Martin Wu]Omni-presentRelatively Immune to Lateral Gene TransferReads containing these marker genes can be classified with high reliability.

Semi-supervised ClassificationUse a semi-supervised version of the normalized cut algorithm

The Semi-supervised Normalized Cut AlgorithmCalculate the K-nearest neighbor graph from the point set.Update graph with marker information.If two nodes are from the same species, add an edge between them.If two nodes are from different species, remove any edge between them.Bisect the graph using the normalized-cut algorithm.

Generalization to multiple binsGluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

TestingSimulate Metagenomic SequencingSanger ReadsVariables Number of speciesRelative abundanceGC contentPhylogenetic DiversityTest on a real dataset where answer is well-established.

Results

Conclusions/Future DirectionsSatisfactory performanceNo Training on Existing Genomes Sanger Reads Low number of Species Future WorkHoly Grail : Complex CommunitiesSemi-supervised projection?Hybrid Assembly/Binning

AcknowledgementsUC DavisJonathan Eisen Martin WuDongying WuIchitaro YamazakiAmber HartmanMarcel HuntemannUC BerkeleyLior PachterRichard KarpAmbuj TewariNarayanan ManikandanPrinceton UniversitySimon LevinJosh WeitzJonathan Dushoff

***We study microbial genomics in our lab. Well, microbes are small and invisible, they cant speak for themselves. So I will do some marketing for them.Microbes are also agents of disease. For example, this is an electron micrograph of Salmonella cells in red attacking human tissue. For instance, you might remember the Anthrax scare that was in the news just after the 9/11 attacks. These are just three among many reasons the understanding the biology of microbes is really important.-Appearance not a reliable indicator of what one is looking at.-Cultivate microbes in the laboratory in artificial media.*The most apparent problem in metagenomic analysis is binning, the clustering of metagenomic sequences into taxon-specific bins.*Metagenomics and binning has applications in the real world too, like the production of wines. Dont worry, it has nothing to do with storing of wine bottles.*******Current methods : All dimensions treated equally.

*Current methods : All dimensions treated equally.

**The species in the example before were equally abundant. However, **The idea behind the weighting scheme is that sequences from more abundant species will have more overlap and thus lower weights.*We use these weights in a variation of the standard PCA algorithm. In the weighted algorithm, we first calculate the weighted mean mu_w. Each point is then normalized *We use these weights in a variation of the standard PCA algorithm. In the weighted algorithm, we first calculate the weighted mean mu_w. Each point is then normalized *We use these weights in a variation of the standard PCA algorithm. In the weighted algorithm, we first calculate the weighted mean mu_w. Each point is then normalized ***To use this information in clustering, we use a semi-supervised version of the widely used normalized cut algorithm. *******

Documents

CompostBin : A DNA composition based metagenomic binning algorithm