CompostBin : A DNA composition based metagenomic binning algorithm

  • Upload
    orrin

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

CompostBin : A DNA composition based metagenomic binning algorithm. Sourav Chatterji * , Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis [email protected]. Overview of Talk. Metagenomics and the binning problem. CompostBin. The Microbial World. - PowerPoint PPT Presentation

Citation preview

  • CompostBin : A DNA composition based metagenomic binning algorithmSourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan EisenUC Davis [email protected]

  • Overview of TalkMetagenomics and the binning problem.CompostBin

  • The Microbial World

  • Exploring the Microbial WorldCulturingMajority of microbes currently unculturable.No ecological context. Molecular Surveys (e.g. 16S rRNA)who is out there?what are they doing?

  • Metagenomics

  • Interpreting Metagenomic DataNature of Metagenomic DataMosaicIntraspecies polymorphismFragmentaryNew Sequencing TechnologiesEnormous amount of dataShort Reads

  • Metagenomic BinningClassification of sequences by taxa

  • Why Bin at all?

  • Binning in ActionGlassy Winged Sharpshooter (Homalodisca coagulata).Feeds on plant xylem (poor in organic nutrients).Microbial Endosymbionts

  • Current Binning Methods Assembly Align with Reference GenomeDatabase Search [MEGAN, BLAST]Phylogenetic AnalysisDNA Composition [TETRA,Phylopythia]

  • Current Binning Methods Need closely related reference genomes.Poor performance on short fragments.Sanger sequence reads 500-1000 bp long.Current assembly methods unreliableComplex Communities Hard to Bin.

  • Overview of TalkMetagenomics and the binning problem.CompostBin

  • Genome SignaturesDoes genomic sequence from an organism have a unique signature that distinguishes it from genomic sequence of other organisms?Yes [Karlin et al. 1990s]What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

  • Imperfect WorldHorizontal Gene TransferRecent Estimates [Ge et al. 2005]Varies between 0-6% of genes.Typically ~2%.But Amelioration

  • DNA-composition metricsThe K-mer Frequency MetricCompostBin uses hexamers

  • Working with K-mers for Binning.Curse of Dimensionality : O(4K) independent dimensions.Statistical noise increases with decreasing fragment lengths.Project data into a lower dimensional space to decrease noise.Principal Component Analysis.DNA-composition metrics

  • PCA separates speciesGluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

  • Effect of Skewed Relative AbundanceB. anthracis and L. monogocytesAbundance 1:1Abundance 20:1

  • A Weighting SchemeFor each read, find overlap with other sequences

  • A Weighting SchemeCalculate the redundancy of each position.Weight is inverse of average redundancy.

  • Weighted PCACalculate weighted mean w :

    Calculates weighted co-variance matrix Mw

    PCs are eigenvectors of Mw.Use first three PCs for further analysis.NXwN1iiiw==

  • Weighted PCACalculate weighted mean w :

    Calculates weighted co-variance matrix Mw

    PCs are eigenvectors of Mw.Use first three PCs for further analysis.

  • Weighted PCACalculate weighted mean w :

    Calculates weighted co-variance matrix Mw

    Principal Components are eigenvectors of Mw.Use first three PCs for further analysis.mw=wiXii=1NNMw=wi(Xi-mw)(Xi-mw)Ti

  • Weighted PCA separates speciesB. anthracis and L. monogocytes : 20:1 PCAWeighted PCA

  • Un-supervised Classification ?

  • Semi-Supervised Classification31 Marker Genes [courtesy Martin Wu]Omni-presentRelatively Immune to Lateral Gene TransferReads containing these marker genes can be classified with high reliability.

  • Semi-supervised ClassificationUse a semi-supervised version of the normalized cut algorithm

  • The Semi-supervised Normalized Cut AlgorithmCalculate the K-nearest neighbor graph from the point set.Update graph with marker information.If two nodes are from the same species, add an edge between them.If two nodes are from different species, remove any edge between them.Bisect the graph using the normalized-cut algorithm.

  • Generalization to multiple binsGluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

  • Generalization to multiple binsGluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

  • TestingSimulate Metagenomic SequencingSanger ReadsVariables Number of speciesRelative abundanceGC contentPhylogenetic DiversityTest on a real dataset where answer is well-established.

  • Results

  • Results

  • Conclusions/Future DirectionsSatisfactory performanceNo Training on Existing Genomes Sanger Reads Low number of Species Future WorkHoly Grail : Complex CommunitiesSemi-supervised projection?Hybrid Assembly/Binning

  • AcknowledgementsUC DavisJonathan Eisen Martin WuDongying WuIchitaro YamazakiAmber HartmanMarcel HuntemannUC BerkeleyLior PachterRichard KarpAmbuj TewariNarayanan ManikandanPrinceton UniversitySimon LevinJosh WeitzJonathan Dushoff

  • ***We study microbial genomics in our lab. Well, microbes are small and invisible, they cant speak for themselves. So I will do some marketing for them.Microbes are also agents of disease. For example, this is an electron micrograph of Salmonella cells in red attacking human tissue. For instance, you might remember the Anthrax scare that was in the news just after the 9/11 attacks. These are just three among many reasons the understanding the biology of microbes is really important.-Appearance not a reliable indicator of what one is looking at.-Cultivate microbes in the laboratory in artificial media.*The most apparent problem in metagenomic analysis is binning, the clustering of metagenomic sequences into taxon-specific bins.*Metagenomics and binning has applications in the real world too, like the production of wines. Dont worry, it has nothing to do with storing of wine bottles.*******Current methods : All dimensions treated equally.

    *Current methods : All dimensions treated equally.

    **The species in the example before were equally abundant. However, **The idea behind the weighting scheme is that sequences from more abundant species will have more overlap and thus lower weights.*We use these weights in a variation of the standard PCA algorithm. In the weighted algorithm, we first calculate the weighted mean mu_w. Each point is then normalized *We use these weights in a variation of the standard PCA algorithm. In the weighted algorithm, we first calculate the weighted mean mu_w. Each point is then normalized *We use these weights in a variation of the standard PCA algorithm. In the weighted algorithm, we first calculate the weighted mean mu_w. Each point is then normalized ***To use this information in clustering, we use a semi-supervised version of the widely used normalized cut algorithm. *******