View
503
Download
1
Category
Preview:
Citation preview
16S Classifier: a tool for fast and accurate classification of 16S rRNA sequences
16S Classifier: a tool for fast and accurate classification of 16S rRNA sequences
Ashok K. SharmaResearch ScholarMetagenomics and Systems Biology LaboratoryIndian Institute of Science Education and Research, Bhopal
Species DiversityOverview
ArcobacterPaludibacterShewanellaPseudomonasUnknownSpecies Richness
MetagenomeMicrobial diversity of soil and other extreme environments are still limited
Only 1-3% of soil microbes are culturable
Estimated in 1g of soil = 4000- 5000 different bacterial genomic units
Bacteria and fungi plays an important role in biogeochemical cycles, and specially in human health
Species diversity consists of:1. Species richness, 2. Total number of species, and 3. Distribution of species2
Methods of studying microbial diversityBiochemical
Plate count
Community level physiological profiling
Fatty acid methyl ester analysis: as fatty acids make up constant proportion of cell biomassMolecular
G+C content
Nucleic acid re-association and hybridization
DNA microarray
DNA cloning and sequencing-based methods
Plate count is fast and cost effective but having disadvantage of not detection of unculturable microbes, bias towards fast growing, bias towards fungal speciesCLPP is fast, highly reproducible, inexpensive and generate large amount of data but having disadvantage of only represent culturable community, favour fast growingFAME: no culturing needed, directly extracted from soil, but having disadvantage of affecting by external factors. 3
Metagenomic reads vs 16S rRNA for microbial diversity identification
Metagenome
DNA IsolationFragmentation of DNAMetagenomic Reads
Amplification of 16S rRNA 16S rRNA from multiple speciesMicrobial diversityTools: Kraken, PhylopathiaS, Phymm, phymmBL, MetabinMicrobial diversity
16S rRNA a gold standard for microbial molecular identificationUniversal Highly conservedLong enough (~1500 bp) to provide significant discrimination between many speciesStructural information can guide alignment and phylogenetic reconstructionMany species now represented in the database16S rRNA gene sequencingEarlier By sequencing whole geneNow By sequencing short variable regionsLimitations:
Insufficient and underestimated diversity
16S rRNA gene
16S rRNA: to understand microbial diversity
Community composition shifts over time as revealed by 16S data
Software and tools available for the analysis of 16S rRNA dataCloVR-16S
QIIME a Python-based workflow package, allowing for sequence processing and phylogenetic analysis using different methods including the phylogenetic distance metricUniFrac,UCLUST, PyNASTand theRDP Bayesian classifier;
Mothur a C++-based software package for 16S analysis;
Metastatsand custom R scripts used to generate additional statistical and graphical evaluations.
Most recent: 16S Classifier Random forest based standalone package specially for short hypervariable regions
Material and methodsGreen genes databaseRandom forestEmbossRDP ClassifierBLAST
Input Data for TrainingIn 16S Classifier, we made separate models for different Hypervariable regions of 16S rRNA gene
Took Greengenes 16S rRNA database
Extracted individual HVRs as well as combination of 2 or more commonly used HVRs using commonly used Universal primers with the help of in-house perl scripts and EMBOSS software suit
Discarded HVRs where primer coverage was lesser than 50% of all sequences
Clustered out highly similar sequences using CD-hit at threshold 1.
Table 1. Summary of the number of HVR sequences which were used for the training and testing of RF*.
Parameters optimizationsLabeled each sequence with its taxonomic information to the lowest known level except species
Used V3 region for optimization of parameters
Calculated 2-mer, 3-mer, 4-mer, 5-mer, 6-mer nucleotide frequencies and tried them as feature inputs
Tried various mtry values at each k-mer to get the least OOB error value
Got best results at k = 4. So utilized 4-mer nucleotide frequencies for building models at ntree = 1000.
Figure 1. Optimization of parameters using hypervariable region V3
Variables selection
ntree optimization
OOB Error for Different HVRs
Input data for testingFirst test dataset was obtained by randomly extracting ~10% of the sequences which we had clustered out using CD-hit earlier. 1% random mutations were inserted in these sequences to mimic real life sequencing errors
Second dataset was obtained from real metagenomics sequences available from SRA dataset of NCBI
Performance of 16S Classifier was compared with that of RDP Classifier in terms of accuracy as well as time taken for computation.
Performance Of Different RF Models On Different Hvrs And Complete 16S rRna Gene
Performance Of RF Models On First Test Dataset
Comparison Of 16S Classifier With RDP Classifier On Real Datasets
Advantages of 16S Classifier
Extremely fast
High sensitivity as well as specificity
Consistent across various HVRs
Easy availability
Easy to deploy and use
How to useUser can download zip file of a particular hypervariable region or complete 16S, which is freely available at http://metagenomics.iiserb.ac.in/16Sclassifier/download.html
Extract the zipped file which contains a model file (*.Rdata), a script file (*.sh) and an exe file (16Sclassifier.exe).
Other dependencies:
User has to install R from the following linkhttp://cran.r-project.org/
intall Randomforest
## Command line usage ##
./16sclassifier.exe
The query file should be in Fasta format and the model name could be v2, v3, v4, v5, v6, v7, v8, v23, v34, v35, v45, v56, v67, v78 and complete.
Thank You
Recommended