Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
I AM NOT A METAGENOMIC EXPERT
I am merely the MESSENGER
Blaise T.F. Alako, PhD EBI Ambassador [email protected]
Hubert Denise Alex Mitchell Peter Sterk Sarah Hunter
http://www.ebi.ac.uk/metagenomics
Blaise T.F. Alako EBI Ambassador [email protected]
Where is the true cost of NGS ?
70 % (~80 bp/$)
14.5 %
28 % (~2m bp/$)
36.5 %
14.5 %
14.5 %
55 %
30 %
4.5 %
Sboner et al. Genome Biology (2011) 12:125
§ Philosophy § Overview data analysis
§ QC steps + tutorial § Overview of functional analysis § Result outputs § Others public pipelines Data analysis using
selected EBI and external software tools
EBI Metagenomics pipeline
Philosophy behind EBI Metagenomics pipeline
From chaos to structure:
§ archiving of data with metadata
§ performing stringent QC filtering prior to analysis § quality in, quality out
§ performing robust taxonomy and functional analysis § model-based rather than similarity-based approaches § assignment done on reads rather than assembly
§ intuitive navigation through website
§ constant drive to improvement § benchmarking and tool testing
Helping metagenomics researchers make sense of their data
EBI Metagenomics currently do not perform assembly
Why ? § absence of reference genome § short reads make chimaera inevitable
What are the consequences ?
§ cannot link taxonomy information to functional annotations § cannot currently perform viral taxonomy analysis
Ex: re-analysis of Hess et al, Science (2011) 331:463
Image credits:
(1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74:171-199
Quality control
Diversity analysis Metagenomics data analysis
Functional analysis
rRNAselector
reads with
rRNA
reads without rRNA
FragGeneScan predicted
CDS Amplicon
-based data
processed reads
discarded reads that
fail QC
Quality control
raw reads
Qiime
Taxonomic analysis
InterProScan
Function assignment
Unknown function
pCDS
Overview of EBI Metagenomics Pipeline
§ Philosophy § Overview data analysis
§ QC steps + tutorial § Overview of functional analysis § Result outputs
§ Others public pipelines
Data analysis using selected EBI and
external software tools
EBI Metagenomics pipeline
EBI Metagenomics: QC rationale
Why ?
§ Garbage in, garbage out
§ Base call error: - each base call has a quality score associated - platform-dependent errors
§ Reads quality decreases with reads length
§ NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.
EBI Metagenomics: QC step by step
§ Clipping - low quality ends trimmed and adapter/barcode sequences removed using Biopython SeqIO package
§ Quality filtering - sequences with > 10% undetermined nucleotides removed
§ Read length filtering - depending on the platform short sequences are removed
§ Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579) and representative sequence chosen
§ Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked
EBI Metagenomics: QC consequences
Roche 454
Illumina
Ion Torrent
EBI Metagenomics: overview of functional analysis
reads without rRNA
FragGeneScan predicted
CDS
InterProScan
Function assignment
Unknown function
pCDS
EBI Metagenomics: identification of coding sequences
Prediction of coding sequences is a challenge § read length § sequencing errors: frame-shift
Two main types of approaches:
§ homology-based methods: identify only known coding sequences § feature-based approaches: predict probability that ORF are coding
EBI Metagenomics uses FragGeneScan :
§ hidden Markov models to correct frame-shift using codon usage § probabilistic identification of start and stop codons § 60 bp minimum ORF
Rho et al. (2010) NAR 38-20
EBI Metagenomics pipeline do not use pairwise similarity based methods to associate functions to predicted protein sequences
instead we use InterProScan to mine the InterPro database
Most available pipelines use homology-based methods (such as BLAST) § compare a query sequence with a database of sequences § identify database sequences that resemble the query sequence with
homology score above a certain threshold
However sequences may appear to have low homology score because: § proteins may share homology only in limited domains § proteins from different species can differ in length
EBI Metagenomics: annotation of coding sequences
EBI Metagenomics: Avantage of InterPro
InterPro database (HMM and profile –based functional analysis)
§ based on presence of “signatures” (models) from several databases
§ Specificity: mapping is manually curated
BLAST vs. UniRef100 hit C7VBM8, Predicted protein C7VC62, Predicted protein
InterProScan hit 5-formyltetrahydrofolate cyclo-ligase-like (IPR024185) Transcription regulator HTH, LysR (IPR000847)
§ Speed
Test set of 40692 predicted protein sequences
§ BLAST vs UniRef100 = 21.5 s/cds
§ InterProScan (5 databases) = 3 s/cds
EBI Metagenomics: overview of taxonomy analysis
rRNAselector
reads with
rRNA
Amplicon-based data
processed reads
Qiime
Taxonomic analysis
EBI Metagenomics: identification of suitable sequences
Taxonomy analysis is generally based on identification and classification of rRNA sequences
§ Prokaryotes: archaebacteria and eubacteria: 5S, 16S and 23S
§ Eukaryotes: 5S, 5.8S, 18S and 28S
§ there is no equivalent for virus so depend on DNA polymerase or part of
5’-UTR (internal ribosomal entry site [IRES]) sequences
EBI Metagenomics currently only provide taxonomy analysis for Prokaryotes. rRNA sequences are identified using rRNASelector :
§ hidden Markov models to identified rRNA sequences
§ 60 bp minimum overlap with well-curated HMM model § E-value < 10-5
Lee et al (2011) J Microbiol. 49(4)
EBI Metagenomics: identification of suitable sequences
Once identified, rRNA sequences are clustered and classified using Qiime
“QIIME stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities”
The main steps are:
§ clustering sequences in Operational Taxonomy Unit (OTU) using uclust
§ picking a representative sequence set (one sequence from each OTU)
§ aligning the representative sequence set
§ assigning taxonomy to the representative sequence set using PyNAST
§ generating output files: § filtering the alignment prior to tree building § building phylogenetic tree § creating OTU table
EBI Metagenomics pipeline in a nut shell
§ QC : - trim adaptor sequences, low quality sequence ends - remove duplicates and short sequences - remove low complexity sequences,
§ Diversity analysis : - identify prokaryotic rRNAsequences (5, 16 and 23s)
- cluster rRNA-containing reads - assign taxonomy classification using Qiime,
§ Functional analysis : - predict ORFs
- translate ORFs into peptides - submit to InterProScan for functional annotation
“Powerful and sophisticated alternative to BLAST-based functional metagenomic analysis”
§ Philosophy § Overview data analysis
§ QC steps + tutorial § Overview of functional analysis § Overview of taxonomy analysis § Result outputs
§ Others public pipelines
Data analysis using selected EBI and
external software tools
EBI Metagenomics pipeline
Current outputs of EBI Metagenomics pipeline
- QC and sequence statistics
- Diversity analysis
- Functional analysis
Visualisation
Download
EBI Metagenomics pipeline: taxonomy visualisation switch to bar chart, column or Krona interactive views
Google charts dynamic representation
Google charts dynamic representation
Gene ontology
links to InterPro website
EBI Metagenomics pipeline: functional visualisation
Interpro matchers
EBI Metagenomics pipeline : download options
Large starting material
Small size output for post-processing
§ Philosophy § Overview data analysis
§ QC steps + tutorial § Overview of functional analysis § Overview of taxonomy analysis § Result outputs
§ Others public pipelines
Data analysis using selected EBI and
external software tools
EBI Metagenomics pipeline
Some other Metagenomics tools
http://www.computationalbioenergy.org/software.html
http://ab.inf.uni-tuebingen.de/software/megan/ http://cbcb.umd.edu/software/metAMOS
http://www.ebi.ac.uk/metagenomics/ http://metagenomics.anl.gov/
http://camera.calit2.net/ http://img.jgi.doe.gov/
Public Metagenomics portals
http://www.ebi.ac.uk/metagenomics
Thanks to EMG Team, InterPro team and you for your attention