30
I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGER Blaise T.F. Alako, PhD EBI Ambassador [email protected]

I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

I AM NOT A METAGENOMIC EXPERT

I am merely the MESSENGER

Blaise T.F. Alako, PhD EBI Ambassador [email protected]

Page 2: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

Hubert Denise Alex Mitchell Peter Sterk Sarah Hunter

Page 3: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

http://www.ebi.ac.uk/metagenomics

Blaise T.F. Alako EBI Ambassador [email protected]

Page 4: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

Where is the true cost of NGS ?

70 % (~80 bp/$)

14.5 %

28 % (~2m bp/$)

36.5 %

14.5 %

14.5 %

55 %

30 %

4.5 %

Sboner et al. Genome Biology (2011) 12:125

Page 5: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

§  Philosophy §  Overview data analysis

§  QC steps + tutorial §  Overview of functional analysis §  Result outputs §  Others public pipelines Data analysis using

selected EBI and external software tools

EBI Metagenomics pipeline

Page 6: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

Philosophy behind EBI Metagenomics pipeline

From chaos to structure:

§  archiving of data with metadata

§  performing stringent QC filtering prior to analysis §  quality in, quality out

§  performing robust taxonomy and functional analysis §  model-based rather than similarity-based approaches §  assignment done on reads rather than assembly

§  intuitive navigation through website

§  constant drive to improvement §  benchmarking and tool testing

Helping metagenomics researchers make sense of their data

Page 7: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics currently do not perform assembly

Why ? §  absence of reference genome §  short reads make chimaera inevitable

What are the consequences ?

§  cannot link taxonomy information to functional annotations §  cannot currently perform viral taxonomy analysis

Ex: re-analysis of Hess et al, Science (2011) 331:463

Page 8: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

Image credits:

(1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74:171-199

Quality control

Diversity analysis Metagenomics data analysis

Functional analysis

Page 9: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

rRNAselector

reads with

rRNA

reads without rRNA

FragGeneScan predicted

CDS Amplicon

-based data

processed reads

discarded reads that

fail QC

Quality control

raw reads

Qiime

Taxonomic analysis

InterProScan

Function assignment

Unknown function

pCDS

Overview of EBI Metagenomics Pipeline

Page 10: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

§  Philosophy §  Overview data analysis

§  QC steps + tutorial §  Overview of functional analysis §  Result outputs

§  Others public pipelines

Data analysis using selected EBI and

external software tools

EBI Metagenomics pipeline

Page 11: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics: QC rationale

Why ?

§  Garbage in, garbage out

§  Base call error: - each base call has a quality score associated - platform-dependent errors

§  Reads quality decreases with reads length

§  NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.

Page 12: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics: QC step by step

§  Clipping - low quality ends trimmed and adapter/barcode sequences removed using Biopython SeqIO package

§  Quality filtering - sequences with > 10% undetermined nucleotides removed

§  Read length filtering - depending on the platform short sequences are removed

§  Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579) and representative sequence chosen

§  Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked

Page 13: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics: QC consequences

Roche 454

Illumina

Ion Torrent

Page 14: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics: overview of functional analysis

reads without rRNA

FragGeneScan predicted

CDS

InterProScan

Function assignment

Unknown function

pCDS

Page 15: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics: identification of coding sequences

Prediction of coding sequences is a challenge §  read length §  sequencing errors: frame-shift

Two main types of approaches:

§  homology-based methods: identify only known coding sequences §  feature-based approaches: predict probability that ORF are coding

EBI Metagenomics uses FragGeneScan :

§  hidden Markov models to correct frame-shift using codon usage §  probabilistic identification of start and stop codons §  60 bp minimum ORF

Rho et al. (2010) NAR 38-20

Page 16: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics pipeline do not use pairwise similarity based methods to associate functions to predicted protein sequences

instead we use InterProScan to mine the InterPro database

Most available pipelines use homology-based methods (such as BLAST) §  compare a query sequence with a database of sequences §  identify database sequences that resemble the query sequence with

homology score above a certain threshold

However sequences may appear to have low homology score because: §  proteins may share homology only in limited domains §  proteins from different species can differ in length

EBI Metagenomics: annotation of coding sequences

Page 17: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics: Avantage of InterPro

InterPro database (HMM and profile –based functional analysis)

§  based on presence of “signatures” (models) from several databases

§  Specificity: mapping is manually curated

BLAST vs. UniRef100 hit C7VBM8, Predicted protein C7VC62, Predicted protein

InterProScan hit 5-formyltetrahydrofolate cyclo-ligase-like (IPR024185) Transcription regulator HTH, LysR (IPR000847)

§  Speed

Test set of 40692 predicted protein sequences

§  BLAST vs UniRef100 = 21.5 s/cds

§  InterProScan (5 databases) = 3 s/cds

Page 18: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics: overview of taxonomy analysis

rRNAselector

reads with

rRNA

Amplicon-based data

processed reads

Qiime

Taxonomic analysis

Page 19: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics: identification of suitable sequences

Taxonomy analysis is generally based on identification and classification of rRNA sequences

§  Prokaryotes: archaebacteria and eubacteria: 5S, 16S and 23S

§  Eukaryotes: 5S, 5.8S, 18S and 28S

§  there is no equivalent for virus so depend on DNA polymerase or part of

5’-UTR (internal ribosomal entry site [IRES]) sequences

EBI Metagenomics currently only provide taxonomy analysis for Prokaryotes. rRNA sequences are identified using rRNASelector :

§  hidden Markov models to identified rRNA sequences

§  60 bp minimum overlap with well-curated HMM model §  E-value < 10-5

Lee et al (2011) J Microbiol. 49(4)

Page 20: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics: identification of suitable sequences

Once identified, rRNA sequences are clustered and classified using Qiime

“QIIME stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities”

The main steps are:

§  clustering sequences in Operational Taxonomy Unit (OTU) using uclust

§  picking a representative sequence set (one sequence from each OTU)

§  aligning the representative sequence set

§  assigning taxonomy to the representative sequence set using PyNAST

§  generating output files: §  filtering the alignment prior to tree building §  building phylogenetic tree §  creating OTU table

Page 21: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics pipeline in a nut shell

§  QC : - trim adaptor sequences, low quality sequence ends - remove duplicates and short sequences - remove low complexity sequences,

§  Diversity analysis : - identify prokaryotic rRNAsequences (5, 16 and 23s)

- cluster rRNA-containing reads - assign taxonomy classification using Qiime,

§  Functional analysis : - predict ORFs

- translate ORFs into peptides - submit to InterProScan for functional annotation

“Powerful and sophisticated alternative to BLAST-based functional metagenomic analysis”

Page 22: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

§  Philosophy §  Overview data analysis

§  QC steps + tutorial §  Overview of functional analysis §  Overview of taxonomy analysis §  Result outputs

§  Others public pipelines

Data analysis using selected EBI and

external software tools

EBI Metagenomics pipeline

Page 23: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

Current outputs of EBI Metagenomics pipeline

- QC and sequence statistics

- Diversity analysis

- Functional analysis

Visualisation

Download

Page 24: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics pipeline: taxonomy visualisation switch to bar chart, column or Krona interactive views

Google charts dynamic representation

Page 25: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

Google charts dynamic representation

Gene ontology

links to InterPro website

EBI Metagenomics pipeline: functional visualisation

Interpro matchers

Page 26: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

EBI Metagenomics pipeline : download options

Large starting material

Small size output for post-processing

Page 27: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

§  Philosophy §  Overview data analysis

§  QC steps + tutorial §  Overview of functional analysis §  Overview of taxonomy analysis §  Result outputs

§  Others public pipelines

Data analysis using selected EBI and

external software tools

EBI Metagenomics pipeline

Page 28: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

Some other Metagenomics tools

http://www.computationalbioenergy.org/software.html

http://ab.inf.uni-tuebingen.de/software/megan/ http://cbcb.umd.edu/software/metAMOS

Page 29: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

http://www.ebi.ac.uk/metagenomics/ http://metagenomics.anl.gov/

http://camera.calit2.net/ http://img.jgi.doe.gov/

Public Metagenomics portals

Page 30: I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGERhpc.ilri.cgiar.org/beca/training/AdvancedBFX2013_2/Oct_2013/ebi... · Overview data analysis ! QC steps + tutorial ! Overview

http://www.ebi.ac.uk/metagenomics

Thanks to EMG Team, InterPro team and you for your attention