Genomic Epidemiology: How High Throughput Sequencing changed our view on bacterial strain...

  • View
    296

  • Download
    0

  • Category

    Science

Preview:

Citation preview

João André CarriçoMicrobiology Institute and Mramirez Lab, Instituto de Medicina Molecular, Faculty of Medicine, University of Lisbonjcarrico@fm.ul.pt twitter: @jacarrico

Bioinformatic Open Days 2017Braga 22 February 2017

Jon Snow

English physician 1813-1858

Total : 616 dead31 Aug -10 Sep: 500 dead

"the study of what is upon the people”

the branch of medicine which deals with the incidence, distribution, and possible control of diseases and other factors relating to health.

It is the cornerstone of public health, and shapes policy decisions and evidence-based practice by identifying risk factors for disease and targets for preventive healthcare

“shoe-leather epidemiology” and lots of statistics

EBOLA West African Ebola virus epidemic

2011 Germany E. coli O104:H4 outbreak :

bloody diarrhea accompanied by

hemolytic-uremic syndrome (HUS)

Genomic sequencing by BGI Shenzhen confirm a 2001 finding that the O104:H4 serotype has some enteroaggregative E. coli (EAEC or EAggEC) properties, presumably acquired by horizontal gene transfer

On 8 June, the EU's E. coli O104:H4 outbreak was

estimated to have cost ~2,690,000,000 EUR in human

losses (such as sick leave), regardless of material losses

(such as dumped cucumbers - ~240M Euro in Spain only)

Crowdsourcing event started at ABPHM2011

Smith, K. F. et al. Global rise in human infectious disease

outbreaks. Journal of The Royal Society Interface 11,

20140950–20140950 (2014).

Flight paths across North America.

Outbreaks follow flight paths more closely than

simple geographic distance.

Outbreaks have costs in mortality, mobility and

other direct economic impacts (product recall)

Fast intervention can save lives and money

Slide credit: Fiona Brinkman

1. Didelot, X., Bowden, R., Wilson, D. J., Peto, T. E. A. & Crook, D. W. Transforming

clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13, 601–612

(2012).

“Not all bacteria are born equal”

discriminating strains within a species/subspecies

Gel based:Pulsed Gel ElectrophoresisRAPDAFLP

Phenotypic based:Colony morphology/colorAntibiogramSerotype

Sequence Based:MultLocus Sequence Typing (MLST)emm typingspa typing

ST aroE gdh gki recP spi xpt ddl

156 7 11 10 1 6 8 1

Bacterial Population

Genetics

Pathogenesis and

NaturalHistory ofInfection

Surveillance ofInfectiousDiseases

Outbreak Investigation

and Control

S pneumoniae housekeeping genes

?

?

?

??

?

?

PCR

aroE

gdh

gki

recP

spi

xpt

ddl

7 Sequences

wwwhttp://pubmlst.org/spneumoniae/

Retrieve alleles and ST

ST aroE gdh gki recP spi xpt ddl

156 7 11 10 1 6 8 1

1

?aroE

gdh

gki

recP

spi

xpt

ddl

8

1

6

7

11

10

ST 156

SangerSequencing

Greatest advantages:Sequence reduced to allele IDPortableEasy to infer relationship

Nomenclature

Clinical

animalNA

community

HospitalSurv/Outb

Enterococcus faecium

More on this later…

The Internet

Sequence-basedInformation

But only 7 target loci ….

HiSeq 2000

MiSeq

PacBIO

OXFORD NANOPOREMinION

https://nanoporetech.com/products/minion

https://nanoporetech.com/products/smidgion

OXFORD NANOPORESmigION

Alikhan, N.-F. et al., 2011. BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons. BMC genomics, 12, p.402.

Bacterial Draft Genomes:- 1 circular chromossome- 1.5 -4 MB(most of them)- From a few to hundreds of contigs- may contain Plasmids- Hundreds of thousands bacterial

read sets are already available on SRA /ENA

- Usually sequenced at 30-100x depth of coverage

- Cost : 70-150 EUR (Illumina) / 500-2000 EUR (PacBIo/Nanopore)

Sequencing & Bioinformatics

• Sequencing, Assembly Pipeline Parameters

• QA/QC Metrics• Tree Construction Details

Sample Information

• Isolation source• Food, Clinical, Environment• Food category, Body Product• Dates, Location

Clinical and Epi Details

• Demographics• Host disease, Symptoms • Lab Test Results• Exposures

Slide credit: Will Hsiao

investigations using integrated microbial genomic data & “metadata”(lab, epidemiological data)

… aiming to save lives, economies

Slide credit: Fiona Brinkman

Chronicle of a Death Foretold

http://en.wikipedia.org/wiki/File:ChronicleOfADeathForetold.JPG

Game Changer for Microbial Typing

From the reads much more information can be extracted :

- gene-by-gene approaches: wgMLST, cgMLST

- SNP comparison approaches: comparison with reference

strains

- k-mer based distances

- Ability to recover most of the present sequence based

typing information in a single experimental procedure

- Greatly Increased discriminatory power

- Unifies genomics and typing

Microbiological

Sample

The Ideal Scenario

Magic Box of

NGS Wonders

for

Microbiology

Completely characterized strain:

• Antibiotic resistance profile• Multilocus Sequence Typing (MLST)• Virulence factors present• Other SBTM information .Ex:

• spa (S. aureus)• emm (Group A Streptococcus)

Desired End result:

Risk Assessment of the strain and

Useful application of the data to clinical practice

Comparison between groups of strains

Didelot, X., Bowden, R., Wilson, D. J., Peto, T. E. A. & Crook, D. W. Transforming

clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13, 601–612

(2012).

sample

HTS

reads

Reference

genome

VCF/Fasta File

with SNPs

• Uses a reference strain:• Outbreak determination• Comparative studies• Monomorphic (Clonal) species

• Recombination/Horizontal gene transfer must be detected and removed from phylogenetic analysis

• Difficult to create a nomenclature (due to different references)

Read mapping software

Phylogenetic/Minimum spanning Tree

61 Streptococcus dysgalactiae subspecies equisimilis isolates

Roary presence and absence matrix (10661 gene clusters)

Core (n or n-1 strains)

Soft-Core (n-2 or n-3 strains)

Shell( 8(?) to n-3 strains)

Cloud( <8 (?) strains)

Core genome:Core + Soft-Core

Accessory genome:Shell + Cloud

Catarina Inês Mendes(as you already noticed decorations were only for Xmas)

Virulome

Core genomeAccessory genome

Mobilome

Central nomenclature server:

Schemas,

Allele /Profile IDs

contigs

Output :Allelic Profile

• Expansion of the MLST concepts to core/pan genome

• Buffers recombination effect• Simpler to create a nomenclature• Population structure of non-

monomorphic species• Easy to compare thousands of

samples using thousands of loci• Handling Missing data is still an open

problem

sample

HTS

reads De novo assembly software

Phylogenetic/Minimum spanning Tree

This is Chewbacca ... He is chewBBACA’s cousin

Our approach to the problem:

Mickael Silva(He didn’t bring the glasses…)

https://pmcvariety.files.wordpress.com/2014/06/eli-wallach-dead-good-bad-ugly.jpg?w=670&h=377&crop=1

My Goals/ Areas that I want to apply WGS to: • Microbial population structure• Microbial Evolution• Microbial Genomics : gene structure, genome synteny,

Mobile Genetic Elements detection

My toolbox is chosen based on my questions and what I want to do !

Trying to avoid:“I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” - Abraham H. Maslow (1962), Toward a Psychology of Being

Sequence QA/QCFastQChttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Adaptor and Quality trimming:trimmomatichttp://www.usadellab.org/cms/?page=trimmomatic

AssemblySPAdeshttp://bioinf.spbau.ru/spades

Velvet http://www.ebi.ac.uk/~zerbino/velvet/

MappingBowtie2http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

Annotation:Prokkahttp://www.vicbioinformatics.com/software.prokka.shtml

Whole genome comparisonBRIG (Blast Ring Generator)http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

MAUVEhttp://darlinglab.org/mauve/mauve.html

http://rugbyea.com/wp-content/uploads/2013/05/blast.jpghttp://www.ecohealthypets.com/writable/pet_report_photos/photo/480x/ball_python_2.jpg

- Perform the same analysis over tens, hundreds or thousands of strains : your own and publicly available

- Integrate multiple analysis in a single pipeline- Pipelines = reproducibility (if not something is very wrong)

http://www.ebi.ac.uk/ena

http://www.ncbi.nlm.nih.gov/sra

A Standardized Pipeline for Bacterial Genome Assembly and Quality Control

Miguel Machado

(not wearing his wolf suit today…I think...)

Virulence Factor Databases VFDB (http://www.mgc.ac.cn/VFs/main.htm)

Pathosystems Resource Integration Center (PATRIC) VF (https)://www.patricbrc.org/)

Victors (http://www.phidias.us/victors/)

PHI-Base (http://www.phi-base.org/)

MvirDB (http://mvirdb.llnl.gov/ )

To know more: - Presentation on the Controversies in interpreting whole genome sequence data session :

http://eccmidlive.org/#resources/how-can-we-design-actionable-virulome-databases

Comprehensive Antibiotic Resistance Database (CARD) (https://card.mcmaster.ca/)

Repository of Antibiotic resistance Cassetes (RAC)(http://rac.aihi.mq.edu.au/rac/)

Integrall :The integron database (http://integrall.bio.ua.pt/)

(…)

“Formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts” – Wikipedia

Domain modeling: represents all the concepts involved in in microbial typing by sequence-based methods

Provides a shared vocabulary, where the concepts should be unambiguous

Enables a machine-readable format that can be used for software and algorithms automatically interact with multiple databases

Existing DBs reuse each others datasets without truedatabase interoperability: need for common ontologies(controlled vocabularies already exist but are not used byall)

Ontologies and computer readable data formats (json-ld or RDF) can allow for true database interoperabilityallowing bioinformaticians to extract the targetedinformation from a single query reaching multipledatabases

Trends Microbiol 17, 279–285 (2009).

GenEpiO: Combining Different Epi, Lab,

Genomics and Clinical Data Fields.

Lab AnalyticsGenomics, PFGE

Serotyping, Phage typingMLST, AMR

Clinical DataPatient demographics,

Medical History, Comorbidities, Symptoms,

Health Status

ReportingCase/Investigation Status

GenEpiO(Genomic Epidemiology Application Ontology)

See draft version at https://github.com/Public-Health-Bioinformatics/IRIDA_ontology

Original slide fromEmma Griffiths

Public Health Surveillance

Case Cluster Analysis

Result Reporting

Infectious Disease Epidemiology (from case to Intervention)

Lab Surveillance (from sample to strain typing results)

Evidence Collection

& Outbreak Investigation

Sample Collection& Processing

Sequence Data Generation &

Processing

Bioinformatics Analysis

Result Reporting

Whole Genome Sequencing (SO, ERO, OBI etc)

Quality Control (OBI, ERO)

Anatomy (FMA)

Environment (Envo)

Food (FoodOn)

Clinical Sampling (OBI)

Custom LIMS

Quality Control (OBI, ERO)

AMR (ARO)

Virulence (PATO)

Phylogenetic Clustering (EDAM)

Mobile Elements (MobiO)

Quality Control (OBI, ERO)

AMR (ARO) LOINC

Surveillance (SurvO)

Demographics (SIO)

Patient History (SIO)

Symptoms (SYMP)

Exposures (ExO)

Source Attribution (IDO)

Travel (IDO)

Transmission (TRANS)

Food (FoodOn)

Geography (OMRSE)

Outbreak Protocols

Surveillance (SurvO)

Food (FoodOn)

Surveillance (SurvO)

Mobile Elements (MobiO)

Infectious Disease (IDO)

Typing (TypON)

Nomenclature & Taxonomy (NCBItaxon)

Original slide from Emma Griffiths /IRIDA

htt

p:/

/fo

od

on

tolo

gy.

git

hu

b.io

/fo

od

on

/

(pipeline) NGSOnto

Available databases still lack interfaces forprogrammatic access : RESTful APIs would allow:▪ easy automatic querying from scripts without the need

of web interfaces or downloads

▪ Database updates by authorized groups (distributedcuration effort)

APIs : Application Programming Interfaces

Now we have thousands of targets for thousands of strains annotated with precious epidemiological data

Traditional phylogenetic analysis methods aren’t able to tackle the existing amount of information

Freely available /Open sourceJava software

Calculates:goeBURST MSTHierarchical clusteringNeighbour Joining

Can be easily applied to:- MLST/ cgMLST/wgMLST- MLVA- SNP data*- Gene Presence/absence

https://online.phyloviz.net/

API: *account creation*profile + metadata upload*running goeBURST*retrieving a link

Private or Public data sharing

Scalable to thousands of nodes

Tree Analysis tools:Interactive distance matrixNLV graph Node.js / VivaGraph.js (webGL)

Screenshot by @happy_khanWith Enterobase datahttps://enterobase.warwick.ac.uk/

Bruno Gonçalves(He also didn’t bring the glasses…)

• High Throughput Sequencing changed our views and ways to analyze bacterial populations and discriminate strains for outbreak investigation /surveillance purposes - > Genomic Epidemiology

• Bioinformatics is the key item for global genomic epidemiology. Open-source and freely-available tools provide the ability to build custom-made and verifiable pipelines.

• Real time global data sharing can speed up outbreak investigations and save lives…however some ethical /confidential issues need to be handled

• It is computationally challenging when we want to analyze and query all data produced. Most methods don’t scale well

• The future: Isolation free methods are needed: Speed up the analysis Metagenomics

Algorithms

Interfaces

Ontologies

UMMI Members Bruno Gonçalves Mickael Silva Catarina Inês Mendes Miguel MAchado Mário Ramirez José Melo-Cristino

INESC-ID Alexandre Francisco Cátia Vaz Marta Nascimento

EFSA INNUENDO Project (https://sites.google.com/site/innuendocon/) Mirko Rossi

BACGENTRACK project [FCT / Scientific and Technological Research Council of Turkey (Türkiye Bilimsel ve Teknolojik AraştırmaKurumu, TÜBİTAK), TUBITAK/0004/2014]

ONEIDA project FCT Joint Activities Programme (PAC) - http://www.itqb.unl.pt/oneida

Genome Canada IRIDA project (www.irida.ca) Franklin Bristow, Thomas Matthews, Aaron Petkau, Morag Graham and Gary Van Domselaar (NLM , PHAC) Ed Taboada and Peter Kruczkiewicz (Lab Foodborne Zoonoses, PHAC) Fiona Brinkman (SFU) William Hsiao (BCCDC)

INTEGRATED RAPID INFECTIOUS DISEASE ANALYSIS

Recommended