30
National Center for Emerging and Zoonotic Infectious Diseases Under the Hood of WGS in PulseNet: Assessment of the Workflow Heather Carleton, MPH, PhD Bioinformatics/Metagenomics Team Lead, EDLB PulseNet/OutbreakNet East Coast Regional Meeting 1/17/19 1

Under the Hood of WGS in PulseNet: Assessment of ... - APHL€¦ · In this E.coli, trimming, choice of assembler and assembly post processing has a huge influence in the allele calling

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

National Center for Emerging and Zoonotic Infectious Diseases

Under the Hood of WGS in PulseNet: Assessment of the WorkflowHeather Carleton, MPH, PhDBioinformatics/Metagenomics Team Lead, EDLB

PulseNet/OutbreakNet East Coast Regional Meeting

1/17/19

1

What’s under the hood: Calculation Engine

2

Read Processing Assembly Allele

calling

Assembly workflow assessment How do different trimming and assembly algorithms affect allele calling

– Overview of BN Assembly– Comparison to different common assemblers and workflows (PulseNet

International Technical Workgroup)• SKESA (NCBI wgMLST assembler)• Spades (multiple versions)• Velvet

Why it’s important to use the standard BN assembly workflow – get the same answer from BN database

3

BioNumerics Assembly Read trimming (quality trim) Assembly – SPAdes version 3.7

– Use SPAdes –careful option– BN CE maps reads back to assembly and corrects or masks bases

based on quality (annotated assembly)

4

Assembly Comparison: BN versus Skesa

5

SKESA Vs. BN AssembliesComparing 250 Escherichia assemblies

Assembly metrics

108000 112000 116000 120000 124000

BN AVERAGE

SKESA AVE

N50

220 230 240 250 260 270 280

BN AVERAGE

SKESA AVE

NrContigs

NrContigs

Ambiguous bases

0 2000 4000 6000 8000 10000 12000

BN AVERAGE

SKESA AVE

NrBasesN

Allele calls

4350 4400 4450 4500 4550 4600 4650

BN AVERAGE

SKESA AVE

NrConsensus

93 94 95 96 97 98 99

BN AVERAGE

SKESA AVE

CorePercent

Discrepant allele calls between assemblers

0

10

20

30

40

50

60

70

Misc

alls

Locus

>=2 Miscalls Skesa vs. Spades

*787 loci loci with discrepant allele calls*68 loci with 2 or more discrepant alleles*246 genomes have discrepant allele calls

Discrepant calls – SPAdes versus SKESA

Locus Spades -> SkesaEC24359 2 -> 1 16 -> 1 , 2 26 -> 1 1 -> 36EC09209 4 -> 92 9 -> 1EC09399 1 -> 6 5 -> 6EC27193 2 -> 1 , 3 1 -> 3EC23264 79 -> 5 9 -> 6EC25998 30 -> 3 2 -> 3 5 -> 29 29 -> 5EC26376 4 -> 8 , 1 , 3 5 -> 3 8 -> 3 16 -> 8 , 3 1 -> 25 , 4EC09110 46 -> 115 63 -> 235 26 -> 2 2 -> 26 12 -> 2 , 13 30 -> 83 13 -> 2 5 -> 23 14 -> 30 6 -> 12 10 -> 14 , 30

EC08855 2 -> 32 , 110 , 49 , 20 , 57 , 27 30 -> 2 , 29 8 -> 105 , 18 , 29EC24168 1 -> 104 , 4 , 5 104 -> 1 4 -> 1 , 6 7 -> 10 11 -> 10 10 -> 11EC09005 10 -> 1 , 30 17 -> 1 15 -> 1 11 -> 1 49 -> 1EC09118 12 -> 3 3 -> 12 38 -> 2 , 5EC08862 116 -> 23 45 -> 1 62 -> 23 8 -> 261 65 -> 8 261 -> 8 12 -> 25 , 1 25 -> 49 , 1 16 -> 25 37 -> 34EC09144 1 -> 9

Cluster comparisons: 1704MAEXH-1 O157:H7

Cluster comparison 1705NYEXK-1: O121:H19

SKESA Vs. BN AssembliesComparing 113 Escherichia(Shigella) assemblies

150bp vs 250 bp read sets250bp read set downsampled to match coverage of 150bp read set

N50 and Genome Length

31000

32000

33000

34000

35000

36000

37000

38000

SPAdes_150bp SPAdes_250bp SKESA_150bp SKESA_250bp

N50

4200000440000046000004800000500000052000005400000

Length

Number of Allele Calls and Core Percentage

95

96

97

98

99

100

SPAdes_150bp SPAdes_250bp SKESA_150bp SKESA_250bp

CorePercent

4060

4070

4080

4090

4100

4110

4120

SPAdes_150bp SPAdes_250bp SKESA_150bp SKESA_250bp

NrConsensus

Genotyper Comparison - Pathotyper

0

10

20

30

40

50

60

70

80

90

100

SPAdes_150bp SPAdes_250bp SKESA_150bp SKESA_250bp

Pathotyper (percent)Shigella – IpaH or IpaDSTEC – stx gene and aaiC, aggR, aatA, aap

Conclusions

• SKESA is a fast assembler that produces relatively few descrepanciesbetween SKESA and BN assemblies

• SKESA does not perform well for genotyper type analysis

Assembly Comparison: PulseNet International

Reproducibility studies

João André Carriço, Microbiology Institute and Instituto de Medicina Molecular, Faculty of Medicine, University of [email protected] twitter: @jacarrico

151 E.coli Fastq(outbreaks and

population sampling)

QA /QC Trimming

(trimmomatic)

De novo Assembly

(Spades 3.7, Spades 3.11.1,

skesa 0.2)

Assembly post-processing

(Pilon)

cgMLST Allele call(chewBBACA)

Visualization(PHYLOViZ Online 2.0)

Each strain will have 12 assemblies (1812 total)These assemblies generate 66 pairwise comparisons

Pairwise comparison Allelic profiles

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

samtools http://www.htslib.org/

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://cab.spbu.ru/software/spades/

https://github.com/broadinstitute/pilon

MLST 2 https://github.com/tseemann/mlst

https://ftp.ncbi.nlm.nih.gov/pub/agarwala/skesa/Skesa 0.2

https://github.com/B-UMMI/chewBBACA /chewBBACA

PHYLOViZ Online 2.0 http://online2.phyloviz.net

Spades

FastQC

https://www.docker.com/

All packaged in docker containers:

All available at:https://hub.docker.com/u/ummidock/

https://github.com/assemblerflow/flowcraft/tree/flowcraft

Fully reproducible and scalable way to run data analysis pipelines

https://www.nextflow.io/

spades 3.11.1spades 3.7skesa 0.2

SRA/ENA

QC trimmed

fasta fasta

fasta fasta

fasta fasta

fasta fasta

fasta fasta

fasta fasta

Using 200 CPUS ~42 hours for a total of 151*12 =1812 assemblies Equivalent of taking ~43 seconds per assembly

using 4 CPUS . In a single computer with 4 CPUs each assembly

takes (in average) 4o minutes

• Each X :• 1 fastq file – 1 strain• 66 comparisons between 12 assemblies• 2360 cgMLST loci pairwise comparison

10 locus difference

SRR53065902325 (out of 2360) loci present in all assemblies

spades 3.11.1spades 3.7skesa 0.2

TrimmedNot trimmed

Pilon correctedNot corrected

10

305 genomes selected by Alison Cody

27371 comparisonsSame as before + velvet (thx Keith!)Only 505 with distance > 10

In this E.coli, trimming, choice of assembler and assembly post processing has a huge influence in the allele calling process for a great number of strains

Strains with less allelic differences had better coverage (read depth) , slightly higher N50 and lesser number of contigs but no verified set of variables can explain why there are such variation

For reproducible comparisons, one has to share the raw data and use the same data analysis pipeline

For more information, contact CDC1-800-CDC-INFO (232-4636)TTY: 1-888-232-6348 www.cdc.gov

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Thank you!

30