Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
National Center for Emerging and Zoonotic Infectious Diseases
Under the Hood of WGS in PulseNet: Assessment of the WorkflowHeather Carleton, MPH, PhDBioinformatics/Metagenomics Team Lead, EDLB
PulseNet/OutbreakNet East Coast Regional Meeting
1/17/19
1
Assembly workflow assessment How do different trimming and assembly algorithms affect allele calling
– Overview of BN Assembly– Comparison to different common assemblers and workflows (PulseNet
International Technical Workgroup)• SKESA (NCBI wgMLST assembler)• Spades (multiple versions)• Velvet
Why it’s important to use the standard BN assembly workflow – get the same answer from BN database
3
BioNumerics Assembly Read trimming (quality trim) Assembly – SPAdes version 3.7
– Use SPAdes –careful option– BN CE maps reads back to assembly and corrects or masks bases
based on quality (annotated assembly)
4
Assembly metrics
108000 112000 116000 120000 124000
BN AVERAGE
SKESA AVE
N50
220 230 240 250 260 270 280
BN AVERAGE
SKESA AVE
NrContigs
NrContigs
Allele calls
4350 4400 4450 4500 4550 4600 4650
BN AVERAGE
SKESA AVE
NrConsensus
93 94 95 96 97 98 99
BN AVERAGE
SKESA AVE
CorePercent
Discrepant allele calls between assemblers
0
10
20
30
40
50
60
70
Misc
alls
Locus
>=2 Miscalls Skesa vs. Spades
*787 loci loci with discrepant allele calls*68 loci with 2 or more discrepant alleles*246 genomes have discrepant allele calls
Discrepant calls – SPAdes versus SKESA
Locus Spades -> SkesaEC24359 2 -> 1 16 -> 1 , 2 26 -> 1 1 -> 36EC09209 4 -> 92 9 -> 1EC09399 1 -> 6 5 -> 6EC27193 2 -> 1 , 3 1 -> 3EC23264 79 -> 5 9 -> 6EC25998 30 -> 3 2 -> 3 5 -> 29 29 -> 5EC26376 4 -> 8 , 1 , 3 5 -> 3 8 -> 3 16 -> 8 , 3 1 -> 25 , 4EC09110 46 -> 115 63 -> 235 26 -> 2 2 -> 26 12 -> 2 , 13 30 -> 83 13 -> 2 5 -> 23 14 -> 30 6 -> 12 10 -> 14 , 30
EC08855 2 -> 32 , 110 , 49 , 20 , 57 , 27 30 -> 2 , 29 8 -> 105 , 18 , 29EC24168 1 -> 104 , 4 , 5 104 -> 1 4 -> 1 , 6 7 -> 10 11 -> 10 10 -> 11EC09005 10 -> 1 , 30 17 -> 1 15 -> 1 11 -> 1 49 -> 1EC09118 12 -> 3 3 -> 12 38 -> 2 , 5EC08862 116 -> 23 45 -> 1 62 -> 23 8 -> 261 65 -> 8 261 -> 8 12 -> 25 , 1 25 -> 49 , 1 16 -> 25 37 -> 34EC09144 1 -> 9
SKESA Vs. BN AssembliesComparing 113 Escherichia(Shigella) assemblies
150bp vs 250 bp read sets250bp read set downsampled to match coverage of 150bp read set
N50 and Genome Length
31000
32000
33000
34000
35000
36000
37000
38000
SPAdes_150bp SPAdes_250bp SKESA_150bp SKESA_250bp
N50
4200000440000046000004800000500000052000005400000
Length
Number of Allele Calls and Core Percentage
95
96
97
98
99
100
SPAdes_150bp SPAdes_250bp SKESA_150bp SKESA_250bp
CorePercent
4060
4070
4080
4090
4100
4110
4120
SPAdes_150bp SPAdes_250bp SKESA_150bp SKESA_250bp
NrConsensus
Genotyper Comparison - Pathotyper
0
10
20
30
40
50
60
70
80
90
100
SPAdes_150bp SPAdes_250bp SKESA_150bp SKESA_250bp
Pathotyper (percent)Shigella – IpaH or IpaDSTEC – stx gene and aaiC, aggR, aatA, aap
Conclusions
• SKESA is a fast assembler that produces relatively few descrepanciesbetween SKESA and BN assemblies
• SKESA does not perform well for genotyper type analysis
Reproducibility studies
João André Carriço, Microbiology Institute and Instituto de Medicina Molecular, Faculty of Medicine, University of [email protected] twitter: @jacarrico
151 E.coli Fastq(outbreaks and
population sampling)
QA /QC Trimming
(trimmomatic)
De novo Assembly
(Spades 3.7, Spades 3.11.1,
skesa 0.2)
Assembly post-processing
(Pilon)
cgMLST Allele call(chewBBACA)
Visualization(PHYLOViZ Online 2.0)
Each strain will have 12 assemblies (1812 total)These assemblies generate 66 pairwise comparisons
Pairwise comparison Allelic profiles
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
samtools http://www.htslib.org/
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
http://cab.spbu.ru/software/spades/
https://github.com/broadinstitute/pilon
MLST 2 https://github.com/tseemann/mlst
https://ftp.ncbi.nlm.nih.gov/pub/agarwala/skesa/Skesa 0.2
https://github.com/B-UMMI/chewBBACA /chewBBACA
PHYLOViZ Online 2.0 http://online2.phyloviz.net
Spades
FastQC
https://www.docker.com/
All packaged in docker containers:
All available at:https://hub.docker.com/u/ummidock/
https://github.com/assemblerflow/flowcraft/tree/flowcraft
Fully reproducible and scalable way to run data analysis pipelines
https://www.nextflow.io/
spades 3.11.1spades 3.7skesa 0.2
SRA/ENA
QC trimmed
fasta fasta
fasta fasta
fasta fasta
fasta fasta
fasta fasta
fasta fasta
Using 200 CPUS ~42 hours for a total of 151*12 =1812 assemblies Equivalent of taking ~43 seconds per assembly
using 4 CPUS . In a single computer with 4 CPUs each assembly
takes (in average) 4o minutes
• Each X :• 1 fastq file – 1 strain• 66 comparisons between 12 assemblies• 2360 cgMLST loci pairwise comparison
10 locus difference
SRR53065902325 (out of 2360) loci present in all assemblies
spades 3.11.1spades 3.7skesa 0.2
TrimmedNot trimmed
Pilon correctedNot corrected
10
305 genomes selected by Alison Cody
27371 comparisonsSame as before + velvet (thx Keith!)Only 505 with distance > 10
In this E.coli, trimming, choice of assembler and assembly post processing has a huge influence in the allele calling process for a great number of strains
Strains with less allelic differences had better coverage (read depth) , slightly higher N50 and lesser number of contigs but no verified set of variables can explain why there are such variation
For reproducible comparisons, one has to share the raw data and use the same data analysis pipeline