38
Snippy Torsten Seemann Balti & Bioinformatics - Birmingham, UK - Tue 5 May 2015 Rapid bacterial variant calling & core genome alignments

Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Embed Size (px)

Citation preview

Page 1: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Snippy

Torsten Seemann

Balti & Bioinformatics - Birmingham, UK - Tue 5 May 2015

Rapid bacterial variant calling & core genome alignments

Page 2: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Background

Page 3: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

(Far) south east England

Page 4: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Phyloflagomics

UK / Birmingham Australia / Victoria Canada / British Columbia

Page 5: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

A new home

Centre for Applied Microbial Genomics

Page 6: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Microbiological Diagnostic Unit

∷ Oldest public health lab in Australia: established 1897 in Melbourne: large historical isolate collection back to 1950s

∷ National reference laboratory: Salmonella, Listeria, EHEC

∷ WHO regional reference lab: vaccine preventable invasive bacterial pathogens

Page 7: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

New director

∷ Professor Ben Howden: clinician, microbiologist, pathologist: early adopter of genomics and bioinformatics: long term collaborator on MRSA/VRE w/ Tim Stinear

∷ Mandate: modernise service delivery: enhance research output and collaboration: nationally lead the conversion to WGS

Page 8: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Hardware∷ Sequencers

: NextSeq 500: 3 x MiSeq: PacBio RS II (arriving 22 May)

∷ Robots: Perkin Elmer (does not have a Twitter account): Colony picker

∷ Compute: 240 TB, 10 GigE, 3 x 72 core boxes

Page 9: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Variant calling

Page 10: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Variant calling

∷ Find DNA differences between genomes: variants to explain phenotype: validate your complemented mutant

∷ Two approaches: reference based (read alignment): reference-free (de novo assembly / k-mer based)

Page 11: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Types of variants

∷ Substitutions: single nucleotide polymorphism (snp) A➝C: multiple nucleotide polymorphism (mnp) AG➝TC

∷ Indels: insertion (ins) A➝AC : deletion (del) ACCG➝AG

∷ Complex: compound events AC➝T

Page 12: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

My solution

Page 13: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Snippy

∷ Fast → snappy

∷ Finds variants → SNPs

∷ Australian → Skippy the bush kangaroo

Page 14: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Input

∷ FASTQ files: paired end, interleaved, or single-end

∷ Reference: FASTA or Genbank

∷ Output folder: self contained bundle of results

Page 15: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Inside the black box

∷ bwa mem - no clipping needed

∷ samtools - sorted, filtered BAM

∷ freebayes - split / GNU parallel / merge

∷ vcflib/vcftools - VCF filtering

∷ perl - glue

Page 16: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Outputs

∷ Read alignments: .bam / .bai

∷ Variants: .vcf / .vcf.gz / .vcf.gz.tbi / .gff .bed .tab .csv .html

∷ Consensus: reference with all variants applied to it

∷ Genome alignment: reference with “-” (missing) and “N” low depth

Page 17: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

TAB outputCHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT

chr 5958 snp A G G:44 A:0 CDS + 41/600 13/200 ECO_0001 dnaA replication protein

DnaA

chr 35524 snp G T T:73 G:1 C:1 tRNA -

chr 45722 ins ATT ATTT ATTT:43 ATT:1 CDS - ECO_0045 gyrA DNA gyrase

chr 100541 del CAAA CAA CAA:38 CAAA:1 CDS + ECO_0179 hypothetical protein

plas 619 complex GATC AATA GATC:28 AATA:0

plas 3221 mnp GA CT CT:39 CT:0 CDS + ECO_p012 rep hypothetical protein

Page 18: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Phylogenomics

Page 19: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Page 20: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Phylogenetics 101∷ Choose some genes∷ Sequence each gene from each isolate∷ Align the protein sequences of each gene∷ Back-align to nucleotide space∷ Concatenate all the alignments∷ Construct a distance matrix (many ways)∷ Draw a tree (many ways)∷ Make wild inferences from little data

Page 21: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Phylogenomics 101

∷ Assemble each genome

∷ Perform whole genome alignment : in nucleotide space, as don’t know what is coding: very computationally expensive: can’t parallelize as with individual genes

∷ Continue as for phylogenetics

Page 22: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTC

∷ Ideally, feed this directly to a tree builder∷ Properly model gaps, codons and ambiguity ∷ Hard!

Whole genome alignment

Page 23: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Core genome SNPs

Page 24: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||

Core sites are present in all genomes.

Core genome

Page 25: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||SNPs | | | | |

Core SNPS = polymorphic sites in core genome

Core SNPs

Page 26: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore | | ||||||||| ||||||SNPs | | | | |SNPs’ | | | |

Unambiguous core SNPs

Page 27: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCSNPs’ | | | | ata ttc ata atg 1 2 3 4

Allele sites

Page 28: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

>bug1ATAA>bug2TTTT>bug3ACAG

Alignment ⇢Tree

+------ bug3 | ---+--- bug1 | +--------- bug2

--- 1 SNP

Page 29: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

The N±1 problem

Page 30: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Aligning to reference

∷ Why is whole genome alignment not used?: involves genome (mis)assembly: computationally difficult: expensive to add or remove isolates

∷ Short-cut: choose a single reference: align each isolates reads to the reference: core, by definition, must include the reference

Page 31: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Read mapping considerations

∷ Choice of reference

∷ Too divergent?: reads may not align well: will get too many core genome SNPs

∷ One solution: Assemble one isolate and use as the reference

Page 32: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore1 ||| ||||||||||| ||||||||||SNPs1 | | || |

Remove taxon, different core (1)

Page 33: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore2 | | ||||||||| ||||||SNPs2 | | | | |

Remove taxon, different core (2)

Page 34: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

SNPs | | | | |core | | ||||||||| ||||||bug1 GATTACCAGCATTAAGG-TTCTCCAATCbug2 GAT---CTGCATTATGGATTCRNCATTCbug3 G-TTACCAGCACTAA-------CCAGTCcore3 | ||||||||||||| ||||||SNPs3 | |

Remove taxon, different core (3)

Page 35: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Core genome alignments

∷ Core SNP alignments: can shift dramatically with taxa content: we are only using globally conserved sites: remember variation still exists outside “core”

∷ Snippy will keep the full alignments: quickly derive subsets on the fly: adding isolates can be done quickly too

Page 36: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Conclusion

Page 37: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Snippy summary∷ The good

: Fast, scales to 100 cores: Simple, clean interface and output

∷ The bad: Doesn’t do full consequences yet using snpEff

∷ The ugly?: Written in Perl

Page 38: Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Contact

∷ tseemann.github.io

∷ github.com/tseemann/snippy

∷ @torstenseemann