Targeted PacBio sequencing of wild zebrafish immune gene ... · 2-5 ug for Template Prep 5 kb SMRTbell Template Prep PacBio Sequel * All paralogs of: IRGs, GBPs, MX, NFKb, TLRs, RLRs,

Targeted PacBio sequencing of wild

zebrafish immune gene families

Jaanus Suurväli University of Cologne

Institute for Genetics

Leiden, 12. June 2018

Cyprinidae ~3000 species of cyprinids

~9-10 % of all fish species

10 most harvested fish in the world

11:

Source: Wikipedia

10 most harvested fish in the world

11:

Source: Wikipedia

6/10 of the top fish all belong to the family Cyprinidae!

Zebrafish

• Cyprinid fish, genus Danio

• Model vertebrate for research

• Also popular as pets to keep in an aquarium

• The common lab strains are inbred and often have unclear

origins

• Reference genome: 1.4 Gb

• Average differences from the reference: 0.5%

Zebrafish immune genes

• Many unknowns even after decades of research

• Adaptive immune system – MHC I and II unlinked (as in all

teleosts)

– MHC loci scattered across the genome

• Innate immune system – B30.2 domain attached to TRIMs

and NLRs, both multiplied to hundreds of copies

– Hundreds of small Ig-based receptors

Closest neighbour graph of the NLRs.

Different colors mark different NLR subtypes.

Howe K, Schiffer P et al (2016). Open Biology 6:160009

What is an NLR?

• NOD-Like Receptor

• NACHT and Leucine-rich

Repeats

• Nucleotide-binding domain and

Leucine-rich Repeats

(used mostly in plants)

B30.2

Fish NLRs ( ) ( )

Figure adapted from the Invivogen website,

https://www.invivogen.com/review-nlr

FISNA(220 bp)-NACHT (500 bp)-helixes (1100bp)

Hundreds of NLRs in fish

Tørresen, OK et al (2018) BMC Genomics 19: 1-240

Note: this table might still be an underestimation. NLR coding exons can have up to 100%

identity to each other, meaning that short-read approaches are not nearly sufficient to

distinguish them from each other. PacBio or NanoPore are required here, either with WGS

or target enrichment.

Methods The following protocol is based on Witek & Jones (2016). SMRT RenSeq protocol. Protocol Exchange doi:10.1038/protex.2016.027

• Extract genomic DNA

• Perform target enrichment and multiplex

• Sequence on the PacBio Sequel

• Bait design: – Custom 120 bp biotinylated baits with

2x coverage for all targeted regions

– Bait specificity was first tested in silico, nonspecific ones were excluded

– Targets:

• 400x ~2kb exons: FISNA-NACHT-helixes

• 600x ~0.6kb exons: B30.2

• All exons of the Class I and II MHC genes, TLRs, IFNs and selected other genes*

– Final baitset: ~19,500 baits (~16k of these unique)

DNA extraction

Covaris shearing

DNA repair

Add amplification adapters and

barcodes (NEBNext Ultra II)

Enrichment with hybridization

baits (Arbor Biosciences)

Amplify the library to get

2-5 ug for Template Prep

5 kb SMRTbell Template Prep

PacBio Sequel

* All paralogs of: IRGs, GBPs, MX, NFKb, TLRs, RLRs, IFNs, PTGS

In addition, IL-1b, TNFa, DHX9, CTCF, IRF3, IRF5, IRF7, and a few others

Zebrafish samples

Collection of CHT samples has been previously described in: Whiteley et al (2011). Molecular Ecology 20:4259-76

The pie charts show population substructure of wild zebrafish,

calculated from RADseq data with the R package LEA.

The populations in the red circle are targeted for PacBio.

Coalescent tree built from

~4500 independent loci using SNAPP

Enrichment efficiency

4 fish per SMRTcell

> 95% of the reads succesfully demultiplexed with lima

~60% of the data is on target

• 1.2 Gb data on average per fish (2.3 Gb with LR)

• 60% = 0.7 Gb on target per fish (1.4 Gb with LR)

• Zebrafish is diploid, 0.35 Gb per genome (0.7 with LR)

• 1200 targets, assuming 5kb for average length

• Coverage: ~60x per genome (120x with LR)

Methods 2 (data analysis)

• Lima (demultiplexing)

• CCS (getting the circular consensus)

• BLASR (mapping the subreads)

• Canu (de novo assembly)

• Arrow (polishing of the assembly, variant calling)

• Mapping/aligning the assembled reads: blastn, minimap2

• Predicting protein domains: EMBOSS transeq, followed by hmmer3

• Multiple alignments: clustalo, mafft

• Trees: MEGA, RaxML

Genetic variation • Many de novo assembled contigs have > 99.5% identity to the reference

• This is exactly what would be expected from zebrafish.

• Getting information on haplotypes and heterozygous SNPs is a work in progress.

• CCS reads with >= 20 passes from a single wild fish were mapped to the reference with BLASR.

• The targeted-phasing-consensus approach described in the PB wiki was used to separate the haplotypes.

• This is the output for one of the NLRs on Chromosome 4, visualized in IGV (many others look similar).

• Many NLR-aligning reads get a superb MAPQ, yet still look like the above.

• 4 „haplotypes“ were mapped to the same gene.

• Zebrafish is diploid, so not biologically possible.

• Previously undescribed NLR copies, possibly from recent duplications?

• Looking at the data, some genomic NLRs have a mapping coverage of up to 700x.

• Others have no primary alignments from the data at all.

• Indication of strain-specific copy number variation?

99%

97%

94%

96%

identitites

MHC haplotypes

McConnell et al (2016). PNAS 113(34): E5014–E5023.

AB strain

TU strain

CG strain

Four Class I u MHC sequences were

assembled from wild fish KG35. The closest matches in NCBI databases are

shown with % identities

UBA 83%, UKA 77%

UKA 87%, UBA 83%

UEA 92%, UIA 87%, UDA 82%

UIA 85%, UEA 81%, UDA 74%

Non-reference NLRs in lab strains

98%

Conclusions

• We have established a pipeline for targeted sequencing of

zebrafish immune genes

• We can see variation from SNPs to new genes

• There are three types of possible „new genes“ in our data:

– Genes that clearly differ from anything in the reference genome

– Cases of multiple genes mapping to a single gene in the reference with

high confidence (recent duplicates)

– MHC haplotypes, which in zebrafish can sometimes mean distinct sets

of genes

Plans and perspectives

– Sequence the (partial) immune repertoire of a

total of 96 zebrafish

– Build a new reference for mapping the reads.

– Get rid of PCR duplicates in the data

– Call all variation, including heterozygous.

– Phase the data into haplotype blocks and use it for

population genetics.

Acknowledgements

University of Cologne, Cologne, Germany

Maria Leptin, Thomas Wiehe, Katja Palitzsch

Max Planck Genome Centre, Cologne, Germany

Bruno Hüttel

University of Montana, Missoula, MT, USA

Andrew Whiteley

University College London, London, UK

Philipp Schiffer

Sainsbury laboratory, Norwich, UK

Jonathan Jones, Kamil Witek, Oliver Furzer

Documents

Targeted PacBio sequencing of wild zebrafish immune gene ... · 2-5 ug for Template Prep 5 kb SMRTbell Template Prep PacBio Sequel * All paralogs of: IRGs, GBPs, MX, NFKb, TLRs, RLRs,