Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Targeted PacBio sequencing of wild
zebrafish immune gene families
Jaanus Suurväli University of Cologne
Institute for Genetics
Leiden, 12. June 2018
Cyprinidae ~3000 species of cyprinids
~9-10 % of all fish species
10 most harvested fish in the world
11:
Source: Wikipedia
10 most harvested fish in the world
11:
Source: Wikipedia
6/10 of the top fish all belong to the family Cyprinidae!
Zebrafish
• Cyprinid fish, genus Danio
• Model vertebrate for research
• Also popular as pets to keep in an aquarium
• The common lab strains are inbred and often have unclear
origins
• Reference genome: 1.4 Gb
• Average differences from the reference: 0.5%
Zebrafish immune genes
• Many unknowns even after decades of research
• Adaptive immune system – MHC I and II unlinked (as in all
teleosts)
– MHC loci scattered across the genome
• Innate immune system – B30.2 domain attached to TRIMs
and NLRs, both multiplied to hundreds of copies
– Hundreds of small Ig-based receptors
Closest neighbour graph of the NLRs.
Different colors mark different NLR subtypes.
Howe K, Schiffer P et al (2016). Open Biology 6:160009
What is an NLR?
• NOD-Like Receptor
• NACHT and Leucine-rich
Repeats
• Nucleotide-binding domain and
Leucine-rich Repeats
(used mostly in plants)
B30.2
Fish NLRs ( ) ( )
Figure adapted from the Invivogen website,
https://www.invivogen.com/review-nlr
FISNA(220 bp)-NACHT (500 bp)-helixes (1100bp)
Hundreds of NLRs in fish
Tørresen, OK et al (2018) BMC Genomics 19: 1-240
Note: this table might still be an underestimation. NLR coding exons can have up to 100%
identity to each other, meaning that short-read approaches are not nearly sufficient to
distinguish them from each other. PacBio or NanoPore are required here, either with WGS
or target enrichment.
Methods The following protocol is based on Witek & Jones (2016). SMRT RenSeq protocol. Protocol Exchange doi:10.1038/protex.2016.027
• Extract genomic DNA
• Perform target enrichment and multiplex
• Sequence on the PacBio Sequel
• Bait design: – Custom 120 bp biotinylated baits with
2x coverage for all targeted regions
– Bait specificity was first tested in silico, nonspecific ones were excluded
– Targets:
• 400x ~2kb exons: FISNA-NACHT-helixes
• 600x ~0.6kb exons: B30.2
• All exons of the Class I and II MHC genes, TLRs, IFNs and selected other genes*
– Final baitset: ~19,500 baits (~16k of these unique)
DNA extraction
Covaris shearing
DNA repair
Add amplification adapters and
barcodes (NEBNext Ultra II)
Enrichment with hybridization
baits (Arbor Biosciences)
Amplify the library to get
2-5 ug for Template Prep
5 kb SMRTbell Template Prep
PacBio Sequel
* All paralogs of: IRGs, GBPs, MX, NFKb, TLRs, RLRs, IFNs, PTGS
In addition, IL-1b, TNFa, DHX9, CTCF, IRF3, IRF5, IRF7, and a few others
Zebrafish samples
Collection of CHT samples has been previously described in: Whiteley et al (2011). Molecular Ecology 20:4259-76
The pie charts show population substructure of wild zebrafish,
calculated from RADseq data with the R package LEA.
The populations in the red circle are targeted for PacBio.
Coalescent tree built from
~4500 independent loci using SNAPP
Enrichment efficiency
4 fish per SMRTcell
> 95% of the reads succesfully demultiplexed with lima
~60% of the data is on target
• 1.2 Gb data on average per fish (2.3 Gb with LR)
• 60% = 0.7 Gb on target per fish (1.4 Gb with LR)
• Zebrafish is diploid, 0.35 Gb per genome (0.7 with LR)
• 1200 targets, assuming 5kb for average length
• Coverage: ~60x per genome (120x with LR)
Methods 2 (data analysis)
• Lima (demultiplexing)
• CCS (getting the circular consensus)
• BLASR (mapping the subreads)
• Canu (de novo assembly)
• Arrow (polishing of the assembly, variant calling)
• Mapping/aligning the assembled reads: blastn, minimap2
• Predicting protein domains: EMBOSS transeq, followed by hmmer3
• Multiple alignments: clustalo, mafft
• Trees: MEGA, RaxML
Genetic variation • Many de novo assembled contigs have > 99.5% identity to the reference
• This is exactly what would be expected from zebrafish.
• Getting information on haplotypes and heterozygous SNPs is a work in progress.
• CCS reads with >= 20 passes from a single wild fish were mapped to the reference with BLASR.
• The targeted-phasing-consensus approach described in the PB wiki was used to separate the haplotypes.
• This is the output for one of the NLRs on Chromosome 4, visualized in IGV (many others look similar).
• Many NLR-aligning reads get a superb MAPQ, yet still look like the above.
• 4 „haplotypes“ were mapped to the same gene.
• Zebrafish is diploid, so not biologically possible.
• Previously undescribed NLR copies, possibly from recent duplications?
• Looking at the data, some genomic NLRs have a mapping coverage of up to 700x.
• Others have no primary alignments from the data at all.
• Indication of strain-specific copy number variation?
99%
97%
94%
96%
identitites
MHC haplotypes
McConnell et al (2016). PNAS 113(34): E5014–E5023.
AB strain
TU strain
CG strain
Four Class I u MHC sequences were
assembled from wild fish KG35. The closest matches in NCBI databases are
shown with % identities
UBA 83%, UKA 77%
UKA 87%, UBA 83%
UEA 92%, UIA 87%, UDA 82%
UIA 85%, UEA 81%, UDA 74%
Non-reference NLRs in lab strains
98%
Conclusions
• We have established a pipeline for targeted sequencing of
zebrafish immune genes
• We can see variation from SNPs to new genes
• There are three types of possible „new genes“ in our data:
– Genes that clearly differ from anything in the reference genome
– Cases of multiple genes mapping to a single gene in the reference with
high confidence (recent duplicates)
– MHC haplotypes, which in zebrafish can sometimes mean distinct sets
of genes
Plans and perspectives
– Sequence the (partial) immune repertoire of a
total of 96 zebrafish
– Build a new reference for mapping the reads.
– Get rid of PCR duplicates in the data
– Call all variation, including heterozygous.
– Phase the data into haplotype blocks and use it for
population genetics.
Acknowledgements
University of Cologne, Cologne, Germany
Maria Leptin, Thomas Wiehe, Katja Palitzsch
Max Planck Genome Centre, Cologne, Germany
Bruno Hüttel
University of Montana, Missoula, MT, USA
Andrew Whiteley
University College London, London, UK
Philipp Schiffer
Sainsbury laboratory, Norwich, UK
Jonathan Jones, Kamil Witek, Oliver Furzer