34
Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer, Cuppen Group, Hubrecht Institute

Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Experience of using BWA-mem & GATK HaplotypeCaller for Variant

Calling in Multiple Rat Strains

Wim Spee, Bio-informatics Engineer, Cuppen Group, Hubrecht Institute

Page 2: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Content

● Project overview: Euratrans (FP7)● Pipeline and data overview● NGS alignment and variant calling● BAC alignment and variant calling● NGS and BAC genotype concordance● Heterozygosity in inbred species● Conclusion & discussion

Page 3: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Euratrans (FP7)

● European consortium for large-scale functional genomics in the rat for translational research

● Rat is a popular model organism● Multiple homozygous inbred disease model

strains have been set up– Example: SHR = Spontaneously hypertensive rat

– Set up before NGS was established by traditional breeding on phenotype

Page 4: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Pipeline and Data Overview

Raw reads =WGS Solid Fragment (50bp) and PE (50bp x 35bp)

Mapping =BWA 0.5.9 colorspace

Duplicate marking =Picard MarkDuplicates

Local realignment =GATK IndelRealigner

BQSR =N/A on BWA mapped Solid

Call variants =GATK HaplotypeCaller multisample

VQSR = SNP array and top 33% of indels as truth variant call set

Variant evaluation = Precision and Recall against 13 aligned Sanger based contigs (2.1 mB, aligned with BWA-MEM)

Page 5: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

NGS Alignment and Variant Calling

● “Best practice” BWA-Picard-GATK pipeline– GATK HaplotypeCaller for variant calling

– Variant Quality Score Recalibration (VQSR) using SNP array and top 33% INDELS

Page 6: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

HaplotypeCaller (HC) (Theory)

● Local denovo assembly based variant caller– Calls SNP, INDEL, MNP and small SV simultaneously

– Removes mapping artifacts

– More sensitive and accurate than the Unified Genotyper (UG) – Physical phasing of variants

– Used to run on geological timescales– Now runs on practical timescales (v 2.6.3 via Queue on SGE cluster)

● 2 days for 10 SOLID WGS rat strains multi-sample variant calling

Page 7: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Slide taken from Broad presentation

Page 8: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Slide taken from Broad presentation

Page 9: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Slide taken from Broad presentation

Page 10: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Slide taken from Broad presentation

Page 11: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Variant Quality Score Recalibration (VQSR)

● Use known true variants to dynamically set a cutoff between true positive and false positive calls– True positives will cluster together with the known variants and false

positives will mainly be in a separate cluster

– Alternative to setting manual hard cutoffs e.g. (coverage = 20, quality = 50, etc.)

● Known true variants for the rat species:– SNP: 500.000 high quality positions from a SNP array

– INDEL: no external set available, used top 33% (QUAL) in call set

Page 12: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

VQSR (SNP): Plots

Page 13: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

VQSR (SNP): Truth Sensitivity Tranches

Page 14: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

VQSR (SNP): Truth Sensitivity Tranches

Page 15: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

VQSR (INDEL): Plots

Page 16: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

VQSR (INDEL): Truth Sensitivity Tranches?

Which Tranche to Take?

Page 17: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

BAC Contig Alignment and Variant Calling

● 13 BACS for rat strain LE, ca. 150 kB per BAC, 2.1 mB in total

● BAC contig alignment – BWA-MEM

● BAC contig variant calling– GATK Unified Genotyper

● BAC & NGS alignment and variant calls in IGV

Page 18: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

● BWA-MEM* – New long read & contig aligner from Heng Li

● 70bp to a few mB

– Can switch between end to end and local alignment

● Supports structural events detection from long reads and contigs

– Outputs a standard BAM file● Useful for downstream processing

BAC Contig Alignment: BWA-MEM

* Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (arXiv:1303.3997v2)

Page 19: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

BWA-MEM Settings

● Seed length– 400 bp

● Banded Alignment (space to search for optimal alignment) = – 5000 positions

Page 20: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

GATK UG Settings for Variant Calling on Aligned BAC

● --genotype_likelihoods_model BOTH● -stand_call_conf 0 ● -stand_emit_conf 0 ● -indelGapContinuationPenalty 30● -indelGapOpenPenalty 60● -minIndelCnt 1● -L BACToBedMerged.bed

Page 21: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

BAC & NGS Alignment and Calls

Page 22: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

BAC & NGS Alignment and Calls(zoomed out)

Page 23: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

BAC Multiple Local Alignment

Page 24: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

BAC Deletion vs. Reference

Page 25: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Unknown Reference Sequence

Page 26: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Mismatch Between BAC and Reference

Page 27: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

SOLID Low Mapping Quality Regions

Page 28: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

NGS and BAC Genotype Concordance

● Genotype concordance: – GATK module to compare 2 VCF files

● Input:– NGS call set restricted to BAC region

– BAC call set

● Filters used:– VQSR on (NGS)

– SNP cluster (3 SNP in 10bp window) (NGS and BAC)

– No known repeats regions (NGS and BAC)

– No NGS LE low quality mapping regions (NGS and BAC)

Page 29: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Precision and Recall (SNP)Current: Rnor 5.0, GATK Haplotype caller, VQSR 99.5 BWA-MEM

No clusters

No repeats

No low quality

Match NGS ONLY

BAC ONLY

SNP

Precision Recall

2309 14 1747 99.40% 56.93%

x 2270 22 1157 99.04% 66.24%

x x 2231 18 811 99.20% 73.34%

x x x 1944 10 192 99.49% 91.01%

Comparison:SNP

Precision Recall

Rnor 3.4, modified samtools, BLAT, no cluster, repeat and low qual. Same LE solid data set and BAC.

97.30% 82.80%

Rnor 3.4, GATK UG, simulated reads from BAC. Ilumina LE dataset and same BAC. Additional filters unknown

99.62% 91.90%

Page 30: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Precision and Recall (INDEL) Preliminary!

Current: Rnor 5.0, GATK Haplotype caller, VQSR 99.3, BWA-MEM

INDEL

Precision Recall

Rnor 3.4, modified samtools, BLAT, no cluster, repeat and low qual. Same LE solid data set and BAC.

97.80% 58.60%

Rnor 3.4, GATK UG, simulated reads from BAC. Ilumina LE dataset and same BAC. Additional filters unknown

96.25% 89.02%

Comparison:

INDEL

No repeats

No low quality

GT mismatch

Match NGS ONLY

BAC ONLY

Precision Recall

12 329 109 795 75.11% 29.27%

x 7 287 64 469 81.77% 37.96%

x x 1 182 28 100 86.67% 64.54%

Page 31: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Precision and Recall (INDEL) Improvement

● Include 2 Ilumina sequenced strains in HC variant calling

● And or include INDEL calls based on 2 Ilumina sequenced strains in VQSR– Intersection between Solid and Ilumina INDEL

calls as truth set?

● Better selection of truth INDELS from current call set?

Page 32: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Heterozygosity

● ~10% of LE true positive calls (vs. BAC) are heterozygous– Remaining heterozygosity?

– Paralogous regions?

– Other mapping artifacts?

– Bias of GATK HC towards diploid heterozygous species?

Page 33: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Conclusions & Discussion

● External data sets are very useful (SNP array & BAC for VQSR and genotype concordance)

● GATK Haplotype caller works– Better than samtools based variant calling

– To really compare with GATK UG, run on Ilumina LE strain sample● Solid reads to short to benefit from GATK HC?

– How to improve INDEL VQSR with no external truth set?

– How to handle heterozygous calls in inbred species?

● BWA-mem works– GATK UG can call SNP / INDELS on aligned BACs

– Visualization in IGV

– How to call SVs?

Page 34: Experience of using BWA-mem & GATK HaplotypeCaller for Variant Calling ... · GATK HaplotypeCaller for Variant Calling in Multiple Rat Strains Wim Spee, Bio-informatics Engineer,

Acknowledgment

Cuppen Group at the Hubrecht Institute