26
Germline variant calling and joint genotyping Applying the joint discovery workflow with HaplotypeCaller + GenotypeGVCFs talks

Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Germlinevariantcallingandjointgenotyping

ApplyingthejointdiscoveryworkflowwithHaplotypeCaller+GenotypeGVCFs

talks

Page 2: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

YouarehereintheGATKBestPracDcesworkflowforgermlinevariantdiscovery

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

Page 3: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

NewGVCFworkflowsolvesbothproblems,yieldssameresultsAscalableworkflowforjointvariantdiscovery

+Incrementalover:meScalableoversamplesize

Page 4: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Toolsinvolvedintheworkflow

•  IdenDfypotenDalvariantsineachsample

➔ HaplotypeCaller

•  Performjointgenotypingonthecohort

➔ GenotypeGVCFs

Page 5: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Whatitdoes:•  CallsSNPandindelvariantssimultaneously•  Performslocalre-assemblytoidenDfyhaplotypes•  ReferenceconfidencemodelenablesdetecDonoflow

frequencyvariants•  Joint-discoveryworkflow(referenceconfidencemodel,GVCFs)•  HandlesRNAseqnaDvely•  Handlesnon-diploidorganismsandpooledsamples

Whatitdoesn’tdo•  SomaDcvariantcalling(useMuTect2instead!)

KeyHaplotypeCallerfeatures

Page 6: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

HowHaplotypeCallerworksin4simplesteps

Page 7: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Step1:IdenDfyAcDveRegions

•  Slidingwindowalongthereference•  Countmismatches,indelsandsoVclips

Ø Measureofentropy

Overthreshold:Trigger“AcDveRegion”tobeprocessed

Page 8: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Step2:Assembleplausiblehaplotypes

•  Localrealignmentviagraphassembly

•  Traversegraphtocollectmostlikelyhaplotypes

•  AlignhaplotypestorefusingSmith-Waterman

Likelyhaplotypes+candidatevariantsites

CanmakeHCoutputthereassembledreadsandselectedhalpotypesusingthe–bamOutparameter

Page 9: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

ExampleassemblygraphproducedbyHaplotypeCaller

•  Previousalignmentsareignored•  K-mersconsistofeverypossiblesequencecombinaDonbasedonthereads•  Mostlikelypathsthroughthegrapharescored

Page 10: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

GraphassemblyrecoversindelsandremovesarDfacts

NA12878originalreaddata

HaplotypeCaller

(validated)

MulDplecallerarDfactsthatarehardtofilterout,sincetheyarewellsupportedbyreaddata

Page 11: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

GraphassemblyresolvescomplexitycausedbymapperlimitaDons

OriginalBWAalignments

Reference TConsensus C T T A A T A A G T G TReads A C

Canberepresentedbythemappertwodifferentways,atrandom:

HaplotypeCallerwillseMleononerepresenta:on->cleaneroutputcall

[+A][T->C]

[T->A][+C]

Page 12: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Bonusperkofhaplotypecalling:freephysicalphasing

Twonewsample-levelannotaDons,PID(forphaseidenDfier)andPGT(phasedgenotype)

Page 13: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Step3:ScorehaplotypesusingPairHMM

•  Calculatehaplotypelikelihoodsgiventheread–  PairHMMalignseachreadtoeachhaplotype

Likelihoodofthehaplotypegivenreads

Page 14: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

PairHMM State(M) Match(Ix) Insertion(Iy) DeletionTransition probabilities (derived from BQSR)(ε) = Gap continuation(δ) = Gap open penalty(1 - ε) = Base precedes an insertion or a deletion(1 - 2δ) = Base matches and continues

PairHMMusesbasequaliDestoscorealignments

->likelihoodsofthehaplotypesgiventhereads->storeinmatrix

Haplotypes

Reads

Aij=probabilityofhaplotypevsread

Page 15: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Step4:GenotypeeachsampleateachpotenDalvariantsite

•  Determinemostlikelyallelesforeachsample•  Basedonsupportforhaplotypes(fromPairHMM)•  Evaluatedoverreadsfromeachsample

Genotypecallsforeachsample

Page 16: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

0.01 0.02 0.03 0.04

0.09 0.06 0.07 0.08

0.10 0.11 0.01 0.02

HaplotypesR 1 32

Reads 1

2

3

0.04 0.03

0.08 0.09

0.11 0.10

Alleles

Reads

- T1

2

3

*Thesenumbersaremadeuptogiveasenseofhowtheprocessworks.Inrealitythenumberswouldbemuchsmaller.

Takehighestprobabilityofhaplotypesgivenreadsthatcontaintheallele(foreachvariantposi:on)

Reference:ATCGATCATAGCTAGCTGCG Haplotype1:ATCGA-CATAGCTAGCTGCGHaplotype2:ATGGATCATAGCTTGCTGCGHaplotype3:ATCGA-CATAGCTTGCTGCG

*

Transformingsupportforhaplotypesintosupportforalleles

Page 17: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Bayesianmodel

4 SNP calling

4.1 Simple genotype likelihoods for presentations

Pr{G|D} =Pr{G}Pr{D|G}

�i Pr{Gi}Pr{D|Gi}, [Bayes’ rule]

Pr{D|G} =⇧

j

�Pr{Dj|H1}

2+

Pr{Dj|H2}2

⇥where G = H1H2

Pr{D|H} is the haploid likelihood function

4.1.1 SNP haploid likelihood

Pr{Dj|H} = Pr{Dj|b}, [single base pileup]

Pr{Dj|b} =

⇤1� �j Dj = b,�j otherwise.

4.1.2 Indel haploid likelihood

Pr{Dj|H} =⌅

alignments � of Dj to H

Pr{Dj, ⇥}

4.2 Genotype likelihoods

Pr{Di|GTi} =⇧

j

Pr{Di,j|GTi}

Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2

Pr{Di,j|B} =

⇤1� �i,j Di,j = B,

�i,j · Pr{B is true|Di,j is miscalled} otherwise.

3

Prior of the genotype

Likelihood of the genotype

Diploid assumption

Justpluginthenumbers! 0.04 0.03

0.08 0.09

0.11 0.10

Alleles- T

Reads

1

2

3

DeterminesthemostlikelygenotypeofthesampleateachsitewherethereisevidenceofvariaDon

Andfinally,abitofBayesianmath

Page 18: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

HaplotypeCallerrecap:readsin/variantsout

BAM

VCF

Thisisallyouneedforasinglesampleortradi:onalmul:-sampleanalysis

Page 19: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Forjointdiscovery:emitGVCF+addjointgenotypingstep

s

•  RunHCinGVCFmodetoemitGVCF

•  RunGenotypeGVCFstore-genotypesampleswithmul:-samplemodel

Page 20: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

GVCFincludes<NON-REF>allele+genotypelikelihoodsforjointgenotyping

Symbolicallelestandsforallnon-calledbutpossiblenon-referencealleles

endposofhom-refband

Page 21: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

GVCFsarevalidVCFswithextrainformaDon

Page 22: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

MulDpleGVCFscombinedformasquared-offmatrixofgenotypes

s

Page 23: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

ThejointdiscoveryworkflowinpracDce

RawgVCF*fileRawgVCF*fileRawgVCF*file

Analysis-readyBAMfileAnalysis-readyBAMfileAnalysis-readyBAMfile

GenotypeGVCFs

RawVCFfile

HaplotypeCaller

java–jarGenomeAnalysisTK.jar

–THaplotypeCaller\–Rhuman.fasta\–Isample1.bam\–osample1.g.vcf\

[–Lexome_targets.intervals\]–ERCGVCF

java–jarGenomeAnalysisTK.jar

–TGenotypeGVCFs\–Rhuman.fasta\–Vsample1.g.vcf\–Vsample2.g.vcf\–VsampleN.g.vcf\–ooutput.vcf

If>200samples,combineinbatchesfirstusingCombineGVCFs

Page 24: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

NewGVCFworkflowsolvesbothproblems,yieldssameresultsAndthatishowwecanscalejointdiscoverytoeleventythousandsamples

+Incrementalover:meScalableoversamplesize

Page 25: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

YouarehereintheGATKBestPracDcesworkflowforgermlinevariantdiscovery

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

Page 26: Germline variant calling and joint genotyping...Joint Genotyping Analysis-Ready N on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Furtherreadinghgp://www.broadinsDtute.org/gatk/guide/best-pracDces

hgp://www.broadinsDtute.org/gatk/guide/arDcle?id=1237

hgps://www.broadinsDtute.org/gatk/gatkdocs/

org_broadinsDtute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

hgps://www.broadinsDtute.org/gatk/gatkdocs/org_broadinsDtute_gatk_tools_walkers_variantuDls_GenotypeGVCFs.php

talks