Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Germlinevariantcallingandjointgenotyping
ApplyingthejointdiscoveryworkflowwithHaplotypeCaller+GenotypeGVCFs
talks
YouarehereintheGATKBestPracDcesworkflowforgermlinevariantdiscovery
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
NewGVCFworkflowsolvesbothproblems,yieldssameresultsAscalableworkflowforjointvariantdiscovery
+Incrementalover:meScalableoversamplesize
Toolsinvolvedintheworkflow
• IdenDfypotenDalvariantsineachsample
➔ HaplotypeCaller
• Performjointgenotypingonthecohort
➔ GenotypeGVCFs
Whatitdoes:• CallsSNPandindelvariantssimultaneously• Performslocalre-assemblytoidenDfyhaplotypes• ReferenceconfidencemodelenablesdetecDonoflow
frequencyvariants• Joint-discoveryworkflow(referenceconfidencemodel,GVCFs)• HandlesRNAseqnaDvely• Handlesnon-diploidorganismsandpooledsamples
Whatitdoesn’tdo• SomaDcvariantcalling(useMuTect2instead!)
KeyHaplotypeCallerfeatures
HowHaplotypeCallerworksin4simplesteps
Step1:IdenDfyAcDveRegions
• Slidingwindowalongthereference• Countmismatches,indelsandsoVclips
Ø Measureofentropy
Overthreshold:Trigger“AcDveRegion”tobeprocessed
Step2:Assembleplausiblehaplotypes
• Localrealignmentviagraphassembly
• Traversegraphtocollectmostlikelyhaplotypes
• AlignhaplotypestorefusingSmith-Waterman
Likelyhaplotypes+candidatevariantsites
CanmakeHCoutputthereassembledreadsandselectedhalpotypesusingthe–bamOutparameter
ExampleassemblygraphproducedbyHaplotypeCaller
• Previousalignmentsareignored• K-mersconsistofeverypossiblesequencecombinaDonbasedonthereads• Mostlikelypathsthroughthegrapharescored
GraphassemblyrecoversindelsandremovesarDfacts
NA12878originalreaddata
HaplotypeCaller
(validated)
MulDplecallerarDfactsthatarehardtofilterout,sincetheyarewellsupportedbyreaddata
GraphassemblyresolvescomplexitycausedbymapperlimitaDons
OriginalBWAalignments
Reference TConsensus C T T A A T A A G T G TReads A C
Canberepresentedbythemappertwodifferentways,atrandom:
HaplotypeCallerwillseMleononerepresenta:on->cleaneroutputcall
[+A][T->C]
[T->A][+C]
Bonusperkofhaplotypecalling:freephysicalphasing
Twonewsample-levelannotaDons,PID(forphaseidenDfier)andPGT(phasedgenotype)
Step3:ScorehaplotypesusingPairHMM
• Calculatehaplotypelikelihoodsgiventheread– PairHMMalignseachreadtoeachhaplotype
Likelihoodofthehaplotypegivenreads
PairHMM State(M) Match(Ix) Insertion(Iy) DeletionTransition probabilities (derived from BQSR)(ε) = Gap continuation(δ) = Gap open penalty(1 - ε) = Base precedes an insertion or a deletion(1 - 2δ) = Base matches and continues
PairHMMusesbasequaliDestoscorealignments
->likelihoodsofthehaplotypesgiventhereads->storeinmatrix
Haplotypes
Reads
Aij=probabilityofhaplotypevsread
Step4:GenotypeeachsampleateachpotenDalvariantsite
• Determinemostlikelyallelesforeachsample• Basedonsupportforhaplotypes(fromPairHMM)• Evaluatedoverreadsfromeachsample
Genotypecallsforeachsample
0.01 0.02 0.03 0.04
0.09 0.06 0.07 0.08
0.10 0.11 0.01 0.02
HaplotypesR 1 32
Reads 1
2
3
0.04 0.03
0.08 0.09
0.11 0.10
Alleles
Reads
- T1
2
3
*Thesenumbersaremadeuptogiveasenseofhowtheprocessworks.Inrealitythenumberswouldbemuchsmaller.
Takehighestprobabilityofhaplotypesgivenreadsthatcontaintheallele(foreachvariantposi:on)
Reference:ATCGATCATAGCTAGCTGCG Haplotype1:ATCGA-CATAGCTAGCTGCGHaplotype2:ATGGATCATAGCTTGCTGCGHaplotype3:ATCGA-CATAGCTTGCTGCG
*
Transformingsupportforhaplotypesintosupportforalleles
Bayesianmodel
4 SNP calling
4.1 Simple genotype likelihoods for presentations
Pr{G|D} =Pr{G}Pr{D|G}
�i Pr{Gi}Pr{D|Gi}, [Bayes’ rule]
Pr{D|G} =⇧
j
�Pr{Dj|H1}
2+
Pr{Dj|H2}2
⇥where G = H1H2
Pr{D|H} is the haploid likelihood function
4.1.1 SNP haploid likelihood
Pr{Dj|H} = Pr{Dj|b}, [single base pileup]
Pr{Dj|b} =
⇤1� �j Dj = b,�j otherwise.
4.1.2 Indel haploid likelihood
Pr{Dj|H} =⌅
alignments � of Dj to H
Pr{Dj, ⇥}
4.2 Genotype likelihoods
Pr{Di|GTi} =⇧
j
Pr{Di,j|GTi}
Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2
Pr{Di,j|B} =
⇤1� �i,j Di,j = B,
�i,j · Pr{B is true|Di,j is miscalled} otherwise.
3
Prior of the genotype
Likelihood of the genotype
Diploid assumption
Justpluginthenumbers! 0.04 0.03
0.08 0.09
0.11 0.10
Alleles- T
Reads
1
2
3
DeterminesthemostlikelygenotypeofthesampleateachsitewherethereisevidenceofvariaDon
Andfinally,abitofBayesianmath
HaplotypeCallerrecap:readsin/variantsout
BAM
VCF
Thisisallyouneedforasinglesampleortradi:onalmul:-sampleanalysis
Forjointdiscovery:emitGVCF+addjointgenotypingstep
s
• RunHCinGVCFmodetoemitGVCF
• RunGenotypeGVCFstore-genotypesampleswithmul:-samplemodel
GVCFincludes<NON-REF>allele+genotypelikelihoodsforjointgenotyping
Symbolicallelestandsforallnon-calledbutpossiblenon-referencealleles
endposofhom-refband
GVCFsarevalidVCFswithextrainformaDon
MulDpleGVCFscombinedformasquared-offmatrixofgenotypes
s
ThejointdiscoveryworkflowinpracDce
RawgVCF*fileRawgVCF*fileRawgVCF*file
Analysis-readyBAMfileAnalysis-readyBAMfileAnalysis-readyBAMfile
GenotypeGVCFs
RawVCFfile
HaplotypeCaller
java–jarGenomeAnalysisTK.jar
–THaplotypeCaller\–Rhuman.fasta\–Isample1.bam\–osample1.g.vcf\
[–Lexome_targets.intervals\]–ERCGVCF
java–jarGenomeAnalysisTK.jar
–TGenotypeGVCFs\–Rhuman.fasta\–Vsample1.g.vcf\–Vsample2.g.vcf\–VsampleN.g.vcf\–ooutput.vcf
If>200samples,combineinbatchesfirstusingCombineGVCFs
NewGVCFworkflowsolvesbothproblems,yieldssameresultsAndthatishowwecanscalejointdiscoverytoeleventythousandsamples
+Incrementalover:meScalableoversamplesize
YouarehereintheGATKBestPracDcesworkflowforgermlinevariantdiscovery
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
Furtherreadinghgp://www.broadinsDtute.org/gatk/guide/best-pracDces
hgp://www.broadinsDtute.org/gatk/guide/arDcle?id=1237
hgps://www.broadinsDtute.org/gatk/gatkdocs/
org_broadinsDtute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php
hgps://www.broadinsDtute.org/gatk/gatkdocs/org_broadinsDtute_gatk_tools_walkers_variantuDls_GenotypeGVCFs.php
talks