Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
VariantQualityScoreRecalibra2on
Assigningaccurateconfidencescorestoeachputa2vemuta2oncall
talks
YouarehereintheGATKBestPrac2cesworkflowforgermlinevariantdiscovery
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
Raw,high-sensi2vitycallsetscontainmanyfalseposi2ves
• Muta2oncallingalgorithmsareverypermissivebydesign
• Howtofilter?– Hand-tunedhard-filteringrequires2meandexper2se– BeMertolearnwhatthefiltersshouldbefromthedataitself
• Mustenableanalyststotradeoffsensi2vityandspecificitydependingonprojectgoals
þ Buildingamodelofwhattruegene6cvaria6onlookslikewillallowustorank-ordervariantsbasedontheirlikelihoodofbeingreal
Fromannota2onstomixturemodels
• Eachvarianthasadiversesetofsta2s2csassociatedwithit.
• Theseannota2onstendtoformGaussianclusters
• Wecanfita“Gaussianmixturemodel”totheannota2onsknownvariantsinourdataset.
• Anynewvariantcanbescoredbyevalua2ngtheassociatedannota2onsinthismodel.
Variantannota2onsarethe“features”ofthemodel
22 49582364 . A G 198.96 . AC=3; AF=0.50; AN=6; DP=87; MLEAC=3;MLEAF=0.50;MQ=71.31; MQ0=22; QD=2.29; SB=-31.76 GT:DP:GQ 0/1:12:99 0/1:11:89 0/1:28:37
VCFrecordforanA/GSNPat22:49582364
AC No.chromosomescarryingaltallele
MLEAF MaxlikelihoodAF
AN Totalno.ofchromosomes MQ RMSMAPQofallreads
AF Allelefrequency MQ0 No.ofMAPQ0readsatlocus
DP Depthofcoverage QD QUALscoreoverdepth
MLEAC MaxlikelihoodACINFO
field
NotethatVQSRwillonlylookatINFOannota2ons;sample-levelannota2ons(genotype,ADetc)arenotused.
Two steps: (1) train a model then (2) apply to callset
ModifiedfromDePristoetal.NatureGene2cs.2011
(1)TrainmodelusingHapMap (2)Applymodeltocallset
Basicidea:trainingonhigh-confidenceknownsitestodeterminetheprobabilitythatothersitesaretrue
(1) Training the model
(1)Trainmodelusinge.g.HapMap
• Thistellsuswhatgoodvariantslooklike• Asimilarmodelforthevariantsinourcallsetthatleastlooklikegoodvariantsisalsocreated
(badmodel,nobiscuit!)• Allvariantscannowberankedbasedonthera2obetweentheirscoresinthegoodmodeland
thebadmodel(=VQSLOD)
• Wechooseatrainingset
• Variantsthatarebothinthetrainingsetandinourcallsetareselected.
• Wetrainthemodelusingtheannota2onsoftheselectedvariants
(2) Applying the model to our callset
(2)Applymodeltocallset• Usingtherankingproducedbythemodel,filteringvariantsisaseasyasseengasinglethresholdvalue
• Anyvariantswhosescorefallsbelowthethresholdisfilteredout
Buthowdowesetthatthreshold?
There are in fact two components to the model
• Anega6vemodelisalsobuiltduringtraining• Itrepresentstheprobabilityofvariantstobe
falseposi6ves
Nega2veModel
Posi2veModel
The VQSLOD threshold is a tradeoff between TP and FP
Nega2veModel
Posi2veModel
ThresholdConsideredPosi2ve
ConsideredNega2ve
TruePos.
FalsePos.
VQSLOD(x)=Log(p(x)/q(x))
q
p
(VQSLODisdis2nctfromQUAL!)
Roleoftrainingandtruthresources
Calibra2ngSensi2vity
Truthsetisusedfortransla6ngVQSLODvaluesintosensi6vity“tranches”
Trainingsetisusedforbuildingthemodel
Wesetthethresholdbasedonsensi6vitytotruthdata
Calibra2ngSensi2vity
Densityofsitesintruthset
WhatthresholddoweneedtosettocaptureX%ofthesitesinthetruthset?
VariantRecalibra2onsteps&tools
• BuildandApplythemodels(fromresourcesandcallset)➔ VariantRecalibrator
• UseVQSLODtofiltervariantsandwriteanewannotatedVCF➔ ApplyRecalibra6on
Number of Novel Variants (1000s)0 500 1000 1500 2000
Ti/Tv FDR
10
5
1
0.1
1.91
1.99
2.06
2.07
0 100 200 300 400
Cumulative TPsTranch−specific TPsTranch−specific FPsCumulative FPs
Ti/Tv FDR
10
2
1
0.1
1.92
2.04
2.05
2.07
0.0 0.2 0.4 0.6 0.8
Ti/Tv FDR
10
2
1
0.1
2.79
2.96
2.98
3.01
Evaluating novel variants
More Bias
Less Bias
HiSeq: training on HapMap
More Bias
Less Bias
Likely dbSNP errors
Gaussian mixture model fits
A HM3 1KG Trio
99.5
99.5
99.5
99.5
98.2
98.5
98.4
98.6
NR sensitivity (%):
HiSeq C
HiSeq
88.0
89.7
88.0
93.8
HM3
96.0
96.8
96.0
98.3
D
65.1
66.2
65.3
66.9 86.5 With imputation
82.3
82.8
82.4
96.7 NGS only 83.0
B
Heterozygous variants Homozygous
variants
Exome
Low-pass E
Analysis tranche
Analysis tranche
HiSeq: evaluating novel variants
HiSeq HM3
Analysis tranche
Number of Novel Variants (1000s)0 500 1000 1500 2000
Ti/Tv FDR
10
5
1
0.1
1.91
1.99
2.06
2.07
0 100 200 300 400
Cumulative TPsTranch−specific TPsTranch−specific FPsCumulative FPs
Ti/Tv FDR
10
2
1
0.1
1.92
2.04
2.05
2.07
0.0 0.2 0.4 0.6 0.8
Ti/Tv FDR
10
2
1
0.1
2.79
2.96
2.98
3.01
Evaluating novel variants
More Bias
Less Bias
HiSeq: training on HapMap
More Bias
Less Bias
Likely dbSNP errors
Gaussian mixture model fits
A HM3 1KG Trio
99.5
99.5
99.5
99.5
98.2
98.5
98.4
98.6
NR sensitivity (%):
HiSeq C
HiSeq
88.0
89.7
88.0
93.8
HM3
96.0
96.8
96.0
98.3
D
65.1
66.2
65.3
66.9 86.5 With imputation
82.3
82.8
82.4
96.7 NGS only 83.0
B
Heterozygous variants Homozygous
variants
Exome
Low-pass E
Analysis tranche
Analysis tranche
HiSeq: evaluating novel variants
HiSeq HM3
Analysis tranche
NOTE:SNPsandIndelsmustberecalibratedseparately!
OriginalSNPs+originalIndels
RecalSNPs+RecalIndels
Pro-2p:RunVQSRtwiceinsuccessionaccordingtothisworkflow.Thatwayyouavoidhavingtosplitthem,recalibrateandcombinethemagain.
RecalSNPs+originalIndels
VariantRecalibrator
ApplyRecalibra6on
FirstpassinSNPmode,Indelswillbeleiuntouched
VariantRecalibrator
ApplyRecalibra6on
SecondpassinINDELmode,SNPswillbeleiuntouched
VariantRecalibra2onworkflow
OriginalVCFfile
Recalibra2onfileTranchesfile
+recalibra2onplots
RecalibratedVCFfile
VariantRecalibrator
ApplyRecalibra6on
+Resources
OriginalVCFfile
Recalibra2onfileTranchesfile
+recalibra2onplots
RecalibratedVCFfile
VariantRecalibrator
ApplyRecalibra6on
+Resources
VariantRecalibra2onworkflow
VariantRecalibrator
• BuildtheGaussianmixturemodelusingthevariantsintheinputcallsetwhichoverlapthetrainingdata
TOOLTIPS
java–jarGenomeAnalysisTK.jar–TVariantRecalibrator\
–Rhuman.fasta\–inputraw.SNPs.vcf\–resource:{seenextslide}\–anDP–anQD–anFS–anMQRankSum{…}\–modeSNP\–recalFileraw.SNPs.recal\–tranchesFileraw.SNPs.tranches\–rscriptFilerecal.plots.R
SNPexample–seedocumenta2onforindelrecommenda2ons
VQSRresources
-resource:hapmap,known=false,training=true,truth=true,prior=15.0hapmap_3.3.b37.sites.vcf-resource:omni,known=false,training=true,truth=false,prior=12.0omni2.5.b37.sites.vcf-resource:1000G,known=false,training=true,truth=false,prior=10.01000G.b37.sites.vcf-resource:dbsnp,known=true,training=false,truth=false,prior=2.0dbsnp_137.b37.vcf
TOOLTIPS
SNPexample–seedocumenta2onforindelrecommenda2ons
Understanding“resources”TOOLTIPS
• Prior–Phred-scaledes2mateofdataaccuracy
• Training–useinputvariantsthatoverlapwiththesetrainingsitestobuildthemodel
• Truth-usethesetruthsitestodeterminewheretosetthecutoffinVQSLODsensi2vity.
• Known–onlyforrepor2ngpurposes,notusedinanycalcula2ons
OriginalVCFfile
Recalibra2onfileTranchesfile
+recalibra2onplots
RecalibratedVCFfile
VariantRecalibrator
ApplyRecalibra6on
+Resources
VariantRecalibra2onworkflow
Recalibra2onplotsshowaspectsofthemodel(foreachpossiblepairofannota2ons)
21
Tranchesareslicesofthedatacorrespondingtopre-setsensi2vitythresholdvalues
Calibra2ngSensi2vity
90 99 99.9
Truthsensi2vity(%)
100
OriginalVCFfile
Recalibra2onfileTranchesfile
+recalibra2onplots
RecalibratedVCFfile
VariantRecalibrator
ApplyRecalibra6on
+Resources
VariantRecalibra2onworkflow
ApplyRecalibra2on
• Executesthedesiredsensi2vity/specificitytradeoffbyapplyingfilterstotheinputcallset
• Createsanew,filtered,analysis-worthyVCFfile.
• Addi2onallyeveryvariantisnowannotatedwithitsVQSLODscore.
TOOLTIPS
java–jarGenomeAnalysisTK.jar–TApplyRecalibra6on\
–Rhuman.fasta\–inputraw.vcf\–modeSNP\–recalFileraw.SNPs.recal\–tranchesFileraw.SNPs.tranches\–orecal.SNPs.vcf\–ts_filter_level99.0
SNPexample–seedocumenta2onforindelrecommenda2ons
OriginalVCFfile
Recalibra2onfileTranchesfile
+recalibra2onplots
RecalibratedVCFfile
VariantRecalibrator
ApplyRecalibra6on
+Resources
VariantRecalibra2onworkflow
OriginalSNPs+originalIndels
RecalSNPs+RecalIndels
Pro-2p:RunVQSRtwiceinsuccessionaccordingtothisworkflow.Thatwayyouavoidhavingtosplitthem,recalibrateandcombinethemagain.
RecalSNPs+originalIndels
VariantRecalibrator
ApplyRecalibra6on
FirstpassinSNPmode,Indelswillbeleiuntouched
VariantRecalibrator
ApplyRecalibra6on
SecondpassinINDELmode,SNPswillbeleiuntouched
NOTE:SNPsandIndelsmustberecalibratedseparately!
• BeforeVQSR(inputvcf):
• AierVQSR(outputvcf):
• Hardfilteringvcf:
VQSRoutputVCF(vs.HardFilter)
Didtherecalibra2onworkproperly?
• Commonerrormodes:– “Nodatafound”
->toofewvariantsincallset,seedocsforworkarounds
– “Annota2onXnotfoundforanyvariant”->annota6onnotpresentinfile,useVariantAnnotator
->ifrelatedtoInbreedingCoefficientprobablylessthan10samples,cannotusethisanno6on
– Screwed-uptrancheplots,lownovelTiTv->wrongdbSNPversion,usetheoneinthebundle
NOTES
• SmallernumberofvariantspersamplecomparedtoWGS->typicallyinsufficienttobuildarobustrecalibra6onmodel
ifrunningononlyafewsamples
Ø Analyzesamplesjointlyincohortsofatleast30Ø Ifnecessary,addexomesfrom1000GProject
• Whattolookforinsamplesforpaddingacohort:– Similartechnicalgenera2on(technology,capture,readlength,depth)– Similarethnicbackground
VariantRecalibra2on(VQSR)onWExdata
• ALWAYSdothis: • NEVERdothis:
+
per-sampleGVCFs
BAMs 1000GBAMs
+BAMs 1000GVCF
GenotypeGVCFs
RawVCF
RawVCF
FilteredVCF
VQSR VQSR
FilteredVCFü X
HC
GenotypeGVCFs
HC
Howtoaddexomesfrom1000Gtoyouranalysis
WhenshouldyouNOTrunVQSR?
• Non-humanorganismswhereknownresourcesareunavailableorinsufficientlycurated
• RNAseqdataàseeRNAseq-specificfiltering
• Cohortistoosmallandnoothersamplesareavailablefor“padding”thecohort
àUsemanualfilteringrecommenda2onsinstead
YouarehereintheGATKBestPrac2cesworkflowforgermlinevariantdiscovery
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
FurtherreadinghMp://www.broadins2tute.org/gatk/guide/best-prac2ces
hMp://www.broadins2tute.org/gatk/guide/ar2cle?id=39
hMp://www.broadins2tute.org/gatk/gatkdocs/
org_broadins2tute_s2ng_gatk_walkers_variantrecalibra2on_VariantRecalibrator.html
hMp://www.broadins2tute.org/gatk/gatkdocs/org_broadins2tute_s2ng_gatk_walkers_variantrecalibra2on_ApplyRecalibra2on.html
talks