Variant Quality Score Recalibraon - UCLAqcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr... · 2017. 2. 24. · MLEAC Max likelihood AC Note that VQSR will only look at INFO

VariantQualityScoreRecalibra2on

Assigningaccurateconfidencescorestoeachputa2vemuta2oncall

talks

YouarehereintheGATKBestPrac2cesworkflowforgermlinevariantdiscovery

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

Raw,high-sensi2vitycallsetscontainmanyfalseposi2ves

•  Muta2oncallingalgorithmsareverypermissivebydesign

•  Howtofilter?–  Hand-tunedhard-filteringrequires2meandexper2se–  BeMertolearnwhatthefiltersshouldbefromthedataitself

•  Mustenableanalyststotradeoffsensi2vityandspecificitydependingonprojectgoals

þ  Buildingamodelofwhattruegene6cvaria6onlookslikewillallowustorank-ordervariantsbasedontheirlikelihoodofbeingreal

Fromannota2onstomixturemodels

•  Eachvarianthasadiversesetofsta2s2csassociatedwithit.

•  Theseannota2onstendtoformGaussianclusters

•  Wecanfita“Gaussianmixturemodel”totheannota2onsknownvariantsinourdataset.

•  Anynewvariantcanbescoredbyevalua2ngtheassociatedannota2onsinthismodel.

Variantannota2onsarethe“features”ofthemodel

22  49582364 . A G 198.96 . AC=3; AF=0.50; AN=6; DP=87; MLEAC=3;MLEAF=0.50;MQ=71.31; MQ0=22; QD=2.29; SB=-31.76 GT:DP:GQ 0/1:12:99 0/1:11:89 0/1:28:37

VCFrecordforanA/GSNPat22:49582364

AC No.chromosomescarryingaltallele

MLEAF MaxlikelihoodAF

AN Totalno.ofchromosomes MQ RMSMAPQofallreads

AF Allelefrequency MQ0 No.ofMAPQ0readsatlocus

DP Depthofcoverage QD QUALscoreoverdepth

MLEAC MaxlikelihoodACINFO

field

NotethatVQSRwillonlylookatINFOannota2ons;sample-levelannota2ons(genotype,ADetc)arenotused.

Two steps: (1) train a model then (2) apply to callset

ModifiedfromDePristoetal.NatureGene2cs.2011

(1)TrainmodelusingHapMap (2)Applymodeltocallset

Basicidea:trainingonhigh-confidenceknownsitestodeterminetheprobabilitythatothersitesaretrue

(1) Training the model

(1)Trainmodelusinge.g.HapMap

•  Thistellsuswhatgoodvariantslooklike•  Asimilarmodelforthevariantsinourcallsetthatleastlooklikegoodvariantsisalsocreated

(badmodel,nobiscuit!)•  Allvariantscannowberankedbasedonthera2obetweentheirscoresinthegoodmodeland

thebadmodel(=VQSLOD)

•  Wechooseatrainingset

•  Variantsthatarebothinthetrainingsetandinourcallsetareselected.

•  Wetrainthemodelusingtheannota2onsoftheselectedvariants

(2) Applying the model to our callset

(2)Applymodeltocallset•  Usingtherankingproducedbythemodel,filteringvariantsisaseasyasseengasinglethresholdvalue

•  Anyvariantswhosescorefallsbelowthethresholdisfilteredout

Buthowdowesetthatthreshold?

There are in fact two components to the model

•  Anega6vemodelisalsobuiltduringtraining•  Itrepresentstheprobabilityofvariantstobe

falseposi6ves

Nega2veModel

Posi2veModel

The VQSLOD threshold is a tradeoff between TP and FP

Nega2veModel

Posi2veModel

ThresholdConsideredPosi2ve

ConsideredNega2ve

TruePos.

FalsePos.

VQSLOD(x)=Log(p(x)/q(x))

q

p

(VQSLODisdis2nctfromQUAL!)

Roleoftrainingandtruthresources

Calibra2ngSensi2vity

Truthsetisusedfortransla6ngVQSLODvaluesintosensi6vity“tranches”

Trainingsetisusedforbuildingthemodel

Wesetthethresholdbasedonsensi6vitytotruthdata


Densityofsitesintruthset

WhatthresholddoweneedtosettocaptureX%ofthesitesinthetruthset?

VariantRecalibra2onsteps&tools

•  BuildandApplythemodels(fromresourcesandcallset)➔  VariantRecalibrator

•  UseVQSLODtofiltervariantsandwriteanewannotatedVCF➔  ApplyRecalibra6on

Number of Novel Variants (1000s)0 500 1000 1500 2000

Ti/Tv FDR

10

5

1

0.1

1.91

1.99

2.06

2.07

0 100 200 300 400

Cumulative TPsTranch−specific TPsTranch−specific FPsCumulative FPs

Ti/Tv FDR

10

2

1

0.1

1.92

2.04

2.05

2.07

0.0 0.2 0.4 0.6 0.8

Ti/Tv FDR

10

2

1

0.1

2.79

2.96

2.98

3.01

Evaluating novel variants

More Bias

Less Bias

HiSeq: training on HapMap

More Bias

Less Bias

Likely dbSNP errors

Gaussian mixture model fits

A HM3 1KG Trio

99.5

99.5

99.5

99.5

98.2

98.5

98.4

98.6

NR sensitivity (%):

HiSeq C

HiSeq

88.0

89.7

88.0

93.8

HM3

96.0

96.8

96.0

98.3

D

65.1

66.2

65.3

66.9 86.5 With imputation

82.3

82.8

82.4

96.7 NGS only 83.0

B

Heterozygous variants Homozygous

variants

Exome

Low-pass E

Analysis tranche

Analysis tranche

HiSeq: evaluating novel variants

HiSeq HM3

Analysis tranche

Number of Novel Variants (1000s)0 500 1000 1500 2000

Ti/Tv FDR

10

5

1

0.1

1.91

1.99

2.06

2.07

0 100 200 300 400

Cumulative TPsTranch−specific TPsTranch−specific FPsCumulative FPs

Ti/Tv FDR

10

2

1

0.1

1.92

2.04

2.05

2.07

0.0 0.2 0.4 0.6 0.8

Ti/Tv FDR

10

2

1

0.1

2.79

2.96

2.98

3.01

Evaluating novel variants

More Bias

Less Bias

HiSeq: training on HapMap

More Bias

Less Bias

Likely dbSNP errors

Gaussian mixture model fits

A HM3 1KG Trio

99.5

99.5

99.5

99.5

98.2

98.5

98.4

98.6

NR sensitivity (%):

HiSeq C

HiSeq

88.0

89.7

88.0

93.8

HM3

96.0

96.8

96.0

98.3

D

65.1

66.2

65.3

66.9 86.5 With imputation

82.3

82.8

82.4

96.7 NGS only 83.0

B

Heterozygous variants Homozygous

variants

Exome

Low-pass E

Analysis tranche

Analysis tranche

HiSeq: evaluating novel variants

HiSeq HM3

Analysis tranche

NOTE:SNPsandIndelsmustberecalibratedseparately!

OriginalSNPs+originalIndels

RecalSNPs+RecalIndels

Pro-2p:RunVQSRtwiceinsuccessionaccordingtothisworkflow.Thatwayyouavoidhavingtosplitthem,recalibrateandcombinethemagain.

RecalSNPs+originalIndels

VariantRecalibrator

ApplyRecalibra6on

FirstpassinSNPmode,Indelswillbeleiuntouched

VariantRecalibrator

ApplyRecalibra6on

SecondpassinINDELmode,SNPswillbeleiuntouched

VariantRecalibra2onworkflow

OriginalVCFfile

Recalibra2onfileTranchesfile

+recalibra2onplots

RecalibratedVCFfile

VariantRecalibrator

ApplyRecalibra6on

+Resources

OriginalVCFfile


+recalibra2onplots

RecalibratedVCFfile

VariantRecalibrator

ApplyRecalibra6on

+Resources


VariantRecalibrator

•  BuildtheGaussianmixturemodelusingthevariantsintheinputcallsetwhichoverlapthetrainingdata

TOOLTIPS

java–jarGenomeAnalysisTK.jar–TVariantRecalibrator\

–Rhuman.fasta\–inputraw.SNPs.vcf\–resource:{seenextslide}\–anDP–anQD–anFS–anMQRankSum{…}\–modeSNP\–recalFileraw.SNPs.recal\–tranchesFileraw.SNPs.tranches\–rscriptFilerecal.plots.R

SNPexample–seedocumenta2onforindelrecommenda2ons

VQSRresources

-resource:hapmap,known=false,training=true,truth=true,prior=15.0hapmap_3.3.b37.sites.vcf-resource:omni,known=false,training=true,truth=false,prior=12.0omni2.5.b37.sites.vcf-resource:1000G,known=false,training=true,truth=false,prior=10.01000G.b37.sites.vcf-resource:dbsnp,known=true,training=false,truth=false,prior=2.0dbsnp_137.b37.vcf

TOOLTIPS


Understanding“resources”TOOLTIPS

•  Prior–Phred-scaledes2mateofdataaccuracy

•  Training–useinputvariantsthatoverlapwiththesetrainingsitestobuildthemodel

•  Truth-usethesetruthsitestodeterminewheretosetthecutoffinVQSLODsensi2vity.

•  Known–onlyforrepor2ngpurposes,notusedinanycalcula2ons

OriginalVCFfile


+recalibra2onplots

RecalibratedVCFfile

VariantRecalibrator

ApplyRecalibra6on

+Resources


Recalibra2onplotsshowaspectsofthemodel(foreachpossiblepairofannota2ons)

21

Tranchesareslicesofthedatacorrespondingtopre-setsensi2vitythresholdvalues


90 99 99.9

Truthsensi2vity(%)

100

OriginalVCFfile


+recalibra2onplots

RecalibratedVCFfile

VariantRecalibrator

ApplyRecalibra6on

+Resources


ApplyRecalibra2on

•  Executesthedesiredsensi2vity/specificitytradeoffbyapplyingfilterstotheinputcallset

•  Createsanew,filtered,analysis-worthyVCFfile.

•  Addi2onallyeveryvariantisnowannotatedwithitsVQSLODscore.

TOOLTIPS

java–jarGenomeAnalysisTK.jar–TApplyRecalibra6on\

–Rhuman.fasta\–inputraw.vcf\–modeSNP\–recalFileraw.SNPs.recal\–tranchesFileraw.SNPs.tranches\–orecal.SNPs.vcf\–ts_filter_level99.0


OriginalVCFfile


+recalibra2onplots

RecalibratedVCFfile

VariantRecalibrator

ApplyRecalibra6on

+Resources


OriginalSNPs+originalIndels

RecalSNPs+RecalIndels

Pro-2p:RunVQSRtwiceinsuccessionaccordingtothisworkflow.Thatwayyouavoidhavingtosplitthem,recalibrateandcombinethemagain.

RecalSNPs+originalIndels

VariantRecalibrator

ApplyRecalibra6on

FirstpassinSNPmode,Indelswillbeleiuntouched

VariantRecalibrator

ApplyRecalibra6on

SecondpassinINDELmode,SNPswillbeleiuntouched

NOTE:SNPsandIndelsmustberecalibratedseparately!

•  BeforeVQSR(inputvcf):

•  AierVQSR(outputvcf):

•  Hardfilteringvcf:

VQSRoutputVCF(vs.HardFilter)

Didtherecalibra2onworkproperly?

•  Commonerrormodes:–  “Nodatafound”

->toofewvariantsincallset,seedocsforworkarounds

–  “Annota2onXnotfoundforanyvariant”->annota6onnotpresentinfile,useVariantAnnotator

->ifrelatedtoInbreedingCoefficientprobablylessthan10samples,cannotusethisanno6on

–  Screwed-uptrancheplots,lownovelTiTv->wrongdbSNPversion,usetheoneinthebundle

NOTES

•  SmallernumberofvariantspersamplecomparedtoWGS->typicallyinsufficienttobuildarobustrecalibra6onmodel

ifrunningononlyafewsamples

Ø  Analyzesamplesjointlyincohortsofatleast30Ø  Ifnecessary,addexomesfrom1000GProject

•  Whattolookforinsamplesforpaddingacohort:–  Similartechnicalgenera2on(technology,capture,readlength,depth)–  Similarethnicbackground

VariantRecalibra2on(VQSR)onWExdata

•  ALWAYSdothis: •  NEVERdothis:

+

per-sampleGVCFs

BAMs 1000GBAMs

+BAMs 1000GVCF

GenotypeGVCFs

RawVCF

RawVCF

FilteredVCF

VQSR VQSR

FilteredVCFü  X

HC

GenotypeGVCFs

HC

Howtoaddexomesfrom1000Gtoyouranalysis

WhenshouldyouNOTrunVQSR?

•  Non-humanorganismswhereknownresourcesareunavailableorinsufficientlycurated

•  RNAseqdataàseeRNAseq-specificfiltering

•  Cohortistoosmallandnoothersamplesareavailablefor“padding”thecohort

àUsemanualfilteringrecommenda2onsinstead

YouarehereintheGATKBestPrac2cesworkflowforgermlinevariantdiscovery

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

FurtherreadinghMp://www.broadins2tute.org/gatk/guide/best-prac2ces

hMp://www.broadins2tute.org/gatk/guide/ar2cle?id=39

hMp://www.broadins2tute.org/gatk/gatkdocs/

org_broadins2tute_s2ng_gatk_walkers_variantrecalibra2on_VariantRecalibrator.html

hMp://www.broadins2tute.org/gatk/gatkdocs/org_broadins2tute_s2ng_gatk_walkers_variantrecalibra2on_ApplyRecalibra2on.html

talks

Documents

Variant Quality Score Recalibraon - UCLAqcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr... · 2017. 2. 24. · MLEAC Max likelihood AC Note that VQSR will only look at INFO