Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Markingduplicates
Removingnon-independentobserva7ons
talks
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
YouarehereintheGATKBestPrac7cesworkflowforgermlinevariantdiscovery
Whymarkduplicates?
Reference
Mappedreads
=sequencingerrorpropagatedinduplicates
• Duplicatesaresetsofreadspairsthathavethesameunclippedalignmentstartandunclippedalignmentend
• They’resuspectedtobenon-independentmeasurementsofasequence• SampledfromtheexactsametemplateofDNA• Violatesassump7onsofvariantcalling
• What’smore,errorsinsample/libraryprepwillgetpropagatedtoalltheduplicates• Justpickthe“best”copy–mi7gatestheeffectsoferrors
Markduplicates
Howdoduplica7oneventsarise?
PCRduplicates
Op:calduplicatesReadnameshavethefollowingform:@identifier:lane:tile:x:y
hWp://www.slideshare.net/jandot/next-genera7on-sequencing-course-part-2-sequence-mappinghWp://www.slideshare.net/cosen7a/illumina-gaiix-for-high-throughput-sequencing
Op7calandPCRduplica7oneventsariseatdifferentratesasasequencingexperimentproceeds
PCRduplicates
Op:calduplicates
Howdoweiden7fyduplicatereads?
• DupesmightcomefromthesameinputDNAtemplate,sowewillassumethatreadswillhavesamestartposi7ononreference
– “Wherewasthefirstbasethatwassequenced?”
– Forpaired-end(PE)reads,samestartforbothends
• Iden7fyduplicatesets,thenchooserepresenta7vereadbasedonbasequalityscoresandothercriteria
Butthere’sacatch(ortwo)…
• BWAsome7mes“clips”basesfromtheendsofthealignment(whenthealignmentthereispoor)
• NeedtouseSAMflags+CIGARstringtodeterminethe
unclipped5’end
• Fragmentsmappedtothereversestrandarespecifiedbytheir3’posi7on,insteadof5’
Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on
Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A
BluemapstoforwardstrandRedmapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingWhataretheduplicatesets?
Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A
BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞ r1,r3,r5,r6(startatposi7on1)
Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on
Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A
BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞ r1,r3,r5,r6(startatposi7on1)☞ r2,r4(startatposi7on7)
Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on
Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A
BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞ r1,r3,r5,r6(startatposi7on1)☞ r2,r4(startatposi7on7)☞ r7(startsatposi7on3)
Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on
Sonowwehavemapped,sorted,anddedupedreads
Showingduplicatereads Hidingduplicatereads
Whatthismeansfordownstreamanalysis
• DuplicatestatusisindicatedinSAMflag
• Duplicatesarenotremoved,justtagged(unlessyourequestremoval)• Downstreamtoolscanreadthetagandchoosetoignorethosereads
• MostGATKtoolsignoreduplicatesbydefault
UsecaseswhereyoumayNOTwanttomarkduplicates
• Ampliconsequencing->allreadsstartatsameposi7onbydesign
• RNAseqallele-specificexpressionanalysis(ASEReadCountercandisableDuplicateFilter)
Add-on:Predic7ngthecomplexityofasequencingexperiment
Complexityanalysisdependson: • Es:matedlibrarysize• ReturnonInvestment(ROI)
calcula:ons
Es7ma7onoflibrarysizeandduplica7oninPicard
Es7matedfrac7onofduplicates
Assump7ons● allreadsaredrawnfromthesamePoissondistribu7onPo(λ)
● theoccurrenceofduplica7oneventsdependsonunderlyingconcentra7onofinsertsinthelibrary
Ac7veresearchtoimprovelibrarysizees7ma7on
• Rateofduplica7onvarieswithinsertsizelength• Duplica7onsratesalsolikelyvarywithGCcontent
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
YouarehereintheGATKBestPrac7cesworkflowforgermlinevariantdiscovery
Furtherreading
hWp://www.broadins7tute.org/gatk/guide/best-prac7ces
hWp://broadins7tute.github.io/picard/
talks