19
Marking duplicates Removing non-independent observa7ons talks

Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Markingduplicates

Removingnon-independentobserva7ons

talks

Page 2: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

YouarehereintheGATKBestPrac7cesworkflowforgermlinevariantdiscovery

Page 3: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Whymarkduplicates?

Reference

Mappedreads

=sequencingerrorpropagatedinduplicates

•  Duplicatesaresetsofreadspairsthathavethesameunclippedalignmentstartandunclippedalignmentend

•  They’resuspectedtobenon-independentmeasurementsofasequence•  SampledfromtheexactsametemplateofDNA•  Violatesassump7onsofvariantcalling

•  What’smore,errorsinsample/libraryprepwillgetpropagatedtoalltheduplicates•  Justpickthe“best”copy–mi7gatestheeffectsoferrors

Markduplicates

Page 4: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Howdoduplica7oneventsarise?

PCRduplicates

Op:calduplicatesReadnameshavethefollowingform:@identifier:lane:tile:x:y

hWp://www.slideshare.net/jandot/next-genera7on-sequencing-course-part-2-sequence-mappinghWp://www.slideshare.net/cosen7a/illumina-gaiix-for-high-throughput-sequencing

Page 5: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Op7calandPCRduplica7oneventsariseatdifferentratesasasequencingexperimentproceeds

PCRduplicates

Op:calduplicates

Page 6: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Howdoweiden7fyduplicatereads?

•  DupesmightcomefromthesameinputDNAtemplate,sowewillassumethatreadswillhavesamestartposi7ononreference

–  “Wherewasthefirstbasethatwassequenced?”

–  Forpaired-end(PE)reads,samestartforbothends

•  Iden7fyduplicatesets,thenchooserepresenta7vereadbasedonbasequalityscoresandothercriteria

Page 7: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Butthere’sacatch(ortwo)…

•  BWAsome7mes“clips”basesfromtheendsofthealignment(whenthealignmentthereispoor)

•  NeedtouseSAMflags+CIGARstringtodeterminethe

unclipped5’end

•  Fragmentsmappedtothereversestrandarespecifiedbytheir3’posi7on,insteadof5’

Page 8: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on

Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A

BluemapstoforwardstrandRedmapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingWhataretheduplicatesets?

Page 9: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A

BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞  r1,r3,r5,r6(startatposi7on1)

Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on

Page 10: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A

BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞  r1,r3,r5,r6(startatposi7on1)☞  r2,r4(startatposi7on7)

Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on

Page 11: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A

BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞  r1,r3,r5,r6(startatposi7on1)☞  r2,r4(startatposi7on7)☞  r7(startsatposi7on3)

Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on

Page 12: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Sonowwehavemapped,sorted,anddedupedreads

Showingduplicatereads Hidingduplicatereads

Page 13: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Whatthismeansfordownstreamanalysis

•  DuplicatestatusisindicatedinSAMflag

•  Duplicatesarenotremoved,justtagged(unlessyourequestremoval)•  Downstreamtoolscanreadthetagandchoosetoignorethosereads

•  MostGATKtoolsignoreduplicatesbydefault

Page 14: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

UsecaseswhereyoumayNOTwanttomarkduplicates

•  Ampliconsequencing->allreadsstartatsameposi7onbydesign

•  RNAseqallele-specificexpressionanalysis(ASEReadCountercandisableDuplicateFilter)

Page 15: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Add-on:Predic7ngthecomplexityofasequencingexperiment

Complexityanalysisdependson: •  Es:matedlibrarysize•  ReturnonInvestment(ROI)

calcula:ons

Page 16: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Es7ma7onoflibrarysizeandduplica7oninPicard

Es7matedfrac7onofduplicates

Assump7ons●  allreadsaredrawnfromthesamePoissondistribu7onPo(λ)

●  theoccurrenceofduplica7oneventsdependsonunderlyingconcentra7onofinsertsinthelibrary

Page 17: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Ac7veresearchtoimprovelibrarysizees7ma7on

•  Rateofduplica7onvarieswithinsertsizelength•  Duplica7onsratesalsolikelyvarywithGCcontent

Page 18: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

YouarehereintheGATKBestPrac7cesworkflowforgermlinevariantdiscovery

Page 19: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Furtherreading

hWp://www.broadins7tute.org/gatk/guide/best-prac7ces

hWp://broadins7tute.github.io/picard/

talks