Download pdf - ECCB10 talk - Next-generation sequencing and structural variation

Next-generation sequencingand structural variation

Jan AertsWellcome Trust Sanger Institute

[email protected]

principles & pittfallsvs

list of commands

What is structural variation?

• “variation that changes the structure ofa chromosome”

• Mechanisms: NAHR, NHEJ, FoSTeS• This presentation: focus on discovery

(not: genotyping)

“experiment 4” from last slide Thomas

Types of structural variation

Approaches for discovery

Combination of:• Read pairs• Read depth• Split reads• Fine-mapping breakpoints: local assembly

=> Identify signatures

A. Read Pairs

RP - General principle

• Paired-end library => insert size• Orientation/distance

RP - Signatures

Medvedev et al, 2009

RP - Real world

RP - Workflow overview

Mapping Identify discordant readpairsCluster on locationFilter on nr RPs/clusterFilter on RDFilter: mappingQ x #readpairs Identify signaturesAlternative referenceValidate

RP - Mapping

• Provides raw data => crucial• MAQ/bwa

– only report one hit (mappingQ = 0)– MAQ might prefer mismatches to aberrant

distance!• Insert size = distribution instead of exact

RP - Discordant readpairs

• Orientation• Distance

– Plot insert size distribution for chromosome– Very long tail! => difficult to set cutoff:

• 4mad or 0.01%?

RP - Clustering

“standard clustering strategy”– Only consider mate pairs that do not have

concordant mappings– Ignore read pairs that have more than one

good mapping

Clustering: use insert size distribution(e.g. 2x4 mad)

RP - Clustering: issues

• Ignores pairs that have >1 good mapping =>no detection within repetitive regions(segmental duplications)

• What cutoff for what is considered abnormaldistance? (4 mad? 0.01%? 2stdev?)

• Low library quality or mix of libraries =>multiple peaks in size distribution

RP - Filtering

• On nr RPs/cluster– Normally: n=2– For high coverage (e.g. pilot 2: 80X): n=5

• On drop in RD & SR• On (mappingQ x nrRP)

– If published data available: ROC fordifferent cutoffs mQxnrRP

– If not: very difficult

RP - Issues

• Difficult => different groups = different results“consensus set”– RP & SP: many set agree– RD: totally different

• CEU (80X): sometimes drop in RD in all 3,but RP spanning only in 2 => why??

• Mapper = critical; maq/bwa: only 1 mapping(=> many false negatives); mosaik, mrFAST:return more results

RP - Issues (2)

• Large insert size: low resolution for detectingbreakpoints

• Small insert size: low resolution for detectingcomplex regions

B. Read Depth

RD - General principle

• Similar to aCGH: using reference RDfile (e.g. based on 1kG)

• In theory: higher resolution, but noisierthan aCGH– Algorithms not mature yet– More complex steps

=> Data binned

RD - Exome

here: using exome data

RD - Example

RD - Workflow overview

• Mapping• Read filtering• GC correction• Spike identification• Validation

RD - mapping

Critical…(see RP)

RD - Filtering

• mapQ– mapQ >= 0 (noisy; few FN, many FP)– mapQ >= 10– mapQ >= 30 (many FN, few FP)

• Mean depth exon (often: e.g. +/- 0.01)– Mean depth > 1– Mean depth > 5

RD - Filtering: what’s left

152,000153,000160,000mean DP exon > 5

162,000163,000169,000mean DP exon > 1

207,000207,000207,000all

mapQ >= 30mapQ >= 10mapQ >= 0

RD - correction

• Mainly: GC– Other: repeat-rich regions, mapping Q, …

• Fit linear model GC-content exon andRD of exon=> noise decreases

RD - segmentation

• Identify spikes• Many segmentational algorithms, e.g.

GADA• Issues: setting parameters: when to cut

off peaks?– Combine outputs from different runs with

different parameters– Compare to known CNVs

RD - Combine algorithms

RD - Issues

• How to assess TP/FP/FN? => comparewith known CNVs

• Breakpoints: unknown– 1 datapoint/exon– Can be outside of exon

• Different parameters for rare vscommon CNVs => which?

C. Split Reads

SR - Principle

SR - Mapping

Short subsequences => many possiblemappings

Solution: “anchored split mapping” (e.g.Pindel)

D. Local reassembly

Aim: to determine breakpoints

Which reads?– for deletions: local reads– for insertions: hanging reads for read pairs with

only one read mapped

– (rather not: unmapped reads)

For large region: split up

Assemblers

VelvetABySSTIGRA…

Conclusions

• Available algorithms: more todemonstrate technique rather thancomplete solution

• Different algorithms => different results

Chris Yoon

Genotyping• Create alternative reference => remap reads

– All reads vs reads covering variant locis– Whole-genome vs concatenation of variant loci

• Homozygous insertions/deletions: should disappear• Heterozygous insertions/deletions: should have different

signatures• Bayesian approach: see what’s the most likely: do the reads

support wild-type/het/homnonref?• Not exact mapping => local reassembly

– Microhomologies & non-template sequence => “breakpoint”= region of 2-10 bp

• Convention: left-most position reported (but not always)

References and software• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)• Lee S et al. Bioinformatics 24:i59-i67 (2008)• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)• Campbell P et al. Nat Genet 40:722-729 (2008)• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)• Chen K et al. Genome Res 19:1527-1541 (2009)• Yoon S et al. Genome Res 19:1586-1592 (2009)• Du J et al. PLoS Comp Biol 5(7):e1000432 (2009)• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences

(2009)

Questions?