Next-generation sequencingand structural variation
Jan AertsWellcome Trust Sanger Institute
principles & pittfallsvs
list of commands
What is structural variation?
• “variation that changes the structure ofa chromosome”
• Mechanisms: NAHR, NHEJ, FoSTeS• This presentation: focus on discovery
(not: genotyping)
“experiment 4” from last slide Thomas
Types of structural variation
Approaches for discovery
Combination of:• Read pairs• Read depth• Split reads• Fine-mapping breakpoints: local assembly
=> Identify signatures
A. Read Pairs
RP - General principle
• Paired-end library => insert size• Orientation/distance
RP - Signatures
Medvedev et al, 2009
RP - Real world
RP - Workflow overview
Mapping Identify discordant readpairsCluster on locationFilter on nr RPs/clusterFilter on RDFilter: mappingQ x #readpairs Identify signaturesAlternative referenceValidate
RP - Mapping
• Provides raw data => crucial• MAQ/bwa
– only report one hit (mappingQ = 0)– MAQ might prefer mismatches to aberrant
distance!• Insert size = distribution instead of exact
RP - Discordant readpairs
• Orientation• Distance
– Plot insert size distribution for chromosome– Very long tail! => difficult to set cutoff:
• 4mad or 0.01%?
RP - Clustering
“standard clustering strategy”– Only consider mate pairs that do not have
concordant mappings– Ignore read pairs that have more than one
good mapping
Clustering: use insert size distribution(e.g. 2x4 mad)
RP - Clustering: issues
• Ignores pairs that have >1 good mapping =>no detection within repetitive regions(segmental duplications)
• What cutoff for what is considered abnormaldistance? (4 mad? 0.01%? 2stdev?)
• Low library quality or mix of libraries =>multiple peaks in size distribution
RP - Filtering
• On nr RPs/cluster– Normally: n=2– For high coverage (e.g. pilot 2: 80X): n=5
• On drop in RD & SR• On (mappingQ x nrRP)
– If published data available: ROC fordifferent cutoffs mQxnrRP
– If not: very difficult
RP - Issues
• Difficult => different groups = different results“consensus set”– RP & SP: many set agree– RD: totally different
• CEU (80X): sometimes drop in RD in all 3,but RP spanning only in 2 => why??
• Mapper = critical; maq/bwa: only 1 mapping(=> many false negatives); mosaik, mrFAST:return more results
RP - Issues (2)
• Large insert size: low resolution for detectingbreakpoints
• Small insert size: low resolution for detectingcomplex regions
B. Read Depth
RD - General principle
• Similar to aCGH: using reference RDfile (e.g. based on 1kG)
• In theory: higher resolution, but noisierthan aCGH– Algorithms not mature yet– More complex steps
=> Data binned
RD - Exome
here: using exome data
RD - Example
RD - Workflow overview
• Mapping• Read filtering• GC correction• Spike identification• Validation
RD - mapping
Critical…(see RP)
RD - Filtering
• mapQ– mapQ >= 0 (noisy; few FN, many FP)– mapQ >= 10– mapQ >= 30 (many FN, few FP)
• Mean depth exon (often: e.g. +/- 0.01)– Mean depth > 1– Mean depth > 5
RD - Filtering: what’s left
152,000153,000160,000mean DP exon > 5
162,000163,000169,000mean DP exon > 1
207,000207,000207,000all
mapQ >= 30mapQ >= 10mapQ >= 0
RD - correction
• Mainly: GC– Other: repeat-rich regions, mapping Q, …
• Fit linear model GC-content exon andRD of exon=> noise decreases
RD - segmentation
• Identify spikes• Many segmentational algorithms, e.g.
GADA• Issues: setting parameters: when to cut
off peaks?– Combine outputs from different runs with
different parameters– Compare to known CNVs
RD - Combine algorithms
RD - Issues
• How to assess TP/FP/FN? => comparewith known CNVs
• Breakpoints: unknown– 1 datapoint/exon– Can be outside of exon
• Different parameters for rare vscommon CNVs => which?
C. Split Reads
SR - Principle
SR - Mapping
Short subsequences => many possiblemappings
Solution: “anchored split mapping” (e.g.Pindel)
D. Local reassembly
Aim: to determine breakpoints
Which reads?– for deletions: local reads– for insertions: hanging reads for read pairs with
only one read mapped
– (rather not: unmapped reads)
For large region: split up
Assemblers
VelvetABySSTIGRA…
Conclusions
• Available algorithms: more todemonstrate technique rather thancomplete solution
• Different algorithms => different results
Chris Yoon
Genotyping• Create alternative reference => remap reads
– All reads vs reads covering variant locis– Whole-genome vs concatenation of variant loci
• Homozygous insertions/deletions: should disappear• Heterozygous insertions/deletions: should have different
signatures• Bayesian approach: see what’s the most likely: do the reads
support wild-type/het/homnonref?• Not exact mapping => local reassembly
– Microhomologies & non-template sequence => “breakpoint”= region of 2-10 bp
• Convention: left-most position reported (but not always)
References and software• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)• Lee S et al. Bioinformatics 24:i59-i67 (2008)• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)• Campbell P et al. Nat Genet 40:722-729 (2008)• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)• Chen K et al. Genome Res 19:1527-1541 (2009)• Yoon S et al. Genome Res 19:1586-1592 (2009)• Du J et al. PLoS Comp Biol 5(7):e1000432 (2009)• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences
(2009)
Questions?