Reference Assisted Nucleic Acid Sequence Reconstruction from
Mass Spectrometry Data
Gabriel Ilie1, Alex Zelikovsky2 and Ion Măndoiu1
1CSE Department, University of Connecticut2CS Department, Georgia State University
MassCLEAVE assay for MS-based nucleic acid sequence analysis
• Signed relative errors assumed to follow a normal distribution with mean 0, standard deviation σ for masses and σ’ for intensities
• Two types of error incurred when matching compomer c to peak of mass m and intensity i(m):– Relative mass error
– Relative intensity error:
Error model
Problem formulation
Given:• Mass spectra MS• Reference sequence r including position of PCR primers• Maximum edit distance D• Standard deviations σ and σ’, tolerance parameter Find: • Target sequence t flanked by PCR primers that
a. is within edit distance D of r, and b. yields a matching of compomers of CS(t) to masses of MS with
minimum total relative error
Naïve Algorithm
• Exhaustive search – Generate all sequences within an edit distance of
D of the reference, and – Compute the minimum total relative error for
matching the compomers of each of these sequences to the masses in MS.
• The number of candidate sequences grows exponentially with D
3-Stage Algorithm
1. Identify regions of the reference sequence that are unambiguously supported by MS data– High probability to be present in the unknown target
sequence
2. Branch-and-bound approach to fill in remaining gaps– Generates set of candidate sequences with compomers
supported by MS data
3. Compute candidate sequences with minimum total relative error – Min-cost flow problem currently solved as linear program– With or without intensities
First stage: finding strongly supported regions of the reference
• Chebyshev’s inequality:
• A detectable compomer c ϵ CSσ(s) is strongly matched to mass m ϵ MSσ(s) if:
where ε = σ / 0.5 is set based on a user specified tolerance
First stage: finding strongly supported regions of the reference
• A strong match between compomer c and mass m is unambiguous if:– c has multiplicity of 1 in reference– c can be strongly matched only to m– m can be strongly matched only to c
• The set M of unambiguous matches can be found efficiently by binary search
First stage: finding strongly supported regions of the reference
which are normally distributed with mean 0 and standard deviation σ /i0.5
• If Chebyshev’s inequality fails for index i, match(ci, mi) is removed from M
• (c1, m1), . . . , (cn, mn) = unambiguous matches for cut base σ, indexed in non-decreasing order of relative errors
• We iteratively apply Chebyshev’s inequality with tolerance to the running means of signed relative errors,
First stage: finding strongly supported regions of the reference
• A position in the reference sequence has strong support if – All detectable compomers overlapping it can be
strongly matched, and – At least one of these matches is in M
(unambiguous + not removed)• Positions in PCR primers automatically marked
as having strong support
Second stage: generating candidate targets by branch-and-bound
• Reference regions with strong support assumed to be present in target
• Gaps filled one base at a time, in left-to-right order, using branch-and-bound– Choice order: reference base, substitutions,
deletion, insertions– Chebyshev test with tolerance applied to running
means of signed relative errors of closest matches• Search pruned when test fails or more than D mutations
Third stage: scoring candidates by linear programming
• Objective:– Minimize total relative error
• Variables:– For each c ϵ CSσ and m ϵ MSσ, xc,m is set to 1 if c is matched to
m, 0 otherwise (integrality follows from total unimodularity)• Constraints:
– No missing peaks: each detectable compomer c ϵ CSσ(t) must be matched to one mass in MSσ
– No extraneous peaks: each mass m ϵ MSσ must be matched to at least one detectable compomer c ϵ CSσ(t)
LP w/o intensities
LP with intensities
Simulation setup
• Reference length: 100-500 bp• Reference sequences/targets– D=1: 10 random references, all sequences at edit
distance 1 used as targets – D=2,3: 100 random reference-target pairs
• Error free MS data: σ = σ’ = 0• Noisy MS data: σ = 0.0001, σ’ =0-1• Tolerance parameter: = 0.01
Precision and Recall
actual target
predicted target(s)
tp(true positive)Prediction is
unique & correct
fp(false positive)
Prediction is unique & incorrect
fn(false negative)
Prediction is not unique
Branch-and-bound vs. Naïve(F-measure for D=1, error free data, w/o intensities)
100 150 200 250 300 350 400 450 50065%
70%
75%
80%
85%
90%
95%
100%
1 substitution Branch-and-Bound1 deletion Branch-and-Bound1 substitution Naïve1 deletion Naïve1 insertion Branch-and-Bound1 insertion Naïve
Branch-and-bound speed-up(D=1, error free data, w/o intensities)
Length 100 150 200 250 300 350 400 450 500
Naïve 18.66 34.95 49.65 65.72 82.60 100.25 120.19 139.90 161.71
Branch-and-Bound 0.06 0.08 0.12 0.16 0.19 0.25 0.33 0.50 0.52
Speed-up 307X 429X 418X 418X 430X 404X 368X 278X 314X
Results on noisy data (F-measure, D=1, σ = 0.0001, w/o intensities)
100 150 200 250 300 350 400 450 50070%
75%
80%
85%
90%
95%
100%
1 substitution, σ=0, τ=01 substitution, σ=0.0001, τ=0.011 deletion, σ=0, τ=01 deletion, σ=0.0001, τ=0.011 insertion, σ=0, τ=01 insertion, σ=0.0001, τ=0.01
Effect of the number of mutations (F-measure, σ = 0.0001, w/o intensities)
100 150 200 250 300 350 400 450 50020%
30%
40%
50%
60%
70%
80%
90%
100%
1 substitution, σ=0.0001, τ=0.011 deletion, σ=0.0001, τ=0.012 substitutions, σ=0.0001, τ=0.011 insertion, σ=0.0001, τ=0.012 deletions, σ=0.0001, τ=0.013 substitutions, σ=0.0001, τ=0.013 deletions, σ=0.0001, τ=0.012 insertions, σ=0.0001, τ=0.013 insertions, σ=0.0001, τ=0.01
Do intensities help?(F-measure, σ = 0.0001, 1 substitution)
100 150 200 250 300 350 400 450 50084%
86%
88%
90%
92%
94%
96%
98%
σ'=0σ'=0.15σ'=0.25σ'=0.35σ'=0.5σ'=1without intensities
Do intensities help?(F-measure, σ = 0.0001)
100 150 200 250 300 350 400 450 50050%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
1 substitution σ'=0.352 substitutions σ'=0.351 substitution w/o intensities3 substitutions σ'=0.352 substitutions w/o intensities3 substitutions w/o intensities
Ongoing Work
• Experiments on EPLD clone data– Branch-and-bound relaxation + penalty in LP
objective to handle missing/extraneous peaks– Intensity data normalization: correct for mass and
base composition effects