23
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1 , Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department, University of Connecticut 2 CS Department, Georgia State University

Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Embed Size (px)

DESCRIPTION

Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data. Gabriel Ilie 1 , Alex Zelikovsky 2 and Ion M ă ndoiu 1 1 CSE Department, University of Connecticut 2 CS Department, Georgia State University. MassCLEAVE assay for MS-based nucleic acid sequence analysis. - PowerPoint PPT Presentation

Citation preview

Page 1: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Reference Assisted Nucleic Acid Sequence Reconstruction from

Mass Spectrometry Data

Gabriel Ilie1, Alex Zelikovsky2 and Ion Măndoiu1

1CSE Department, University of Connecticut2CS Department, Georgia State University

Page 2: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

MassCLEAVE assay for MS-based nucleic acid sequence analysis

Page 3: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

• Signed relative errors assumed to follow a normal distribution with mean 0, standard deviation σ for masses and σ’ for intensities

• Two types of error incurred when matching compomer c to peak of mass m and intensity i(m):– Relative mass error

– Relative intensity error:

Error model

Page 4: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Problem formulation

Given:• Mass spectra MS• Reference sequence r including position of PCR primers• Maximum edit distance D• Standard deviations σ and σ’, tolerance parameter Find: • Target sequence t flanked by PCR primers that

a. is within edit distance D of r, and b. yields a matching of compomers of CS(t) to masses of MS with

minimum total relative error

Page 5: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Naïve Algorithm

• Exhaustive search – Generate all sequences within an edit distance of

D of the reference, and – Compute the minimum total relative error for

matching the compomers of each of these sequences to the masses in MS.

• The number of candidate sequences grows exponentially with D

Page 6: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

3-Stage Algorithm

1. Identify regions of the reference sequence that are unambiguously supported by MS data– High probability to be present in the unknown target

sequence

2. Branch-and-bound approach to fill in remaining gaps– Generates set of candidate sequences with compomers

supported by MS data

3. Compute candidate sequences with minimum total relative error – Min-cost flow problem currently solved as linear program– With or without intensities

Page 7: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

First stage: finding strongly supported regions of the reference

• Chebyshev’s inequality:

• A detectable compomer c ϵ CSσ(s) is strongly matched to mass m ϵ MSσ(s) if:

where ε = σ / 0.5 is set based on a user specified tolerance

Page 8: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

First stage: finding strongly supported regions of the reference

• A strong match between compomer c and mass m is unambiguous if:– c has multiplicity of 1 in reference– c can be strongly matched only to m– m can be strongly matched only to c

• The set M of unambiguous matches can be found efficiently by binary search

Page 9: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

First stage: finding strongly supported regions of the reference

which are normally distributed with mean 0 and standard deviation σ /i0.5

• If Chebyshev’s inequality fails for index i, match(ci, mi) is removed from M

• (c1, m1), . . . , (cn, mn) = unambiguous matches for cut base σ, indexed in non-decreasing order of relative errors

• We iteratively apply Chebyshev’s inequality with tolerance to the running means of signed relative errors,

Page 10: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

First stage: finding strongly supported regions of the reference

• A position in the reference sequence has strong support if – All detectable compomers overlapping it can be

strongly matched, and – At least one of these matches is in M

(unambiguous + not removed)• Positions in PCR primers automatically marked

as having strong support

Page 11: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Second stage: generating candidate targets by branch-and-bound

• Reference regions with strong support assumed to be present in target

• Gaps filled one base at a time, in left-to-right order, using branch-and-bound– Choice order: reference base, substitutions,

deletion, insertions– Chebyshev test with tolerance applied to running

means of signed relative errors of closest matches• Search pruned when test fails or more than D mutations

Page 12: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Third stage: scoring candidates by linear programming

• Objective:– Minimize total relative error

• Variables:– For each c ϵ CSσ and m ϵ MSσ, xc,m is set to 1 if c is matched to

m, 0 otherwise (integrality follows from total unimodularity)• Constraints:

– No missing peaks: each detectable compomer c ϵ CSσ(t) must be matched to one mass in MSσ

– No extraneous peaks: each mass m ϵ MSσ must be matched to at least one detectable compomer c ϵ CSσ(t)

Page 13: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

LP w/o intensities

Page 14: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

LP with intensities

Page 15: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Simulation setup

• Reference length: 100-500 bp• Reference sequences/targets– D=1: 10 random references, all sequences at edit

distance 1 used as targets – D=2,3: 100 random reference-target pairs

• Error free MS data: σ = σ’ = 0• Noisy MS data: σ = 0.0001, σ’ =0-1• Tolerance parameter: = 0.01

Page 16: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Precision and Recall

actual target

predicted target(s)

tp(true positive)Prediction is

unique & correct

fp(false positive)

Prediction is unique & incorrect

fn(false negative)

Prediction is not unique

Page 17: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Branch-and-bound vs. Naïve(F-measure for D=1, error free data, w/o intensities)

100 150 200 250 300 350 400 450 50065%

70%

75%

80%

85%

90%

95%

100%

1 substitution Branch-and-Bound1 deletion Branch-and-Bound1 substitution Naïve1 deletion Naïve1 insertion Branch-and-Bound1 insertion Naïve

Page 18: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Branch-and-bound speed-up(D=1, error free data, w/o intensities)

Length 100 150 200 250 300 350 400 450 500

Naïve 18.66 34.95 49.65 65.72 82.60 100.25 120.19 139.90 161.71

Branch-and-Bound 0.06 0.08 0.12 0.16 0.19 0.25 0.33 0.50 0.52

Speed-up 307X 429X 418X 418X 430X 404X 368X 278X 314X

Page 19: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Results on noisy data (F-measure, D=1, σ = 0.0001, w/o intensities)

100 150 200 250 300 350 400 450 50070%

75%

80%

85%

90%

95%

100%

1 substitution, σ=0, τ=01 substitution, σ=0.0001, τ=0.011 deletion, σ=0, τ=01 deletion, σ=0.0001, τ=0.011 insertion, σ=0, τ=01 insertion, σ=0.0001, τ=0.01

Page 20: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Effect of the number of mutations (F-measure, σ = 0.0001, w/o intensities)

100 150 200 250 300 350 400 450 50020%

30%

40%

50%

60%

70%

80%

90%

100%

1 substitution, σ=0.0001, τ=0.011 deletion, σ=0.0001, τ=0.012 substitutions, σ=0.0001, τ=0.011 insertion, σ=0.0001, τ=0.012 deletions, σ=0.0001, τ=0.013 substitutions, σ=0.0001, τ=0.013 deletions, σ=0.0001, τ=0.012 insertions, σ=0.0001, τ=0.013 insertions, σ=0.0001, τ=0.01

Page 21: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Do intensities help?(F-measure, σ = 0.0001, 1 substitution)

100 150 200 250 300 350 400 450 50084%

86%

88%

90%

92%

94%

96%

98%

σ'=0σ'=0.15σ'=0.25σ'=0.35σ'=0.5σ'=1without intensities

Page 22: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Do intensities help?(F-measure, σ = 0.0001)

100 150 200 250 300 350 400 450 50050%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

1 substitution σ'=0.352 substitutions σ'=0.351 substitution w/o intensities3 substitutions σ'=0.352 substitutions w/o intensities3 substitutions w/o intensities

Page 23: Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data

Ongoing Work

• Experiments on EPLD clone data– Branch-and-bound relaxation + penalty in LP

objective to handle missing/extraneous peaks– Intensity data normalization: correct for mass and

base composition effects