13
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 General comments, TODOs, etc. -Think up a catchy title. -Decide which references to anonymize to maintain double-blindness. -Write the abstract -Fill in remaining [CITE], which are cases where it is unclear to me what paper should be cited. The NIPS style guide allows for 1-page only of citations; the font size has been reduced as much as permissible. - Given the space constraints, I’m inclined to think that we should not include a future work section in the paper. 1

General comments, TODOs, etc.proteome.gs.washington.edu/~frewen/main.pdf · 2011. 5. 31. · deserunt mollit anim id est laborum. 1 Introduction Tandem mass spectrometry, ... Didea

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • 000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

    General comments, TODOs, etc.

    -Think up a catchy title.-Decide which references to anonymize to maintain double-blindness.-Write the abstract-Fill in remaining [CITE], which are cases where it is unclear to me what

    paper should be cited. The NIPS style guide allows for 1-page only ofcitations; the font size has been reduced as much as permissible.

    - Given the space constraints, I’m inclined to think that we should notinclude a future work section in the paper.

    1

  • 054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

    Identifying Tandem Mass Spectra using DynamicBayesian Networks

    Anonymous Author(s)AffiliationAddressemail

    Abstract

    Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod temporincididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrudexercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis auteirure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nullapariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officiadeserunt mollit anim id est laborum.

    1 Introduction

    Tandem mass spectrometry, a.k.a. shotgun proteomics, is an increasingly accurate and efficienttechnology for identifying and quantifying proteins in a complex biological sample, such as a drop ofblood. This technology has been used to identify biomarkers associated with disease [CITE], andto quantify changes in protein expression across different cell types [CITE]. Most applications oftandem mass spectrometry require the ability to accurately map a fragmentation spectrum generatedby the device to a peptide, a protein subsequence, which generated the spectrum. The task of mappingspectra to peptides is known as spectrum identification, a pattern recognition task akin to speechrecognition. In speech recognition, the input is an utterance, which must be mapped to a sentence innatural language, a enormous structured class of labels. A spectrum is akin to an acoustic utterance; apeptide is akin to a sentence, a sequence of amino acids instead of words. Unlike speech recognition,(i) accurate labelled data, ground truth peptide-spectrum matches, cannot be acquired; (ii) the scoringfunction for peptide-spectrum matches has traditionally been a non-probabilistic function, whileprobabilistic approaches have become dominant in speech; (iii) the optimization used to identify thebest peptide match require enumerating and scoring all candidate peptides against a spectrum.

    In this work, we introduce a dynamic Bayesian network (DBN) that generalizes one of the mostpopular scoring functions for peptide identification (Section 4). Our probabilistic formulation, whichwe call Didea, provides new insight into a technique that has been used in computational biology forover 17 years. Didea provides a new function for scoring peptide-spectrum matches, that significantlyoutperforms existing scoring functions, including those used in expensive commercial tools for peptideidentification. We further show that additional qualitative knowledge about peptide fragmentationcan be easily incorporated into the model, leading to further improvements in identification accuracy.

    A fundamental computational constraint in current approaches to spectrum identification is thedependence on peptide database search. The best peptide match is found by exhaustively scoring alarge list of candidate peptides against the spectrum. In speech recognition, database search wouldbe analogous to decoding an utterance by scoring every common sentence in the English languageagainst the utterance, picking the highest scoring match. In Section 5, we extend Didea with lattices, acompressed representation of sequences, common in speech and language processing [CITE]. Latticesfind novel use here: allowing us to replace an exhaustive enumeration with dynamic programmingover peptide sequences.

    2

  • 108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

    AProteinsProteins

    1 2 3

    Peptides Mass spectrometerSpectraProteins

    B

    Figure 1: (A) Schematic of a typical shotgun proteomics experiment. The three steps—(1) cleavingproteins into peptides, (2) separation of peptides using liquid chromatography, and (3) tandem massspectrometry analysis—are described in the text. (B) A sample fragmentation spectrum, along withthe peptide (PTPVSHNDDLYG) responsible for generating the spectrum. Peaks corresponding toprefixes and suffixes of the peptide are colored red and blue, respectively. By convention, prefixes arereferred to as b-ions and suffixes as y-ions.

    Jeff: [Note, while lattices are common in the speech world, outside of speech they might be confusiblewith, say, Birkhoff lattices. We might want to add a bit of text in the above sayingthat lattices, in thiscontext, is a linear sized representation of an exponential number of sequences, and can be seen as asequential analoge of, say, binary decision diagrams. BTW, also, one option for the extended versionis to,s ay define a macro that is a comment for the main version but includes the text for the extendedversion, so then we can have one .tex file for both submission and supplement.]

    Jeff: [One other commen there. I think this reads well, but we then immediately go on to describeshotgun proteomics. Pehraps in the intro offer up a few more details of the model and what enables itto achieve such good performance. The reason is that, otherwise, people might be left wondering.]

    2 Tandem Mass Spectrometry

    Experimental framework A typical shotgun proteomics experiment proceeds in three steps, asillustrated in Figure 1. The input to the experiment is a collection of proteins, which have beenisolated from a complex mixture. Each protein can be represented as a string of amino acids, wherethe alphabet is size 20 and the proteins range in length from 50–1500 amino acids. A typicalcomplex mixture may contain a few thousand proteins, ranging in abundance from tens to hundredsof thousands of copies.

    In the first experimental step, the proteins are digested into shorter sequences (peptides) using amolecular agent called trypsin. To a first approximation, trypsin cleaves each protein deterministicallyat all occurrences of “K” or “R” unless they are followed by a “P”. This digestion is necessary becausewhole proteins are too massive to be subject to direct mass spectometry analysis without using veryexpensive equipment. Second, the peptides are subjected to a process called liquid chromatography,in which the peptides pass through a thin glass column that separates the peptides based on a particularchemical property (e.g., the hydrophobicity). This separation step reduces the complexity of themixtures of peptides going into the mass spectrometer. The third step, which occurs inside the massspectrometer, involves two rounds of mass spectrometry. Approximately every second, the deviceanalyzes the population of approximately 20,000 intact peptides that most recently exited from theliquid chromatography column. Then, based on this initial analysis, the machine selects five distinctpeptide species for fragmentation. Each of these fragmented species is subjected to a second round

    3

  • 162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

    of mass spectrometry analysis. The resulting “fragmentation spectra” are the primary output of theexperiment.

    A sample fragmentation spectrum is shown in Figure 1B. During the fragmentation process, eachamino acid sequence is typically cleaved once, so cleavage of the population results in a varietyof observed prefix and suffix sequences. Each of these subpeptides is characterized by its mass-to-charge ratio (m/z, shown on the horizontal axis) and a corresponding intensity (unitless, but roughlyproportional to abundance, shown on the vertical axis). The input to the spectrum identificationproblem is one such fragmentation spectrum, along with the observed (approximate) mass of theintact peptide. The goal is to identify the peptide sequence that was responsible for generating thespectrum.

    Solving the spectrum identification problem In practice, the spectrum identification problemcan be solved in two different ways, either de novo, in which the universe of all possible peptidesis considered as candidates solutions, or by restricting the search space to a given peptide database.Because high throughput DNA sequencing can provide a very good estimate of the set of possiblepeptide sequences for most commonly studied organisms, and because database search typicallyprovides more accurate results than de novo approaches, we focus on the database search version ofthe problem in this paper.

    The first computer program to use a database search procedure to identify fragmentation spectrawas SEQUEST [7], and SEQUEST’s basic algorithm is still used by essentially all database searchtools available today. John: [cite: Sadygov2004] Bill: [Do we really need a cite for the previoussentence? If so, then we have to use something more recent than 2004. I vote to delete this cite]The approach is as follows. We are given a spectrum S, a peptide database P , a precursor massm (i.e., the measured mass of the intact peptide), and a precursor mass tolerance δ. The algorithmextracts from the database all peptides whose mass lies within the range [m − δ,m + δ]. Thesecomprise the set of candidate peptides C(m,P, δ) = {p : p ∈ P ; |m(p) − m| < δ} where m(p)is the calculated mass of peptide p. In practice, depending on the size of the peptide database andthe precursor mass tolerance, the number of candidate peptides ranges from hundreds to hundredsof thousands. Each candidate peptide p is used to generate a theoretical spectrum s(p), and thetheoretical spectrum is compared to the observed spectrum using a score function K(·, ·). Theprogram reports the candidate peptide whose theoretical spectrum scores most highly with respect tothe observed spectrum: arg maxp∈C(m,P,δ) K(S, s(p)).

    In this work, we compare the performance of Didea to two widely used search programs, SEQUESTand Mascot [11], as well as to a less commonly used but methodologically related method, PepHMM[15]. These three methods differ primarily in their choice of score function K(·, ·). Describing thedetails of SEQUEST’s score function, XCorr, is beyond the scope of this paper, but the basic ideais to compute a scalar product of the observed and theoretical spectrum and then subtract out an“average” scalar product term that is produced by shifting the observed and theoretical spectrumrelative to one another: XCorr(S, s(p)) = 〈S, s(p)〉 − 1150

    ∑75τ=−75

    ∑Ni=1 Sis(p)i−τ . Mascot is a

    commercial product that uses a probabilistic scoring function to rank candidate peptides, the details ofwhich have not been published. PepHMM first generates a theoretical spectrum, akin to SEQUEST’s.The probability that the peaks in the theoretical spectrum occurred in the observed spectrum is thencaclulated using a a hidden Markov model (HMM), and the candidate peptide is assigned a scorebased on the confidence of this probability, which is measured using an estimated normal distributionover the peptide masses within ±δ of the precursor mass.The spectrum identification problem is difficult to solve primarily because of noise in the observedspectrum. In general, the x-axis of the observed spectrum is known with relatively high precision.However, in any given spectrum, many expected fragment ions will fail to be observed, and thespectrum is also likely to contain a variety of additional, unexplained peaks. These unexplained peaksmay result from unusual fragmentation events, in which small molecular groups are shed from thepeptide during fragmentation, or from contaminating molecules (peptides or other small molecules)that are present in the mass spectrometer along with the target peptide species.

    3 Evaluation Metrics

    4

  • 216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

    Jeff: [Do we want this here? Most ML papers put the evaluation methodology just before theresults section, and the the first thing that is done is the intro, motivation, background literatureand alternative approaches, and then the new approach (i.e., the new probabilistic method). Thenthe results (and methodology) go at the end. Putting the evaluation section here might surprise thereviewer.]

    Labelled data for spectrum identification would consist of a set of ground truth peptide-spectrummatches: spectra where the mapping to a peptide is known. Unfortunately, accurate labelled data doesnot exist in this domain, which complicates evaluation. To estimate the probability that a spectrumidentification is false, we therefore make use of the standard decoy-target approach [6, 9]. For eachspectrum, two searches are performed: one to find the best peptide in the target database C(m,P, δ).Then, a second search is performed to find the best peptide in a decoy database C(m, P̃ , δ): a set ofplausible peptides where it is extremely improbable that the correct peptide is contained in it. In ourexperiments, the target P and decoy P̃ databases are the same size, with decoys being generated byrandomly permuting peptides in the target database, under the requirement that P ∩ P̃ = ∅.A single tandem mass spectrometry experiment generates m = O(105) spectra. We expect a certainfraction of the identifications to be spurious, and so only the top-k scoring identifications are retainedas quality matches, the rest are ignored. False Discovery Rate (FDR) [14] (essentially one minusprecision) provides a rule for determining what k should be, given a bound on the expected fraction ofspurious identifications among the top-k. To make use of FDR, we first pose the question of whetheror not to accept a single spectrum identification as a hypothesis test.

    Consider a single spectrum s, searched against the target database C(m(s), P, δ). Denote the peptidescoring function θ : s×p → Θ ⊆ R. When only one spectrum is under consideration, the dependenceof θ on s is not shown. Now, θ(p) is itself a random variable. To sample from the distribution of θ(p),we score each peptide in the target database: θ(C) = {θ(p) : p ∈ C(m(s), P, δ)}. Choosing thehighest scoring peptide as the proposed match corresponds to the test statistic T (θ(C)) = max(θ(p) :p ∈ C(m(S), P, δ)). Colloquially, the hypothesis test can expressed in terms of the test statistic. Thenull hypothesis, H0, is that a peptide matches the spectrum by chance; the alternate hypothesis, H1,is that the peptide generated the spectrum. Formally, the hypothesis test is

    H0 : θ(p) ≤ θ0 H1 : θ(p) > θ0,

    where θ0 is a user-determined threshold on the score which determines the stringency of the test. Asa decision rule, the null hypothesis is rejected if the test statistic T (θ(C)) exceeds critical value c.Equivalently, the highest scoring peptide match for a spectrum is deemed correct if its score is greaterthan c.

    A single tandem MS experiment leads to m hypotheses. Let V (c) be the number of hypotheses whereH0 is incorrectly rejected at critical value c; let R(c) be the number of hypotheses where H0 wasincorrectly rejected. For sufficiently large m, we estimate FDR using

    F̂DR(c) = E[V (c)]/ E[R(c)].

    An estimate of E[V (c)] is the number of spectra where the best decoy match has a score higher thanc; an estimate of E[R(c)] is the value of R(c) itself, the number of spectra where the best targetmatch has a score higher than c. The decoy database is only used to estimate the error rate. The aboveestimate of FDR has an intuitive interpretation, it is 1-precision. Since FDR(c) is not necessarilystrictly increasing with c, we instead report the estimated q-value [12]:

    q̂(c) = mint≥c

    F̂DR(t).

    At a score threshold c, we have q-value q(c) ∈ [0, 1], which is the expected fraction of spuriousidentifications among those whose score is at least c.

    Jeff: [I think most of the equations above are not long and shoud be inlined, to save space.]

    The tradeoff between the number of identifications that are accepted and the stringency of theacceptance criterion is represented as an absolute ranking curve. Each point on the x-axis is aq-value in [0,1], the corresponding value on the y-axis is the number of top-scoring spectra whoseidentification is accepted at that q-value. At q = 1, all m identifications are accepted; at q = 0, noidentifications are accepted. In real-world usage, the concern is with maximizing performance at

    5

  • 270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

    small q-values, so we plot only q ∈ [0, 0.1]. One method dominates another if its absolute rankingcurve is strictly above the absolute ranking curve for the other method.

    Ajit: [Include a pointer to an absolute ranking plot.]

    Ajit: [It’s natural for a machine learning audience to want to view hypothesis testing as a 0/1 classificationproblem: i.e., assign label 1 to the identifications we want to accept. If we switch from FDR to the positive FalseDiscovery Rate [13], we can draw a connection to Bayes error rates on such a classification problem. However,using Bayes error rate would require the user to control the stringency of the test by setting a parameter thatcorresponds to the relative importance of a false negative to a false positive, and that is harder to understand anda q-value.]

    4 Scoring identifications as inference in a Dynamic Bayesian Network

    In this section, we show that Equation ?? can be generalized Ajit: [ this is not strictly what is going on,as we’re not generalizing it, rather we are inspired by it to create a proper probabilistic model. ] as inference ina DBN (Figure 2). The DBN is based on the mobile proton hypothesis of peptide fragmentation [4],which we describe mathematically below. We provide empirical evidence that our probabilisticscoring function is significantly better than the scoring functions used in commercially developedpackages.

    Peptide Fragmentation We start at the second phase in tandem mass spectrometry: the proteinsequence has been digested, and a peptide has been isolated in the first mass spectrometry step. Apeptide is represented as a string p = a1a2 . . . an, since our only concern is in decoding a peptide’ssequence. Each letter at is drawn from an alphabet of 20 standard amino acids, whose masses areknown. The mass function m(·) refers both to the mass of a residue, m(at), and to the mass of asequence of residues, m(p) =

    ∑nt=1 m(at).

    Peptides are ionized in the second phase of mass spectrometry, so each peptide has a positive chargedue to carrying one, two, or three extra protons: c(p) ∈ {1, 2, 3}. Peptides predominantly fragmentinto a prefix and suffix: b = a1 . . . at, y = at+1 . . . an. The extra protons are divided between theprefix and suffix: c(b) + c(y) = c(p). If either b or y have zero charge, it cannot be detected, and itscorresponding peak will not show up in the spectrum. Charge distributions are not equally probable:e.g., when c(p) = 2, fragment ions of charge 2 are exceedingly rare. When the peptide fragments atposition t, the prefix fragment ion is referred to as the bt-ion; the suffix fragment ion, the yt-ion. Theset {bt}t is referred to as the b-ion series, with the y-ion series defined analogously.Each peak in an idealized spectrum corresponds to a fragment ion in the the b-ion or y-ion series: theposition of the peak for a fragment ion b is a deterministic function of m(b) and c(b) and likewise fory. A fragment spectra measures how often particular peaks with a specific mass-to-charge (m/z) ratioare detected, so there is no sequence information in a peak.

    A spectrum s is a collection of peaks, intensities at given m/z positions: s = {(xj , hj)} where xj isa point on the m/z axis (x-axis), and hj is the corresponding intensity (see Figure 1B). In practice,there is substantial discrepancy between an idealized spectrum and a real one due to measurementnoise, secondary fragmentation of the b or y ions, non-protein contaminants, or other imperfectionsin the isolation of the peptide. Even barring noise in the spectra, there is substantial variation acrossspectra which must be controlled. There can be order-of-magnitude differences in both total intensity∑

    j hj and maximum intensity, max{hj} across spectra. To control for intensity variation, werank-normalize each spectrum: peaks are sorted in order of increasing intensity, and the ith peak isassigned intensity i/|s|, so max{hj} = 1.0.From the settings used to collect the spectra, we know that xj ∈ [0, 2000] m/z units. We quantize them/z scanning range into B = 2000 uniformly sized bins. The bins correspond to a vector of randomvariables S = (Si : i = 1 . . . B). A spectrum is an instantiation of S, s = (s1, . . . , sB), where themost intense rank-normalized peak is retained in each bin. If no peak is present in a bin, then Si = 0.

    A Generative Model of Peptide Fragmentation

    DBNs are commonly used to model discrete-time phenomena; but can be applied to any sequentialdata. In Figure 2, each non-prologue frame t = 1 . . . n corresponds to the fragmentation of peptide pinto the bt and yt ions. The peptide is represented as a vector of random variables A = (Ai : i =

    6

  • 324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

    1 . . .). Since we are given the peptide-spectrum match to score, A is observed, with At = at. Thespectrum variables S are fixed across all frames, and observed, since the spectrum is given.

    The masses of the prefix and suffix are denoted nt = m(a1 . . . at) and ct = m(at+1 . . . an)1. Themasses can be defined recursively: n0 = 0, nt = nt−1 + m(at), and cn = 0, ct = ct+1 + m(at+1).The variables p = {nt, ct}n−1t=1 identify the peptide.The random variables bt, yt ∈ {1 . . . B} are indices that select which bins are expected to contain thebt-ion and the yt-ion, respectively. Recall, the there is a deterministic relationship between the massand charge of a fragment, and its location on the m/z axis: i.e, bt = round((nt + 1)/zt).

    To generalize Equation ?? to a posterior probability, we need a background score which measuresthe average fit of the spectrum to a shifted version of the theoretical spectrum. The shift variableτ0 allows us to shift the theoretical spectrum. τ0 ∈ [−M . . . + M ], for a choice of M ∈ {1 . . . B}.Instead of predicting the bt-ion at bin bt, we predict it at bin bt + τ . If the shifted bin location isoutside the range {1, . . . , B}, we map those positions to a special bin that contains no peak. To shiftthe entire theoretical spectrum, τt = τt−1, t = 1 . . . n. The distribution over τ0 is uniform.

    Most of the conditional probability distributions in Figure 2 are deterministic, which leads to a simpleform for the joint distribution:

    p(τ0, s,p) = p(τ0)n−1∏t=1

    B∏i=1

    [P (Si | bt, yt, τt)]δ(i=bt+τt∨i=ct+τt) . (1)

    The inference which connects this model to Equation ?? is the log-posterior of τn:

    θ(s,p) , log p(τn = 0 |p, s) = log p(τ0 = 0,p, s)− log |τ0|−1∑τ0

    p(p, s | τ0). (2)

    The log p(τ0 = 0,p, s) term is the probabilistic analogue of 〈S, s(p)〉 in Equation ??, a term whichmeasures the similarity between the theoretical and observed spectra. The log |τ0|−1

    ∑τ0

    p(p, s | τ0)term is a generalized version of the cross-correlation between the real and theoretical spectra: theaverage similarity between the spectrum and shifted versions of the theoretical spectra.

    Computing the scoring function θ(s,p) is somewhat simpler than computing the evidence p(p, s).Algorithms for DBN inference are typically forward-backward schemes (c.f., [2]), with it beingpossible for θ(·) to be computed using only a forward pass.

    Virtual Evidence

    An advantage of our probabilistic approach to scoring is that we have substantial flexibility inrepresenting the contribution of peaks towards the score, P (Si | bt, yt, τt). Using virtual evidence [10],we are free to choose an arbitrary non-negative function fi(S) to model each bin.

    One way to mimic the observation Si = si is to introduce a virtual binary variable Ci, whose soleparent is Si. The virtual child is fixed to Ci = 1. If P (Ci = 1 |Si) , δ(Si = si), then

    P (Si = si | bt, yt, τt) =∑Ci

    P (Ci = 1, Si | bt, yt, τt)P (Ct = 1 |Si).

    Virtual evidence changes the definition of the virtual child’s conditional probability distribution toP (Ci = 1 |Si) = fi(Si), for a user-defined non-negative function fi. One could define a separate fifor each bin i, but for simplicity we choose a single function for all bins, f .

    Following Equation ?? we impose additional constraints on the form of f . The score of a peptide-spectrum match should depend only on the peaks: f(0) = 1. If a peak is found in an activatedbin, its contribution to the score must be higher than that of an activated bin with no peak: ∀S >0, f(S) > f(0). Finally, matching high intensity peaks should be worth more than matching lowintensity peaks; the b- and y-series should be more prominent than noise in the spectrum: i.e., f ismonotone increasing. Based on our experiments, a class of f that works particularly well is

    fλ(S) =eλ − λ + λeλS

    2eλ − 1− λ. (3)

    7

  • 378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

    Prologue Chunk Epilogue

    nt

    ct

    i = 1 . . . B

    Si

    ytbt

    c0

    n0

    cn

    nn

    at an

    τ0 τt τn

    Shared across all frames

    Figure 2: Didea as a graphical model: the “prologue” occurs once at the beginning, the “epilogue”occurs once at the end, and the “chunk” is unrolled as necessary to any desired length. At each chunk,the bottom plate is expanded to have B copies.

    The parameter λ > 0 dictates the relative value placed upon peak intensity in the scoring function.

    Experiments

    We compare the performance of spectrum identification algorithms against three tandem MS experi-ments on proteins from two different organisms:

    • 60cm: A tryptic digest of S. Cerevisiae lysate containing 18,149 spectra, each with aprecursor ion charge of 2. That is, c(p) = 2 for all candidate peptides.

    • Yeast-01: A tryptic digest of S. Cerevisiae lysate, containing 34,499 spectra, each with aprecursor charge 2 or 3. That is c(p) ∈ {2, 3} for all candidate peptides. We compare allalgorithms under the assumption that each candidate peptide has c(p) = 2.

    • Worm-01: A tryptic digest of C. Elegans proteins, containing 22,436 spectra, each withprecursor charge 2 or 3. Again, we compare all algorithms under the assumption thatc(p) = 2.

    The peptide database P for the yeast data sets is generated by an in silico trypsin digest of the solubleyeast proteome [16].

    We compare the performance of four spectrum identification algorithms on these three data sets2:Didea, Crux/XCorr, Mascot, and PepHMM. Jeff: [Say again that these other methods include atleast one that is quite standard, and that mascot is commercial]. The search parameters are controlledacross the four methods. Candidate peptides are selected using a mass window of δ = 3.0 Da, savePepHMM, which uses a hard-coded window of δ = 2.0 Da. The entire b- and y-ion series are assumedto be present. A fixed modification to cysteine is included to account for carbamidomethylation ofprotein disulfide bonds. In all cases, the decoys are generated by randomly permuting target peptides.

    Figure 3 presents the absolute ranking comparison of the four methods. In all cases, there is asignificant improvement in the number of spectra that are confidently identified, with Didea strictlydominating over q ∈ [0, 1], save for the Worm-01 experiment. Ajit: [I’m betting that the poor performancein Worm-01 is due largely to failures on spectra with heavy precursor masses where the charge is probably +3. Ifwe assume the charge is +2, a large chunk of the b/y-series would fall outside the scanning range of the device]

    Jeff: [I think we should include some reasons for this performance increase, rather than just giving it.First, did you end up using the MLE for λ? If so, say so. This should also include the benefit of thef function, that this “distribution” was not pulled out of a hat, but is a family of distributions that

    1The prefix and suffix of a peptide are more commonly referred to as the N- and C-terminal fragments.2Comparisons on additional data sets are included in D.

    8

  • 432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

    0.02 0.04 0.06 0.08 0.10q-value

    0

    1

    2

    3

    4

    5

    Spe

    ctra

    iden

    tified

    (100

    0’s)

    DideaCruxMascotPepHMM

    (a) 60cm

    0.02 0.04 0.06 0.08 0.10q-value

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    Spe

    ctra

    iden

    tified

    (100

    0’s)

    DideaCruxMascotPepHMM

    (b) Yeast-01

    0.02 0.04 0.06 0.08 0.10q-value

    0

    2

    4

    6

    8

    10

    Spe

    ctra

    iden

    tified

    (100

    0’s)

    DideaCruxMascotPepHMM

    (c) Yeast-02

    0.02 0.04 0.06 0.08 0.10q-value

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    Spe

    ctra

    iden

    tified

    (100

    0’s)

    DideaCruxMascotPepHMM

    (d) Yeast-03

    0.02 0.04 0.06 0.08 0.10q-value

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    Spe

    ctra

    iden

    tified

    (100

    0’s)

    DideaCruxMascotPepHMM

    (e) Yeast-04

    0.02 0.04 0.06 0.08 0.10q-value

    0

    2

    4

    6

    8

    10

    12

    14

    Spe

    ctra

    iden

    tified

    (100

    0’s)

    DideaCruxMascotPepHMM

    (f) Worm-01

    0.02 0.04 0.06 0.08 0.10q-value

    0

    1

    2

    3

    4

    5

    6

    7

    8

    Spe

    ctra

    iden

    tified

    (100

    0’s)

    DideaCruxPepHMM

    (g) Worm-02

    0.02 0.04 0.06 0.08 0.10q-value

    0

    1

    2

    3

    4

    5

    6

    Spe

    ctra

    iden

    tified

    (100

    0’s)

    DideaCruxPepHMM

    (h) Worm-03

    Figure 3: Absolute ranking comparison

    are particularly suited to this problem. Shoudl also say that this function is novel, not before beenused for this (or any problem), as far as we know. Now also, key benefit of our approach is that it isprobabilistic, and thus automatically normalized appropriately, unlike the crux approach mentionedabove where there can be unwanted miscalculations between forground and background model (atleast this should be our hypothesis).]

    5 Lattice Decoding

    Lattice representation for peptide database

    The drawback of representing each peptide as an individual observation sequence is that the samecomputations need to be carried out multiple times for peptides with identical substrings. A moreefficient way of representing a peptide database is in the form of a subpeptide lattice. Latticerepresentations are widely used for other sequence modeling problems outside computational biology,such as speech and language processing (e.g. [1, 3, 5]). They provide a way of representing a finitebut possibly very large set of strings in a compact, compressed form by the sharing of commonsubstrings. Given an alphabet A of amino acids, a peptide p can be defined as a string over A. Asubpeptide s is a substring of p whose length, |s|, is typically less than the length of p. We denote thetotal inventory of subpeptides with S. A subpeptide lattice is a directed acyclic graph G = (V,E)with a set of vertices V and a set of edges E, each of which is labelled with a subpeptide s ∈ S

    9

  • 486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

    Figure 4: Compressed lattice for the three peptides AAAANWLR, AAADEWDER, AAADLISR.

    From Jeff: I don’t think we’ll have space for this figure, unfortunately, at least in the main version.We could use it in the extended version (which could be a strict superset of this paper).

    Figure 5: Graphical model structure for a peptide lattice.

    and, optionally, additional information such as frequencies or probabilities. The concatenation ofsubpeptides along a path through the lattice corresponds to a complete peptide in the database.

    Using a lattice representation, common subpeptides can be shared among peptides and the peptidedatabase can be represented much more compactly. The computations needed to evaluate theobservation model for specific amino acids are only performed once per edge; thus, depending on thedegree of sharing inherent in the lattice (relative to the uncompressed database), significant speedupscan be achieved. The question is how to define the S such that the resulting lattice is as compact aspossible. To address this problem we exploit the fact that, formally, a lattice is a (weighted) finite-stateautomaton (FSA) Jeff: [Add cite]. Our initial starting point is a “naive” lattice representation whereevery peptide is represented as a separate path consisting of edges labelled with individual aminoacids only. We then apply a series of well-known operations on finite-state machines that transformthe lattice into the corresponding minimal lattice that has the smallest possible number of states. Thealphabet S results as a by-product of this procedure.The first step is to convert the peptide database (i.e. a simple set of strings) to a finite-state automatonF . Next, F is determinized. Determinization converts F into an equivalent FSA, Fdet, such that forany given state q and alphabet symbol a ∈ A, there is only a single outgoing edge from q labeled witha. Third, Fdet is minimized. Minimization creates an FSA, Fmin, that is equivalent to Fdet but hasthe minimal number of states. Algorithms for determinization and minimization have been studied indepth (e.g. [8]); we use the implementations provided in the OpenFst toolkit3. Finally, deterministicsubpaths in the lattice (sequences of states with only one outgoing edge) are collapsed into a singleedge, further limiting the number of states and edges and thus reducing memory requirements. Figure4 shows an example of a compressed lattice for three peptides. At the end of this procedure, S isdefined by list of unique edge labels in the final collapsed lattice.

    One problem is that the lattice incorporates peptides of different lengths, which complicates scoringwith the observation model described in Section 4. In order be able to score all strings simultaneously,they need to be warped to a common length. We achieve this by appending a “dummy amino acid”symbol to peptides shorter than the longest peptide in the database, such that all strings have the samelength.

    Graphical model representation of lattices

    In order to use a lattice representation within our graphical modeling framework the lattice needs tobe represented as a graphical model structure, visualized in Figure 5.

    Valid paths through the lattice are specified by the NODE variable and associated parameters: theprobability of node j given node i is nonzero whenever an edge exists between them in the originallattice. The SP (subpeptide) variable with cardinality |S| encodes the identity of the edge labeland is dependent on a start node i and end node j. SPPOS specifies the position in the subpeptide;whenever the final position is reached, the binary transition variable TRANS is switched to 1. TheTRANS variable is in turn a switching parent for NODE and SP; if it is 1, NODE and SP take on newvalues (i.e. a transition in the lattice occurs), otherwise the values from the previous frame are copied.Finally, SPPOS and SP jointly determine the amino acid (AA) variable, which is connected to anobservation (or a more complicated observation model as described above). The validity of strings isensured by dedicated end node and end transition variables which ensure that the end the observationsequence coincides with the end of a subpeptide.

    When using a peptide lattice to search an entire database, precursor filtering can be done as part of thesearch. To this end a pruning variable is included that assigns zero probability to a path if the currentaccumulated mass exceeds an upper mass limit, or if the lower mass limit exceeds the current mass

    3www.openfst.org

    10

  • 540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

    database # peptides “naive” lattice compressed lattice |S|worm 523,190 27M/27M 295k/804k 251kyeast 162,916 8.5M 95k/256k 85k

    Table 1: Sizes of naive and compressed lattices (given as number of nodes/number of edges), and thesize of the subpeptide alphabet for worm and yeast databases.

    Experiment “naive” compressedABC

    Table 2: CPU time of inference for database search vs. search through lattice

    plus the maximum possible mass value that can still be added before the end of the peptide is reached(the maximum remaining mass is equal to the remaining number of peptide positions multiplied bythe largest mass value of any amino acid). The pruning variable is checked whenever a new edge inthe lattice is being entered.

    Experiments

    Table 2 compares the sizes of the original “naive” lattice representation where each peptide isrepresented as an individual string and the corresponding compressed lattice representation.

    With respect to computational efficiency, speedups can be achieved by evaluating the observationmodel only once for each edge in the lattice.

    Three different timing experiments were conducted to evaluate the lattice representation. In Experi-ment A we use the lattice as a compact representation for sets of peptides that have been prefilteredaccording to their precursor mass values. In Experiment B, the entire peptide database is representedas a lattice and the search is conducted against the entire database. Precursor filtering is performed aspart of the search, through the pruning variable in the graphical model lattice structure, as describedabove. In Experiment C we also conduct a search over the entire database but (additionally?) usepruning options provided by the graphical model inference code.

    Timing experiments were conducted on a (MACHINE SPECS?). Each number is the average of 20runs and reports the inference time only, excluding startup cost.

    Acknowledgements

    Use unnumbered third level headings for the acknowledgements title. All acknowledgements go atthe end of the paper.

    11

  • 594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

    References

    [1] X. Aubert, C. Dugast, H. Ney, and V. Steinbiss. Large vocabulary continuous speech recognition of wallstreet journal data. In Proceedings of ICASSP, pages 129–132.

    [2] Jeff Bilmes. Dynamic graphical models. IEEE Signal Processing Magazine, 27(6):29–42, Nov 2010.

    [3] C. Chelba and A. Acero. Position-specific posterior lattices for indexing speech. In Proceedings of ACL,2005.

    [4] A. R. Dongre, J. L. Jones, A. Somogyi, and V. H. Wysocki. Influence of peptide composition, gas-phasebasicity, and chemical modification on fragmentation efficiency: evidence for the mobile proton model.Journal of the American Chemical Society, 118:8365–8374, 1996.

    [5] C. Dyer, S. Muresan, and P. Resnik. Generalizing word lattice translation. In Proceedings of ACL/HLT,pages 1012–1020, 2008.

    [6] J. E. Elias and S. P. Gygi. Target-decoy search strategy for increased confidence in large-scale proteinidentifications by mass spectrometry. Nature Methods, 4(3):207–214, 2007.

    [7] J. K. Eng, A. L. McCormack, and J. R. Yates, III. An approach to correlate tandem mass spectral dataof peptides with amino acid sequences in a protein database. Journal of the American Society for MassSpectrometry, 5:976–989, 1994.

    [8] J.E. Hopcroft and J. Ullman. Introduction fo Automata Theory, Languages and Computation. Addison-Wesley, Reading, Mass., 1979.

    [9] L. Käll, J. D. Storey, M. J. MacCoss, and W. S. Noble. Assigning significance to peptides identified bytandem mass spectrometry using decoy databases. Journal of Proteome Research, 7(1):29–34, 2008.

    [10] J. Pearl. Probabilistic Reasoning in Intelligent Systems : Networks of Plausible Inference. MorganKaufmann, 1998.

    [11] D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell. Probability-based protein identification bysearching sequence databases using mass spectrometry data. ELECTROPHORESIS, 20(18):3551–3567,1999.

    [12] J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society, 64:479–498, 2002.

    [13] J. D. Storey. The positive false discovery rate: A bayesian interpretation and the q-value. The Annals ofStatistics, 31(6):2013–2035, 2003.

    [14] J. D. Storey and R. Tibshirani. Statistical significance for genome-wide studies. Proceedings of theNational Academy of Sciences of the United States of America, 100:9440–9445, 2003.

    [15] Yunhu Wan, Austin Yang, and Ting Chen. PepHMM: A hidden markov model based scoring function formass spectrometry database search. Analytical Chemistry, 78(2):432–437, 2006. PMID: 16408924.

    [16] M. P. Washburn, D. Wolters, and J. R. Yates, III. Large-scale analysis of the yeast proteome by multidi-mensional protein identification technology. Nature Biotechnology, 19:242–247, 2001.

    12

  • 648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

    SUPPLEMENTARY MATERIAL

    A Tandem Mass Spectrometry

    B Description of the Testing Datasets

    Elided. The data sets have been used in previously published work.

    C Evaluation Metrics

    Add form of the equation when ntargets neq ndecoys. Include results on qvality: i.e., that we dobetter even under alternate estimators of the FDR.

    Ajit: [Questions that we may want to put in a supplement.]

    Q: Why are ground truth peptide-spectrum matches not available in any significant quantities ?

    A: Theoretically, one could create a purified sample of a peptide which could be used to generate aspectrum where the peptide is known. However, the resolution of tandem mass spectrometry is sohigh that creating sufficiently pure samples is impractical.

    One could attempt to label spectra by hand, but such labellings are known not to be especially accurate[CITE].

    D Scoring Identifications as Inference in a Dynamic Bayesian Network

    Explain where the virtual evidence function comes from, the MLE, and why it does not work well.

    Additional Experiments

    Scatter plots relating our scoring function against Crux. Break down the comparison of methodsbased on filtering returned PSMs by length, by spectrum length, by precursor mass. Sum-product vs.max-product. Ablative: replace the VECPT function with intensity.

    E Lattice Decoding

    Additional Experiments

    1

    IntroductionTandem Mass SpectrometryEvaluation MetricsScoring identifications as inference in a Dynamic Bayesian NetworkLattice DecodingTandem Mass SpectrometryDescription of the Testing DatasetsEvaluation MetricsScoring Identifications as Inference in a Dynamic Bayesian NetworkLattice Decoding