Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Analysis of tandem mass spectra - II

Prof. William Stafford NobleGENOME 541

Intro to Computational Molecular Biology

Re-ranking identified spectra

(Keller Analytical Chemistry 2002)

(Anderson J Proteome Research 2003)

(Käll Nature Methods 2007)

EAMPK

EAMPK

EAMPK EAMPK

EAMPKEAMPK

EAMPK

EAMPK

EAMPK?

This is the problem we set out to solve

Modified problem: Is this peptide assignment correct?

m/z

Inte

nsi

ty

VVVTGLGMLSPVGNTVESTWK +2

1304.4+1

888.14+1

Peptide-spectrum match features

• Total peptide mass• Charge (+1, +2 or +3)• Total ion current• Peak count• Preliminary SEQUEST score (Sp)• Sp rank• Cross-correlation score (XCorr)• Change in XCorr (delta Cn) • Mass difference

• Percent of theoretical peaks matched• Percent of observed peaks matched• Percent of peptide fragment ion current matched• Percent sequence identity between top and second-

ranked peptides

• Uses linear discriminant analysis rather than SVM.• Uses a four-dimensional feature space (XCorr, delta Cn,

ln SpRank, delta mass).• Uses EM to fit distributions to the discriminants of the

two classes, yielding a probability.• Learns a simple, independent probability model of the

number of tryptic termini.• Publicly available software, PeptideProphet, is widely

used.

Peptide-spectrum matches against the real database

Peptide-spectrum matches against the shuffled database

q=0.01

Features

2780 PSMs

13706 PSMs

8050 PSMs

12691 PSMs

1% FDR

10863 PSMs

Cleaving with elastase

Variation by data set

Black lines are q = 0.01Yellow line is y=xRed line = equal q value thresholds

Elastase data set Chymotrypsin data set

Percolator best match

SEQUEST best match

Protein identification

The protein ID problem

Proteins

Peptides

Spectra

EEAMPFK CYCYGGLGK CYCLLIGK FTEILYCDLNR VNILLGLPK

1.00.95

0.98

0.870.74

The peptide-to-protein mapping is many-to-many

0.03

0.10

0.91

0.99

0.97

≥ 0.90

Proteins (X) Peptides (Y) Spectra (D)

One- and two-peptide rules use a simple threshold

0.03

0.10

0.91

0.99

0.97

≥ 0.90


Select the minimum number of proteins to explain the peptides

IDPicker

ProteinProphet

0.03

0.10

0.91

0.99

0.97

0.3

0.7

0.80.2

1

1

0.55

0.45


Use an EM-like procedure…

0.03

0.10

0.910.3

0.7

0.8

0.2

1

0.45

0.91

0.03

0.97

0.991

0.550.97


ProteinProphet

Proteins (X) Peptides (Y)

0.8 x 0.03

0.10

0.3 x 0.91

Spectra (D)

0.7 x 0.91

0.2 x 0.03

0.45 x 0.97

0.99

0.55 x 0.97

Pr(X2 |D)

1 (1 0.70.91)(1 0.80.03)(1 0.10)

ProteinProphet

EM-like algorithm

E-step

M-step

All proteins containing peptide i

Probability of protein n

Weight of link from peptide i to protein n

Maximum probability assigned to peptide i

Nested Mixture Model

0.03

0.10

0.91

0.91

0.03

0.97

0.99

0.97


Modeled as mixture of present and absent

Model number of matches conditional on

protein states

Model distribution of scores conditional on

peptide states(Li Ann Applied Science 2010)

Shen et al. 2008

Li et al. 2008

Model the MS/MS process generatively (forward) using free parameters.Sum over all possible protein and peptide states to get posterior probabilities.Use Expectation-maximization to get parameter estimates.

Model the MS/MS process generatively using an existing static peptide detectability model.Use Markov chain Monte Carlo to estimate posterior probabilities.

Generatively model:Y | XD | Y

Perform inference to get Pr(X | D)

The emergence of graphical Bayesian methods

Fido

Fido performs exact calculations on a Bayesian network model

Barista uses a neural network to score PSMs

Input units: 17 PSM features

Hidden units

Output unit

PSM feature vector

The Barista model includes spectra, peptides and proteins

F R 1

N R g E E N R

g E s: E ,s ?max f E,s

f E,s

R1 R2 R3

E1 E2 E3 E4

S1 S2 S3 S4 S5 S6 S7

Proteins

Peptides

SpectraNeural network score function

Number of peptides in protein R

Model Training

• Search against a database containing real (target) and shuffled (decoy) proteins.

• For each protein, the label y {+1, -1} indicates whether it is a target or decoy.

• Hinge loss function: L(F(R),y) = max(0, 1-yF(R))• Goal: Choose parameters W such that F(R) > 0 if y = 1,

F(R) < 0 if y = -1.

repeatPick a random protein (Ri, yi)Compute F(Ri)if (1 – yF(Ri)) > 0 then

Make a gradient step to optimize L(F(Ri),yi)end if

until convergence

Barista performs well in target/decoy evaluation

Why does Barista work well?

Sources of information loss during two-stage analysis:• Spectra that are not confidently assigned to a peptide

during the initial search are lost.• Also lost are lower-ranked peptides that match a given

spectrum, corresponding to– the correct peptide when the top-ranked peptide is

incorrect, or– a second correct peptide when the spectrum is chimeric.

• A single score is less informative than a rich feature vector describing the PSM.

Documents

Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology