24
Sensitive and accurate peptide identification with Mascot Percolator Markus Brosch [email protected] Wednesday, 10 June 2009

Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Sensitive and accurate peptide identification with

Mascot Percolator

Markus [email protected]

Wednesday, 10 June 2009

Page 2: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 29 (2008).L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 40 (2008).M. Brosch, J. Choudhary, in Scoring and validation of tandem MS peptide identification methods, Eds. (Humana Press, 2009).

0 20 40 60

0e+0

04e

+05

8e+0

5

Density estimate of data

−100

−100

Total PSMsIncorrect PSMsCorrect PSMs

Terminology: FPR, FDR, PEPFr

eque

ncy

Mascot score

Wednesday, 10 June 2009

Page 3: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

0 20 40 60

0e

+0

04

e+

05

8e

+0

5

Density estimate of data

!100

!1

00

Total PSMs

Incorrect PSMs

Correct PSMs

B' B AScore

Fre

qu

en

cy

b

a

FPR = B / (B'+B) FDR = B /A = (!PEP) / Ai=1

A

iPEP = b / a

Terminology: FDR & PEP

L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 29 (2008).L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 40 (2008).M. Brosch, J. Choudhary, in Scoring and validation of tandem MS peptide identification methods, Eds. (Humana Press, 2009).Wednesday, 10 June 2009

Page 4: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Terminology: FDR vs q-value

J. D. Storey, R. Tibshirani, Proc Natl Acad Sci U S A 100, 9440 (2003).Figure: L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 29 (2008).

(i.e, many of the PSMs are correct), the accepted method formultiple testing correction is to estimate the false discovery rate(FDR).10,11 Storey and Tibshirani12 provide a description of FDRmethods that is accessible to nonstatisticians and that includesmore recent developments. In our case, the FDR associatedwith a particular score threshold is defined as the expectedpercentage of accepted PSMs that are incorrect, where an“accepted PSM” is one that scores above the threshold (Manyproteomics papers incorrectly refer to this quantity as the “falsepositive rate.”) However, other scientific fields define the falsepositive rate as the fraction of true null tests that are calledsignificant,13–17 whereas the false discovery rate is defined asthe fraction of true null tests among all of those that are calledsignificant). For example, at an FDR of 1%, if we accept 500PSMs, then we expect five of those matches to be incorrect.

The simplest way to calculate the FDR is analogous to thecalculation of p-values, above. For a given score threshold, wecount the number of decoy PSMs above the threshold and thenumber of target PSMs above the threshold. We can nowestimate the FDR by simply computing the ratio of these twovalues. For example, at a score threshold of 3.0, we observe3849 accepted target PSMs and 219 accepted decoy PSMs,yielding an estimated FDR of 5.7%. Figure 4 plots the numberof accepted PSMs as a function of the estimated FDR, and theseries labeled “Simple FDR” was computed using the ratio ofaccepted decoys versus accepted targets.

Estimating the Percentage of Incorrect Target PSMsA slightly more sophisticated method for calculating the FDR

takes into account the observation that, whereas all decoy PSMsare incorrect by construction, not all target PSMs are correct.Ideally, the presence of these incorrect target PSMs should befactored into the FDR calculation. For example, suppose thatamong 10 000 target PSMs, 8000 are incorrect and 2000 arecorrect. We would like to know the 8000 quantity so that wecan adjust our FDR estimates.

Figure 2 shows that the distributions of scores assigned totarget and decoy PSMs are similar, except that the target PSMscore distribution has a heavier tail to the right. This tail arisesbecause the set of target PSMs is comprised of a mixture ofcorrect and incorrect PSMs. Figure 5 shows simulated distribu-

tions that illustrate the underlying phenomenon. For thissimulation, we assume that our PSM score function follows anormal distribution, and we set the standard deviation to 0.7(The assumption of normality is for the purposes of illustrationonly; the methods we describe here do not require anyparticular form of distribution, nor do we assume that XCorris normally distributed). For incorrect PSMs, we set the meanof the distribution to 1.0, and for correct PSMs, we change themean to 3.0. Our simulated data set contains 10 000 decoyPSMs, 8000 incorrect target PSMs, and 2000 correct targetPSMs. The figure shows the resulting decoy score distribution(black line), the target score distribution (blue line), and its twocomponent distributions (dotted and dashed blue lines). In thissimulated data set, the percentage of incorrect targets (PIT) is80%. This PIT is equivalent to the ratio of the area under thedotted blue line (the incorrect target PSMs) to the area underthe solid black line (the decoy PSMs).

The PIT is important because it allows us to reduce theestimated FDR associated with a given set of accepted targetPSMs. In our simulation, if we accept X decoy PSMs with scoresabove a certain threshold, then we expect to find 0.8X incorrecttarget PSMs above the same theshold. A more accurate estimateof the FDR, therefore, is to multiply the previous estimate—the

Figure 4. Mapping from the number of identified PSMs to the estimated false discovery rate. (A) The figure plots the number of PSMsabove the threshold as a function of the estimated false discovery rate. Two different methods for computing the FDR are plotted, withand without an estimate of the percentage of incorrect target PSMs (PIT). The vertical line corresponds to an XCorr of 3.0. (B) A zoomed-in version of panel A, with the estimated FDR shown as a dotted line and the q-value shown as a solid line.

Figure 5. Simulated target and decoy PSM score distributions.

Assigning Significance to Peptides perspectives

Journal of Proteome Research • Vol. xxx, No. xx, XXXX C

decreas

ing score

threshold

new TPs

new FPs

Wednesday, 10 June 2009

Page 5: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

20 40 60 80

−500

050

010

00

Mascot score distributions

Mascot score

Freq

uenc

y

Scor

e cu

toff

TargetDecoyTarget − Decoy

Target / Decoy database searching

Dens

ity

0

Target/Decoy concept: R. E. Moore, M. K. Young, T. D. Lee, J Am Soc Mass Spectrom 13, 378 (2002).

Correct PSMs

Incorrect PSMs

Wednesday, 10 June 2009

Page 6: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

20 40 60 80

−500

050

010

00

Mascot score distributions

Mascot score

Freq

uenc

y

Scor

e cu

toff

TargetDecoyDecoy x pi0Target − DecoyTarget − Decoy x pi0

Target / Decoy database searching

Dens

ity

0

L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 29 (2008).Wednesday, 10 June 2009

Page 7: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Accurate q-values (FDR) and PEPs

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Score

q−va

lue

or P

EP

q−valuePEP

qvality software package: L. Käll, J. D. Storey, W. S. Noble, Bioinformatics 24, i42 (2008).

Mascot Score

Think FDR!

Wednesday, 10 June 2009

Page 8: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

MIT MHT Mascot Identity Threshold Mascot Homology Threshold

num

ber

of e

ntri

es

score

Mascot score = -10log10(P)P = 10-(Mascot score / 10)

Mascot score

MIT = -10log10(P)

Example:A 1% probability that the peptide spectrum match is a

random event would translate into a Mascot score of 20.

‘Empirical’‘Theoretical’

http://www.matrixscience.com/pdf/optimization.pdfhttp://www.matrixscience.com/pdf/2005WKSHP4.pdf

Example:If there are 5000 precursor matches, a 1 in a 20 chance of getting a false

positive match is a probability ofP = 1 / (20 x 5000 x n)

Probability that match is random

Wednesday, 10 June 2009

Page 9: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

• Good news: convincing method

• Bad news: only available for Sequest

• Good news: command line interface with generic input and output formats, so we can extend it for use with Mascot

Wednesday, 10 June 2009

Page 10: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Figure: http://en.wikipedia.org/wiki/Support_vector_machine

Support Vector Machine

Wednesday, 10 June 2009

Page 11: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

SVM Trainer

FDR

FilterTarget

Features

Initial scoring

functionClassifier

Decoy

Features

Unlabelled data

Labelled data Percolator

Käll, L., Canterbury, J. D., Weston, J., Noble, W. S., & MacCoss, M. J. (2007). Nature Methods

Scored Target/Decoy PSMs

Best scoring Target PSMs

Wednesday, 10 June 2009

Page 12: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Scored

PSMsClassifier

FDR

Filter

SVM

Target

Features

Decoy

Features

Feature

Calculator

Target

PSMs

Decoy

PSMsDecoy

Protein

Database

Target

Protein

Database

Mascot

Percolator v1.07

Initial scoring function: Mascot

score

Percolator

Mascot Auto-Decoy

function

M. Brosch, L. Yu, T. Hubbard, J. Choudhary, J Proteome Res (2009).

Mascot Percolator

MascotParser

Wednesday, 10 June 2009

Page 13: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Mascot Percolator Features

(a) Peptide Matching Scores (Mascot score, Peptide score [1])

(b) Peptide properties (mass, charge, mc, var mods)

(c) Delta mass, absolute delta mass, delta mass accounting for incorrect peak detection (13C)

(d) Fragment delta mass, absolute fragment delta mass

(e) Total intensity, matched intensity, relative matched intensity

(f) Fraction of ions matched (per ion series)

(g) Sequence coverage (per ion series)

(h) Intensity matched (per ion series)

(i) Retention time where available (see talk by Lukas Käll, WOE 3:10 pm)

[1] S. A. Beausoleil, J. Villen, S. A. Gerber, J. Rush, S. P. Gygi, Nat Biotechnol 24, 1285 (2006).

Wednesday, 10 June 2009

Page 14: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Scored

PSMsClassifier

FDR

Filter

SVM

Target

Features

Decoy

Features

Feature

Calculator

Target

PSMs

Decoy

PSMsDecoy

Protein

Database

Target

Protein

Database

Mascot

Percolator v1.07

Initial scoring function: Mascot

score

Percolator

Mascot Auto-Decoy

function

M. Brosch, L. Yu, T. Hubbard, J. Choudhary, J Proteome Res (2009).

Mascot Percolator

MascotParser

Wednesday, 10 June 2009

Page 15: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

0.00

00.

010

0.02

00.

030

Mascot vs Percolator score density

Arbitrary score

Freq

uenc

y

0 20 40 60 80 100 120

MascotPercolator

Mascot score vs Percolator PEPs

0 50 100 150

050

100

150

Mascot score vs PEP

Mascot score

−10l

og10

(PEP

)1% FDR

1% F

DR

Wednesday, 10 June 2009

Page 16: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

0.00 0.01 0.02 0.03 0.04 0.05

050

0010

000

1500

0

Comparison of MIT, MHT and MP (v1.08)

q−value

Num

ber o

f est

imat

ed c

orre

ct P

SMs

MPMHTMIT

M. Brosch, L. Yu, T. Hubbard, J. Choudhary, J Proteome Res (2009).

ROC comparison

Think FDR

Wednesday, 10 June 2009

Page 17: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Mascot vs Sequest Percolator

0.00 0.01 0.02 0.03 0.04 0.05

020

0040

0060

0080

0010

000

1200

014

000

Sequest Percolator vs Mascot Percolator

q−value

Num

ber o

f est

imat

ed c

orre

ct P

SMs

Mascot PercolatorSequest Percolator

Wednesday, 10 June 2009

Page 18: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

0.00 0.02 0.04 0.06 0.08 0.10

050

0010

000

1500

020

000

2500

0

Search log ID: 12782!12783

q!value

num

ber o

f est

imat

ed c

orre

ct P

SM

s

MP

MHT

MIT

MScore

#ids @ 1% q!value: MP= 24304 , MHT= 21815 , MIT= 20169 , MScore= 20388

All of Peptide Atlas Mouse

MHT MIT MScore

510

2050

100

200

500

% in

crea

se o

f PSM

s wi

th M

P ov

er M

HT, M

IT, M

Scor

e at

1%

q−v

alue

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Wednesday, 10 June 2009

Page 19: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Evaluate q-value accuracy

Standard errornegligible

0.00 0.05 0.10 0.15 0.20

0.00

0.05

0.10

0.15

0.20

FDR evaluation (MP v1.08 + P v1.09)

estimated q−value

actu

al F

DR

LTQLTQ−FT 200 ppmLTQ−FT 50 ppmLTQ−FT 20 ppmLTQ−FT Ultra

Wednesday, 10 June 2009

Page 20: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Mascot Percolator software package

java -cp MascotPercolator.jar cli.MascotPercolator

-target 11083

-decoy 11084

-out 11083-11084

-rankdelta 1

-newDat

• Simple command line interface:

• Tested with Mascot 2.2; limited experience with versions <= 2.1

Mascot log IDs

• Runtime: 10-150 spectra per second• Memory requirements typically 1-2 GB for up to 100k spectra.

Largest Mascot search processed was 350k spectra which required 6GB of memory.

Wednesday, 10 June 2009

Page 21: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Score:-10log10(PEP)

FDR

p = PEP = 0.05; MIT 13p = PEP = 0.01; MIT 20

Expect = PEP

Warning

Wednesday, 10 June 2009

Page 22: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

Registered nodes

file info, parameters info,

statistics

DB server

Status page

MP server Node 1Node 2Node 3Node n

Sumit JobSubmit Job

Distributing Mascot Percolator Jobs

Wednesday, 10 June 2009

Page 23: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

• Lukas Käll (Stockholm University) WOE 3:10 pm

• Erik Deutsch (Institute of Systems Biology)

• Jyoti Choudhary & Tim Hubbard (Sanger)

• Members of team 17 and 119, in particular Lu Yu (Sanger)

• Matrix Science

• Wellcome Trust for funding

Acknowledgements

Wednesday, 10 June 2009

Page 24: Sensitive and accurate peptide identification with Mascot … · 2009-07-03 · PSMs, then we expect five of those matches to be incorrect. The simplest way to calculate the FDR

http://www.sanger.ac.uk/Software/analysis/MascotPercolator/

http://tinyurl.com/mascotpercolator/

Mascot Percolator Website:

Lukas Käll - WOE 3:10 pm

Markus Brosch - MPB 063 - [email protected]

Wednesday, 10 June 2009