Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Sensitive and accurate peptide identification with
Mascot Percolator
Markus [email protected]
Wednesday, 10 June 2009
L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 29 (2008).L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 40 (2008).M. Brosch, J. Choudhary, in Scoring and validation of tandem MS peptide identification methods, Eds. (Humana Press, 2009).
0 20 40 60
0e+0
04e
+05
8e+0
5
Density estimate of data
−100
−100
Total PSMsIncorrect PSMsCorrect PSMs
Terminology: FPR, FDR, PEPFr
eque
ncy
Mascot score
Wednesday, 10 June 2009
0 20 40 60
0e
+0
04
e+
05
8e
+0
5
Density estimate of data
!100
!1
00
Total PSMs
Incorrect PSMs
Correct PSMs
B' B AScore
Fre
qu
en
cy
b
a
FPR = B / (B'+B) FDR = B /A = (!PEP) / Ai=1
A
iPEP = b / a
Terminology: FDR & PEP
L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 29 (2008).L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 40 (2008).M. Brosch, J. Choudhary, in Scoring and validation of tandem MS peptide identification methods, Eds. (Humana Press, 2009).Wednesday, 10 June 2009
Terminology: FDR vs q-value
J. D. Storey, R. Tibshirani, Proc Natl Acad Sci U S A 100, 9440 (2003).Figure: L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 29 (2008).
(i.e, many of the PSMs are correct), the accepted method formultiple testing correction is to estimate the false discovery rate(FDR).10,11 Storey and Tibshirani12 provide a description of FDRmethods that is accessible to nonstatisticians and that includesmore recent developments. In our case, the FDR associatedwith a particular score threshold is defined as the expectedpercentage of accepted PSMs that are incorrect, where an“accepted PSM” is one that scores above the threshold (Manyproteomics papers incorrectly refer to this quantity as the “falsepositive rate.”) However, other scientific fields define the falsepositive rate as the fraction of true null tests that are calledsignificant,13–17 whereas the false discovery rate is defined asthe fraction of true null tests among all of those that are calledsignificant). For example, at an FDR of 1%, if we accept 500PSMs, then we expect five of those matches to be incorrect.
The simplest way to calculate the FDR is analogous to thecalculation of p-values, above. For a given score threshold, wecount the number of decoy PSMs above the threshold and thenumber of target PSMs above the threshold. We can nowestimate the FDR by simply computing the ratio of these twovalues. For example, at a score threshold of 3.0, we observe3849 accepted target PSMs and 219 accepted decoy PSMs,yielding an estimated FDR of 5.7%. Figure 4 plots the numberof accepted PSMs as a function of the estimated FDR, and theseries labeled “Simple FDR” was computed using the ratio ofaccepted decoys versus accepted targets.
Estimating the Percentage of Incorrect Target PSMsA slightly more sophisticated method for calculating the FDR
takes into account the observation that, whereas all decoy PSMsare incorrect by construction, not all target PSMs are correct.Ideally, the presence of these incorrect target PSMs should befactored into the FDR calculation. For example, suppose thatamong 10 000 target PSMs, 8000 are incorrect and 2000 arecorrect. We would like to know the 8000 quantity so that wecan adjust our FDR estimates.
Figure 2 shows that the distributions of scores assigned totarget and decoy PSMs are similar, except that the target PSMscore distribution has a heavier tail to the right. This tail arisesbecause the set of target PSMs is comprised of a mixture ofcorrect and incorrect PSMs. Figure 5 shows simulated distribu-
tions that illustrate the underlying phenomenon. For thissimulation, we assume that our PSM score function follows anormal distribution, and we set the standard deviation to 0.7(The assumption of normality is for the purposes of illustrationonly; the methods we describe here do not require anyparticular form of distribution, nor do we assume that XCorris normally distributed). For incorrect PSMs, we set the meanof the distribution to 1.0, and for correct PSMs, we change themean to 3.0. Our simulated data set contains 10 000 decoyPSMs, 8000 incorrect target PSMs, and 2000 correct targetPSMs. The figure shows the resulting decoy score distribution(black line), the target score distribution (blue line), and its twocomponent distributions (dotted and dashed blue lines). In thissimulated data set, the percentage of incorrect targets (PIT) is80%. This PIT is equivalent to the ratio of the area under thedotted blue line (the incorrect target PSMs) to the area underthe solid black line (the decoy PSMs).
The PIT is important because it allows us to reduce theestimated FDR associated with a given set of accepted targetPSMs. In our simulation, if we accept X decoy PSMs with scoresabove a certain threshold, then we expect to find 0.8X incorrecttarget PSMs above the same theshold. A more accurate estimateof the FDR, therefore, is to multiply the previous estimate—the
Figure 4. Mapping from the number of identified PSMs to the estimated false discovery rate. (A) The figure plots the number of PSMsabove the threshold as a function of the estimated false discovery rate. Two different methods for computing the FDR are plotted, withand without an estimate of the percentage of incorrect target PSMs (PIT). The vertical line corresponds to an XCorr of 3.0. (B) A zoomed-in version of panel A, with the estimated FDR shown as a dotted line and the q-value shown as a solid line.
Figure 5. Simulated target and decoy PSM score distributions.
Assigning Significance to Peptides perspectives
Journal of Proteome Research • Vol. xxx, No. xx, XXXX C
decreas
ing score
threshold
new TPs
new FPs
Wednesday, 10 June 2009
20 40 60 80
−500
050
010
00
Mascot score distributions
Mascot score
Freq
uenc
y
Scor
e cu
toff
TargetDecoyTarget − Decoy
Target / Decoy database searching
Dens
ity
0
Target/Decoy concept: R. E. Moore, M. K. Young, T. D. Lee, J Am Soc Mass Spectrom 13, 378 (2002).
Correct PSMs
Incorrect PSMs
Wednesday, 10 June 2009
20 40 60 80
−500
050
010
00
Mascot score distributions
Mascot score
Freq
uenc
y
Scor
e cu
toff
TargetDecoyDecoy x pi0Target − DecoyTarget − Decoy x pi0
Target / Decoy database searching
Dens
ity
0
L. Käll, J. D. Storey, M. J. MacCoss, W. S. Noble, J Proteome Res 7, 29 (2008).Wednesday, 10 June 2009
Accurate q-values (FDR) and PEPs
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
Score
q−va
lue
or P
EP
q−valuePEP
qvality software package: L. Käll, J. D. Storey, W. S. Noble, Bioinformatics 24, i42 (2008).
Mascot Score
Think FDR!
Wednesday, 10 June 2009
MIT MHT Mascot Identity Threshold Mascot Homology Threshold
num
ber
of e
ntri
es
score
Mascot score = -10log10(P)P = 10-(Mascot score / 10)
Mascot score
MIT = -10log10(P)
Example:A 1% probability that the peptide spectrum match is a
random event would translate into a Mascot score of 20.
‘Empirical’‘Theoretical’
http://www.matrixscience.com/pdf/optimization.pdfhttp://www.matrixscience.com/pdf/2005WKSHP4.pdf
Example:If there are 5000 precursor matches, a 1 in a 20 chance of getting a false
positive match is a probability ofP = 1 / (20 x 5000 x n)
Probability that match is random
Wednesday, 10 June 2009
• Good news: convincing method
• Bad news: only available for Sequest
• Good news: command line interface with generic input and output formats, so we can extend it for use with Mascot
Wednesday, 10 June 2009
Figure: http://en.wikipedia.org/wiki/Support_vector_machine
Support Vector Machine
Wednesday, 10 June 2009
SVM Trainer
FDR
FilterTarget
Features
Initial scoring
functionClassifier
Decoy
Features
Unlabelled data
Labelled data Percolator
Käll, L., Canterbury, J. D., Weston, J., Noble, W. S., & MacCoss, M. J. (2007). Nature Methods
Scored Target/Decoy PSMs
Best scoring Target PSMs
Wednesday, 10 June 2009
Scored
PSMsClassifier
FDR
Filter
SVM
Target
Features
Decoy
Features
Feature
Calculator
Target
PSMs
Decoy
PSMsDecoy
Protein
Database
Target
Protein
Database
Mascot
Percolator v1.07
Initial scoring function: Mascot
score
Percolator
Mascot Auto-Decoy
function
M. Brosch, L. Yu, T. Hubbard, J. Choudhary, J Proteome Res (2009).
Mascot Percolator
MascotParser
Wednesday, 10 June 2009
Mascot Percolator Features
(a) Peptide Matching Scores (Mascot score, Peptide score [1])
(b) Peptide properties (mass, charge, mc, var mods)
(c) Delta mass, absolute delta mass, delta mass accounting for incorrect peak detection (13C)
(d) Fragment delta mass, absolute fragment delta mass
(e) Total intensity, matched intensity, relative matched intensity
(f) Fraction of ions matched (per ion series)
(g) Sequence coverage (per ion series)
(h) Intensity matched (per ion series)
(i) Retention time where available (see talk by Lukas Käll, WOE 3:10 pm)
[1] S. A. Beausoleil, J. Villen, S. A. Gerber, J. Rush, S. P. Gygi, Nat Biotechnol 24, 1285 (2006).
Wednesday, 10 June 2009
Scored
PSMsClassifier
FDR
Filter
SVM
Target
Features
Decoy
Features
Feature
Calculator
Target
PSMs
Decoy
PSMsDecoy
Protein
Database
Target
Protein
Database
Mascot
Percolator v1.07
Initial scoring function: Mascot
score
Percolator
Mascot Auto-Decoy
function
M. Brosch, L. Yu, T. Hubbard, J. Choudhary, J Proteome Res (2009).
Mascot Percolator
MascotParser
Wednesday, 10 June 2009
0.00
00.
010
0.02
00.
030
Mascot vs Percolator score density
Arbitrary score
Freq
uenc
y
0 20 40 60 80 100 120
MascotPercolator
Mascot score vs Percolator PEPs
0 50 100 150
050
100
150
Mascot score vs PEP
Mascot score
−10l
og10
(PEP
)1% FDR
1% F
DR
Wednesday, 10 June 2009
0.00 0.01 0.02 0.03 0.04 0.05
050
0010
000
1500
0
Comparison of MIT, MHT and MP (v1.08)
q−value
Num
ber o
f est
imat
ed c
orre
ct P
SMs
MPMHTMIT
M. Brosch, L. Yu, T. Hubbard, J. Choudhary, J Proteome Res (2009).
ROC comparison
Think FDR
Wednesday, 10 June 2009
Mascot vs Sequest Percolator
0.00 0.01 0.02 0.03 0.04 0.05
020
0040
0060
0080
0010
000
1200
014
000
Sequest Percolator vs Mascot Percolator
q−value
Num
ber o
f est
imat
ed c
orre
ct P
SMs
Mascot PercolatorSequest Percolator
Wednesday, 10 June 2009
0.00 0.02 0.04 0.06 0.08 0.10
050
0010
000
1500
020
000
2500
0
Search log ID: 12782!12783
q!value
num
ber o
f est
imat
ed c
orre
ct P
SM
s
MP
MHT
MIT
MScore
#ids @ 1% q!value: MP= 24304 , MHT= 21815 , MIT= 20169 , MScore= 20388
All of Peptide Atlas Mouse
●
●
●
●
●
●
●
●
MHT MIT MScore
510
2050
100
200
500
% in
crea
se o
f PSM
s wi
th M
P ov
er M
HT, M
IT, M
Scor
e at
1%
q−v
alue
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
Wednesday, 10 June 2009
Evaluate q-value accuracy
Standard errornegligible
0.00 0.05 0.10 0.15 0.20
0.00
0.05
0.10
0.15
0.20
FDR evaluation (MP v1.08 + P v1.09)
estimated q−value
actu
al F
DR
●
●
●
●
●
LTQLTQ−FT 200 ppmLTQ−FT 50 ppmLTQ−FT 20 ppmLTQ−FT Ultra
Wednesday, 10 June 2009
Mascot Percolator software package
java -cp MascotPercolator.jar cli.MascotPercolator
-target 11083
-decoy 11084
-out 11083-11084
-rankdelta 1
-newDat
• Simple command line interface:
• Tested with Mascot 2.2; limited experience with versions <= 2.1
Mascot log IDs
• Runtime: 10-150 spectra per second• Memory requirements typically 1-2 GB for up to 100k spectra.
Largest Mascot search processed was 350k spectra which required 6GB of memory.
Wednesday, 10 June 2009
Score:-10log10(PEP)
FDR
p = PEP = 0.05; MIT 13p = PEP = 0.01; MIT 20
Expect = PEP
Warning
Wednesday, 10 June 2009
Registered nodes
file info, parameters info,
statistics
DB server
Status page
MP server Node 1Node 2Node 3Node n
Sumit JobSubmit Job
Distributing Mascot Percolator Jobs
Wednesday, 10 June 2009
• Lukas Käll (Stockholm University) WOE 3:10 pm
• Erik Deutsch (Institute of Systems Biology)
• Jyoti Choudhary & Tim Hubbard (Sanger)
• Members of team 17 and 119, in particular Lu Yu (Sanger)
• Matrix Science
• Wellcome Trust for funding
Acknowledgements
Wednesday, 10 June 2009
http://www.sanger.ac.uk/Software/analysis/MascotPercolator/
http://tinyurl.com/mascotpercolator/
Mascot Percolator Website:
Lukas Käll - WOE 3:10 pm
Markus Brosch - MPB 063 - [email protected]
Wednesday, 10 June 2009