View
1.207
Download
2
Category
Preview:
DESCRIPTION
This is the third presentation of the BITS training on 'Mass spec data processing'. It reviews the methods for matching mass spectrometry data with protein sequences, with review of useful tools.Thanks to the Compomics Lab of the VIB for contribution.
Citation preview
http://www.bits.vib.be/training
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Lennart MARTENS lennart.martens@ebi.ac.uk
Proteomics Services Group European Bioinformatics Institute
Hinxton, Cambridge United Kingdom www.ebi.ac.uk
search engines
lennart martens
lennart.martens@ugent.be
Computational Omics and Systems Biology Group
Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University
Ghent, Belgium
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
THREE TYPICAL PRE-PROCESSING STEPS
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Global thresholding
Local thresholding
precursor
precursor
Noise thresholding
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpg
Charge deconvolution (peptides)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: Gill et al, EMBO Journal, 2000
Charge deconvolution (proteins)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
x x
Monoisotopic mass Average mass
Centroiding (peak picking)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: Last et al, Nature Rev. Mol. Cell Bio., 2007
A total ion current chromatogram, corrected by typical pre-processing steps.
Combined results
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
51.4
25.8
0.7 0.3
24.5 23.7
0.2 0.10
10
20
30
40
50
60
RAW RAW GZIPped Peak lists Peak lists GZIPpedData type
File
siz
e (M
B)
Q-TOF I Esquire HCT
Data type
File size (MB)
Q-TOF I Esquire HCT
See: Martens et al., Proteomics, 2005
Data size reduction
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
MS/MS IDENTIFICATION
PEPTIDE FRAGMENTATION FINGERPRINTING
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
L E N N A R T
L LE
LEN
LENN
LENNA
LENNAR
LENNART
E N N A R T L
T
RT
ART
NART NNART
ENNART
LENNART
m/z
intensity
Peptide sequences and MS/MS spectra
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
protein sequence database
in silico
digest
YSFVATAER
HETSINGK
MILQEESTVYYR
SEFASTPINK
…
peptide sequences
m/z
Int
m/z
Int
m/z
Int m/z
Int in silico
MS/MS
theoretical MS/MS spectra
experimental MS/MS spectrum
in silico
matching
1) YSFVATAER 34 2) YSFVSAIR 12 3) FFLIGGGGK 12
peptide scores
Peptide fragment fingerprinting (PFF)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Spectral comparison
Sequencial comparison
Threading comparison
database sequence theoretical spectrum
experimental spectrum
compare
database sequence experimental spectrum
compare de novo sequence
database sequence experimental spectrum
thread
From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007
Three types of PFF identification
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• MASCOT (Matrix Science) http://www.matrixscience.com • SEQUEST (Scripps, Thermo Fisher Scientific) http://fields.scripps.edu/sequest • X!Tandem (The Global Proteome Machine Organization) http://www.thegpm.org/TANDEM • OMSSA (NCBI) http://pubchem.ncbi.nlm.nih.gov/omssa/
The most popular algorithms
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Incorrect identifications
Correct identifications
False positives False negatives
Threshold score
Adapted from: www.proteomesoftware.com – Wiki pages
Overall concept of scores and cut-offs
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
0%
1%
2%
3%
4%
5%
6%
p=0.05 p=0.01 p=0.005 p=0.00050%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
false positives
identifications
higher stringency
Playing with probabilistic cut-off scores
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• Very well established search engine
• Can be used for MS/MS (PFF) identifications
• Based on a cross-correlation score (includes peak height)
• Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994
• Provides preliminary (Sp) score, rank, cross-correlation score (XCorr),
and score difference between the top tow ranks (deltaCn, ∆Cn)
• Thresholding is up to the user, and is commonly done per charge state
• Many extensions exist to perform a more automatic validation of results
SEQUEST
XCorr = deltaCn= XCorr1− XCorr 2
XCorr1𝑅0 −
1151
� 𝑅𝑅+75
𝑖=−75
𝑅𝑖 = �𝑥𝑗 ∙ 𝑦(𝑗+𝑖)
𝑛
𝑗=1
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: MacCoss et al., Anal. Chem. 2002
From: Peng et al., J. Prot. Res.. 2002
SEQUEST: some additional pictures
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• Very well established search engine, Perkins, Electrophoresis 1999
• Can do MS (PMF) and MS/MS (PFF) identifications
• Based on the MOWSE score,
• Unpublished core algorithm (trade secret)
• Predicts an a priori threshold score that identifications need to pass
• From version 2.2, Mascot allows integrated decoy searches
• Provides rank, score, threshold and expectation value per identification
• Customizable confidence level for the threshold score
Mascot
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
y = 8.3761x - 34.089R2 = 0.9985
0
5
10
15
20
25
30
35
40
6.50 7.00 7.50 8.00 8.50log10(number of AA)
Ave
rage
iden
tity
thre
shol
dA
vera
ge id
enti
ty t
hres
hold
Mascot: some additional pictures
0%
1%
2%
3%
4%
5%
6%
p=0.05 p=0.01 p=0.005 p=0.00050%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
false positives
identifications
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• A successful open source search engine, Craig and Beavis, RCMS 2003
• Can be used for MS/MS (PFF) identifications
• Based on a hyperscore (Pi is either 0 or 1):
• Relies on a hypergeometric distribution (hence hyperscore)
• Published core algorithm, and is freely available
• Provides hyperscore and expectancy score (the discriminating one)
• X!Tandem is fast and can handle modifications in an iterative fashion
• Has rapidly gained popularity as (auxiliary) search engine
X!Tandem
*0
* !* !n
i i b yi
HyperScore I P N N=
= ∑
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
-10
-8
-6
-4
-2
0
2
4
6
0 20 40 60 80 100
hyperscore
log(
# re
sults
)
log(
# re
sults
)
0
0.5
1
1.5
2
2.5
3
3.5
4
20 25 30 35 40 45 50
hyperscore 0
10
20
30
40
50
60
0 20 40 60 80 100
hyperscore
# re
sults
Adapted from: Brian Searle, ProteomeSoftware, http://www.proteomesoftware.com/XTandem_edited.pdf
significance threshold
E-value=e-8.2
X!Tandem: some additional pictures
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
A note on how the scores differ
X! T
ande
m
SEQ
UES
T
XCorr
HyperScore
DeltaCn
E-Value
Accuracy Score Relative Score
Adapted from: Brian Searle, ProteomeSoftware
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• A successful open source search engine, Geer, JPR 2004
• Can be used for MS/MS (PFF) identifications
• Relies on a Poisson distribution
• Published core algorithm, and is freely available
• Provides an expectancy score, similar to the BLAST E-value
• OMSSA was recently upgraded to take peak intensity into account
• Good really good marks in a recently published comparative study
OMSSA
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Yeast lysate spectrum, m/z matches of fragment peak matches versus all NCBI nr sequence library. Poisson distribution fitted.
Validation of the Poisson distribution model: mean number of modelled and measured
matching peaks (against the NCBI nr database) for two mass tolerances.
Adapted from: Geer et al., J. Prot. Res., 2004
OMSSA: some additional pictures
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
COMPARATIVE STUDIES
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Kapp et al., Proteomics, 2005
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
1.6x more?!
Balgley et al., Mol. Cell. Proteomics, 2007
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
1776
Mascot SEQUEST
Phenyx
ProteinSolver
501
40
212 (+4,2%)
486 (+9,6%)
329 (+6,5%)
380 (+7,5%)
3203
3229 3792
3186 168
348
179
96
146
139 77 195
Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center, Ruhr-Universität Bochum; Human Brain Proteome Project
Combining the output of search algorithms
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
SEQUENCIAL COMPARISON
ALGORITHMS
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Image from: Matthias Wilm, EMBL Heidelberg, Germany http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.html
sequence tag
The concept of sequence tags was introduced by Mann and Wilm (Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399).
Sequence tags
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010
• Recent implementations of the sequence tag approach
• Refine hits by peak mapping in a second stage to resolve ambiguities
• Rely on a empirical fragmentation model
• Published core algorithms, DirecTag and TagRecon freely available
• Most useful to retrieve unexpected peptides (modifications, variations)
• Entire workflows exist (e.g., combination with IDPicker)
GutenTag, DirecTag, TagRecon
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: Tabb et al., Anal. Chem., 2003
GutenTag: some additional pictures
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Example of a manual de novo of an MS/MS spectrum No more database necessary to extract a sequence!
Algorithms
Lutefisk Sherenga
PEAKS PepNovo
…
References
Dancik 1999, Taylor 2000 Fernandez-de-Cossio 2000
Ma 2003, Zhang 2004 Frank 2005, Grossmann 2005
…
De novo compared to sequence tags
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Thank you!
Questions?
Recommended