35
http://www.bits.vib.be/training

BITS - Search engines for mass spec data

  • Upload
    bits

  • View
    1.207

  • Download
    2

Embed Size (px)

DESCRIPTION

This is the third presentation of the BITS training on 'Mass spec data processing'. It reviews the methods for matching mass spectrometry data with protein sequences, with review of useful tools.Thanks to the Compomics Lab of the VIB for contribution.

Citation preview

Page 2: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

Lennart MARTENS [email protected]

Proteomics Services Group European Bioinformatics Institute

Hinxton, Cambridge United Kingdom www.ebi.ac.uk

search engines

lennart martens

[email protected]

Computational Omics and Systems Biology Group

Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University

Ghent, Belgium

Page 3: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

THREE TYPICAL PRE-PROCESSING STEPS

Page 4: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

Global thresholding

Local thresholding

precursor

precursor

Noise thresholding

Page 5: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpg

Charge deconvolution (peptides)

Page 6: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

From: Gill et al, EMBO Journal, 2000

Charge deconvolution (proteins)

Page 7: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

x x

Monoisotopic mass Average mass

Centroiding (peak picking)

Page 8: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

From: Last et al, Nature Rev. Mol. Cell Bio., 2007

A total ion current chromatogram, corrected by typical pre-processing steps.

Combined results

Page 9: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

51.4

25.8

0.7 0.3

24.5 23.7

0.2 0.10

10

20

30

40

50

60

RAW RAW GZIPped Peak lists Peak lists GZIPpedData type

File

siz

e (M

B)

Q-TOF I Esquire HCT

Data type

File size (MB)

Q-TOF I Esquire HCT

See: Martens et al., Proteomics, 2005

Data size reduction

Page 10: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

MS/MS IDENTIFICATION

PEPTIDE FRAGMENTATION FINGERPRINTING

Page 11: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

L E N N A R T

L LE

LEN

LENN

LENNA

LENNAR

LENNART

E N N A R T L

T

RT

ART

NART NNART

ENNART

LENNART

m/z

intensity

Peptide sequences and MS/MS spectra

Page 12: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

protein sequence database

in silico

digest

YSFVATAER

HETSINGK

MILQEESTVYYR

SEFASTPINK

peptide sequences

m/z

Int

m/z

Int

m/z

Int m/z

Int in silico

MS/MS

theoretical MS/MS spectra

experimental MS/MS spectrum

in silico

matching

1) YSFVATAER 34 2) YSFVSAIR 12 3) FFLIGGGGK 12

peptide scores

Peptide fragment fingerprinting (PFF)

Page 13: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

Spectral comparison

Sequencial comparison

Threading comparison

database sequence theoretical spectrum

experimental spectrum

compare

database sequence experimental spectrum

compare de novo sequence

database sequence experimental spectrum

thread

From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007

Three types of PFF identification

Page 14: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

• MASCOT (Matrix Science) http://www.matrixscience.com • SEQUEST (Scripps, Thermo Fisher Scientific) http://fields.scripps.edu/sequest • X!Tandem (The Global Proteome Machine Organization) http://www.thegpm.org/TANDEM • OMSSA (NCBI) http://pubchem.ncbi.nlm.nih.gov/omssa/

The most popular algorithms

Page 15: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

Incorrect identifications

Correct identifications

False positives False negatives

Threshold score

Adapted from: www.proteomesoftware.com – Wiki pages

Overall concept of scores and cut-offs

Page 16: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

0%

1%

2%

3%

4%

5%

6%

p=0.05 p=0.01 p=0.005 p=0.00050%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

false positives

identifications

higher stringency

Playing with probabilistic cut-off scores

Page 17: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

• Very well established search engine

• Can be used for MS/MS (PFF) identifications

• Based on a cross-correlation score (includes peak height)

• Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994

• Provides preliminary (Sp) score, rank, cross-correlation score (XCorr),

and score difference between the top tow ranks (deltaCn, ∆Cn)

• Thresholding is up to the user, and is commonly done per charge state

• Many extensions exist to perform a more automatic validation of results

SEQUEST

XCorr = deltaCn= XCorr1− XCorr 2

XCorr1𝑅0 −

1151

� 𝑅𝑅+75

𝑖=−75

𝑅𝑖 = �𝑥𝑗 ∙ 𝑦(𝑗+𝑖)

𝑛

𝑗=1

Page 18: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

From: MacCoss et al., Anal. Chem. 2002

From: Peng et al., J. Prot. Res.. 2002

SEQUEST: some additional pictures

Page 19: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

• Very well established search engine, Perkins, Electrophoresis 1999

• Can do MS (PMF) and MS/MS (PFF) identifications

• Based on the MOWSE score,

• Unpublished core algorithm (trade secret)

• Predicts an a priori threshold score that identifications need to pass

• From version 2.2, Mascot allows integrated decoy searches

• Provides rank, score, threshold and expectation value per identification

• Customizable confidence level for the threshold score

Mascot

Page 20: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

y = 8.3761x - 34.089R2 = 0.9985

0

5

10

15

20

25

30

35

40

6.50 7.00 7.50 8.00 8.50log10(number of AA)

Ave

rage

iden

tity

thre

shol

dA

vera

ge id

enti

ty t

hres

hold

Mascot: some additional pictures

0%

1%

2%

3%

4%

5%

6%

p=0.05 p=0.01 p=0.005 p=0.00050%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

false positives

identifications

Page 21: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

• A successful open source search engine, Craig and Beavis, RCMS 2003

• Can be used for MS/MS (PFF) identifications

• Based on a hyperscore (Pi is either 0 or 1):

• Relies on a hypergeometric distribution (hence hyperscore)

• Published core algorithm, and is freely available

• Provides hyperscore and expectancy score (the discriminating one)

• X!Tandem is fast and can handle modifications in an iterative fashion

• Has rapidly gained popularity as (auxiliary) search engine

X!Tandem

*0

* !* !n

i i b yi

HyperScore I P N N=

= ∑

Page 22: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

-10

-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100

hyperscore

log(

# re

sults

)

log(

# re

sults

)

0

0.5

1

1.5

2

2.5

3

3.5

4

20 25 30 35 40 45 50

hyperscore 0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

Adapted from: Brian Searle, ProteomeSoftware, http://www.proteomesoftware.com/XTandem_edited.pdf

significance threshold

E-value=e-8.2

X!Tandem: some additional pictures

Page 23: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

A note on how the scores differ

X! T

ande

m

SEQ

UES

T

XCorr

HyperScore

DeltaCn

E-Value

Accuracy Score Relative Score

Adapted from: Brian Searle, ProteomeSoftware

Page 24: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

• A successful open source search engine, Geer, JPR 2004

• Can be used for MS/MS (PFF) identifications

• Relies on a Poisson distribution

• Published core algorithm, and is freely available

• Provides an expectancy score, similar to the BLAST E-value

• OMSSA was recently upgraded to take peak intensity into account

• Good really good marks in a recently published comparative study

OMSSA

Page 25: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

Yeast lysate spectrum, m/z matches of fragment peak matches versus all NCBI nr sequence library. Poisson distribution fitted.

Validation of the Poisson distribution model: mean number of modelled and measured

matching peaks (against the NCBI nr database) for two mass tolerances.

Adapted from: Geer et al., J. Prot. Res., 2004

OMSSA: some additional pictures

Page 26: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

COMPARATIVE STUDIES

Page 27: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

Kapp et al., Proteomics, 2005

Page 28: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

1.6x more?!

Balgley et al., Mol. Cell. Proteomics, 2007

Page 29: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

1776

Mascot SEQUEST

Phenyx

ProteinSolver

501

40

212 (+4,2%)

486 (+9,6%)

329 (+6,5%)

380 (+7,5%)

3203

3229 3792

3186 168

348

179

96

146

139 77 195

Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center, Ruhr-Universität Bochum; Human Brain Proteome Project

Combining the output of search algorithms

Page 30: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

SEQUENCIAL COMPARISON

ALGORITHMS

Page 31: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

Image from: Matthias Wilm, EMBL Heidelberg, Germany http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.html

sequence tag

The concept of sequence tags was introduced by Mann and Wilm (Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399).

Sequence tags

Page 32: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

• Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010

• Recent implementations of the sequence tag approach

• Refine hits by peak mapping in a second stage to resolve ambiguities

• Rely on a empirical fragmentation model

• Published core algorithms, DirecTag and TagRecon freely available

• Most useful to retrieve unexpected peptides (modifications, variations)

• Entire workflows exist (e.g., combination with IDPicker)

GutenTag, DirecTag, TagRecon

Page 33: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

From: Tabb et al., Anal. Chem., 2003

GutenTag: some additional pictures

Page 34: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

Example of a manual de novo of an MS/MS spectrum No more database necessary to extract a sequence!

Algorithms

Lutefisk Sherenga

PEAKS PepNovo

References

Dancik 1999, Taylor 2000 Fernandez-de-Cossio 2000

Ma 2003, Zhang 2004 Frank 2005, Grossmann 2005

De novo compared to sequence tags

Page 35: BITS - Search engines for mass spec data

BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011

Lennart Martens [email protected]

Thank you!

Questions?