bioinformatics for proteomics - GitHub Pages

bioinformatics for proteomics

lennart martens

[email protected] omics and systems biology groupVIB / Ghent University, Ghent, Belgium

www.compomics.com@compomics

Sequencial search algorithms

A key issue is to choose the right database

Decoys and false discovery rate calculation

Database search algorithms

Introduction: MS/MS spectra and identification

Protein inference: bad, ugly, and not so good







NH2 CH

CO COOHNH

R1 R2

CH

CO NH

CH

CO NH

CH

R3 R4

x3 y3 z3 x2 y2 z2 x1 y1 z1

a1 b1 c1 a2 b2 c2 a3 b3 c3

There are several other ion types that can be annotated, as well as‘internal fragments’. The latter are fragments that no longer contain an intactterminus. These are harder to use for ‘ladder sequencing’, but can still be interpreted.

This nomenclature was coined by Roepstorff and Fohlmann (Biomed. Mass Spec., 1984) and Klaus Biemann (Biomed.Environ. Mass Spec., 1988) and is commonly referred to as ‘Biemann nomenclature’. Note the link with the Roman alphabet.

Peptides subjected to fragmentation analysis can yield several types of fragment ions

L E N N A R T

L LE

LEN

LENN

LENNA

LENNAR

LENNART

E N N A R TL

T

RT

ART

NARTNNART

ENNART

LENNART

m/z

intensity

In an ideal world, the peptide sequence will produce directly interpretable ion ladders

Real spectra usually look quite a bit worse

LE

LENLENNA LENNART

m/z

intensity

[EL][EI]

[E[IL]][QN][KN]

TART

NARTLENNART

LE/EL N NA / AN[AN][QG][KG]

N

Spectral comparison

Sequencial comparison

Threading comparison

database sequence theoreticalspectrum

experimentalspectrum

compare

database sequence experimentalspectrum

compare de novosequence

database sequence experimentalspectrum

thread

We can distinguish three typesof M/MS identification algorithms

Eidhammer, Wiley, 2007







in silicoMS/MS

in silicodigest

protein sequence database

YSFVATAER

HETSINGK

MILQEESTVYYR

SEFASTPINK

…

peptide sequences

m/z

Int

m/z

Int

m/z

Intm/z

Int

theoretical MS/MSspectra

experimentalMS/MS spectrum

in silicomatching

1) YSFVATAER 342) YSFVSAIR 123) FFLIGGGGK 12

peptide scores

Database search engines match experimental spectra to known peptide sequences

CC BY-SA 4.0

• SEQUEST (UWashington, Thermo Fisher Scientific)http://fields.scripps.edu/sequest

• MASCOT (Matrix Science)http://www.matrixscience.com

• X!Tandem (The Global Proteome Machine Organization)http://www.thegpm.org/TANDEM

Three popular algorithms can serve as templates for the large variety of tools

• Can be used for MS/MS (PFF) identifications

• Based on a cross-correlation score (includes peak height)

• Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994

• Provides preliminary (Sp) score, rank, cross-correlation score (XCorr),

and score difference between the top tow ranks (deltaCn, ∆Cn)

• Thresholding is up to the user, and is commonly done per charge state

• Many extensions exist to perform a more automatic validation of results

SEQUEST is the original search engine, but not that much used anymore these days

XCorr = deltaCn= XCorr1− XCorr 2

XCorr1𝑅𝑅0 −

1151 �

𝑖𝑖=−75

+75

𝑅𝑅𝑅𝑅

𝑅𝑅𝑖𝑖 = �𝑗𝑗=1

𝑛𝑛

𝑥𝑥𝑗𝑗 � 𝑦𝑦(𝑗𝑗+𝑖𝑖)

From: MacCoss et al., Anal. Chem. 2002

From: Peng et al., J. Prot. Res.. 2002

SEQUEST reveals the problems with scoring different charges, and using different scores

• Very well established search engine, Perkins, Electrophoresis 1999

• Can do MS (PMF) and MS/MS (PFF) identifications

• Based on the MOWSE score,

• Unpublished core algorithm (trade secret)

• Predicts an a priori threshold score that identifications need to pass

• From version 2.2, Mascot allows integrated decoy searches

• Provides rank, score, threshold and expectation value per identification

• Customizable confidence level for the threshold score

Mascot is probably the most recognized search engine, despite its secret algorithm

• A successful open source search engine, Craig and Beavis, RCMS 2003

• Can be used for MS/MS (PFF) identifications

• Based on a hyperscore (Pi is either 0 or 1):

• Relies on a hypergeometric distribution (hence hyperscore)

• Published core algorithm, and is freely available

• Provides hyperscore and expectancy score (the discriminating one)

• X!Tandem is fast and can handle modifications in an iterative fashion

• Has rapidly gained popularity as (auxiliary) search engine

X!Tandem is a clear front-runneramong open source search engines

*0

* !* !n

i i b yi

HyperScore I P N N=

= ∑

-10

-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100

hyperscore

log(

# re

sults

)

log(

# re

sults

)

0

0.5

1

1.5

2

2.5

3

3.5

4

20 25 30 35 40 45 50

hyperscore0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

Adapted from: Brian Searle, ProteomeSoftware,http://www.proteomesoftware.com/XTandem_edited.pdf

significancethreshold

E-value=e-8.2

X!Tandem’s significance calculation for scores can be seen as a general template

The influence of various parameter changes on database size is clearly visible

Verheggen, Mass Spec Reviews, 2017

And the effect on identification rateis correspondingly obvious


The main search engines in use are Mascot, Andromeda, SEQUEST and X!Tandem


Among the up-and-coming engines, Comet, MS-GF+ and MS-Amanda are most notable


In metaproteomics or proteogenomics combining search algorithms can be useful

Muth and Kolmeder, Proteomics, 2015

SearchGUI makes it very easy for youto run multiple free search engines

Vaudel, Proteomics, 2011

PeptideShaker is your gateway to the results

Vaudel, Nature Biotechnology, 2015

Our brand-new ionbot engine allows youto search for all possible modifications!

https://ionbot.cloud

ionbot searches for:- all 1490 UniMod mods- all possible SAPs

Known modifications from Terman and Kashina, Curr Opin Cell Biol, 2013Source data presented to ionbot from Kim et al., Nature, 2014 and mapping on PDB 3j82

human beta actin, front human beta actin, rear

ionbot recaptiulates a few decades of work on beta actin, and actually expands upon it

An example match with UniProt annotations (for Elongation factor 1-alpha 1) is very good

Peptide Residue UniProt # Mods Modification listTHINIVVIGHVDSGK K20 NO 1 carbamidomethylSTTTGHLIYK K30 NO 1 carbamidomethylCGGIDKR K36 YES 2 dimethyl,methylTIEKFEK K44 NO 1 carbamidomethylGSFKYAWVLDK K55 YES 3 carbamidomethyl,dimethyl,methylGITIDISLWKFETSK K79 YES 2 carbamidomethyl,trimethylYYVTIIDAPGHRDFIK K100 NO 1 carbamidomethylEHALLAYTLGVKQLIVGVNK K146 NO 1 carbamidomethylQLIVGVNK K154 NO 1 carbamidomethylMDSTEPPYSQK K165 YES 4 carbamidomethyl,dimethyl,methyl,trimethylYEEIVKEVSTYIK K172 acetyl* 2 carbamidomethyl,carboxymethylDGNASGTTLLEALDCILPPTRPTDK K244 NO 1 carbamidomethylLPLQDVYKIGGIGTVPVGR K255 NO 2 carbamidomethyl,trimethylVETGVLKPGMVVTFAPVNVTTEVK K273 acetyl* 4 carbamidomethyl,dicarbamidomethyl,dimethyl,trimethylVETGVLKPGMVVTFAPVNVTTEVK K290 NO 2 carbamidomethyl,dicarbamidomethylNVSVKDVR K318 YES 2 dimethyl,trimethylKLEDGPK K392 acetyl* 1 carbamidomethylSGDAAIVDMVPGKPMCVESFSDYPPLGR K408 NO 3 carbamidomethyl,carboxymethyl,methylolQTVAVGVIK K439 acetyl* 1 carbamidomethyl

missing according to UniProt: K2, which is a very short peptide (4 residues)

Known modifications from UniProt entry P68104, https://www.uniprot.org/uniprot/P68104Source data presented to ionbot from Kim et al., Nature, 2014







sequence tag

The concept of sequence tags was introduced by Mann and Wilm

1079.61 - SD[IL] - 303.20

Sequence tags are as old as SEQUEST, and these still have a role to play today

Mann, Analytical Chemistry, 1994

• Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010

• Recent implementations of the sequence tag approach

• Refine hits by peak mapping in a second stage to resolve ambiguities

• Rely on a empirical fragmentation model

• Published core algorithms, DirecTag and TagRecon freely available

• GutenTag and DirecTag extracts tags,

• TagRecon matches these to the database

• Very useful to retrieve unexpected peptides (modifications, variations)

• Entire workflows exist (e.g., combination with IDPicker)

GutenTag, DirecTag, TagRecon

GutenTag: two stage, hybrid tag searching

Tabb, Analytical Chemistry, 2003

Example of a manual de novo of an MS/MS spectrumNo more database necessary to extract a sequence!

Algorithms

LutefiskSherenga

PEAKSPepNovo

…

References

Dancik 1999, Taylor 2000Fernandez-de-Cossio 2000

Ma 2003, Zhang 2004Frank 2005, Grossmann 2005

…

De novo sequencing tries to read the entire peptide sequence from the spectrum







Colony colapse disorder, soldiers,and forcing the issue (or rather: the solution)

Knudsen, PLoS ONE, 2011

The identification seems reasonable,but is limited in an unreasonable way!

Knudsen, PLoS ONE, 2011

The end result may be that you are takento task for mistakes in your research

Beware of common contaminants

Tyrosine nitrosylation

Ghesquière, Proteomics, 2010







All hits, good and bad together,form a distribution of scores

Nesvizhskii, J Proteomics, 2010

If we know how scores for bad hits distribute, we can distinguish good from bad by score

The separation is not perfect, which leads to the calculation of a local false discovery rate

local false discovery rate(posterior error probability; PEP)

- Reversed databases (easy)

LENNARTMARTENS Æ SNETRAMTRANNEL

- Shuffled databases (slightly more difficult)

LENNARTMARTENS Æ NMERLANATERTTN (for instance)

- Randomized databases (as difficult as you want it to be)

LENNARTMARTENS Æ GFVLAEPHSEAITK (for instance)

Three main types of decoy DB’s are used:

The concept is that each peptide identified from the decoy database is an incorrect identification. By counting the number of decoy hits, we can estimate the number of false positives in the original database, provided that the decoys have similar properties as the forward sequences.

Decoy databases are false positive factories, assumed to deliver representative bad hits

With the help of the scores of decoy hits,we can assess the score distribution of bad hits

local false discovery rate(posterior error probability; PEP)

Käll, Journal of Proteome Research, 2008

score

Setting a threshold classifies all hits as either bad or good, which inevitably leads to errors

True Positive

False Positive

False Negative True Negative







peptides a b c d

proteinsprot X x xprot Y xprot Z x x x

Minimal setOccam {

peptides a b c d

proteinsprot X x xprot Y xprot Z x x x

Maximal setanti-Occam {

peptides a b c d

proteinsprot X (-) x xprot Y (+) xprot Z (0) x x x

Minimal set withmaximal annotation {

true Occam?

Protein inference is a question of conviction

Martens, Molecular Biosystems, 2007

In real life, protein inference issues will bemainly bad, often ugly, and occasionally good

Protein inference is linked to quantification (i)

Nice and easy, 1/1, only unique peptides (blue) and narrow distribution

Colaert, Proteomics, 2010

Nice and easy, down-regulated

Protein inference is linked to quantification (ii)


Protein inference is linked to quantification (iii)

A little less easy, up-regulated


Protein inference is linked to quantification (iv)

A nice example of the mess of degenerate peptides


Protein inference is linked to quantification (v)

A bit of chaos, but a defined core distributionColaert, Proteomics, 2010

Documents

bioinformatics for proteomics - GitHub Pages