Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Statistical Significance for

Peptide Identification by

Tandem Mass Spectrometry

Statistical Significance for

Peptide Identification by

Tandem Mass SpectrometryNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

High Bandwidth

0250 500 750 1000

Mass is fundamental!

• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein sequence databases• A reference for comparison

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

Single Stage MS

Tandem Mass Spectrometry(MS/MS)

Precursor selection

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

Peptide Fragmentation

H…-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus

C-terminus

Peptides consist of amino-acids arranged in a linear backbone.

-HN-CH-CO-NH-CH-CO-NH-

RiCH-R’

yn-iyn-i-1

Peptide: S-G-F-L-E-E-D-E-L-K

762SGFL EEDELKb4

389SGFLEED ELKb7

633SGFLE EDELKb5

1080S GFLEEDELKb1

1022SG FLEEDELKb2

875SGF LEEDELKb3

504SGFLEE DELKb6

260SGFLEEDE LKb8

147SGFLEEDEL Kb9

0250 500 750 1000

S88 b ions

147260389504633762875102210801166 y ions

S88 b ions

0250 500 750 1000

147260389504633762875102210801166 y ions

y2 y3 y4

b5 b6 b7b8 b9

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...

• Automated, high-throughput peptide identification in complex mixtures

High Quality Peptide Identification: E-value < 10-8

Moderate quality peptide identification: E-value < 10-3

Amino-Acid Molecular Weights

Amino-Acid Residual MW Amino-Acid Residual MW

A Alanine 71.03712 M Methionine 131.04049

C Cysteine 103.00919 N Asparagine 114.04293

D Aspartic acid 115.02695 P Proline 97.05277

E Glutamic acid 129.04260 Q Glutamine 128.05858

F Phenylalanine 147.06842 R Arginine 156.10112

G Glycine 57.02147 S Serine 87.03203

H Histidine 137.05891 T Threonine 101.04768

I Isoleucine 113.08407 V Valine 99.06842

K Lysine 128.09497 W Tryptophan 186.07932

L Leucine 113.08407 Y Tyrosine 163.06333

• Peptide fragmentation by CID is poorly understood

• MS/MS spectra represent incomplete information about amino-acid sequence• I/L, K/Q, GG/N, …

• Correct identifications don’t come with a certificate!

• High-throughput workflows demand we analyze all spectra, all the time.

• Spectra may not contain enough information to be interpreted correctly• …bad static on a cell phone

• Peptides may not match our assumptions• …its all Greek to me

• “Don’t know” is an acceptable answer!

• Rank the best peptide identifications

• Is the top ranked peptide correct?

• Incorrect peptide has best score• Correct peptide is missing?• Potential for incorrect conclusion• What score ensures no incorrect

peptides?• Correct peptide has weak score

• Insufficient fragmentation, poor score• Potential for weakened conclusion• What score ensures we find all correct

peptides?

Statistical Significance

• Can’t prove particular identifications are right or wrong...• ...need to know fragmentation in advance!

• A minimal standard for identification scores...• ...better than guessing.• p-value, E-value, statistical significance

Pin the tail on the donkey…

Probability Concepts

Throwing darts• One at a time• Blindfolded

Uniform distribution?Independent?Identically distributed?

Pr [ Dart hits 20 ] = 0.05

Throwing darts• One at a time• Blindfolded• Three darts

Pr [Hitting 20 3 times] = 0.05 * 0.05 * 0.05

Pr [Hit 20 at least twice] = 0.007125 + 0.000125

0 times 0.857375

1 times 0.135375

2 times 0.007125

3 times 0.000125

Probability 0.857375 0.135375 0.007125 0.000125

0 1 2 3

Throwing darts• One at a time• Blindfolded• 100 darts

Pr [Hitting 20 3 times] = 0.139575

Pr [Hit 20 at least twice] = 0.9629188

0 times 0.005920

1 times 0.031160

2 times 0.081181

3 times 0.139575

Probability ConceptsHistogram of rbinom(10000, 100, 0.05)

rbinom(10000, 100, 0.05)

0 5 10 15

Match Score

• Dartboard represents the mass range of the spectrum

• Peaks of a spectrum are “slices”• Width of slice corresponds to mass tolerance

• Darts represent • random masses

• masses of fragments of a random peptide• masses of peptides of a random protein• masses of biomarkers from a random class

• How many darts do we get to throw?

Match Score

0250 500 750 1000 m/z

755 580

• What is the probability that we match at least 5 peaks?

Match Score

• Pr [ Match ≥ s peaks ] = Binomial( p , n ) ≈ Poisson( p n ), for small p and large n

p is prob. of random mass / peak match,n is number of darts (fragments in our answer)

Match Score

Theoretical distribution• Used by OMSSA• Proposed, in various forms, by many.

• Probability of random mass / peak match• IID (independent, identically distributed)• Based on match tolerance

Match Score

Theoretical distribution assumptions• Each dart is independent

• Peaks are not “related”

• Each dart is identically distributed• Chance of random mass / peak match is

the same for all peaks

Tournament Size

0 2 4 6 8 10 12

0 5 10 15

100 people 1000 people10000 people 100000 people

Tournament Size10

100 people 1000 people10000 people 100000 people

10 12 14 16 18

Number of Trials

• Tournament size == number of trials• Number of peptides tried• Related to sequence database size

• Probability that a random match score is ≥ s• 1 – Pr [ all match scores < s ]• 1 – Pr [ match score < s ] Trials (*)• Assumes IID!

• Expect value • E = Trials * Pr [ match ≥ s ]• Corresponds to Bonferroni bound on (*)

Better Dart Throwers

Better Random Models

• Comparison with completely random model isn’t really fair

• Match scores for real spectra with real peptides obey rules

• Even incorrect peptides match with non-random structure!

• Want to generate random fragment masses (darts) that behave more like the real thing:• Some fragments are more likely than others• Some fragments depend on others

• Theoretical models can only incorporate this structure to a limited extent.

• Generate random peptides• Real looking fragment masses• No theoretical model!• Must use empirical distribution• Usually require they have the correct

precursor mass

• Score function can model anything we like!

Fenyo & Beavis, Anal. Chem., 2003

• Truly random peptides don’t look much like real peptides

• Just use peptides from the sequence database!

• Caveats:• Correct peptide (non-random) may be included• Peptides are not independent

• Reverse sequence avoids only the first problem

Extrapolating from the Empirical Distribution

• Often, the empirical shape is consistent with a theoretical model

Geer et al., J. Proteome Research, 2004 Fenyo & Beavis, Anal. Chem., 2003

False Positive Rate Estimation

• Each spectrum is a chance to be right, wrong, or inconclusive.• How many decisions are wrong?

• Given identification criteria:• SEQUEST Xcorr, E-value, Score, etc., plus...• ...threshold

• Use “decoy” sequences• random, reverse, cross-species• Identifications must be incorrect!

• # FP in real search = # hits in decoy search• Need same size database, or rate conversion

• FP Rate: # decoy hits # real hits

• FP Rate: 2 x # decoy hits . (# real hits + # decoy hits)

• A form of statistical significance• In “theory”, E-value and a FP rate are the

same.• Search engine independent

• Easy to implement• Assumes a single threshold for all

spectra• Spectrum/Peptide Identification scores are

not iid!...• ...but E-values, in principle, are.

Peptide Prophet

• From the Institute for Systems Biology• Keller et al., Anal. Chem. 2002

• Re-analysis of SEQUEST results

• Spectra are trials • Assumes that many of the spectra are

not correctly identified

Peptide Prophet

Distribution of spectral scores in the results

Keller et al., Anal. Chem. 2002

Peptide Prophet

• Assumes a bimodal distribution of scores, with a particular shape

• Ignores database size• …but it is included implicitly

• Like empirical distribution for peptide sampling, can be applied to any score function• Can be applied to any search engines’ results

Peptide Prophet

• Caveats• Are spectra scores sampled from the same

distribution?• Is there enough correct identifications for second

peak?• Are spectra independent observations?• Are distributions appropriately shaped?

• Huge improvement over raw SEQUEST results

Peptides to Proteins

Nesvizhskii et al., Anal. Chem. 2003

Peptides to Proteins

• A peptide sequence may occur in many different protein sequences• Variants, paralogues, protein families

• Separation, digestion and ionization is not well understood

• Proteins in sequence database are extremely non-random, and very dependent

Publication Guidelines

1. Computational parameters• Spectral processing• Sequence database• Search program• Statistical analysis

2. Number of peptides per protein• Each peptide sequence counts once!• Multiple forms of the same peptide

count once!

3. Single-peptide proteins must be explicitly justified by

• Peptide sequence• N and C terminal amino-acids• Precursor mass and charge• Peptide Scores• Multiple forms of the peptide counted once!

4. Biological conclusions based on single-peptide proteins must show the spectrum

5. More stringent requirements for PMF data analysis

• Similar to that for tandem mass spectra

6. Management of protein redundancy• Peptides identified from a different species?

7. Spectra submission encouraged

Summary

• Could guessing be as effective as a search?

• More guesses improves the best guess

• Better guessers help us be more discriminating

• Peptide to proteins is not as simple as it seems

• Publication guidelines reflect sound statistical principles.

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Documents

Identification and Quantification of Protein Modifications by Peptide ...€¦ · site-specific modifications, tandem mass spectrometry (MS/MS) sequencing is required. This is especially

Mass Spectrometry and Proteomics - Lecture 2 · Tandem mass spectrometry • MS/MS, MS2 or tandem mass spectrometry • Tandem-in-space can be performed on (ion beam-transmitting)

Comparative Proteomics of Tandem Mass Spectrometry ... - DTIC

Peptide Sequencing by Mass Spectrometry

Characterization by Tandem Mass Spectrometry of

CSE182-L12 Mass Spectrometry Peptide identification CSE182

A liquid chromatography–tandem mass spectrometry-based

Multiplex Tandem Mass Spectrometry Enzymatic …faculty.washington.edu/gelb/documents/Manuscript...1 Multiplex Tandem Mass Spectrometry Enzymatic Activity Assay for Newborn Screening

De Novo Peptide Sequencing via Tandem Mass Spectrometry...mass spectrometry peptide sequencing problem as one of the ” nding the correct path in the set of all paths (Fig.3). Unfortunately,experimental

Rapid liquid chromatography–tandem mass spectrometry-based

Introduction to Tandem Mass Spectrometry

Tandem Mass Spectrometry of Synthetic Polymersmaldi.ch.pw.edu.pl/pomiary/Artykuly/Tandem Mass... · 2013-09-02 · Tandem Mass Spectrometry of Synthetic Polymers Árpád Somogyi Department

Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology

Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005

29 Tandem Mass Spectrometry of Sphingolipids

Tandem Mass Spectrometry in Newborn Screeningcanpku.org/.../2016/03/Tandem-Mass-Spectrometry-in-Newborn-Scr… · Tandem Mass Spectrometry in Newborn Screening ... • Phenylalanine

Metabolomics Research with Tandem Mass Spectrometry

Tandem Mass Spectrometry of Sphingolipids: …cdn.intechopen.com/pdfs/29023/InTech-Tandem_mass...Tandem Mass Spectrometry Applications and Principles 726 the application of potential

Protein identification by peptide mass …Protein identification by peptide mass fingerprinting and tandem mass spectrometry Stephen Barnes, PhD 4-7117 sbarnes@uab.edu BMG 2-03-04

Tandem Mass Spectrometry Assays of Palmitoyl …faculty.washington.edu/gelb/documents/BarcenasAnalChem8...Tandem Mass Spectrometry Assays of Palmitoyl Protein Thioesterase 1 and Tripeptidyl