Download pdf - Protein Grouping, FDR Analysis and Databases. Grouping, FDR Analysis and Databases. The Minnesota Supercomputing Institute for Advanced Computational Research March 15th 2012 Pratik

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protein Grouping, FDR

Analysis and Databases.

The Minnesota Supercomputing Institute for Advanced Computational Research

March 15th 2012

Pratik Jagtap

http://www.mass.msi.umn.edu/

http://www.mass.msi.umn.edu/



reserved.

Protein Grouping, FDR Analysis

and Databases…

• Overview.

• Protein Grouping : Concept & Methods.

• Fdr Analysis : Concept & Methods.

• Peptide and Protein identification

• Databases

• Publication Guidelines

REFERENCES : „Reporting Protein Identifications from MS/MS Results‟ by Brian Searle

(ProteomeSoftware Inc.) and “Databases” by Akhilesh Pandey (John Hopkins University) at

the „BioInformatics for Protein Identification‟ workshop at Baltimore (2009).

© 2009 Regents of the University of Minnesota. All rights reserved.


Proteomics workflow

Search Against Database

Protein

Peptides

Trypsin

Mass Spectrum And

Tandem Mass Spectrum

RP HPLC MS/MS

BSATEST2080708_080807191252 8/7/2008 7:12:52 PM

RT: 24.95 - 60.48

26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

Time (min)

0

10

20

30

40

50

60

70

80

90

100

Re

lative

Ab

un

da

nce

36.46

36.37

39.19

39.29

43.31

35.34

40.0240.4834.52 37.08

44.5533.64

30.3138.03

44.9032.37 48.56 51.70 58.2956.0148.25 54.1029.08 49.2659.6728.5125.38

NL:

2.98E8

Base Peak

MS

BSATEST2

080708_08

080719125

2

RT: 24.95 - 60.48

26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

Time (min)

0

10

20

30

40

50

60

70

80

90

100

Re

lative

Ab

un

da

nce

39.19

39.29

40.04 42.7843.3238.80 57.2245.53 58.2947.50 56.0736.8732.73 50.54 54.2951.5326.80 35.6527.74 29.89

NL:

2.34E8

Base Peak

m/z=

653.00-654.00

MS

BSATEST2080

708_0808071

91252Ion Chromatogram

Protein Identification



reserved.

Proteomics workflow

Orbitrap

Mass spectral data. (.raw)

Search Algorithm

Statistical validation of


Visualization Descriptive Statistics Pathway Analysis

Processing



reserved.


and Databases…

• Overview.




• Databases







reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein



reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein

85%

65%

25%

??%



reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein

(15%)

(35%)

(75%)

(??%)

Feng J, Naiman DQ, Cooper B. Anal Chem. 2007 May 15;79(10):3901-11.



reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein

(15%)

(35%)

(75%)

(4%)

0.15 * 0.35 * 0.75 = 0.04




reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein

85%

65%

25%

96%

0.15 * 0.35 * 0.75 = 0.04




reserved.

If only it were so easy…



reserved.

Nesvizhskii, A. I.; Aebersold, R. Mol. Cell. Proteom. 4.10, 1419-1440, 2005



reserved.




reserved.




reserved.

Tubulin alpha 6

Tubulin alpha 3

YMACCLLYR

Tubulin alpha 4

85%

??%

??%

??%



reserved.

Tubulin alpha 6

Tubulin alpha 3

YMACCLLYR

Tubulin alpha 4

85%

85%

3

85%

3

85%

3

Nesvizhskii, A. I.; Keller, A. et al Anal. Chem. 75, 4646-4658



reserved.




reserved.

Tubulin alpha 6

Tubulin alpha 3

YMACCLLYR

SIQFVDWCPTGFK

Tubulin alpha 4

??%

??%

??%



reserved.

Copyright ©2005 American Society for Biochemistry and Molecular Biology

Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

Basic peptide grouping scenarios

Occam‟s Razor :

“When you have two

competing theories that

make exactly the same

predictions, the simpler one

is the better."

William of Ockham



reserved.

Tubulin alpha 6

Tubulin alpha 3

YMACCLLYR

SIQFVDWCPTGFK

Tubulin alpha 4



reserved.

Peptide 1 Peptide 2

Peptide 3 Peptide 4

Pro

tein

B

P

rote

in

A

Distinct Proteins

100% 100%

100% 100%



reserved.

Peptide 1 Peptide 2 Peptide 3 Peptide 4


Pro

tein

B

P

rote

in

A

Indistinguishable

Proteins

50% 50% 50% 50%

50% 50% 50% 50%



reserved.

Peptide 1 Peptide 2 Peptide 3


Pro

tein

B

P

rote

in

A

Differentiable Proteins

100% 50% 50%

50% 50% 100%



reserved.



Pro

tein

B

P

rote

in

A

Subset Proteins

100% 100% 100% 100%

0% 0% 0%



reserved.



Pro

tein

B

P

rote

in

A

The Quantitative

Subset Complication



reserved.

The Similar Peptide

Complication

AVGNLR

Scan Number: 2435

TLR9_HUMAN

GLGNLR

TRFE_HUMAN

LRFN1_HUMAN



reserved.

The Similar Peptide

Complication

AVGNLR

Scan Number: 2435

TLR9_HUMAN

TRFE_HUMAN

LRFN1_HUMAN



reserved.


and Databases…

• Overview.




• Databases







reserved.

Search against database.

Mass spectrum

>IPI:IPI00205563.1|Gene_Symbol=Tmsbl1 thymosin beta-like protein

MSDKPDLSEVETFDKSKLKKTNTEEKNTLPSKETIQQEKEYNQRS

>IPI:REV_IPI00205563.1|Gene_Symbol=Tmsbl1 thymosin beta-like protein

SRQNYKEEQQITKESPLTKNEETNKKTKLKSDFTEVESLDKPDSM

Reverse Database Search



reserved.

False Discovery Rate Analysis

Nature Methods - 4, 787 - 797 (2007) Sennels et al BMC Bioinformatics 2009, 10:179 images.inmagine.com



reserved.

# o

f M

atc

hes

Discriminant Score

0

100

200

300

400

500

600

700

800

-40 -30 -20 -10 0 10 20 30 40 50 60

“Correct”

Slide Courtesy : Brian Searle (Proteome Software)

Choi H, Ghosh D, Nesvizhskii A. J. Proteome Res. 2008 :7(1):286

False Discovery Rate Analysis



reserved.

# o

f M

atc

hes

Discriminant Score

Global / Cumulative FDR



Threshold 1 : 99 % correct;

1% incorrect.


5% incorrect.


10% incorrect.



reserved.

# o

f M

atc

hes

Discriminant Score

Local / Instantaneous FDR




5% incorrect.


40% incorrect.


80% incorrect.



reserved.

# o

f M

atc

hes

Discriminant Score

FDR




1% incorrect.


5% incorrect.


10% incorrect.


5% incorrect.


40% incorrect.


80% incorrect.

Global Local



reserved.

False Discovery Rate Analysis and PSPEP Report SINGLE RESULTS TABLE

Sennels et al BMC Bioinformatics 2009, 10:179



reserved.


and Databases…

• Overview.




• Databases







reserved.

Proteomics workflow

Orbitrap

Mass spectral data. (.raw)

Search Algorithm

Statistical validation of


Visualization Descriptive Statistics Pathway Analysis

Processing



reserved.

Search Algorithms: Matching Spectra To Protein

Sequences.

Andromeda

PEAKS



reserved.

search algorithm

Nature Methods - 4, 787 - 797 (2007)



reserved.

Protip / TINT

https://tropix.msi.umn.edu/

John Chilton Mark Nelson



reserved.

Protip

Raw Data from

Orbitrap mzxml format

dta format

X!TANDEM

search

Scaffold

Analysis

Scaffold

Viewer

MASCOT

search SEQUEST

search

Mgf

format

OMSSA

search

Powered by

14 node SEQUEST cluster

msconvert



reserved.

Combining results from multiple search algorithms

increases the confidence and number of

peptide and protein identifications.

5522 5137

5486

8162

6554 6962

7443

40

1

37

0

411

49

1

44

1

44

1

46

2

0

1200

2400

3600

4800

6000

7200

8400

Sequest

X! ta

ndem

Mascot

All

Togeth

er

Sequest

+ M

ascot

Sequest

+ X

! ta

ndem

X! ta

ndem

+ M

ascot

HUMAN DATASET



reserved.

One hit wonders

Option 1 : Throw Out One-Hit-Wonders Advantages: Easy, works!

Disadvantages: Loss of sensitivity!

Option 2 : Use Multiple Filters

Filter 1 - Protein Mode

• ≥2 peptides/protein • moderate spectrum threshold

Filter 2 - Peptide Mode

• 1 peptide/protein • high spectrum threshold

Advantages: More sensitive!

Disadvantages: Pretty arbitrary!



reserved.

Option 3 : Protein-level FDR. Global Protein FDRs only

accurate with >100 Proteins

0%

2%

4%

6%

8%

10%

12%

0 200 400 600 Number of Confidently IDed Proteins

Un

cert

ain

ty in

Pro

tein

FD

R

1% Error In FDR Estimation



reserved.

Local Protein FDRs…

• Estimate the likelihood that a single

protein of interest is present

• Are trouble at best due to stochastic

sampling

• Shouldn’t be used with <500 likely

proteins



reserved.

False Discovery Rate Analysis and PSPEP Report SINGLE RESULTS TABLE

Sennels et al BMC Bioinformatics 2009, 10:179



reserved.


and Databases…

• Overview.




• Databases







reserved.

Genomic and proteomic databases

http://www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi



reserved.

Search against database.

Mass spectrum

Database Search



reserved.

Choose your database according to your experimental setup • GenBank allows you to submit DNA databases and not protein

databases. Databases are not always correct. Sequence errors can be

altered only by the submitter.

• Acclimatize yourself with details about your database (eg UniProt /IPI /

RefSeq).

• Transatlantic Divide : GenBank / EMBL

NCBI / EBI

Tranche / PRIDE

NCBInr / TreMBL

• Searching with different databases and search algorithms and their

effect on IDs. Effect of search algorithm on Protein grouping.

• Database size affects ranking of proteins, scoring threshold, etc.

Using combined databases / Contaminant sequences.

proteomic databases



reserved.


and Databases…

• Overview.




• Databases







reserved.

Publication Standards

• In 2006 MCP published guidelines for

reporting peptide and protein

identifications

• Other proteomics journals have adopted

similar standards

• Revised “Paris 2” guidelines (1/1/2010).



reserved.

Guidelines remind you to…

• To present a complete methods/results section I. Search Parameters and Acceptance Criteria

VI. Raw Data Submission

• Follow smart criteria for choosing results to

publish II. Protein and Peptide Identification

IV. Protein Inference from Peptide Assignments

V. Quantification

• To not over-report your results III. Post-Translational Modifications



reserved.

Software Can Make

Guideline Fulfillment Easier



reserved.

Where are the Guidelines?

Molecular & Cellular Proteomics: Bradshaw, R. A.,

Burlingame, A. L., Carr, S., Aebersold, R., Reporting

Protein Identification Data: The next Generation of

Guidelines. Mol. Cell. Proteomics, 5:787-788, 2006.

Journal of Proteome Research: Beavis, R., Editorial: The

Paris Consensus. J. Proteome Res., 2005, 4 (5), p 1475

Proteomics: Wilkins, M. R., Appel, R. D., Van Eyk, J. E.,

Maxey, C. M., et al., Guidelines for the next 10 years of

proteomics. Proteomics. 2006, 6, 1, 4-8.

http://www.mcponline.org/site/misc/MSDataResources.xhtml



reserved.

• We identify Proteins (not Peptides)!

– Can’t stop at Peptide FDRs and Probabilities

• One-Hit-Wonders can be wrong and need to be

seriously investigated (manually or mathematically)

• You can compute Protein level FDRs

– But take them with a grain of salt!

• Occam’s Razor can simplify Shared Peptides

• Publication Standards exist to help you

A few take home points…



reserved.

Questions ?

[email protected]

http://sitemaker.umich.edu/iwsmoi