57
Supercomputing Institute for Advanced Computational Research © 2009 Regents of the University of Minnesota. All rights reserved. Protein Grouping, FDR Analysis and Databases. The Minnesota Supercomputing Institute for Advanced Computational Research March 15th 2012 Pratik Jagtap http://www.mass.msi.umn.edu/

Protein Grouping, FDR Analysis and Databases. Grouping, FDR Analysis and Databases. The Minnesota Supercomputing Institute for Advanced Computational Research March 15th 2012 Pratik

  • Upload
    vungoc

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protein Grouping, FDR

Analysis and Databases.

The Minnesota Supercomputing Institute for Advanced Computational Research

March 15th 2012

Pratik Jagtap

http://www.mass.msi.umn.edu/

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protein Grouping, FDR Analysis

and Databases…

• Overview.

• Protein Grouping : Concept & Methods.

• Fdr Analysis : Concept & Methods.

• Peptide and Protein identification

• Databases

• Publication Guidelines

REFERENCES : „Reporting Protein Identifications from MS/MS Results‟ by Brian Searle

(ProteomeSoftware Inc.) and “Databases” by Akhilesh Pandey (John Hopkins University) at

the „BioInformatics for Protein Identification‟ workshop at Baltimore (2009).

© 2009 Regents of the University of Minnesota. All rights reserved.

Supercomputing Institute for Advanced Computational Research

Proteomics workflow

Search Against Database

Protein

Peptides

Trypsin

Mass Spectrum And

Tandem Mass Spectrum

RP HPLC MS/MS

BSATEST2080708_080807191252 8/7/2008 7:12:52 PM

RT: 24.95 - 60.48

26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

Time (min)

0

10

20

30

40

50

60

70

80

90

100

Re

lative

Ab

un

da

nce

36.46

36.37

39.19

39.29

43.31

35.34

40.0240.4834.52 37.08

44.5533.64

30.3138.03

44.9032.37 48.56 51.70 58.2956.0148.25 54.1029.08 49.2659.6728.5125.38

NL:

2.98E8

Base Peak

MS

BSATEST2

080708_08

080719125

2

RT: 24.95 - 60.48

26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

Time (min)

0

10

20

30

40

50

60

70

80

90

100

Re

lative

Ab

un

da

nce

39.19

39.29

40.04 42.7843.3238.80 57.2245.53 58.2947.50 56.0736.8732.73 50.54 54.2951.5326.80 35.6527.74 29.89

NL:

2.34E8

Base Peak

m/z=

653.00-654.00

MS

BSATEST2080

708_0808071

91252Ion Chromatogram

Protein Identification

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Proteomics workflow

Orbitrap

Mass spectral data. (.raw)

Search Algorithm

Statistical validation of

Protein Identification

Visualization Descriptive Statistics Pathway Analysis

Processing

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protein Grouping, FDR Analysis

and Databases…

• Overview.

• Protein Grouping : Concept & Methods.

• Fdr Analysis : Concept & Methods.

• Peptide and Protein identification

• Databases

• Publication Guidelines

REFERENCES : „Reporting Protein Identifications from MS/MS Results‟ by Brian Searle

(ProteomeSoftware Inc.) and “Databases” by Akhilesh Pandey (John Hopkins University) at

the „BioInformatics for Protein Identification‟ workshop at Baltimore (2009).

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein

85%

65%

25%

??%

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein

(15%)

(35%)

(75%)

(??%)

Feng J, Naiman DQ, Cooper B. Anal Chem. 2007 May 15;79(10):3901-11.

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein

(15%)

(35%)

(75%)

(4%)

0.15 * 0.35 * 0.75 = 0.04

Feng J, Naiman DQ, Cooper B. Anal Chem. 2007 May 15;79(10):3901-11.

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

AEPTIR

IDVCIVLLQHK

NTGDR

Protein

85%

65%

25%

96%

0.15 * 0.35 * 0.75 = 0.04

Feng J, Naiman DQ, Cooper B. Anal Chem. 2007 May 15;79(10):3901-11.

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

If only it were so easy…

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Nesvizhskii, A. I.; Aebersold, R. Mol. Cell. Proteom. 4.10, 1419-1440, 2005

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Nesvizhskii, A. I.; Aebersold, R. Mol. Cell. Proteom. 4.10, 1419-1440, 2005

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Nesvizhskii, A. I.; Aebersold, R. Mol. Cell. Proteom. 4.10, 1419-1440, 2005

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Tubulin alpha 6

Tubulin alpha 3

YMACCLLYR

Tubulin alpha 4

85%

??%

??%

??%

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Tubulin alpha 6

Tubulin alpha 3

YMACCLLYR

Tubulin alpha 4

85%

85%

3

85%

3

85%

3

Nesvizhskii, A. I.; Keller, A. et al Anal. Chem. 75, 4646-4658

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Nesvizhskii, A. I.; Aebersold, R. Mol. Cell. Proteom. 4.10, 1419-1440, 2005

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Tubulin alpha 6

Tubulin alpha 3

YMACCLLYR

SIQFVDWCPTGFK

Tubulin alpha 4

??%

??%

??%

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Copyright ©2005 American Society for Biochemistry and Molecular Biology

Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

Basic peptide grouping scenarios

Occam‟s Razor :

“When you have two

competing theories that

make exactly the same

predictions, the simpler one

is the better."

William of Ockham

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Tubulin alpha 6

Tubulin alpha 3

YMACCLLYR

SIQFVDWCPTGFK

Tubulin alpha 4

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Peptide 1 Peptide 2

Peptide 3 Peptide 4

Pro

tein

B

P

rote

in

A

Distinct Proteins

100% 100%

100% 100%

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Peptide 1 Peptide 2 Peptide 3 Peptide 4

Peptide 1 Peptide 2 Peptide 3 Peptide 4

Pro

tein

B

P

rote

in

A

Indistinguishable

Proteins

50% 50% 50% 50%

50% 50% 50% 50%

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Peptide 1 Peptide 2 Peptide 3

Peptide 2 Peptide 3 Peptide 4

Pro

tein

B

P

rote

in

A

Differentiable Proteins

100% 50% 50%

50% 50% 100%

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Peptide 1 Peptide 2 Peptide 3 Peptide 4

Peptide 2 Peptide 3 Peptide 4

Pro

tein

B

P

rote

in

A

Subset Proteins

100% 100% 100% 100%

0% 0% 0%

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Peptide 1 Peptide 2 Peptide 3 Peptide 4

Peptide 2 Peptide 3 Peptide 4

Pro

tein

B

P

rote

in

A

The Quantitative

Subset Complication

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

The Similar Peptide

Complication

AVGNLR

Scan Number: 2435

TLR9_HUMAN

GLGNLR

TRFE_HUMAN

LRFN1_HUMAN

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

The Similar Peptide

Complication

AVGNLR

Scan Number: 2435

TLR9_HUMAN

TRFE_HUMAN

LRFN1_HUMAN

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protein Grouping, FDR Analysis

and Databases…

• Overview.

• Protein Grouping : Concept & Methods.

• Fdr Analysis : Concept & Methods.

• Peptide and Protein identification

• Databases

• Publication Guidelines

REFERENCES : „Reporting Protein Identifications from MS/MS Results‟ by Brian Searle

(ProteomeSoftware Inc.) and “Databases” by Akhilesh Pandey (John Hopkins University) at

the „BioInformatics for Protein Identification‟ workshop at Baltimore (2009).

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Search against database.

Mass spectrum

>IPI:IPI00205563.1|Gene_Symbol=Tmsbl1 thymosin beta-like protein

MSDKPDLSEVETFDKSKLKKTNTEEKNTLPSKETIQQEKEYNQRS

>IPI:REV_IPI00205563.1|Gene_Symbol=Tmsbl1 thymosin beta-like protein

SRQNYKEEQQITKESPLTKNEETNKKTKLKSDFTEVESLDKPDSM

Reverse Database Search

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

False Discovery Rate Analysis

Nature Methods - 4, 787 - 797 (2007) Sennels et al BMC Bioinformatics 2009, 10:179 images.inmagine.com

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

# o

f M

atc

hes

Discriminant Score

0

100

200

300

400

500

600

700

800

-40 -30 -20 -10 0 10 20 30 40 50 60

“Correct”

Slide Courtesy : Brian Searle (Proteome Software)

Choi H, Ghosh D, Nesvizhskii A. J. Proteome Res. 2008 :7(1):286

False Discovery Rate Analysis

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

# o

f M

atc

hes

Discriminant Score

Global / Cumulative FDR

Slide Courtesy : Brian Searle (Proteome Software)

Choi H, Ghosh D, Nesvizhskii A. J. Proteome Res. 2008 :7(1):286

Threshold 1 : 99 % correct;

1% incorrect.

Threshold 2 : 95 % correct;

5% incorrect.

Threshold 3 : 90 % correct;

10% incorrect.

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

# o

f M

atc

hes

Discriminant Score

Local / Instantaneous FDR

Slide Courtesy : Brian Searle (Proteome Software)

Choi H, Ghosh D, Nesvizhskii A. J. Proteome Res. 2008 :7(1):286

Threshold 1 : 95 % correct;

5% incorrect.

Threshold 2 : 60 % correct;

40% incorrect.

Threshold 3 : 20 % correct;

80% incorrect.

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

# o

f M

atc

hes

Discriminant Score

FDR

Slide Courtesy : Brian Searle (Proteome Software)

Choi H, Ghosh D, Nesvizhskii A. J. Proteome Res. 2008 :7(1):286

Threshold 1 : 99 % correct;

1% incorrect.

Threshold 2 : 95 % correct;

5% incorrect.

Threshold 3 : 90 % correct;

10% incorrect.

Threshold 1 : 95 % correct;

5% incorrect.

Threshold 2 : 60 % correct;

40% incorrect.

Threshold 3 : 20 % correct;

80% incorrect.

Global Local

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

False Discovery Rate Analysis and PSPEP Report SINGLE RESULTS TABLE

Sennels et al BMC Bioinformatics 2009, 10:179

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protein Grouping, FDR Analysis

and Databases…

• Overview.

• Protein Grouping : Concept & Methods.

• Fdr Analysis : Concept & Methods.

• Peptide and Protein identification

• Databases

• Publication Guidelines

REFERENCES : „Reporting Protein Identifications from MS/MS Results‟ by Brian Searle

(ProteomeSoftware Inc.) and “Databases” by Akhilesh Pandey (John Hopkins University) at

the „BioInformatics for Protein Identification‟ workshop at Baltimore (2009).

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Proteomics workflow

Orbitrap

Mass spectral data. (.raw)

Search Algorithm

Statistical validation of

Protein Identification

Visualization Descriptive Statistics Pathway Analysis

Processing

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Search Algorithms: Matching Spectra To Protein

Sequences.

Andromeda

PEAKS

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

search algorithm

Nature Methods - 4, 787 - 797 (2007)

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protip / TINT

https://tropix.msi.umn.edu/

John Chilton Mark Nelson

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protip

Raw Data from

Orbitrap mzxml format

dta format

X!TANDEM

search

Scaffold

Analysis

Scaffold

Viewer

MASCOT

search SEQUEST

search

Mgf

format

OMSSA

search

Powered by

14 node SEQUEST cluster

msconvert

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Combining results from multiple search algorithms

increases the confidence and number of

peptide and protein identifications.

5522 5137

5486

8162

6554 6962

7443

40

1

37

0

411

49

1

44

1

44

1

46

2

0

1200

2400

3600

4800

6000

7200

8400

Sequest

X! ta

ndem

Mascot

All

Togeth

er

Sequest

+ M

ascot

Sequest

+ X

! ta

ndem

X! ta

ndem

+ M

ascot

HUMAN DATASET

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

One hit wonders

Option 1 : Throw Out One-Hit-Wonders Advantages: Easy, works!

Disadvantages: Loss of sensitivity!

Option 2 : Use Multiple Filters

Filter 1 - Protein Mode

• ≥2 peptides/protein • moderate spectrum threshold

Filter 2 - Peptide Mode

• 1 peptide/protein • high spectrum threshold

Advantages: More sensitive!

Disadvantages: Pretty arbitrary!

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Option 3 : Protein-level FDR. Global Protein FDRs only

accurate with >100 Proteins

0%

2%

4%

6%

8%

10%

12%

0 200 400 600 Number of Confidently IDed Proteins

Un

cert

ain

ty in

Pro

tein

FD

R

1% Error In FDR Estimation

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Local Protein FDRs…

• Estimate the likelihood that a single

protein of interest is present

• Are trouble at best due to stochastic

sampling

• Shouldn’t be used with <500 likely

proteins

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

False Discovery Rate Analysis and PSPEP Report SINGLE RESULTS TABLE

Sennels et al BMC Bioinformatics 2009, 10:179

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protein Grouping, FDR Analysis

and Databases…

• Overview.

• Protein Grouping : Concept & Methods.

• Fdr Analysis : Concept & Methods.

• Peptide and Protein identification

• Databases

• Publication Guidelines

REFERENCES : „Reporting Protein Identifications from MS/MS Results‟ by Brian Searle

(ProteomeSoftware Inc.) and “Databases” by Akhilesh Pandey (John Hopkins University) at

the „BioInformatics for Protein Identification‟ workshop at Baltimore (2009).

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Genomic and proteomic databases

http://www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Search against database.

Mass spectrum

Database Search

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Choose your database according to your experimental setup • GenBank allows you to submit DNA databases and not protein

databases. Databases are not always correct. Sequence errors can be

altered only by the submitter.

• Acclimatize yourself with details about your database (eg UniProt /IPI /

RefSeq).

• Transatlantic Divide : GenBank / EMBL

NCBI / EBI

Tranche / PRIDE

NCBInr / TreMBL

• Searching with different databases and search algorithms and their

effect on IDs. Effect of search algorithm on Protein grouping.

• Database size affects ranking of proteins, scoring threshold, etc.

Using combined databases / Contaminant sequences.

proteomic databases

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Protein Grouping, FDR Analysis

and Databases…

• Overview.

• Protein Grouping : Concept & Methods.

• Fdr Analysis : Concept & Methods.

• Peptide and Protein identification

• Databases

• Publication Guidelines

REFERENCES : „Reporting Protein Identifications from MS/MS Results‟ by Brian Searle

(ProteomeSoftware Inc.) and “Databases” by Akhilesh Pandey (John Hopkins University) at

the „BioInformatics for Protein Identification‟ workshop at Baltimore (2009).

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Publication Standards

• In 2006 MCP published guidelines for

reporting peptide and protein

identifications

• Other proteomics journals have adopted

similar standards

• Revised “Paris 2” guidelines (1/1/2010).

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Guidelines remind you to…

• To present a complete methods/results section I. Search Parameters and Acceptance Criteria

VI. Raw Data Submission

• Follow smart criteria for choosing results to

publish II. Protein and Peptide Identification

IV. Protein Inference from Peptide Assignments

V. Quantification

• To not over-report your results III. Post-Translational Modifications

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Software Can Make

Guideline Fulfillment Easier

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Where are the Guidelines?

Molecular & Cellular Proteomics: Bradshaw, R. A.,

Burlingame, A. L., Carr, S., Aebersold, R., Reporting

Protein Identification Data: The next Generation of

Guidelines. Mol. Cell. Proteomics, 5:787-788, 2006.

Journal of Proteome Research: Beavis, R., Editorial: The

Paris Consensus. J. Proteome Res., 2005, 4 (5), p 1475

Proteomics: Wilkins, M. R., Appel, R. D., Van Eyk, J. E.,

Maxey, C. M., et al., Guidelines for the next 10 years of

proteomics. Proteomics. 2006, 6, 1, 4-8.

http://www.mcponline.org/site/misc/MSDataResources.xhtml

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

• We identify Proteins (not Peptides)!

– Can’t stop at Peptide FDRs and Probabilities

• One-Hit-Wonders can be wrong and need to be

seriously investigated (manually or mathematically)

• You can compute Protein level FDRs

– But take them with a grain of salt!

• Occam’s Razor can simplify Shared Peptides

• Publication Standards exist to help you

A few take home points…

Supercomputing Institute for Advanced Computational Research

© 2009 Regents of the University of Minnesota. All rights

reserved.

Questions ?

[email protected]

http://sitemaker.umich.edu/iwsmoi