2 3 J. Proteome Res., 2011, 10 (1), pp 153–160 DOI: 10.1021/pr100677g

Paulo Costa CarvalhoLaboratory for Proteomics and Protein EngineeringFiocruz - PR

Analyzing shotgun proteomic data

pcarvalho.com

2

• Shotgun proteomics • Motivation for studying proteomics.• What is shotgun proteomics.

• Data analysis• Protein identification• Label-free quantitation• PatternLab for proteomics

• Final Considerations

Outline

3

J. Proteome Res., 2011, 10 (1), pp 153–160DOI: 10.1021/pr100677g

Motivations

4

5

Editorial

“There has been an unprecedented improvement in the quality and quantity of commercial proteomics data generation technologies, making data generation more accessible to many researchers. However, more and more discoveries will be led by researchers in command of the skills necessary to mine and extensively interpret the volumes of data. Already the ability to generate data vastly outpaces our ability to interpret it, and the lack of expertise in interpreting data is the current gating factor in the advancement of proteomics sciences. Proteomics scientists with training solely in data generation techniques will be shut out of more and more research opportunities.

Nuno Bandeira, July 2011

Computational Proteomics

Too many roads not taken

Eduards AM, Nature, Feb 2011

7




Outline

Proteomics has revolutionized biochemical research

pcarvalho.com 9

10

LC / MS shotgun proteomic data

Mass / Charge

Time

AF Y L K

m/z

A F Y AL KNH2 COOH

(precursor)2+

(B) (Y)

A FY L K

A

m/z

A F Y L KNH2 COOH

AF Y L K

(precursor)2+

(B) (Y)

A F Y

L K

L

m/z

A F Y KNH2 COOH

AF Y L K

A F

Y L K(precursor)2+

(B) (Y)

m/z

A F Y L KNH2 COOH

K

A F Y L

AF Y L K

A F

Y L K

A F Y

L K

(precursor)2+

(B) (Y)

15




Outline

Strategies for protein identificationby mass spectrometry

• Peptide sequence match• Advantage: most sensitive (when the protein is in the DB)• Disadvantage: sequence must be in the DB; needs to

specify PTMs a priori.• De novo sequencing

• Advantage: does not require a database • Disadvantage: most error prone.

• Sequence Tag Search• Advantages: no need to specify PTM a priori; tolerant to

small changes in the sequence• Disadvantages: not as sensitive as PSM when the protein

is in the DB

17

• De novo sequencing• Advantage: does not require a database • Disadvantage: most error prone

M/Z

MS/MS

Inte

ns

ity

QG

D

F V L ET

S K

HA

GI

I

LV

L

G

T

SV

G

V

V

K

E

DA

S

PE

18

• Sequence Tag Search• Advantages: no need to specify PTM a priori; tolerant to small sequence changes• Disadvantages: not as sensitive as PSM when the protein is in the DB

Na S et al., MCP, 2008

19

• Peptide sequence match• Advantage: most sensitive (when the protein is in the DB)• Disadvantage: sequence must be in the DB; needs to specify

PTMs a priori

20

Protein Identification using a database

ProLuCIDXtandemOMSSA

AndromedaSEQUESTMascot

…

Interpreting MS/MS Proteomics Results

Brian C. SearleProteome Software Inc. Portland, Oregon USA

[email protected]

NPC Progress Meeting(February 2nd, 2006)

Illustrated by Toni Boudreault

M/z

Inte

nsity

R I T P E AH2O

B-type, A-type, Y-type IonsAll these peaks are seen together

simultaneouslyand we don’t

even know…

M/z

Inte

nsity

What type of ion they are, making the mass differences approach even more difficult.

Finally, as with all analytical techniques,

M/z

Inte

nsity

There’s noise,producing a final spectrum that looks like…

M/z

Inte

nsity

….This, on a good day. And so it’s actually fairly difficult to…

26

XCalibur :: Show experimental data

Known Ion Types

B-type ionsA-type ionsY-type ions

We knew a couple of things about peptide fragmentation.

Not only do we know to expect B, A, and Y ions,

but…

Known Ion Types

B-type ionsA-type ionsY-type ions

B- or Y-type +2H ionsB- or Y-type -NH3 ions

B- or Y-type -H2O ions

• 100%• 20%• 100%

• 50%• 20%• 20%

… likelihood of seeing each type of ion,

where generally B and Y ions are most prominent.

If we know the amino acid

sequence of a peptide,

we can guess what the spectra should

look like!

So it’s actually pretty easy to guess what a spectrum

should look like

if we know what the peptide sequence is.

ELVISLIVESK

Model Spectrum

*Courtesy of Dr. Richard Johnsonhttp://www.hairyfatguy.com/

So as an example, consider the peptide

ELVIS LIVES K

that was synthesized by Rich Johnson in

Seattle

Model Spectrum

We can create a hypothetical spectrum based on our rules

B/Y type ions (100%)

A type ionsB/Y -NH3/-H2O

(20%)

B/Y +2H type ions(50%)

Where B and Y ions are estimated at 100%,

plus 2 ions are estimated at

50%, and other stragglers are at 20%.

Model Spectrum

So if we consider the spectrum that was derived from the ELVIS LIVES K peptide…

Model Spectrum

We can find where the overlap is between the hypothetical and the actual spectra…

Model Spectrum

And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide.

• 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced.

• 1989 Yeast Genome project announced• 1990 Human Genome project announced• 1992 First chromosome (Yeast) sequenced• 1995 H. influenza sequenced • 1996 Yeast Genome sequenced• 2000 Human Genome draft

Sequencing Explosion

…

Eng, J. K.; McCormack, A. L.; Yates, J. R. III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.

In 1994 Jimmy Eng and John Yates published a technique to

exploit genome sequencing

And the idea was …

for use in tandem mass

spectrometry.

SEQUEST

.…instead of searching all possible peptide sequences,

search only those in genome databases.

Now, in the post- genomic world this seems like a pretty

trivial idea,

but back then there was a lot of assumption placed on

the idea

that we’d actually have a complete Human genome in

a reasonable amount of time.

SEQUEST Model Spectrum

For a scoring function they decided to use Cross-Correlation,

Like so. which basically sums the peaks that

overlap between hypothetical and the actual spectra


And then they shifted the spectra back and ….


They used this number, also called the Auto-Correlation, as their background.

… Forth so that the peaks shouldn’t align.

SEQUEST XCorr

Gentzel M. et al Proteomics 3 (2003) 1597-1610

Offset (AMU)

Cor

rela

tion

Sco

re

Cross Correlation(direct comparison)

Auto Correlation(background)

This is another representation of the Cross Correlation and the Auto Correlation.

SEQUEST XCorrCross Correlation

(direct comparison)

Auto Correlation(background)

CrossCorr

avg AutoCorr offset=-75 to 75 XCorr =Gentzel M. et al Proteomics 3 (2003) 1597-1610

Offset (AMU)

Cor

rela

tion

Sco

re

The XCorr score is the Cross Correlation divided

by the average of the auto correlation over a

150 AMU range.

The XCorr is high if the direct comparison is significantly

greater than the background,

which is obviously good for peptide identification.

SEQUEST DeltaCn

XCorr1 XCorr 2

XCorr1and so far, there really

haven’t been any significant

improvements on it.The DeltaCn is another

score that scientists often use.

It measures how good the XCorr is relative to the

next best match.

And this XCorr is actually a pretty robust method for estimating how accurate

the match is,

As you can see, this is actually a pretty crude calculation.

44

Raw Xtractor / Pause for search

* Show an MS2 file

45

ProLuCID

ProLuCID is a fast and sensitive tandem mass spectra-based protein identification program recently developed in the Yates laboratory at The Scripps Research Institute.

Show ProLuCID RunnerCarvalho PC et al; unpublished

46

ProLuCID runner

Search Engine (e.g. ProLuCID, SEQUEST, etc)

Workflow

MS PSM

Protein Identification

Database

48

The Challenge: How to pinpoint trustworthy identifications

1 spectrum = 1 identification!

49

Filtering data

50

In the beginning…

spectrum scores protein peptide

sort

by

mat

ch s

core SEQUEST

XCorr > 2.5dCn > 0.1

MascotScore > 45

X!TandemScore < 0.01

Spectra were sorted according to some score and then a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size.

Spectra were sorted according to some score and then a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size.

51

There has to be a better way

The threshold model has these problems, which PeptideProphet, DTASelect and others try to solve:

The threshold model has these problems, which PeptideProphet, DTASelect and others try to solve:

• Poor sensitivity/specificity trade-off, unless you consider multiple scores simultaneously.

• No way to choose an error rate (p=0.05).

• Need to have different thresholds for:– different instruments (QTOF, TOF-TOF, IonTrap)– ionization sources (electrospray vs MALDI)– sample complexities (2D gel spot vs MudPIT)– different databases (SwissProt vs NR)

• Impossible to compare results from different search algorithms, multiple instruments, and so on.

52

Creating a discriminant score

spectrum scores protein peptide

sort

by

mat

ch s

core

PeptideProphet starts with a discriminant score. If an application uses several scores, (SEQUEST uses Xcorr, DCn, and Sp scores; Mascot uses ion scores plus identity and homology thresholds), these are first converted to a single discriminant score.

PeptideProphet starts with a discriminant score. If an application uses several scores, (SEQUEST uses Xcorr, DCn, and Sp scores; Mascot uses ion scores plus identity and homology thresholds), these are first converted to a single discriminant score.

pcarvalho.com 53

Scaffold:: Proteome Software

54

correctly identifieseverything, with

no error

Keller et al, Anal Chem 2002

This graph shows the trade-offs between the errors (false identifications) and the sensitivity (the percentage of possible peptides identified).

The ideal is zero error and everything identified (sensitivity = 100%).

PeptideProphet corresponds to the curved line. Squares 1–5 are thresholds chosen by other authors.

This graph shows the trade-offs between the errors (false identifications) and the sensitivity (the percentage of possible peptides identified).

The ideal is zero error and everything identified (sensitivity = 100%).

PeptideProphet corresponds to the curved line. Squares 1–5 are thresholds chosen by other authors.

55

0

20

40

60

80

100

120

140

160

180

200

-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3

“correct”

“incorrect”

Discriminant score (D)

Num

ber

of s

pect

ra in

eac

h bi

nThis histogram shows the distributions of correct and incorrect matches.

PeptideProphet assumes that these distributions are standard statistical distributions.

Using curve-fitting, PeptideProphet draws the correct and incorrect distributions.

This histogram shows the distributions of correct and incorrect matches.

PeptideProphet assumes that these distributions are standard statistical distributions.

Using curve-fitting, PeptideProphet draws the correct and incorrect distributions.

Mixture of distributions

56

Sequências alvo--------------------------

Decoys rotulados}{ Estratégia

decoy para FDR

Resultado

busca

Labeled decoy – False Discovery Rate

Elias and Gygi, Nature Methods, 2007

0

20

40

60

80

100

120

140

160

180

200

-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3

pcarvalho.com 57

Search Engine Processor

SVM - example

58

Summary: “The use of iProphet in the TPP increases thenumber of correctly identified peptides at a constant falsediscovery rate (FDR) as compared to both PeptideProphetand another state-of-the art tool Percolator.”

59

Maximizing proteins under a given FDR

60

61

Target Sequences--------------------------

Labeled Decoys }{ New FDR strategy

Resultado

search

Unlabeled Decoys – False Discovery Rate

0

20

40

60

80

100

120

140

160

180

200

-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3

-------------------------Unlabeled Decoyd

U-Decoy

Total Identified Spectra

LD (spectra) UD (spectra)

WNN 115248 1152 4656Bayes 108376 1083 1064

Unlabeled Decoys – False Discovery Rate

Spectra Peptides Proteins (FDR) UL FDRSEPro 104,654 17,840 1283 (0.9%) 1%Scaffold 88,970 15,406 1,160 (2.3%) 2%

Table I. Scaffold A refers to a 99% confidence level for proteins, 95% for peptides. Scaffold B refers to 95 and 80%, respectively for proteins and peptides.

64

Generating the SEPro Report

65


66


67


68


69


70


71




Outline

Relative quantitation

Thermo

Picture from Strassberger et al, JOP, 2010

Label free quantitation

* Search for examples in xcalibur

Scan 12048How to deal with different charge states????

Subject to random sampling; what are its immplications?

74

Differential Analysis is performed in two steps

Differential Analysis

Marginal Cases (found in only 1 condition)

Differential (found in both)

75

Venn Diagrams of the proteins identified by shotgun proteomics from a cell lysate inbiological states B1 and B2. Panels A, B, and C consider only proteins that appearedin one or more, two or more, or in all three replicates, respectively.

76

Venn Diagrams of the proteins identified by shotgun proteomics from a cell lysate in biological states B1 (A) and B2 (B). R1, R2, and R3 refer to the replicates from 59each state.

77

What proteins can be considered as statistically different for marginal cases?

Low ()Num. Rep. (t) Num. Proteins Fraction () p-value

1 613 0.637 0.1802 283 0.294 0.0563 66 0.069 0.019

Medium ()1 297 0.310 0.1412 417 0.435 0.0423 245 0.255 0.015

High ()1 168 0.176 0.1122 185 0.193 0.0333 604 0.631 0.011

Very High ()1 59 0.070 0.0832 62 0.073 0.0243 725 0.857 0.008

Venn Diagram of the proteins identified by shotgun proteomics from a cell lysate in biological states B1 and B2. Proteins that could not be statistically claimed to be differentially expressed in one of the two states according tothe proposed Bayesian approach (those forwhich p-value 0.05) were automatically filtered out during the generation of the Venn Diagram.

Carvalho PC et al; Bioinformatics 2011

79

Differential Analysis is performed in two steps

Differential Analysis

Marginal Cases (found in only 1 condition)

Differential (found in both)

80

}}

Estrategia Tradicional - Data Dependent Analysis (DDA)

Nova estrategia – Extended Data Independent Analysis (XDIA)

Results

• Number of identified spectra increased by 250%.(improves label-free quantitation)

• Number of unique peptide increased by 35%.

81

82

84

Multiplexed spectrum identification

Confidence when integrating extracted ion chromatograms

DDA XDIA

Co-eluting peptide ions of similar m/z

A AA, B B B B

Data Dependent AnalysisExtended Data Independent Analysis

Time

Peptide Mass:

Spectral deconvolution and monotopic peaks reasignment to aid in identification and XIC quantitation

88




Outline

89

Show SEProQ here

Pinpoint differentially expressed proteins Venn Diagrams

Gene Ontology Analysis Find trends in time-course experiments

PatternLab for proteomics: a one stop shop for data analysisCarvalho PC et al., Current Protocols in Bioinformatics, 2010

91

Computational workflow

Finding Statistically Differentially Expressed Proteins / Data AnalysisPatternLab for proteomics (Trends, Venn Diagrams, Differential Statistics, Gene Ontology Analysis, etc..)

Protein Quantitation

Search Engine Processor / SEProQ

Protein Identification / Quality control ProLuCID => Search Engine Processor

Search Engine Preprocessing

YADA XDIA Processor CPM

Experimental: Data acquisition using the mass spectrometer

DDA XDIA

Documents

2 3 J. Proteome Res., 2011, 10 (1), pp 153–160 DOI: 10.1021/pr100677g