Upload
sara-harvey
View
214
Download
0
Embed Size (px)
Citation preview
Paulo Costa CarvalhoLaboratory for Proteomics and Protein EngineeringFiocruz - PR
Analyzing shotgun proteomic data
pcarvalho.com
2
• Shotgun proteomics • Motivation for studying proteomics.• What is shotgun proteomics.
• Data analysis• Protein identification• Label-free quantitation• PatternLab for proteomics
• Final Considerations
Outline
3
J. Proteome Res., 2011, 10 (1), pp 153–160DOI: 10.1021/pr100677g
Motivations
4
5
Editorial
“There has been an unprecedented improvement in the quality and quantity of commercial proteomics data generation technologies, making data generation more accessible to many researchers. However, more and more discoveries will be led by researchers in command of the skills necessary to mine and extensively interpret the volumes of data. Already the ability to generate data vastly outpaces our ability to interpret it, and the lack of expertise in interpreting data is the current gating factor in the advancement of proteomics sciences. Proteomics scientists with training solely in data generation techniques will be shut out of more and more research opportunities.
Nuno Bandeira, July 2011
Computational Proteomics
Too many roads not taken
Eduards AM, Nature, Feb 2011
7
• Shotgun proteomics • Motivation for studying proteomics.• What is shotgun proteomics.
• Data analysis• Protein identification• Label-free quantitation• PatternLab for proteomics
• Final Considerations
Outline
Proteomics has revolutionized biochemical research
pcarvalho.com 9
10
LC / MS shotgun proteomic data
Mass / Charge
Time
AF Y L K
m/z
A F Y AL KNH2 COOH
(precursor)2+
(B) (Y)
A FY L K
A
m/z
A F Y L KNH2 COOH
AF Y L K
(precursor)2+
(B) (Y)
A F Y
L K
L
m/z
A F Y KNH2 COOH
AF Y L K
A F
Y L K(precursor)2+
(B) (Y)
m/z
A F Y L KNH2 COOH
K
A F Y L
AF Y L K
A F
Y L K
A F Y
L K
(precursor)2+
(B) (Y)
15
• Shotgun proteomics • Motivation for studying proteomics.• What is shotgun proteomics.
• Data analysis• Protein identification• Label-free quantitation• PatternLab for proteomics
• Final Considerations
Outline
Strategies for protein identificationby mass spectrometry
• Peptide sequence match• Advantage: most sensitive (when the protein is in the DB)• Disadvantage: sequence must be in the DB; needs to
specify PTMs a priori.• De novo sequencing
• Advantage: does not require a database • Disadvantage: most error prone.
• Sequence Tag Search• Advantages: no need to specify PTM a priori; tolerant to
small changes in the sequence• Disadvantages: not as sensitive as PSM when the protein
is in the DB
17
• De novo sequencing• Advantage: does not require a database • Disadvantage: most error prone
M/Z
MS/MS
Inte
ns
ity
QG
D
F V L ET
S K
HA
GI
I
LV
L
G
T
SV
G
V
V
K
E
DA
S
PE
18
• Sequence Tag Search• Advantages: no need to specify PTM a priori; tolerant to small sequence changes• Disadvantages: not as sensitive as PSM when the protein is in the DB
Na S et al., MCP, 2008
19
• Peptide sequence match• Advantage: most sensitive (when the protein is in the DB)• Disadvantage: sequence must be in the DB; needs to specify
PTMs a priori
20
Protein Identification using a database
ProLuCIDXtandemOMSSA
AndromedaSEQUESTMascot
…
Interpreting MS/MS Proteomics Results
Brian C. SearleProteome Software Inc. Portland, Oregon USA
NPC Progress Meeting(February 2nd, 2006)
Illustrated by Toni Boudreault
M/z
Inte
nsity
R I T P E AH2O
B-type, A-type, Y-type IonsAll these peaks are seen together
simultaneouslyand we don’t
even know…
M/z
Inte
nsity
What type of ion they are, making the mass differences approach even more difficult.
Finally, as with all analytical techniques,
M/z
Inte
nsity
There’s noise,producing a final spectrum that looks like…
M/z
Inte
nsity
….This, on a good day. And so it’s actually fairly difficult to…
26
XCalibur :: Show experimental data
Known Ion Types
B-type ionsA-type ionsY-type ions
We knew a couple of things about peptide fragmentation.
Not only do we know to expect B, A, and Y ions,
but…
Known Ion Types
B-type ionsA-type ionsY-type ions
B- or Y-type +2H ionsB- or Y-type -NH3 ions
B- or Y-type -H2O ions
• 100%• 20%• 100%
• 50%• 20%• 20%
… likelihood of seeing each type of ion,
where generally B and Y ions are most prominent.
If we know the amino acid
sequence of a peptide,
we can guess what the spectra should
look like!
So it’s actually pretty easy to guess what a spectrum
should look like
if we know what the peptide sequence is.
ELVISLIVESK
Model Spectrum
*Courtesy of Dr. Richard Johnsonhttp://www.hairyfatguy.com/
So as an example, consider the peptide
ELVIS LIVES K
that was synthesized by Rich Johnson in
Seattle
Model Spectrum
We can create a hypothetical spectrum based on our rules
B/Y type ions (100%)
A type ionsB/Y -NH3/-H2O
(20%)
B/Y +2H type ions(50%)
Where B and Y ions are estimated at 100%,
plus 2 ions are estimated at
50%, and other stragglers are at 20%.
Model Spectrum
So if we consider the spectrum that was derived from the ELVIS LIVES K peptide…
Model Spectrum
We can find where the overlap is between the hypothetical and the actual spectra…
Model Spectrum
And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide.
• 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced.
• 1989 Yeast Genome project announced• 1990 Human Genome project announced• 1992 First chromosome (Yeast) sequenced• 1995 H. influenza sequenced • 1996 Yeast Genome sequenced• 2000 Human Genome draft
Sequencing Explosion
…
Eng, J. K.; McCormack, A. L.; Yates, J. R. III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.
In 1994 Jimmy Eng and John Yates published a technique to
exploit genome sequencing
And the idea was …
for use in tandem mass
spectrometry.
SEQUEST
.…instead of searching all possible peptide sequences,
search only those in genome databases.
Now, in the post- genomic world this seems like a pretty
trivial idea,
but back then there was a lot of assumption placed on
the idea
that we’d actually have a complete Human genome in
a reasonable amount of time.
SEQUEST Model Spectrum
For a scoring function they decided to use Cross-Correlation,
Like so. which basically sums the peaks that
overlap between hypothetical and the actual spectra
SEQUEST Model Spectrum
And then they shifted the spectra back and ….
SEQUEST Model Spectrum
They used this number, also called the Auto-Correlation, as their background.
… Forth so that the peaks shouldn’t align.
SEQUEST XCorr
Gentzel M. et al Proteomics 3 (2003) 1597-1610
Offset (AMU)
Cor
rela
tion
Sco
re
Cross Correlation(direct comparison)
Auto Correlation(background)
This is another representation of the Cross Correlation and the Auto Correlation.
SEQUEST XCorrCross Correlation
(direct comparison)
Auto Correlation(background)
CrossCorr
avg AutoCorr offset=-75 to 75 XCorr =Gentzel M. et al Proteomics 3 (2003) 1597-1610
Offset (AMU)
Cor
rela
tion
Sco
re
The XCorr score is the Cross Correlation divided
by the average of the auto correlation over a
150 AMU range.
The XCorr is high if the direct comparison is significantly
greater than the background,
which is obviously good for peptide identification.
SEQUEST DeltaCn
XCorr1 XCorr 2
XCorr1and so far, there really
haven’t been any significant
improvements on it.The DeltaCn is another
score that scientists often use.
It measures how good the XCorr is relative to the
next best match.
And this XCorr is actually a pretty robust method for estimating how accurate
the match is,
As you can see, this is actually a pretty crude calculation.
44
Raw Xtractor / Pause for search
* Show an MS2 file
45
ProLuCID
ProLuCID is a fast and sensitive tandem mass spectra-based protein identification program recently developed in the Yates laboratory at The Scripps Research Institute.
Show ProLuCID RunnerCarvalho PC et al; unpublished
46
ProLuCID runner
Search Engine (e.g. ProLuCID, SEQUEST, etc)
Workflow
MS PSM
Protein Identification
Database
48
The Challenge: How to pinpoint trustworthy identifications
1 spectrum = 1 identification!
49
Filtering data
50
In the beginning…
spectrum scores protein peptide
sort
by
mat
ch s
core SEQUEST
XCorr > 2.5dCn > 0.1
MascotScore > 45
X!TandemScore < 0.01
Spectra were sorted according to some score and then a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size.
Spectra were sorted according to some score and then a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size.
51
There has to be a better way
The threshold model has these problems, which PeptideProphet, DTASelect and others try to solve:
The threshold model has these problems, which PeptideProphet, DTASelect and others try to solve:
• Poor sensitivity/specificity trade-off, unless you consider multiple scores simultaneously.
• No way to choose an error rate (p=0.05).
• Need to have different thresholds for:– different instruments (QTOF, TOF-TOF, IonTrap)– ionization sources (electrospray vs MALDI)– sample complexities (2D gel spot vs MudPIT)– different databases (SwissProt vs NR)
• Impossible to compare results from different search algorithms, multiple instruments, and so on.
52
Creating a discriminant score
spectrum scores protein peptide
sort
by
mat
ch s
core
PeptideProphet starts with a discriminant score. If an application uses several scores, (SEQUEST uses Xcorr, DCn, and Sp scores; Mascot uses ion scores plus identity and homology thresholds), these are first converted to a single discriminant score.
PeptideProphet starts with a discriminant score. If an application uses several scores, (SEQUEST uses Xcorr, DCn, and Sp scores; Mascot uses ion scores plus identity and homology thresholds), these are first converted to a single discriminant score.
pcarvalho.com 53
Scaffold:: Proteome Software
54
correctly identifieseverything, with
no error
Keller et al, Anal Chem 2002
This graph shows the trade-offs between the errors (false identifications) and the sensitivity (the percentage of possible peptides identified).
The ideal is zero error and everything identified (sensitivity = 100%).
PeptideProphet corresponds to the curved line. Squares 1–5 are thresholds chosen by other authors.
This graph shows the trade-offs between the errors (false identifications) and the sensitivity (the percentage of possible peptides identified).
The ideal is zero error and everything identified (sensitivity = 100%).
PeptideProphet corresponds to the curved line. Squares 1–5 are thresholds chosen by other authors.
55
0
20
40
60
80
100
120
140
160
180
200
-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3
“correct”
“incorrect”
Discriminant score (D)
Num
ber
of s
pect
ra in
eac
h bi
nThis histogram shows the distributions of correct and incorrect matches.
PeptideProphet assumes that these distributions are standard statistical distributions.
Using curve-fitting, PeptideProphet draws the correct and incorrect distributions.
This histogram shows the distributions of correct and incorrect matches.
PeptideProphet assumes that these distributions are standard statistical distributions.
Using curve-fitting, PeptideProphet draws the correct and incorrect distributions.
Mixture of distributions
56
Sequências alvo--------------------------
Decoys rotulados}{ Estratégia
decoy para FDR
Resultado
busca
Labeled decoy – False Discovery Rate
Elias and Gygi, Nature Methods, 2007
0
20
40
60
80
100
120
140
160
180
200
-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3
pcarvalho.com 57
Search Engine Processor
SVM - example
58
Summary: “The use of iProphet in the TPP increases thenumber of correctly identified peptides at a constant falsediscovery rate (FDR) as compared to both PeptideProphetand another state-of-the art tool Percolator.”
59
Maximizing proteins under a given FDR
60
61
Target Sequences--------------------------
Labeled Decoys }{ New FDR strategy
Resultado
search
Unlabeled Decoys – False Discovery Rate
0
20
40
60
80
100
120
140
160
180
200
-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3
-------------------------Unlabeled Decoyd
U-Decoy
Total Identified Spectra
LD (spectra) UD (spectra)
WNN 115248 1152 4656Bayes 108376 1083 1064
Unlabeled Decoys – False Discovery Rate
Spectra Peptides Proteins (FDR) UL FDRSEPro 104,654 17,840 1283 (0.9%) 1%Scaffold 88,970 15,406 1,160 (2.3%) 2%
Table I. Scaffold A refers to a 99% confidence level for proteins, 95% for peptides. Scaffold B refers to 95 and 80%, respectively for proteins and peptides.
64
Generating the SEPro Report
65
Generating the SEPro Report
66
Generating the SEPro Report
67
Generating the SEPro Report
68
Generating the SEPro Report
69
Generating the SEPro Report
70
Generating the SEPro Report
71
• Shotgun proteomics • Motivation for studying proteomics.• What is shotgun proteomics.
• Data analysis• Protein identification• Label-free quantitation• PatternLab for proteomics
• Final Considerations
Outline
Relative quantitation
Thermo
Picture from Strassberger et al, JOP, 2010
Label free quantitation
* Search for examples in xcalibur
Scan 12048How to deal with different charge states????
Subject to random sampling; what are its immplications?
74
Differential Analysis is performed in two steps
Differential Analysis
Marginal Cases (found in only 1 condition)
Differential (found in both)
75
Venn Diagrams of the proteins identified by shotgun proteomics from a cell lysate inbiological states B1 and B2. Panels A, B, and C consider only proteins that appearedin one or more, two or more, or in all three replicates, respectively.
76
Venn Diagrams of the proteins identified by shotgun proteomics from a cell lysate in biological states B1 (A) and B2 (B). R1, R2, and R3 refer to the replicates from 59each state.
77
What proteins can be considered as statistically different for marginal cases?
Low ()Num. Rep. (t) Num. Proteins Fraction () p-value
1 613 0.637 0.1802 283 0.294 0.0563 66 0.069 0.019
Medium ()1 297 0.310 0.1412 417 0.435 0.0423 245 0.255 0.015
High ()1 168 0.176 0.1122 185 0.193 0.0333 604 0.631 0.011
Very High ()1 59 0.070 0.0832 62 0.073 0.0243 725 0.857 0.008
Venn Diagram of the proteins identified by shotgun proteomics from a cell lysate in biological states B1 and B2. Proteins that could not be statistically claimed to be differentially expressed in one of the two states according tothe proposed Bayesian approach (those forwhich p-value 0.05) were automatically filtered out during the generation of the Venn Diagram.
Carvalho PC et al; Bioinformatics 2011
79
Differential Analysis is performed in two steps
Differential Analysis
Marginal Cases (found in only 1 condition)
Differential (found in both)
80
}}
Estrategia Tradicional - Data Dependent Analysis (DDA)
Nova estrategia – Extended Data Independent Analysis (XDIA)
Results
• Number of identified spectra increased by 250%.(improves label-free quantitation)
• Number of unique peptide increased by 35%.
81
82
84
Multiplexed spectrum identification
Confidence when integrating extracted ion chromatograms
DDA XDIA
Co-eluting peptide ions of similar m/z
A AA, B B B B
Data Dependent AnalysisExtended Data Independent Analysis
Time
Peptide Mass:
Spectral deconvolution and monotopic peaks reasignment to aid in identification and XIC quantitation
88
• Shotgun proteomics • Motivation for studying proteomics.• What is shotgun proteomics.
• Data analysis• Protein identification• Label-free quantitation• PatternLab for proteomics
• Final Considerations
Outline
89
Show SEProQ here
Pinpoint differentially expressed proteins Venn Diagrams
Gene Ontology Analysis Find trends in time-course experiments
PatternLab for proteomics: a one stop shop for data analysisCarvalho PC et al., Current Protocols in Bioinformatics, 2010
91
Computational workflow
Finding Statistically Differentially Expressed Proteins / Data AnalysisPatternLab for proteomics (Trends, Venn Diagrams, Differential Statistics, Gene Ontology Analysis, etc..)
Protein Quantitation
Search Engine Processor / SEProQ
Protein Identification / Quality control ProLuCID => Search Engine Processor
Search Engine Preprocessing
YADA XDIA Processor CPM
Experimental: Data acquisition using the mass spectrometer
DDA XDIA