Upload
violet-ferguson
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Mass Spectrometry-Based Methods for Protein Identification
Joseph A. LooJoseph A. Loo
Department of Biological ChemistryDepartment of Biological ChemistryDavid Geffen School of MedicineDavid Geffen School of Medicine
Department of Chemistry and Department of Chemistry and BiochemistryBiochemistry
University of CaliforniaUniversity of CaliforniaLos Angeles, CA USALos Angeles, CA USA
Genomics and ProteomicsCharacterizing many genes and gene products simultaneously
Proteomics Aids Biological ResearchProteomics Aids Biological Research
BiologyBiology
protein separation
complex protein mixture
mass spectrometrymass spectrometry
protein identificationprotein modificationprotein abundance
Proteomics - What is it?Proteomics - What is it?
An assay to systematically analyze the diverse properties of proteins
Biological processes are dynamic A quantitative comparison of states is required
The study of protein expression and function on a genome scale
Purpose: Examine altered gene expression pathways Purpose: Examine altered gene expression pathways in disease states and under different environmental in disease states and under different environmental conditionsconditions
The completion of the human genome has provided researchers with the blueprint for life, and proteomics offers scientists the means for analyzing the expressed genome.
Genome to ProteomeGenome to Proteome
MTDLKASSLRALKLMDLTTLNDDDTDEKVIALCHQAKTPVGNTAAICIYP 51 RFIPIARKTLKEQGTPEIRIATVTNFPHGNDDIDIALAETRAAIAYGADE 101 VDVVFPYRALMAGNEQVGFDLVKACKEACAAANVLLKVIIETGELKDEAL 151 IRKASEISI
TranscriptionTranscription
dsDNAdsDNA
mRNAmRNA
TranslationTranslation
ProteinProteinHH22NN COOHCOOH
(Gene)(Gene)
The completion of the human genome has provided researchers with the blueprint for
life, and proteomics offers scientists the means for analyzing the expressed
genome.
Mass spectrometry
Approaches for Protein IdentificationApproaches for Protein Identification
What is this protein?
Molecular weight Isoelectric point Amino acid composition Other physical/chemical characteristics Partial or complete amino acid sequence
Edman (N-terminal sequence) - if N-term. not blocked C-terminal sequence - not commonly performed Mass spectrometry-measured information
150-75-
40-
25-
18-
10-
6.5 6.0 5.5 5.0 4.5
pI
MW
x 1
03
2-D Gel Electrophoresis
Excise separated
protein “spots”
In-gel trypsin digest
Recover tryptic
peptides
m/z 500
71771710891089
12721272 14011401
15471547
1700170018571857
23842384
27912791
1500 2500 3000
Peptide mass fingerprint by MALDI-TOF or LC-ESI-MS.
Additional sequence information can be obtained by
MS/MS.
Protein Identification by Mass SpectrometryProtein Identification by Mass Spectrometry
Protein identification by searching proteomic or genomic databases
Mass Spectrometry:Mass Spectrometry:A method to “weigh” moleculesA method to “weigh” molecules
A simple measurement of A simple measurement of mass is used to confirm the mass is used to confirm the identity of a molecule, but it identity of a molecule, but it can be used for much can be used for much more……more……
Other information can be inferred from a Other information can be inferred from a weight measurement.weight measurement. Post-translational modificationsPost-translational modifications Molecular interactionsMolecular interactions ShapeShape SequenceSequence Physical dimensionsPhysical dimensions etc...etc...
Pre-Separation
Mass SpectrometerMass Spectrometerfor Proteomicsfor Proteomics
Liquid Chromatography
Mass Analyzer IonDetector
Time-of-Flight (TOF)Quadrupole TOF (QTOF)
Ion Trap (IT)Fourier Transform-
Ion Cyclotron Resonance (FT-ICR)
Ion SourceIon Source
The Nobel Prize in Chemistry 2002The Nobel Prize in Chemistry 2002
"for the development of methods for identification and "for the development of methods for identification and structure analyses of biological macromolecules" structure analyses of biological macromolecules"
"for their development of soft desorption ionisation "for their development of soft desorption ionisation methods for mass spectrometric analyses of biological methods for mass spectrometric analyses of biological macromolecules"macromolecules"
John FennJohn Fenn Koichi TanakaKoichi Tanaka
Electrospray: Generation of aerosols and droplets
Electrospray Ionization (ESI)Electrospray Ionization (ESI)
Multiple chargingMultiple charging More charges for larger More charges for larger
moleculesmolecules MW range > 150 kDaMW range > 150 kDa Liquid introduction of analyteLiquid introduction of analyte
Interface with liquid Interface with liquid separation methods, e.g. separation methods, e.g. liquid chromatographyliquid chromatography
Tandem mass spectrometry Tandem mass spectrometry (MS/MS) for protein (MS/MS) for protein sequencingsequencing
500500 700700 900900 11001100
mass/charge (mass/charge (m/zm/z))
20+20+19+19+
18+18+
17+17+
16+16+
15+15+14+14+
21+21+
22+22+
highly charge highly charge dropletsdroplets
MSMS
ESIESI
2750 3000 3250 3500 3750 4000
m/z0
100
%
m/zm/z
3323332331023102
35433543
(M+13H)(M+13H)13+13+
ESI-MS of Large ProteinsESI-MS of Large Proteins
ESI-MS (Q-TOF)ESI-MS (Q-TOF)pH 7.5pH 7.5
45000 46000 47000 48000mass0
100
%
4651246512
4604846048
massmass
(M+15H)(M+15H)15+15+(M+14H)(M+14H)14+14+
(M+16H)(M+16H)16+16+
distribution of multiply charged molecules
History of Electrospray Ionization
Malcolm Dole demonstrated the production of intact oligomers of polystyrene up to MW 500,000 mass analysis of large ions was problematic
John Fenn (Yale University) Chemical engineer - expert in supersonic molecular beams Began work on electrospray in 1981 Adapted ESI to operate on a more “conventional” mass spectrometer Recognized that multiply charged ions were produced by ESI
Reduced the m/z range required
+
++
+
+
+
+
+
+
-
-
- ---
-
-
-+
+
+
++
++
+
+
+
++
++
+
++ +
+++
+
+ ++ ++ +
+ + + ++ +
+
+
++
+
++
++
+
++
++
+
++
++
+
+++
+++
+++
+
++++
+++
+
+
++
++
++
++
++
+
+++
+
++
+
++
++
+
++
++
+
++
++
+
++
++
+
++
Cathode
+_
Power Supply
++
+
++
++
+
++
++
+
+++
++
+
Electrospray process
Analyte dissolved in a suitable solvent flows through a small diameter capillary tube
Liquid in the presence of a high electric field generates a fine “mist” or aerosol spray of highly charged droplets
106 charges for 30 micron droplet
high voltagehigh voltage
samplesample
vv11
m1
vv22
m2
vv33
m3
m1 m2 m3
detectordetector
drift regionlaserlaser
Matrix-assisted Laser Matrix-assisted Laser Desorption/Ionization (MALDI)Desorption/Ionization (MALDI)
Time-of-Flight (TOF) AnalyzerTime-of-Flight (TOF) Analyzer
MALDIMALDI
3081130811 5060850608 7040570405 9020290202 110000110000 129797129797
m/zm/z
100100
% In
ten
sit
y%
Inte
ns
ity
9743097430
9856398563
4865848658
3644636446
5830958309
MALDI Mass Spectrometry of Large MALDI Mass Spectrometry of Large ProteinsProteins
MALDI-MS of rat MVPMALDI-MS of rat MVP
(M+H)+
(M+2H)2+
sample and
matrixpulsed
laser light
20 kV(sample stage or target)
peptide/protein ions desorbed from matrix
MALDI
Developed by Tanaka (Japan) and Hillenkamp/Karas (Germany)
Peptide/protein analyte of interest is co-crystallized on the MALDI target plate with an appropriate matrix small, highly conjugated organic
molecules which strongly absorb energy at a particular wavelength
Energy is transferred to analyte indirectly, inducing desorption from target surface
Analyte is ionized by gas-phase proton transfer (perhaps from ionized matrix molecules)
OHO
N
OH
OHO
OH
OMeMeO
OHO
OH
OH
MALDI matrices
4-hydroxy--cyanocinnamic acid (“alpha-cyano” or 4-
HCCA)peptidespeptides
3,5-dimethoxy-4-hydroxycinnamic
acid (sinapinic acid)proteinsproteins
2,5-dihydroxybenzoic acid (DHB)peptidespeptides and proteinsproteins
matrices for 337 nm irradiation
MALDI
337 nm irradiation is provided by a nitrogen (N2) laser
The target plate is inserted into the high vacuum region of the source and the sample is irradiated with a laser pulse. The matrix absorbs the laser energy and transfers energy to the analyte molecule. The molecules are desorbed and ionized during this stage of the process.
MALDI is most commonly interfaced to a time-of-flight (TOF) mass spectrometer.
R. Aebersold and M. Mann, Nature (2003), 422, 198-207.R. Aebersold and M. Mann, Nature (2003), 422, 198-207.
Time-of-Flight Mass Spectrometervv11
m1
vv22
m2
vv33
m3
high voltagehigh voltage
detectordetectordrift region
(L)
Principal of Operation of Linear TOF A time-of-flight mass spectrometer measures the mass-dependent time it takes ions of different masses to move from the ion source to the detector. This requires that the starting time (the time at which the ions leave the ion source) is well-defined. Recall that the kinetic energy of an ion is:
where “” is ion velocity, “m” is mass, “e” is charge on electron, and “V” is electric field.
The ion velocity, , is also the length of the flight path, L , divided by the flight time, t:
Substituting this expression for into the kinetic energy relation, we can derive the working equation for the time-of-flight mass spectrometer:
mass is proportional to (time)mass is proportional to (time)22
MIRERICACVLALGMLTGFTHAFGSKDAAADGKPLVVTTIGMIADAVKNIAQGDVHLKGLMGPMIRERICACVLALGMLTGFTHAFGSKDAAADGKPLVVTTIGMIADAVKNIAQGDVHLKGLMGPGVDPHLYTATAGDVEWLGNADLILYNGLHLETKMGEVFSKLRGSRLVVAVSETIPVSQRLSLEGVDPHLYTATAGDVEWLGNADLILYNGLHLETKMGEVFSKLRGSRLVVAVSETIPVSQRLSLEEAEFDPHVWFDVKLWSYSVKAVYESLCKLLPGKTREFTQRYQAYQQQLDKLDAYVRRKAQSEAEFDPHVWFDVKLWSYSVKAVYESLCKLLPGKTREFTQRYQAYQQQLDKLDAYVRRKAQSLPAERRVLVTAHDAFGYFSRAYGFEVKGLQGVSTASEASAHDMQELAAFIAQRKLPAIFIESSILPAERRVLVTAHDAFGYFSRAYGFEVKGLQGVSTASEASAHDMQELAAFIAQRKLPAIFIESSIPHKNVEALRDAVQARGHVVQIGGELFSDAMGDAGTSEGTYVGMVTHNIDTIVAALARPHKNVEALRDAVQARGHVVQIGGELFSDAMGDAGTSEGTYVGMVTHNIDTIVAALAR
Approaches for Protein Sequencing and Approaches for Protein Sequencing and IdentificationIdentification
Enzymatic or Enzymatic or chemical chemical
degradationdegradation
MS/MSMS/MS
MS/MSMS/MS
“Top Down”
“Bottom Up”
Identification of proteins from gels
Proteins are separated first by high resolution two-dimensional polyacrylamide gel electrophoresis and then stained. At this point, to identify an individual or set of protein spots, several options can be considered by the researcher, depending on availability of techniques. For protein spots that appear to be relatively abundant (e.g., more For protein spots that appear to be relatively abundant (e.g., more
than 1 pmol), traditional protein characterization methods may be than 1 pmol), traditional protein characterization methods may be employedemployed.
Methods such as amino acid analysis and Edman sequencing can be used to provide necessary protein identification information. With 2-DE, approximate molecular weight and isoelectric point characteristics are provided. Augmented with information on amino acid composition and/or amino-terminal sequence, a confident identification can be obtained.
The sensitivity gains of using MS allows for the identification of proteins below the one pmol level and in many cases in the femtomole regime.
150-75-
40-
25-
18-
10-
6.5 6.0 5.5 5.0 4.5
pI
MW
x 1
03
2-D Gel Electrophoresis
Excise separated
protein “spots”
In-gel trypsin digest
Recover tryptic
peptides
m/z 500
71771710891089
12721272 14011401
15471547
1700170018571857
23842384
27912791
1500 2500 3000
Peptide mass fingerprint by MALDI-TOF or LC-ESI-MS.
Additional sequence information can be obtained by
MS/MS.
Protein Identification by Mass SpectrometryProtein Identification by Mass Spectrometry
Protein identification by searching proteomic or genomic databases
Protein Cleavage
For the application of mass spectrometry for protein identification, the protein bands/spots from a 2-D gel are excised and are exposed to a highly specific enzymatic cleavage reagent (e.g., trypsin cleaves on the C-terminal side of arginine and lysine residues). The resulting tryptic fragments are extracted from the gel slice and are then subjected to MS-methods. One of the major barriers to high throughput in the proteomic approach to protein identification is the “in-gel” proteolytic digestion and subsequent extraction of the proteolytic peptides from the gel. Common protocols for this process are often long and labor intensive. protein digestion robotprotein digestion robot
Protein cleavage - proteolysis and chemical methods
Enzyme and chemical cleavage reagents
Cleavage sites, comments
trypsin C-terminal to R and K (except R-P, K-P bond); R-K, K-K, R-R, K-R cleave slower
endoproteinase Lys-C C-terminal to K (rarely K-P bond) endoproteinase Arg-C C-terminal to R (except R-P bond)
endoproteinase Glu-C (V8 protease) C-terminal to E and D (except C-P, D-P bond) chymotrypsin C-terminal to F, Y, W, L, I , V, M (except X-P bond)
elastase C-terminal to G, A, S, V, L, I (not very specifi c) pepsin C-terminal to F, L, E (pH 2-4 active range) Asp-N N-terminal to D (sometimes E)
thermolysin N-terminal to L/ , I , V, F (and others to a lesser extent) carboxypeptidase A/ Y cleaves C-terminal residues
mild acid cleaves D-P bond cyanogen bromide (CNBr) C-terminus of M; M-T and M-S cleave slower; oxidized M do
not cleave BNPS-skatole cleaves W-X bond
PNGase F cleaves N-linked (Asn-glyco) glycoproteins; leaves entire carbohydrate portion intact and converts N to D
alkaline phosphatase dephosphorylation of phosphoproteins
Mass spectrometry-based protein identification
A mass spectrum of the resulting digest products produces a “peptide map” or a “peptide fingerprint”. The measured masses can be compared to theoretical peptide
maps derived from database sequences for identification. There are a few choices of mass analysis that can be selected from this point, depending on available instrumentation and other factors. The resulting peptide fragments can be subjected to MALDI-MS or ESI-MS analysis.
A small aliquot of the digest solution can be directly analyzed by MALDI-MS to obtain a peptide map. The resulting sequence coverage (relative to the entire protein sequence) displayed from the total number of tryptic peptides observed in the MALDI mass spectrum can be quite high, i.e., greater than 80% of the sequence, although it can vary considerably depending on the protein, sample amount, etc. The measured molecular weights of the peptide The measured molecular weights of the peptide fragments along with the specificity of the enzyme employed can be fragments along with the specificity of the enzyme employed can be searched and compared against protein sequence databasessearched and compared against protein sequence databases using a number of computer searching routines available on the Internet.
Mass spectrum
Theoreticalmass spectrum
Tryptic peptides
Theoreticaltryptic peptides
Protein
Proteinsequence
SEMHIKHYTTKILGFREEGDSCPLKQWDDSKILVAVADKLLEYEEKILLFNSAKYLLDESSTYKLMHDDSV
SEMHIKHYTTKILGFREEGDSCPLKQWDDSKILVAVADKLLEYEEKILLFNSAKYLLDESSTYKLMHDDSV
Protein identification from peptide fragments
MALDI-MS of tryptic peptides
1000 1500 2000 2500 3000 m/z
11
16
.67 12
47
.70
12
87
.73
13
75
.76
14
24
.85 15
05
.77
16
65
.89
18
11
.85
20
05
.07
24
76
.21
25
50
.52
27
19
.48
**1
84
9.1
2
15
74
.20
ARIIVVTSGK GGVGKTTSSA AIATGLAQKG KKTVVIDFDI GLRKTVVIDFDI GLRNLDLIMG CERRVVYDFV NVIQGDATLN QALIKDKRTE NLYILPASQT RDKDALTRRVVYDFV NVIQGDATLN QALIKDKRTE NLYILPASQT RDKDALTREG VAKVLDDLKA MDFEFIVCDS PAGIETGALM ALYFADEAII TTNPEVSSVR DSDRILGILA SKDSDRILGILA SKSRRAENGE EPIKEHLLLT RYNPGRVSRG DMLSMEDVLE RAENGE EPIKEHLLLT RYNPGRVSRG DMLSMEDVLE ILRILRIKLVGVI PEDQSVLRAS NQGEPVILDI NADAGKLVGVI PEDQSVLRAS NQGEPVILDI NADAGKAYAD TVERLLGER LLGER PFRPFRFIEEEKK GFLKRLFGG
*trypsin autolysis
all peaks are (M+H)all peaks are (M+H)++
ESI-MS and LC-MS for protein identification
An approach for peptide mapping similar to MALDI-MS uses ESI-MS. A peptide map can be obtained by analysis of the peptide mixture by ESI-MS. An advantage of ESI is its ease of coupling to separation methodologies such as HPLC. Thus, alternatively, to reduce the complexity of the mixture, the peptides can be separated by HPLC with subsequent mass measurement by on-line ESI-MS. The measured masses can be compared to sequence databases.
400 800 1200 1600 2000 m/z
5 6 7 8 9 10 11 12 Time (min)
9.4
9.88.4
8.97.76.8
6.20
100
Rel
. A
bund
.
0
100
Rel
. A
bund
. 965.3
629.0
LC-MSwith ESI
(M+2H)2+ MW ~ 1928.6 Da
LC-MS/MS for protein identification
An improvement in throughput of the overall method can be obtained by performing LC-MS/MS in the data dependant mode. As full scan mass spectra are acquired continuously in LC-MS mode, any ion detected with a signal intensity above a pre-defined threshold will trigger the mass spectrometer to switch over to MS/MS mode. Thus, the mass spectrometer switches back and forth between MS- (molecular mass information) and MS/MS mode (sequence information) in a single LC run. The data dependant scanning capability can dramatically increase the capacity and throughput for protein identification.
400 800 1200 1600 2000 m/z
5 6 7 8 9 10 11 12Time (min)
9.4
9.88.4
6.8
965.3
629.0
LC-MS
(M+2H)2+
400 600 800 1000 1200 1400 1600 1800m/z
1261.4
1374.5
668.4838.5
1474.4y4b3
b5
b6
b8
y8
y9
y10
y11
y12
y13
y14
LC-MS/MS
MS/MSMS/MS
AA B C DD EE
AA B C DD EE
AA B C DD
AA B C
AA B
AA
m/z
Peptide sequencing by mass spectrometry
Peptide molecules are fragmented by collisionally activated dissociation (CAD) collisions with neutral background gas molecules (nitrogen, argon, etc) typically dissociate by cleavage of -CO-NH- bond
N-term. C-term.
N-terminal product ions
AA B C DD EE
DDB C EE
C DD EE
DD EE
EE
m/z
C-terminal product ions
Ideally, one can measure the spacings between product ion peaks to deduce the sequence if each amide bond dissociates with equal probability if only a single amide bond fragments for each molecule if only C-terminal or N-terminal products ions are formed
In reality, this is not the case…In reality, this is not the case…
Peptide sequencing by mass spectrometry
H2N - C - C - N - C - C - N - C - C - N - C - COOH
O O O
H H HH H H H
R1 R2 R3 R4
N-terminal fragments
C-terminal fragments
b1
y3
b2
y2
b3
y1
Nomenclature for MS Sequencing of Peptides
Klaus Biemann, MITsubscript denotes the number of residues contained in product ion
Nomenclature for MS Sequencing of Peptides
Low-energy collisions promote fragmentation of a peptide primarily along the peptide backbone
Peptide fragmentation which maintains the charge on the C terminus is designated a y-ion
Fragmentation which maintains the charge on the N terminus is designated a b-ion Low energy collisions: ion trap, QQQ, QTOF, FT-ICR High energy collisions: TOF-TOF
cleavage of amino acid side chain bonds (d-ion and w-ion) differentiate Leu vs. Ile
400 600 800 1000 1200 1400 1600 1800 m/z0
100
Rel
. A
bund
.
1261.4
1091.5
1374.5
668.4
838.5555.4 990.5
1474.4y4
b3
b4
b5
b6
b7
b8
b13b12b11
b10
b9b14 b15
b16 b17
y5
y6 y7y8
y9
y10
y11
y12
y13
y14
LVDKVIGITNEEAISTARb3-17
y4-14
242 259Cysteine Synthase ACysteine Synthase ACysteine Synthase ACysteine Synthase A
MS/MS of 2+ charged tryptic peptides yield (often) 1+ charged product ions (but 2+ charged products can be observed as well)
Peptide Sequencing by Mass Spectrometry
mixture of b-ions and y-ions are present
Computer-based Sequence Searching Strategies
A list of experimentally determined masses is compared to lists of computer-generated theoretical masses prepared from a database of protein primary sequences. With the current exponential growth in the generation of genomic data, these databases are expanding every day.
There are typically three types of search strategies employed: searching with peptide fingerprint data searching with sequence data searching with raw MS/MS data.
One limiting factor that must be considered for all of the approaches is that they can only identify proteins that have been identified and reside within an available database, or very homologous to one that resides in the database.
Searching with Peptide Fingerprints
The majority of the available search engines allow one to define certain experimental parameters to optimize a particular search. Minimum number of peptides to be matched Allowable mass error Monoisotopic versus average mass data Mass range of starting protein Type of protease used for digestion Information about potential protein modification,
such as N- and C-terminal modification, carboxymethylation, oxidized methionines, etc.
Searching with Peptide Fingerprints
Most protein databases contain primary sequence information only
Any shift in mass incorporated into the primary sequence as a result of post-translational modification will result in an experimental mass that is in disagreement with the theoretical mass.
Modifications such as glycation and phosphorylation can result in missed identifications.
A single amino acid substitution can shift the mass of a peptide to such a degree that even a protein with a great deal of homology with another in the database can not be identified.
Searching with Peptide Fingerprints
A number of factors affect the utility of peptide fingerprinting. The greater the experimental mass accuracyexperimental mass accuracy, the narrower you
can set your search tolerances, thereby increasing your confidence in the match, and decreasing the number of "false positive" responses.
A common practice used to increase mass accuracy in peptide fingerprinting is to employ an autolysis fragment from the proteolytic enzyme as an internal standard to calibrate a MALDI mass spectrum.
Peptide fingerprinting is also amenable to the identification of proteins in complex mixtures. Peptides generated from the digest of a protein mixture will simply
return two or more results that are a "good" fit. Peptides that are "left over" in a peptide fingerprint after the
identification of one component can be resubmitted for the possible identification of another component.
Web addresses of some representative internet resources for protein identification from mass spectrometry data
Program Web Address
BLAST http:/ / www.ebi.ac.uk/ blastall/
Mascot http:/ / www.matrixscience.com/ cgi/ index.pl?page=/ home.html
MassSearch http:/ / cbrg.inf .ethz.ch/ Server/ ServerBooklet/ MassSearchEx.html
MOWSE http:/ / srs.hgmp.mrc.ac.uk/ cgi-bin/ mowse
PeptideSearch http:/ / www.narrador.embl-
heidelberg.de/ GroupPages/ PageLink/ peptidesearchpage.html
Protein Prospector http:/ / prospector.ucsf .edu/
Prowl http:/ / prowl.rockefeller.edu/
SEQUEST http:/ / fields.scripps.edu/ sequest/
Mascot
Among the first programs for identifying proteins by peptide mass fingerprinting, MOWSE, developed out of a collaboration between Imperial Cancer Research Fund (ICRF) and SERC Daresbury Laboratory, UK.
The name chosen was an acronym of Molecular Weight Search. The MOWSE databases were fully indexed so as to allow very rapid searching and retrieval of sequence data. Subsequently, the software was further developed and renamed Mascot.
Licensed and distributed by Matrix Science Ltd. Specialized tools include Peptide Mass Fingerprint, Sequence
Query, and MS/MS Ion Search. Search output Web-based. Good visual representation of search quality (graphical probability
chart). Simple graphical user interface. Reports MOWSE scores as a quantitative measure of search quality.
Mowse Scoring Rather than just counting the number of matching peptides, Mowse uses Mowse uses
empirically determined factors to assign a statistical weight to each individual empirically determined factors to assign a statistical weight to each individual peptide matchpeptide match. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol. 3:327, 1993.)
Scoring scheme assigns more weight to matches of higher molecular weight peptides (more discriminating).
Compensates for the non-random distribution of fragment molecular weights in proteins of different sizes.
Was first protein identification program to recognize that the relative abundance of peptides of a given length in a proteolytic digest depends on the lengths of both peptide and protein.
Developed for MALDI peptide mass fingerprinting. Probability-Based Mowse
Mascot incorporates a probability-based enhanced Mowse algorithm, described in Perkins et al. (Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551-3567, 1999).
A simple rule can be used to judge whether a result is significant or not. Different types of matching (peptide masses and fragment ions) can be combined in a single search.
Databases
Three components are required for database searching support of proteomics: MALDI or MS/MS data, the algorithms used to search protein databases with the MALDI or MS/MS data, and the protein databases themselves.
The protein databases can be as small as one protein, can be large, public domain databases of all known and predicted proteins, or may be predicted open reading frames based on genomic sequence.
A major challenge for database searching is that these protein A major challenge for database searching is that these protein databases are constantly changing, making database search databases are constantly changing, making database search results potentially obsolete as new entries are added that better fit results potentially obsolete as new entries are added that better fit the MALDI or MS data.the MALDI or MS data.
Even as genomes are completed there is still flux as new coding regions are identified and novel mechanisms of increased translational complexity are better understood, such as alternative splice products, RNA editing, and ribosome slippage leading to novel, unexpected translation products.
Databases NCBI non-redundant (NCBInr)
Non-redundant database from the National Center for Biotechnology Information for use with their search tools BLAST and Entrez; comprised of translated sequences from the Genbank /EMBL/DDBJ consortium, SwissProt, Protein Information Resource (PIR), and Brookhaven Protein Data Bank (PDB).
New releases are published bimonthly while updates occur daily. OWL
OWL is comprised of Swiss-Prot, PIR, translated Genbank, and NRL-3D (PDB). All sequences are compared to Swiss-Prot to remove identical and “trivially different “ sequences. Has not been updated since May, 1999.
SWISSPROT While SwissProt contains only a subset of proteins, the proteins in this
database are much better annotated and the sequences are much more reliable than those available in any other database.
MSDB Comprehensive, non-identical protein sequence database maintained by the
Proteomics Department at the Hammersmith Campus of Imperial College London. Designed specifically for MS applications.
Databases
EST Clusters (dBEST) Division of GenBank that contains "single-pass" cDNA sequences, or
Expressed Sequence Tags (EST’s), from a number of organisms. EST’s are relatively short, usually 3’ end sequences from isolated mRNA. EST’s tend to be highly redundant and the sequence is much lower
quality than from other sources. An advantage to using these EST’s is that they represent only expressed sequences (no introns) and include alternative splice variants; their length, redundancy, and low quality are far improved by using clustered EST’s, such as the Compugen clusters.
The EST database has some redundancy because it contains all possible combinations of alternative splice products, and so it can be very large (and slow to search).
During a Mascot search, the nucleic acid sequences are translated in all six reading frames. dbEST is a very large database, and is divided into three sections: EST_human, EST_mouse, and EST_others. Even so, searches of these databases take far longer than a search of one of the non-redundant protein databases. You should only search an EST database if a search of a protein database has failed to find a match.
MALDI-MS peptide fingerprint(tryptic digest of a single protein)
1000 1500 2000 2500 3000 m/z
1116
.67
1247
.70
1287
.73
1375
.76
1424
.85
1505
.77
1665
.89
1811
.85
2005
.07
2476
.21
2550
.52
2719
.48
**18
49.1
2
1574
.20
*trypsin autolysis
all peaks are (M+H)all peaks are (M+H)++
Mascot (Matrix Science) for peptide mass fingerprints
enter peak list
Mascot (Matrix Science) for peptide mass fingerprints
possible identification
Mascot (Matrix Science) for peptide mass fingerprints
list of all possible matches
get more info on probable proteins
Mascot (Matrix Science) for peptide mass fingerprints
Mascot (Matrix Science) for peptide mass fingerprints
tryptic peptides that matched
peptides that did not match
Mascot (Matrix Science) for peptide mass fingerprints
tryptic peptides in protein sequence
better mass accuracy improves identification
process
LC-MS/MS for protein identification
To provide further confirmation of the identification, if a tandem mass spectrometer (MS/MS) is available, peptide ions can be dissociated in the mass spectrometer to provide direct sequence information. Product ions from an MS/MS spectrum can be compared to available sequences using powerful software toolsl.
For a single sample, LC-MS/MS analysis included two discrete stepsLC-MS/MS analysis included two discrete steps: (a) LC-MS peptide mapping to identify peptide ions from the digestion mixture and to deduce their molecular weights, and (b) LC-MS/MS of the previously detected peptides to obtain sequence information for protein identification.
Automated LC-MS/MS and database searching
Current mass spectral technology permits the generation of MS/MS data at an unprecedented rate. Prior to the generation of powerful computer-based database searching strategies, the largest bottleneck in protein identification was the manual interpretation of this MS/MS data to extract the sequence information. Today, many computer-based search strategies that employ MS/MS data require no operator interpretation at all.
Analogous to the approach described for peptide fingerprinting, these programs take the individual protein entries in a database and electronically "digest" them to generate a list of theoretical peptides for each protein. However, in the use of MS/MS data, these theoretical
peptides are further manipulated to generate a second level of lists which contain theoretical fragment ion masses that would be generated in the MS/MS experiment for each theoretical peptide.
Automated LC-MS/MS and database searching
These programs simply compare the list of experimentally determined fragment ion masses from the MS/MS experiment of the peptide of interest with the theoretical fragment ion masses generated by the computer program.
The recent advent of data-dependant scanning functions has permitted the unattended acquisition of MS/MS data. An example of a raw MS/MS data searching program that takes particular advantage of this ability is SEQUEST. SEQUEST will input the data from a data-dependant LC/MS chromatogram
and automatically strip out all of the MS/MS information for each individual peak, and submit it for database searching using the strategy discussed above.
Each peak is treated as a separate data file, making it especially useful for the on-line separation and identification of individual components in a protein mixture.
SEQUEST cross-correlates uninterpreted MS/MS mass spectra of peptides from protein/nucleotide databases. The software can analyze a single spectrum or an entire LC-MS/MS peptide map.
No user interpretation of MS/MS spectra is involved.
200 300 400 500 600 700 800 900 1000m/z
50
100475.3
588.3
456.7
325.2701.3 815.4
410.3212.1
851.5912.5
Proteolytic Digest
Time (min)25 30 35 40 45 50 55 60
0
50
100 41.63
54.28
60.1662.59
49.2038.27 46.75
33.5929.02
MS/MS
Match?
Experimental Fragment Masses
Seq # B Y #L 1 114.1 1025.6 8P 2 211.1 912.5 7N 3 325.2 815.4 6L 4 438.3 701.4 5I 5 551.3 588.3 4Y 6 714.4 475.2 3H 7 851.5 312.2 2R 8 1007.6 175.1 1
HPLC-MS
Theoretical Peptide
LPNLIYHR
Theoretical Fragment Masses
Link Link et al. et al. Nature Biotechnology 17, 676 (1999)Nature Biotechnology 17, 676 (1999)
Direct identification of proteins using mass spectrometry Removes the requirement to separate
proteins by electrophoresis, etc MudPIT: multidimensional protein
identification technology, or “Shotgun” approach
Protein lysate is digested with trypsin The peptide mixture is loaded onto a
strong cation exchange (SCX) column (to separate on the basis of charge). A discrete fraction of peptides is displaced from the SCX column using a salt step gradient to a reversed-phase (RP) column (to separate on the basis of hydrophobicity).
This fraction is eluted from the RP column into the MS. This iterative process is repeated, obtaining the fragmentation patterns of peptides in the original peptide mixture.
MS/MS spectra are used to identify the proteins in the original protein complex.
Large-scale analysis of the yeast proteome by MudPIT
Yates and coworkers, Nature Biotech. (2001) 19, 242-247
Assigned 5,540 peptides to MS spectra leading to the identification of 1,484 proteins from the S. cerevisiae proteome
Of 6,216 ORFs in yeast genome, 83% have CAI values between 0 and 0.20 (i.e., predicted to be present at low levels) (Fig. A)
MudPIT data: 791 or 53.3% of the proteins identified have a CAI of <0.2 (1.7 peptides per protein) (Fig. B)
Number of peptides per protein increases with increasing CAI (Fig. C)
MIRERICACVLALGMLTGFTHAFGSKDAAADGKPLVVTTIGMIADAVKNIAQGDVHLKGLMGPMIRERICACVLALGMLTGFTHAFGSKDAAADGKPLVVTTIGMIADAVKNIAQGDVHLKGLMGPGVDPHLYTATAGDVEWLGNADLILYNGLHLETKMGEVFSKLRGSRLVVAVSETIPVSQRLSLEGVDPHLYTATAGDVEWLGNADLILYNGLHLETKMGEVFSKLRGSRLVVAVSETIPVSQRLSLEEAEFDPHVWFDVKLWSYSVKAVYESLCKLLPGKTREFTQRYQAYQQQLDKLDAYVRRKAQSEAEFDPHVWFDVKLWSYSVKAVYESLCKLLPGKTREFTQRYQAYQQQLDKLDAYVRRKAQSLPAERRVLVTAHDAFGYFSRAYGFEVKGLQGVSTASEASAHDMQELAAFIAQRKLPAIFIESSILPAERRVLVTAHDAFGYFSRAYGFEVKGLQGVSTASEASAHDMQELAAFIAQRKLPAIFIESSIPHKNVEALRDAVQARGHVVQIGGELFSDAMGDAGTSEGTYVGMVTHNIDTIVAALARPHKNVEALRDAVQARGHVVQIGGELFSDAMGDAGTSEGTYVGMVTHNIDTIVAALAR
Approaches for Protein Sequencing and Approaches for Protein Sequencing and IdentificationIdentification
MS/MSMS/MS
“Top Down”
The molecular mass of an intact protein defines the native covalent state of a gene’s product, including the effects of post-transcriptional/translational modifications, and associated heterogeneity, that are modulated by the actions of other gene products . Moreover, the fragmentation pattern from large proteins can generate sufficient information for identification from sequence databases, particularly when combined with accurate mass measurements of both the intact molecule and its product ions.
PP
EE
EE
EE
PP
TT
TT
IIDD
FFRR
AAGG
MM
NN
In-Source Decay (ISD) for In-Source Decay (ISD) for Protein SequencingProtein Sequencing
PP EE PP
PP EE PP TT
PP EE PP TT II
PP EE PP TT II DD
PP EEEE PP TT II DD
cutcut
...TIDE......TIDE...
Peptides and large proteins can be fragmented by ISD
Fragmentation occurs in the MALDI ion source not generally well controlled
Reflectron TOF not necessary (linear TOF sufficient to measure product ions
Complete sequence information not present, but extensive stretches of sequence from the N- and/or C-termini observed
5000 5000 10000 10000 15000 15000 20000 20000 25000 25000
m/zm/z
Protein 1Protein 1
Protein 2Protein 2
Molecular WeightMolecular WeightInformationInformation
Sequence InformationSequence Information(In-Source Decay)(In-Source Decay)
MALDI-ISD-TOF Mass Spectrometry of Proteins
2500 2500 3000 3000 3500 3500 4000 4000 m/zm/z
AALL YY GG GG EE GG FF EE DD HH RR LL K/QK/Q EE DD
TT PW?PW?
WP?WP?
AALL FF GG DD EE GG FF EE DD HH RR VV AA FF DDL/NL/N
Sequence information from the N-terminus
MALDI-ISD-TOF Mass Spectrometry of Proteins
ISD generally yields c-ions or z-ions
(top) ESI-QqTOF-MS of transferrin and (bottom) ESI-MS/MS of 36+ charged molecule. Sequence specific bn-products from residues 56-69 are generated. In addition, a series of larger products appears at higher m/z above the precursor molecule position that range in size from 69-73 kDa that are consistent for the yn-complement to the bn-products observed in the lower m/z region of the spectrum. (Thevis et al, J. Am. Soc. Mass Spectrom. 2003, 14, 635).
Top-down sequencing of transferrin by ESI-MS/MS
Masses and compositions of commonly occuring amino acid residues