Shotgun Proteomic Analysis - National Institutes of Health · Protein ID’s 157 304 (1.9x) (*2...

Preview:

Citation preview

Shotgun Proteomic Analysis

Department of Cell BiologyThe Scripps Research Institute

Biological/Functional Resolution of Experiments

Organelle

Cells/Tissues

Multiprotein Complex

Function

InteractionAnalysis

ExpressionAnalysis

InformationContent

Abu

ndan

ce

Time

Abu

ndan

ce

Time

µLC

“Shotgun Proteomics”

…SEQ

UES

T80 node Beowulf Computer

Cluster

Protein Mixture

proteolysis

Peptide Mixture

Output Filtering and Re-Assembly

DTA

Sele

ct

Abu

ndan

ce

m/z

Abu

ndan

ce

m/z

MS

3

Abu

ndan

ce

m/z

Abu

ndan

ce

m/z

32

Abu

ndan

ce

m/z

Abu

ndan

ce

m/z

2

MS/MS

1

Abu

ndan

ce

m/z

Abu

ndan

ce

m/z

1

Data Processing Issues with Shotgun Proteomics

1:1 Mixture of Unlabeled/15N-Labeled YeastSoluble Proteins Analyzed Using a Single 12h Analysis

304 (1.9x)157Protein ID’s(*2 peptides confirmed w/ RelEx)

891 (1.6x)559Protein ID’s (*1 peptide confirmed w/ RelEx)

86,950 (4.5 x)18,970MS/MS SpectraLTQLCQ

*RelEx was used to evaluate the presence of labeled isotopomer

Processing Tandem Mass SpectraProcessing Tandem Mass Spectra

Spectral Quality

Database Analysis

De Novo: Unusual,Unanticipated Features

Filter

SEQUEST™PepProb

GutenTag

Quantification RelX

Subtractive Analysis LibQUESTDTASelect/Contrast

Filter and Assembly DTASelectProtProb

Data Issues

• Data quality• How is a match determined?

– Protease issues?– Validation issues?

• Posttranslational Modifications?• Quantification?• Sampling Issues?

Spectral Filtering with Hand Crafted FeaturesSpectral Filtering with Hand Crafted Features

Called Good Called Bad %Correct+1 GOOD 671 75 89.9%+1 BAD 5585 11475 67.3%+2/+3 GOOD 3166 348 90.1%+2/+3 BAD 11611 26684 69.7%All GOOD 3837 423 90.1%All BAD 17196 38159 68.9%

Bern, Goldberg, MacDonald, Yates Bioinformatics (in press)

•• Identification of Identification of PTMsPTMs

•• Sample is split into 3 aliquotsSample is split into 3 aliquots

•• Digest using 3 different proteasesDigest using 3 different proteases

•• Mix and analyze by LC/LCMix and analyze by LC/LC--MS/MSMS/MS

•• Interpret spectra using SEQUESTInterpret spectra using SEQUEST

MudPIT

SEQUEST

trypsin elastase subtilisin

MultiMulti--Enzyme Digestion ProceduresEnzyme Digestion Procedures

High pH

Proteinase K

High pH/Proteinase K Method(hpPK Method)

Wu et al., Nat. Biotech. 21:532-538 (2003)Howell and Palade, J. Cell Biol. 92:822-832 (1982)

Overlapping Peptide Coverage(TM7) gi|13794265|ref|NP_056312.1|DKFZP564G2022 protein MAAAAWLQVL PVILLLLGAH PSPLSFFSAG PATVAAADRS KWHIPIPSGK NYFSFGKILF RNTTIFLKFD GEPCDLSLNI TWYLKSADCY NEIYNFKAEE VELYLEKLKE KRGLSGNIQT SSKLFQNCSE LFKTQTFSGD FMHRLPLLGE KQEAKENGTN LTFIGDKTAM HEPLQTWQDA PYIFIVHIGI SSSKESSKEN SLSNLFTMTV EVKGPYEYLT LEDYPLMIFF MVMCIVYVLF GVLWLAWSAC YWRDLLRIQF WIGAVIFLGM LEKAVFYAEF QNIRYKGESV QGALILAELL SAVKRSLART LVSIVSLGYG IVKPRLGVTL HKVVVAGALY LLFSGMEGVL RVTGAQTDLA SLAFIPLAFL DTALCWWIFI SLTQTMKLLK LRRNIVKLSL YRHFTNTLIL AVAASIVFII WTTMKFRIVT CQSDWRELWV DDAIWRLLFS MILFVIMVLW RPSANNQRFA FSPLSEEEEEDEQKEPMLKE SFEGMKMRST KQEPNGNSKV NKAQEDDLKW VEENVPSSVT DVALPALLDS DEERMITHFE RSKME

WVEENVPSSVTDVALPALLDS*DEERVEENVPSSVTDVALPALLDS*DEER

EENVPSSVTDVALPALLDS*DEERENVPSSVTDVALPALLDS*DEER

VPSSVTDVALPALLDS*DEERPSSVTDVALPALLDS*DEER

LPALLDS*DEERPALLDS*DEER

(TM6) gi|14249524|ref|NP_116213.1| hypothetical protein FLJ14681MVAACRSVAG LLPRRRRCFP ARAPLLRVAL CLLCWTPAAV RAVPELGLWL ETVNDKSGPL IFRKTMFNST DIKLSVKSFH CSGPVKFTIV WHLKYHTCHN EHSNLEELFQ KHKLSVDEDF CHYLKNDNCW TTKNENLDCN SDSQVFPSLN NKELINIRNV SNQERSMDVV ARTQKDGFHI FIVSIKTENT DASWNLNVSL SMIGPHGYIS ASDWPLMIFY MVMCIVYILY GILWLTWSAC YWKDILRIQF WIAAVIFLGM LEKAVFYSEY QNISNTGLST QGLLIFAELI SAIKRTLARL LVIIVSLGYG IVKPRLGTVM HRVIGLGLLY LIFAAVEGVM RVIGGSNHLA VVLDDIILAV IDSIFVWFIF ISLAQTMKTL RLRKNTVKFS LYRHFKNTLIFAVLASIVFM GWTTKTFRIA KCQSDWMERW VDDAFWSFLF SLILIVIMFL WRPSANNQRY AFMPLIDDSD DEIEEFMVTS ENLTEGIKLR ASKSVSNGTA KPATSENFDE DLKWVEENIP SSFTDVALPV LVDSDEEIMT RSEMAEKMFS SEKIM

WVEENIPSSFTDVALPVLVDS*DEEIMTRIPSSFTDVALPVLVDS*DEEIMTRPSSFTDVALPVLVDS*DEEIMTR

SFTDVALPVLVDS*DEEIMTRTDVALPVLVDS*DEEIMTRTDVALPVLVDS*DEEIMTRS

DVALPVLVDS*DEEIMTRVALPVLVDS*DEEIMTR

ALPVLVDS*DEEIMTRALPVLVDS*DEEIMTRS

LPVLVDS*DEEIMTRPVLVDS*DEEIMTRPVLVDS*DEEIMTRS

gi|27229118|ref|NP_082129| RIKEN cDNA 0610006F02; S-adenosylmethionine-dependentmethyltransferase activity [Mus musculus]

MDALVLFLQL LVLLLTLPLH LLALLGCWQP ICKTYFPYFM AMLTARSYKK MESKKRELFS QIKDLKGTSG NVALLELGCG

TGANFQFYPQ GCKVTCVDPN PNFEKFLTKS MAENRHLQYE RFIVAYGENM KQLADSSMDV VVCTLVLCSV QSPRKVLQEVCVDPN PNFEKF

VTCVDPN PNFEKVTCVDPN PNFEKFLTK

QRVLRPGGLL FFWEHVAEPQ GSRAFLWQRV LEPTWKHIGD GCHLTRETWK DIERAQFSEV QLEWQPPPFR WLPVGPHIMGQFSEV QLEWQPPPFR WLPVGPHIM

EV QLEWQPPPFR WLPVGPHEV QLEWQPPPFR WLPVGPHEV QLEWQPPPFR WLPVGPHIMEV QLEWQPPPFR WLPVGPHIM**LEWQPPPFR WLPVGPHLEWQPPPFR WLPVGPHLEWQPPPFR WLPVGPHIMLEWQPPPFR WLPVGPHIMLEWQPPPFR WLPVGPHIMGEWQPPPFR WLPVGPHIMWQPPPFR WLPVGPH

KAVK

Golgi Peptide Synthetic Peptide

400 600 800 1000 1200 1400 1600 18000

20

40

60

80

100

y8b4

y14++

b9

b10b14

b13

b12

b11

y12

y11

y10y7y6

y5

y3

y12++

m/zR

elat

ive

Abu

ndan

ce

400 600 800 1000 1200 1400 1600 18000

20

40

60

80

100

y14++

b4b3 b9b10 b14

b13

b11

y12

y11

y10y8

y7y6

y5

y3

y12++

m/z

Rel

ativ

e A

bund

ance

Dimethyl Arginine Containing Peptide

Shotgun Proteomic Experiments Shotgun Proteomic Experiments and Sampling Issuesand Sampling Issues

•• Based on prior studies in yeast, we know Based on prior studies in yeast, we know not every protein present is not every protein present is id’did’d

•• Reproducibility is good for high abundance Reproducibility is good for high abundance proteins 70proteins 70--80%80%

•• Reproducibility is not as good for low Reproducibility is not as good for low abundance proteins. 20abundance proteins. 20--30%30%

•• Is this predictable?Is this predictable?

2 4 6 8 10 12 14 161000

1500

2000

2500

3000

3500

4000

Num

ber o

f Ide

ntifi

ed P

rote

ins

Number of MudPIT Runs

Experimental Semi-empirical Model Random Model

Random Sampling Model for Data Dependent Acquisition

))/1(1(* SL NLnK −−=

nL= # of protein species at particular levelL = abundance levelN = total number of proteinsS = experiments

2 4 6 8 10

100

200

300

400

500

600

700

Num

ber o

f Pro

tein

s

Num ber of M udPIT's

Distribution of Protein Identifications AfterRepeating Analysis 9 times

# of proteins identifiedin all 9 experiments

# of proteins identifiedin 1 experiment

RelX software

DTASelect OutputPeptide SequenceLVNHFIQEFK

1270 1275 1280 1285 1290 12950

20

40

60

80

100

Rel

ativ

e Ab

unda

nce

m/z

1) Predict m/z Range

1270 1275 1280 1285 1290 12950

20

40

60

80

100

Rel

ativ

e Ab

unda

nce

m/z

2) Sum Signal in Range

32 33 34 35 36 370

20

40

60

80

100

Rel

ativ

e Ab

unda

nce

Time (min)

3) Store Mass Chromatograms

32 33 34 35 36 370

20

40

60

80

100

Rel

ativ

e A

bund

ance

Time (min)

4) Peak Detection

Sam

ple

Inte

nsity

Reference Intensity

5) Correlation

1 2 3 4 5

1

2

3

4

5

6

7

Ave

rage

Pep

tide

Rat

io

Protein Mixture Ratio

TSA1 = 1.335±0.015

SSA1 = 1.004±0.018

ADH1 = 0.661±0.045

Systematic Errors are Present in Samples

Spectral Sampling for Relative QuantificationSpectral Sampling for Relative QuantificationCombined Data for 6 proteins added to Yeast SolubleCombined Data for 6 proteins added to Yeast Soluble

Cell Lysate at 4 different levelsCell Lysate at 4 different levels

y = 1.0613x + 0.2608R2 = 0.9997

0

5

10

15

20

25

30

0 5 10 15 20 25 30

% Protein Markers Added

% S

pect

rum

Obs

erve

d

Linear dynamicrange 2-orders

Measuring small changesis not as reliable

Synthesis of Ribosomal Proteins in Plasmodium falciparum

• Striking trend: almost all ribosomal proteins increase over ring to trophtransition

Ribosomal Protein Abundance

0

500

1000

1500

2000

2500

3000

3500

4000

Ring Troph Schiz Mero

Stage

Sp

ectr

al C

ou

nt

Standards

• 1. Data formats: McDonald et al MS1, MS2, and SQT - three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications, RCM 2004, 18, 2162-8.

• Database search standard based on: Washburn et al, Large Scale analysis of the yeast proteome via

multidimensional protein identification technology Nature Biotechnology 19, 242-247 (2001)

MacCoss et al , Probability Based Validation of Protein Identifications Using a Modified SEQUEST Algorithm, Analytical Chemistry 74, 5593-5599 (2002). Normalized Scores

Standards

• 4. Standards should not prevent innovation– Data formats should be practical e.g. storage space

• 7. Data processing tools should be transparent and validated, e.g. published

• 8. Data for publication: information to support biological conclusions- sequences of peptides id’d.

• 9. Archiving data: biological conclusions should be the most important part of the experiment

0 5 10 15 200

2000

4000

6000

8000

10000

12000

Charge State +1Charge State +2Charge State +3

Num

ber o

f Spe

ctra

Probability Scores

Probability Distributions

Single Spectral Matches are Problematic: How Single Spectral Matches are Problematic: How to tell if they are correctto tell if they are correct

•• Searches determine closeness of fit based Searches determine closeness of fit based on some measure: Compare matches with on some measure: Compare matches with different programsdifferent programs

•• Probability scoringProbability scoring: P = random match : P = random match based on frequency of fragment ions in based on frequency of fragment ions in databasedatabase

•• SEQUESTSEQUEST: XCorr measures how close the : XCorr measures how close the spectrum fits to ideal spectrumspectrum fits to ideal spectrum

•• Manual validation, experimental validationManual validation, experimental validation•• de novode novo interpretationinterpretation

•• Sample is split into 3 aliquotsSample is split into 3 aliquots

•• Digest using 3 different proteasesDigest using 3 different proteases

trypsin elastase subtilisin

MultiMulti--Enzyme DigestEnzyme Digest

•• Sample is split into 3 aliquotsSample is split into 3 aliquots

•• Digest using 3 different proteasesDigest using 3 different proteases

•• Mix and analyze by LC/LCMix and analyze by LC/LC--MS/MSMS/MS

•• Interpret spectra using SEQUESTInterpret spectra using SEQUESTMudPIT

SEQUEST

trypsin elastase subtilisin

MultiMulti--Enzyme DigestEnzyme Digest

0

10

20

30

40

50

60

70

80

90

100

% of Prote insIdentified

% of UniquePeptides

% of SpectrumCopies

P1P2P3P4

Properties of Data Dependent Data Acquisition

Most invariant property is spectral copy number

GutenTagGutenTag: Partial de novo Sequencing : Partial de novo Sequencing of Tandem Mass Spectraof Tandem Mass Spectra

•• Database searching assumes minimal errors in the Database searching assumes minimal errors in the database and sequence variations between strains, database and sequence variations between strains, individuals and speciesindividuals and species

•• Modifications need to be specified in database Modifications need to be specified in database searches, so unanticipated modifications will be searches, so unanticipated modifications will be missed.missed.

•• Partial Partial De novoDe novo analysis of tandem mass spectra in analysis of tandem mass spectra in largelarge--scalescale can identify peptides containing can identify peptides containing sequence variations and unanticipated sequence variations and unanticipated modifications modifications

GutenTagGutenTag

•• 6170 tandem mass spectra: LC/LC/MS/MS 6170 tandem mass spectra: LC/LC/MS/MS analysis of simple digested protein mixtureanalysis of simple digested protein mixture

•• 1987 spectra matched by SEQUEST1987 spectra matched by SEQUEST•• 1328 spectra matched by GutenTag1328 spectra matched by GutenTag•• 766 partial matches suggesting modifications and 766 partial matches suggesting modifications and

sequence variationssequence variations•• Total matching spectra by Total matching spectra by GutenTagGutenTag: 2,094: 2,094•• Partial de novo will extend identificationsPartial de novo will extend identifications•• Software is Software is AutomatedAutomated and and LargeLarge--ScaleScale

Improved Spectral Quality Effects Peptide IdentificationInfusion of 1 pmol/µl Angiotensin I

LCQ-Classic761 of 1000 MS/MS Spectra Matched the Correct Sequence

LTQ970 of 1000 MS/MS Spectra Matched the Correct Sequence

0.5 1.0 1.5 2.0 2.5 3.0 3.50

25

50

75

100

125

150

175

200

Freq

uenc

y

XCorr

0.5 1.0 1.5 2.0 2.5 3.0 3.50

50

100

150

200

250

300

Freq

uenc

y

XCorr

Database Searching with Tandem Database Searching with Tandem Mass SpectraMass Spectra

•• The goal is to identify peptides using MS/MS The goal is to identify peptides using MS/MS spectra and amino acid sequence databases.spectra and amino acid sequence databases.

•• Develop a probabilistic model that establishes a Develop a probabilistic model that establishes a relationship between the database sequences and relationship between the database sequences and the spectrum to complement quantitative measures the spectrum to complement quantitative measures of closenessof closeness--ofof--fit fit

•• Develop nonDevelop non--empirical probabilistic measures empirical probabilistic measures using crossusing cross--correlation measurementscorrelation measurements

Probability Model

• Null hypothesis: All fragment matches to MS/MS spectrum are by random.

• N, K – number of all fragments and all fragment matches, respectively. N1 is the number of fragments of a particular peptide which has K1 matches.

• We seek an amino acid sequence that has the smallest probability of being a random match.

1

111 *),( 11, NN

KNKN

KK

NK CCCNKP

−−=

Model Hypergeometric Distributions

0 5 10 15 20 25 300.00

0.05

0.10

0.15

0.20

0.25

K=300

K=200

K=100N=1000N1=30

Prob

abili

ty

Number of Successes

N, K – number of all fragments and all fragment matches, respectively. N1 is the number of fragments of a particular peptide which has K1 matches.

Significance of Peptide IdentificationSignificance of Peptide Identification

•• Not all identifications are significant: poor quality Not all identifications are significant: poor quality spectra of peptides , incomplete peptide spectra of peptides , incomplete peptide fragmentation, inaccuracies in database, fragmentation, inaccuracies in database, posttranslational modifications, MS/MS of posttranslational modifications, MS/MS of chemical noise and nonchemical noise and non--peptide molecules.peptide molecules.

•• Significance of a match (P_value) is also obtained Significance of a match (P_value) is also obtained from the hypergeometric distribution.from the hypergeometric distribution.

Significance of a Match

0 5 10 15 20 25 300.00

0.04

0.08

0.12

0.16

Significance of K1=14

H0

N=1000K=300N1=30

Prob

abili

ty

Number of Successes

Yeast Database

0 10 20 30 400.00

0.05

0.10

0.15

0.20

0.25Predicted DistributionObserved Frequency

N=5284150K=569160N1=40

Prob

abili

ty,F

requ

ency

Number of Fragment Matches

Cumulative Distribution

0 10 200.0

0.2

0.4

0.6

0.8

1.0C

umul

ativ

e D

istri

butio

nFu

nctio

n

Number of Fragment Matches

Mass/Charge State Dependence

• Scores that use closeness of fit measures can artificially inflate with weight/mass.

• This complicates use of uniform criteria for identification.

• Probabilities generated by hypergeometric distribution are charge/weight independent.

Cross-Correlation Score Distribution

0 1 2 3 4 5 6 7 80

500

1000

1500

2000

2500

3000

3500

Num

ber o

f Spe

ctra

Cross-Correlation Scores

XCorr+1 XCorr+2 XCorr+3

Database Dependence

• The probability inferred from the hypergeometric distribution is in principle database dependent.

• However, the dependency is very weak.

NRP and Yeast Databases

0 10 20 300.00

0.05

0.10

0.15

0.20

0.25 NRP Database Yeast Database

Prob

abili

ty

Number of Fragment Ion Matches

Pep_Probe SummaryPep_Probe Summary

•• Implements 4 scoring schemes: hypergeometric, Implements 4 scoring schemes: hypergeometric, poissonpoisson, , maximum likelihood and crossmaximum likelihood and cross--correlation. Sorts results correlation. Sorts results either by hypergeometric or crosseither by hypergeometric or cross--correlation scores.correlation scores.

•• No enzyme specificity is assumed.No enzyme specificity is assumed.•• Reports significance of each match.Reports significance of each match.•• Can search for posttranslational modifications to three Can search for posttranslational modifications to three

different amino acids.different amino acids.•• Has been implemented to run on a standalone or compute Has been implemented to run on a standalone or compute

clusters.clusters.•• Runs on heterogeneous cluster of computers, in WINDOWS Runs on heterogeneous cluster of computers, in WINDOWS

and LINUX platforms.and LINUX platforms.

Processing Tandem Mass SpectraSpectral Quality

Database Analysis

Analyze Unusual orUnanticipated Features

Eliminate Poor Quality Spectra, Score Quality of

Other spectra

Multiple Methods With Different

Selectivity's

De novo or partialDe novo analysis

Assemble and Annotate Data Biological Significance

Increased Data Production Requires Automated Data AnalysisIncreased Data Production Requires Automated Data Analysis

MS/MS MS/MS ComparisonComparison

Library SearchingLibrary SearchingComparative AnalysisComparative AnalysisSubtractive AnalysisSubtractive Analysis

LIBQUEST LIBQUEST

Yates et al. Yates et al. Anal.Chem. Anal.Chem. 70, 3557 (1998)70, 3557 (1998)

Rel

ativ

e A

bund

ance

m/z

De NovoDe NovoSequencingSequencing

Alternate SplicingAlternate SplicingUnanticUnantic. Mod.. Mod.Unknown ORFsUnknown ORFs

GutenTagGutenTag

Tabb et al Anal. Tabb et al Anal. ChemChem

Existing SequenceExisting SequencePTM PTM

Database Search

Database Database SearchSearch

SEQUEST & Pep_ProbSEQUEST & Pep_ProbYates et al.,Yates et al., JASMS JASMS 5, 976 (1994)5, 976 (1994)Yates et al. Yates et al. Anal. Anal. ChemChem 67, 1426 (1995)67, 1426 (1995)Yates et al. Yates et al. Anal. ChemAnal. Chem. 67, 3205 (1995). 67, 3205 (1995)MacCoss et al. MacCoss et al. Anal. Anal. ChemChem (2002)(2002)Sadygov and Yates, Sadygov and Yates, Anal. Anal. ChemChem (in press)(in press)

DTASelectDTASelectPost Analysis Post Analysis

Review SoftwareReview Software

Link et al. Link et al. Nature Biotech. Nature Biotech. 1717, 676, 676--682 (1999)682 (1999)Tabb et. al. Tabb et. al. J. Proteome Res.J. Proteome Res. 1, 26, (2002) 1, 26, (2002)

Probability forProbability forProtein Protein

IdentificationIdentification

Link et al. Link et al. Nature Biotech. Nature Biotech. 1717, 676, 676--682 (1999)682 (1999)Tabb et. al. Tabb et. al. J. Proteome Res.J. Proteome Res. 1, 26, (2002) 1, 26, (2002)

Peptide IdPeptide IdProbabilityProbabilityPP--Val, S.C.Val, S.C.

Related SequenceRelated SequenceSNP AnalysisSNP AnalysisMutation AnalysisMutation Analysis

VariantVariantSearchSearch

SEQUESTSEQUEST--SNPSNP

Gatlin et al.Gatlin et al. Anal.Chem.Anal.Chem. 72, 757 (2000). 72, 757 (2000).

SpectralQuality

Bioinformatics (in press)Bioinformatics (in press)

Data Considerations

• Different types of experiments• Different types of data analysis

Integrated Multi-Dimensional Liquid Chromatography

hV

100 micron FSC

RPSCXRP

5 µM100-300 nL/min

WasteWaste

Comprehensive Analysis of Complex Protein Mixtures

Total ProteinCharacterization

• Protein Identification: What’s there• Post Translational Modifications: Regulation• Quantification: Dynamics• Proteomic Data to Knowledge: Genetics, RNAi, siRNA

Multiprotein Complex/OrganelleCells/Tissues

Translation of technology development into biological discoveryTranslation of technology development into biological discovery

Recommended