3
RAPID COMMUNICATIONS IN MASS SPECTROMETRY Rapid Commun. Mass Spectrom. 2005; 19: 2983–2985 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/rcm.2137 To the Editor-in Chief Sir, Using cross-correlation normalized for peptide length to optimize peptide identification in shotgun proteomics Over the past few years, the technology of multidimensional peptide separa- tion coupled with tandem mass spec- trometric (MS/MS) identification has seen remarkable development and has been extensively used in high- throughput protein identification for proteomics research. The SEQUEST database search engine 1 has been nor- mally used in this strategy for more than 10 years; however, it is well known that there is considerable over- lapping between positive and negative peptide identifications. Removal of false positive results and reduction of false negative identifications is a key problem in proteomics research. Manual validation was first used to validate peptide identifications in shot- gun proteomics, 2 but this is time-con- suming and not feasible for high- throughput analysis of large datasets, and also depends on the experience of ‘experts’; this reduces the reproducibil- ity and comparability of data among laboratories. Several statistical tools have been used to evaluate results of SEQUEST. 3–7 These statistical models normally were trained or tested using datasets for mixtures of known pro- teins, so their reliability still needs to be investigated in analysis of real samples. Random database searching metho- dology is a simple approach that can provide effective criteria to minimize the false positive rates (FPRs), and also evaluate the effect of different search parameters on peptide identifications. In this approach FPRs were calculated by searching against a conventional protein database and a random protein database (sequence-reversed) or a non- homology database different from the conventional normal database; then FPRs can be controlled in a reasonable range by adjusting the search para- meter. Moore et al. 7 first used a sequence- reversed protein database to estimate random assignment between an MS/ MS spectrum and a peptide in the database. Peng et al. 8 used a reversed yeast protein database to analyze FPRs for a yeast proteome research, and decreased FPRs to less than 1% and reduced the need for manual interpre- tation while identifying more proteins. Qian et al. 9 utilized the reversed data- base strategy to evaluate the FPRs for peptide identifications from three human proteome samples, and sug- gested that FPRs are significantly dependent on sample characteristics. Although the random database search strategy can cut down the FPRs of peptide identifications, this strategy still needs in-depth research. Setting the same cutoff for shorter and longer peptides is obviously inappropriate because the Xcorr value is dependent on assigned peptide size. 3 Yu et al. 10 calculated the FPRs of peptide identi- fications by searching an Archaean protein database; they set an Xcorr cutoff of 2.2 for doubly charged pep- tides with molecular mass <1200 Da, and an Xcorr cutoff of 2.5 for doubly charged peptides with molecular mass 1200 Da. This approach did improve the peptide identification, but we con- sidered it to be insufficient because Xcorr and molecular mass are contin- uous variables. In this work Xcorr was normalized in a fashion designed to reduce the pep- tide size dependence of Xcorr, and we investigated the effect of this normal- ization on the rank of peptides. An experimental dataset was obtained from the work of Keller et al. 11 All tandem mass spectra were generated from 22 liquid chromatography/tan- dem mass spectrometry (LC/MS/MS) runs on two mixtures of 18 purified proteins at a variety of concentrations. Two mixtures were digested by trypsin and analyzed by electrospray ioniza- tion ion trap mass spectrometry (ESI- ITMS) (ThermoFinnigan, San Jose, CA, USA). The protein database ipi.Human.3.05 was downloaded from the European Bioinformatics Institute (EBI); 12 it contained 49 161 protein entries, but in this work the sequence of each entry was reversed using an in-house program (reverse.pl). The sequences of 18 control proteins that were used in this study were a little different from those in the experiment of Keller et al.; 11 we replaced Q04977 by P06278 for B. lichenformis a-amylase, and did not choose the rabbit myosin heavy and light chains. A new protein database was constructed by appending sequences of the 18 known proteins to the sequence-reversed human IPI database. Tandem mass spectra were analyzed against this protein database using SEQUEST v2.7 (ThermoFinnigan). Peptide mass tolerance was set as 1.5 Da, cysteine carbamidomethylation and methionine oxidation were con- sidered, and the enzymatic constraint was trypsin with a maximum of two internal missed cleavage sites. In total, this analysis produced a dataset of 24 489 tandem mass spectra with 24 486 peptide identifications. A peptide that passed the strict Xcorr filter and belonged to the 18 known proteins was counted as a positive peptide, but otherwise was counted as a negative peptide. Common contami- nants were not considered, as the human proteins of the database were sequence-reversed. In the random database searching strategy, researchers normally have set criteria to ensure that the con- fidence of peptide identifications was more then 95%, e.g., Xcorr 1.9 for singly charged peptides, Xcorr 2.2 for doubly charged peptides, Xcorr 3.75 for triply charged peptides, and DeltaCn 0.1 for all peptides., How- ever, the Xcorr value for a peptide depended on peptide size; if a single cutoff of Xcorr is chosen for all peptides with the same charge state but of different lengths, it will be a strict criterion for shorter peptides but loose for longer peptides. Figure 1 shows results for doubly charged peptides with Xcorr >1.5 for peptides contain- ing 8, 13, 18, or 22 amino acids; the total number of peptides considered was 361. Each bar in the histogram repre- sents the number of peptides within an Xcorr range of 0.3 for a specified Copyright # 2005 John Wiley & Sons, Ltd. RCM Letter to the Editor

Using cross-correlation normalized for peptide length to optimize peptide identification in shotgun proteomics

Embed Size (px)

Citation preview

Page 1: Using cross-correlation normalized for peptide length to optimize peptide identification in shotgun proteomics

RAPID COMMUNICATIONS IN MASS SPECTROMETRY

Rapid Commun. Mass Spectrom. 2005; 19: 2983–2985

Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/rcm.2137

To the Editor-in Chief

Sir,

Using cross-correlation normalized

forpeptide length to optimizepeptide

identification in shotgun proteomics

Over the past few years, the technology

of multidimensional peptide separa-

tion coupled with tandem mass spec-

trometric (MS/MS) identification has

seen remarkable development and

has been extensively used in high-

throughput protein identification for

proteomics research. The SEQUEST

database search engine1 has been nor-

mally used in this strategy for more

than 10 years; however, it is well

known that there is considerable over-

lapping between positive and negative

peptide identifications. Removal of

false positive results and reduction of

false negative identifications is a key

problem in proteomics research.

Manual validation was first used to

validate peptide identifications in shot-

gun proteomics,2 but this is time-con-

suming and not feasible for high-

throughput analysis of large datasets,

and also depends on the experience of

‘experts’; this reduces the reproducibil-

ity and comparability of data among

laboratories. Several statistical tools

have been used to evaluate results of

SEQUEST.3–7 These statistical models

normally were trained or tested using

datasets for mixtures of known pro-

teins, so their reliability still needs to be

investigated in analysis of real samples.

Random database searching metho-

dology is a simple approach that can

provide effective criteria to minimize

the false positive rates (FPRs), and also

evaluate the effect of different search

parameters on peptide identifications.

In this approach FPRs were calculated

by searching against a conventional

protein database and a random protein

database (sequence-reversed) or a non-

homology database different from the

conventional normal database; then

FPRs can be controlled in a reasonable

range by adjusting the search para-

meter.

Moore et al.7 first used a sequence-

reversed protein database to estimate

random assignment between an MS/

MS spectrum and a peptide in the

database. Peng et al.8 used a reversed

yeast protein database to analyze FPRs

for a yeast proteome research, and

decreased FPRs to less than 1% and

reduced the need for manual interpre-

tation while identifying more proteins.

Qian et al.9 utilized the reversed data-

base strategy to evaluate the FPRs for

peptide identifications from three

human proteome samples, and sug-

gested that FPRs are significantly

dependent on sample characteristics.

Although the random database

search strategy can cut down the FPRs

of peptide identifications, this strategy

still needs in-depth research. Setting

the same cutoff for shorter and longer

peptides is obviously inappropriate

because the Xcorr value is dependent

on assigned peptide size.3 Yu et al.10

calculated the FPRs of peptide identi-

fications by searching an Archaean

protein database; they set an Xcorr

cutoff of 2.2 for doubly charged pep-

tides with molecular mass <1200 Da,

and an Xcorr cutoff of 2.5 for doubly

charged peptides with molecular mass

�1200 Da. This approach did improve

the peptide identification, but we con-

sidered it to be insufficient because

Xcorr and molecular mass are contin-

uous variables.

In this work Xcorr was normalized in

a fashion designed to reduce the pep-

tide size dependence of Xcorr, and we

investigated the effect of this normal-

ization on the rank of peptides. An

experimental dataset was obtained

from the work of Keller et al.11 All

tandem mass spectra were generated

from 22 liquid chromatography/tan-

dem mass spectrometry (LC/MS/MS)

runs on two mixtures of 18 purified

proteins at a variety of concentrations.

Two mixtures were digested by trypsin

and analyzed by electrospray ioniza-

tion ion trap mass spectrometry (ESI-

ITMS) (ThermoFinnigan, San Jose, CA,

USA).

The protein database ipi.Human.3.05

was downloaded from the European

Bioinformatics Institute (EBI);12 it

contained 49 161 protein entries, but

in this work the sequence of each

entry was reversed using an in-house

program (reverse.pl). The sequences of

18 control proteins that were used in

this study were a little different from

those in the experiment of Keller et al.;11

we replaced Q04977 by P06278 for B.

lichenformis a-amylase, and did not

choose the rabbit myosin heavy and

light chains. A new protein database

was constructed by appending

sequences of the 18 known proteins

to the sequence-reversed human IPI

database.

Tandem mass spectra were analyzed

against this protein database using

SEQUEST v2.7 (ThermoFinnigan).

Peptide mass tolerance was set as

1.5 Da, cysteine carbamidomethylation

and methionine oxidation were con-

sidered, and the enzymatic constraint

was trypsin with a maximum of two

internal missed cleavage sites. In total,

this analysis produced a dataset of

24 489 tandem mass spectra with

24 486 peptide identifications.

A peptide that passed the strict Xcorr

filter and belonged to the 18 known

proteins was counted as a positive

peptide, but otherwise was counted as

a negative peptide. Common contami-

nants were not considered, as the

human proteins of the database were

sequence-reversed.

In the random database searching

strategy, researchers normally have

set criteria to ensure that the con-

fidence of peptide identifications

was more then 95%, e.g., Xcorr �1.9

for singly charged peptides, Xcorr �2.2

for doubly charged peptides, Xcorr

�3.75 for triply charged peptides, and

DeltaCn �0.1 for all peptides., How-

ever, the Xcorr value for a peptide

depended on peptide size; if a single

cutoff of Xcorr is chosen for all peptides

with the same charge state but of

different lengths, it will be a strict

criterion for shorter peptides but loose

for longer peptides. Figure 1 shows

results for doubly charged peptides

with Xcorr >1.5 for peptides contain-

ing 8, 13, 18, or 22 amino acids; the total

number of peptides considered was

361. Each bar in the histogram repre-

sents the number of peptides within an

Xcorr range of 0.3 for a specified

Copyright # 2005 John Wiley & Sons, Ltd.

RCM

Letter to the Editor

Page 2: Using cross-correlation normalized for peptide length to optimize peptide identification in shotgun proteomics

peptide length. Clearly, the Xcorr

values of short peptides are small and

distributed in a narrow range, but

those of long peptides are large and

distributed in a wide range. Thus,

normalizing Xcorr to make it indepen-

dent of the peptide length is highly

desirable.

Keller et al.3 used Eqn. (1) to reduce

the length dependence of Xcorr; here

we investigated another approach

(Eqn. (2)) for the same purpose; in

Eqn. (2), Ni is the number of possible

fragment ions for each peptide. An in-

house program, normalize.pl, was

written to apply Eqn. (2) to normalize

Xcorr and recalculate DeltaCn values

of peptides with Xcorr >1.0; the other

peptides were discarded. This resulted

in normalization of Xcorr of 21 437

peptides and recalculation of their

DeltaCn values. Only the first ranked

peptide hit was accepted as the correct

peptide for protein identifications, but

some peptides were no longer ranked

first after their Xcorr values were

normalized. These peptides were

ranked again according to Xcorr’ and

their DeltaCn values were recalculated

using the Xcorr’ of the new first and

second ranked peptides.

Xcorr0 ¼ lnðXcorrÞlnðNLÞ

ð1Þ

Xcorr0 ¼ lnðXcorrÞlnðNiÞ

ð2Þ

After the Xcorr values of the peptides

had been normalized their length

dependence was reduced, as shown

in Fig. 2. Doubly charged peptides with

Xcorr’ >0.12, for lengths of 8, 13, 18, 22

amino acids, were selected as the

example in this histogram (total num-

ber of peptides was 316). Xcorr’ values

for short and long peptides are now

better distributed and a distinct

improvement of Xcorr’ for long pep-

tides is found.

All peptide identifications were

extracted from SEQUEST search

results files by an in-house program,

extract.pl. Under the same condition of

FPR �5%, the number of true positive

peptides (NTPP) identified was

improved for doubly and triply

charged peptides by using the Xcorr’

threshold, but the NTPP for singly

charged peptides decreased compared

with that using the Xcorr threshold, as

shown in Table 1. The NTPP values for

doubly and triply charged peptides

increased by 16.7% and 5.2%, respec-

tively; as the lengths of the doubly and

triply charged peptides span a wide

range, the peptide identifications were

obviously optimized. Since some false

positive peptides with low DeltaCn

values disturbed the peptide identifi-

cations, the NTPP for singly charged

peptides decreased when using Xcorr’.

The total NTPP in different charge

states increased by 8.5% after Xcorr

was normalized.

Figure 1. Histogram showing effect of peptide length on Xcorr.

Figure 2. Histogram showing reduction of peptide length dependence of Xcorr by

application of normalized Xcorr.

Copyright # 2005 John Wiley & Sons, Ltd. Rapid Commun. Mass Spectrom. 2005; 19: 2983–2985

2984 Letter to the Editor

Page 3: Using cross-correlation normalized for peptide length to optimize peptide identification in shotgun proteomics

Considering the DeltaCn threshold,

a distinct improvement in NTPP was

obtained for all peptides as shown in

Table 2. (Note that we considered the

threshold of DeltaCn in Table 2 but not

in Table 1, so non-normalized Xcorr

scores in Table 2 are less than the scores

in Table 1.) NTPP increased by 13.5%,

17.5% and 13.6% for singly, doubly and

triply charged peptides, respectively,

under the condition FPRs% �5%.

Application of the Xcorr’ and DeltaCn

thresholds as criteria resulted in an

increase of 15.8% for NTPP of all

peptides (all charge states), resulting

in 2068 true positive peptides; this

should be compared with the results

of application of the Xcorr and DeltaCn

thresholds as criteria, that resulted in

identification of 1786 true positive

peptides. A total of 1764 peptides were

found in both datasets produced by the

two different criteria. The increase of

282 in the NTPP values resulted from

removal of 22 peptides from the list of

true positive peptides obtained using

the Xcorr criterion, and reassignment

of 304 false negative peptides to the

true positive peptides list by using

Xcorr’ and DeltaCn as the criteria.

Of the 21 437 peptides with Xcorr

>1.0, 549 were no longer ranked first

after normalization of Xcorr; in these

cases, the peptide with Xcorr’ ranked

first was selected. None of the 2068 true

positive peptides was included in the

549 peptides whose rank was changed,

so normalization of Xcorr appears to

have little effect on the rank of true

positive peptides.

In this report, we applied a simple

formula which is a little different from

that used previously3 to normalize the

Xcorr and reduce the peptide size

dependence of Xcorr. Previously,3 the

normalized Xcorr was used in a statis-

tical model; herein, we used the nor-

malized Xcorr to overcome some

shortcomings of random database

searching methodology. The peptide

identifications were remarkably

improved with a decrease of the false

negative rate by using normalized

Xcorr and DeltaCn with random data-

base strategy. The false positive rate

will decrease also if more strict criteria

were used to keep the peptide identi-

fications fixed.

Peptide identifications can be opti-

mized by using the normalized Xcorr.

Shotgun proteomics assembles peptide

identifications into protein identifica-

tions, so the protein identification can

be optimized when peptides are iden-

tified with more confidence.

AcknowledgementsWe would like to thank Dr. Andrew Kellerfor providing the test dataset. We thankSongfeng Wu and Jiyang Zhang for valuablecomments and discussions. We acknowl-edge the financial support for the work byChina Technology R&D Project (No.2002BA711A11, 2004BA711A18); National

Key Program for Basic Research (No.2001CB510201, 2004CB520802); and BeijingMunicipal Program for Science & Technol-ogy (H03023028190).

Bing Yang1,2{, Wantao Ying1{,Yan Gong1, Yangjun Zhang1, Yun Cai1,

Hongye Dong2 and Xiaohong Qian1*1Beijing Institute of Radiation

Medicine, 27 Taiping Road, Beijing100850, China

2Shengyang Pharmaceutical Univer-sity, Shengyang 110016, China

*Correspondence to: X. Qian, BeijingInstitute of Radiation Medicine, 27Taiping Road, Beijing 100850, China.E-mail: [email protected]{These authors contributed equally tothis work.Contract/grant sponsor: China TechnologyR&D Project; Contract/grant number:2002BA711A11, 2004BA711A18.Contract/grant sponsor: National KeyProgram for Basic Research; Contract/grant number: 2001CB510201,2004CB520802.Contract/grant sponsor: Beijing Munici-pal Program for Science & Technology;Contract/grant number: H03023028190.

REFERENCES

1. Eng JK, McCormack AL, Yates JR III.J. Am. Soc. Mass Spectrom. 1994; 5:976.

2. Link AJ, Eng J, Schieltz DM,Carmack E, Mize GJ, MorrisDR, Garvik BM, Yates JR III. Nat.Biotechnol. 1999; 17: 676.

3. Keller A, Nesvizhskii AI, Kolker E,Aebersold R. Anal. Chem. 2002; 74:5383.

4. Anderson DC, Li WQ, Payan DG,Noble WF. J. Proteome Res 2003; 2:137.

5. Fenyo D, Beavis RC. Anal. Chem.2003; 75: 768.

6. Sadygov RG, Liu HB, Yates JR III.Anal. Chem. 2004; 76: 1664.

7. Moore RE, Young MK, Lee TD. J.Am. Soc.Mass Spectrom. 2002; 13: 378.

8. Peng J, Elias JE, Thoreen CC,Licklider LJ, Gygi SP. J. ProteomeRes. 2003; 2: 43.

9. Qian WJ, Liu T, Monroe ME,Strittmatter EF, Jacobs JM, KangasLJ, Petritis K, Camp DG II, SmithRD. J. Proteome Res. 2005; 4: 53.

10. Yu LR, Conrads TP, Uo T, KinoshitaY, Morrison RS, Lucas DA, ChanKC, Blonder J, Issaq HJ, VeenstraTD. Mol. Cell. Proteomics 2004; 3:896.

11. Keller A, Purvine S, NesvizhskiiAI, Stolyar S, Goodlett DR, KolkerE. Omics 2002; 6: 207.

12. Available: www.ebi.ac.uk/proteome/index.html.

Received 5 August 2005Revised 12 August 2005

Accepted 12 August 2005

Table 1. Estimate of the effect of Xcorr and Xcorr’ threshold on the peptide

identifications with a control of FPR �5% (without considering DeltaCn)

Normalized Non-normalized

Xcorr’ NTPP FPR% Xcorr NTPP FPR%

Charge state 1þ 0.27 19 5 1.8 66 4.3Charge state 2þ 0.27 1075 3.7 2.4 921 4.8Charge state 3þ 0.25 711 3.8 2.9 676 3.8

Table 2. Estimate of the effect of Xcorr and Xcorr’ threshold on the peptide

identifications with a control of FPR �5% (DeltaCn> 0.1)

Normalized Non-normalized

Xcorr’ NTPP FPR% Xcorr NTPP FPR%

Charge state 1þ 0.21 84 1.2 1.7 74 4.0Charge state 2þ 0.25 1166 4.8 2.3 992 4.2Charge state 3þ 0.23 818 3.9 2.8 720 3.5

Letter to the Editor 2985

Copyright # 2005 John Wiley & Sons, Ltd. Rapid Commun. Mass Spectrom. 2005; 19: 2983–2985