10
REVIEW Proteome informatics I: Bioinformatics tools for processing experimental data Patricia M. Palagi 1 , Patricia Hernandez 1 , Daniel Walther 1 and Ron D. Appel 1, 2, 3 1 Proteome Informatics Group, Swiss Institute of Bioinformatics, Geneva, Switzerland 2 Computer Science Department, Geneva University, Geneva, Switzerland 3 Geneva University Hospitals, Geneva, Switzerland Bioinformatics tools for proteomics, also called proteome informatics tools, span today a large panel of very diverse applications ranging from simple tools to compare protein amino acid compositions to sophisticated software for large-scale protein structure determination. This review considers the available and ready to use tools that can help end-users to interpret, validate and generate biological information from their experimental data. It concentrates on bioinfor- matics tools for 2-DE analysis, for LC followed by MS analysis, for protein identification by PMF, by peptide fragment fingerprinting and by de novo sequencing and for data quantitation with MS data. It also discloses initiatives that propose to automate the processes of MS analysis and enhance the quality of the obtained results. Received: April 13, 2006 Revised: June 2, 2006 Accepted: July 10, 2006 Keywords: Analysis / Bioinformatics / Platforms / Software / Tools Proteomics 2006, 6, 5435–5444 5435 1 Introduction The word Proteome was introduced in 1994 to picture the PROTEin complement of a genOME [1]. It describes the ensemble of protein forms expressed in a biological sample at a given point of time and in a given situation. Two years later, the word Proteomics was first used to define the study of proteomes; a quite simplistic definition. A broader defini- tion states that proteomics deals with the large-scale analysis of proteins, and this includes their identification, the meas- ure of their level of expression and their partial characterisa- tion by the analysis of their pre-, co-, and post-translational modifications, their structures, their functions and their interactions. Whatever the adopted definition, proteomics has four main objectives: (i) to identify all proteins from a proteome creating a catalogue of information; (ii) to analyse differential protein expression associated to a disease, differ- ent cell states, sample treatments and drug targets; (iii) to characterise proteins by discovering their function, cellular localisation, PTMs, etc. and (iv) to describe and understand protein interaction networks. Proteomics is in constant evolution and relies on effi- cient protein separation techniques, MS, bioinformatics, as well as gene and protein databases. Technical advances and growing interest in the field have given rise to a great number of specialised tools and software to help end-users analyse their data and discover new biological knowledge. In a survey undertaken by the HUPO association in 2005 and summarised in its newsletter of January 2006, indexes of available and required resources in different proteomics expertise areas were measured. This survey shows that the indexes with highest values are related to requirements in data management and analysis and bioinformatics. We can conclude that (a) obviously, proteomics experimental tech- nologies evolve faster than their informatics and bioinfor- matics applications and (b) not obviously, the production of well-matured bioinformatics solutions cannot keep up with that of current technologies. This review intends to argue about the second conclusion. We will focus on tools that can help end-users make the most out of their proteomics data. Correspondence: Dr. Patricia M. Palagi, Swiss Institute of Bioin- formatics CMU, 1 Michel-Servet CH-1211, Geneva 4, Switzerland E-mail: [email protected] Fax: 141-22-379-5858 Abbreviation: PFF , peptide fragmentation fingerprint DOI 10.1002/pmic.200600273 © 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Proteome informatics I: Bioinformatics tools for processing experimental data

Embed Size (px)

Citation preview

Page 1: Proteome informatics I: Bioinformatics tools for processing experimental data

REVIEW

Proteome informatics I: Bioinformatics tools for

processing experimental data

Patricia M. Palagi1, Patricia Hernandez1, Daniel Walther1 and Ron D. Appel1, 2, 3

1 Proteome Informatics Group, Swiss Institute of Bioinformatics, Geneva, Switzerland2 Computer Science Department, Geneva University, Geneva, Switzerland3 Geneva University Hospitals, Geneva, Switzerland

Bioinformatics tools for proteomics, also called proteome informatics tools, span today a largepanel of very diverse applications ranging from simple tools to compare protein amino acidcompositions to sophisticated software for large-scale protein structure determination. Thisreview considers the available and ready to use tools that can help end-users to interpret, validateand generate biological information from their experimental data. It concentrates on bioinfor-matics tools for 2-DE analysis, for LC followed by MS analysis, for protein identification by PMF,by peptide fragment fingerprinting and by de novo sequencing and for data quantitation with MSdata. It also discloses initiatives that propose to automate the processes of MS analysis andenhance the quality of the obtained results.

Received: April 13, 2006Revised: June 2, 2006

Accepted: July 10, 2006

Keywords:

Analysis / Bioinformatics / Platforms / Software / Tools

Proteomics 2006, 6, 5435–5444 5435

1 Introduction

The word Proteome was introduced in 1994 to picture thePROTEin complement of a genOME [1]. It describes theensemble of protein forms expressed in a biological sampleat a given point of time and in a given situation. Two yearslater, the word Proteomics was first used to define the studyof proteomes; a quite simplistic definition. A broader defini-tion states that proteomics deals with the large-scale analysisof proteins, and this includes their identification, the meas-ure of their level of expression and their partial characterisa-tion by the analysis of their pre-, co-, and post-translationalmodifications, their structures, their functions and theirinteractions. Whatever the adopted definition, proteomicshas four main objectives: (i) to identify all proteins from aproteome creating a catalogue of information; (ii) to analysedifferential protein expression associated to a disease, differ-

ent cell states, sample treatments and drug targets; (iii) tocharacterise proteins by discovering their function, cellularlocalisation, PTMs, etc. and (iv) to describe and understandprotein interaction networks.

Proteomics is in constant evolution and relies on effi-cient protein separation techniques, MS, bioinformatics, aswell as gene and protein databases. Technical advances andgrowing interest in the field have given rise to a greatnumber of specialised tools and software to help end-usersanalyse their data and discover new biological knowledge.In a survey undertaken by the HUPO association in 2005and summarised in its newsletter of January 2006, indexesof available and required resources in different proteomicsexpertise areas were measured. This survey shows that theindexes with highest values are related to requirements indata management and analysis and bioinformatics. We canconclude that (a) obviously, proteomics experimental tech-nologies evolve faster than their informatics and bioinfor-matics applications and (b) not obviously, the production ofwell-matured bioinformatics solutions cannot keep up withthat of current technologies. This review intends to argueabout the second conclusion. We will focus on tools thatcan help end-users make the most out of their proteomicsdata.

Correspondence: Dr. Patricia M. Palagi, Swiss Institute of Bioin-formatics CMU, 1 Michel-Servet CH-1211, Geneva 4, SwitzerlandE-mail: [email protected]: 141-22-379-5858

Abbreviation: PFF, peptide fragmentation fingerprint

DOI 10.1002/pmic.200600273

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 2: Proteome informatics I: Bioinformatics tools for processing experimental data

5436 P. M. Palagi et al. Proteomics 2006, 6, 5435–5444

It is worth noting that the expression ‘proteome infor-matics tool’ is a synonym for ‘bioinformatics tools’. Thebioinformatics tools in the scope of this review are related tothe following proteomics approaches: sample separation by2-DE gel, sample separation by LC followed by MS analysis,identification of proteins by PMF, by peptide fragmentationfingerprinting (PFF) and by de novo sequencing, data quan-titation with MS data, and platforms to automate andimprove the quality of the identification process. The toolsused for other proteomic approaches, such as proteinmicroarrays, yeast two-hybrid projects, high-throughputdetermination of protein structures, protein–protein andprotein–biomolecule interaction, pathways and cell signal-ling networks, elucidation of protein function from sequenceand structure analysis, etc. are outside the scope of this arti-cle. Nevertheless, many of them are discussed in the jointarticle of this special issue [2].

2 Bioinformatics for postseparationanalysis

It is well known that protein separation techniques exploitthe diversity of proteins including their size, shape, electricalcharge, molecular weight, hydrophobicity and their predis-positions to interact with other proteins. A number of tech-nologies capable of performing large, medium or small scaleseparation of complex protein mixtures have been investi-gated, such as capillary and gel electrophoresis, micro-channel, protein chips, LC and HPLC. Usually, the technol-ogies that have a specific apparatus monitored by a compu-ter, such as LC and HPLC, have also already integratedspecific software to render the results in a computerisedmanner. Among the separation technologies mentionedabove, gel electrophoresis is the only technique that hasdedicated bioinformatics tools; others are considered solelyas informatics software companion to vendor’s equipments.

2.1 2-DE gel image analysis

2-DE gels are very useful to resolve simultaneously thou-sands of proteins separated by their molecular weight and pI.The 2-DE gel patterns provide an important research tool for

quantitative analysis and comparative proteomics. The pos-sibility of detecting protein expression changes associatedwith diseases and treatments or find therapeutic moleculartargets have been, among many other applications, a majorincentive to the development of specialised software systemsfor 2-DE gel image analysis since 1975. In the early 1980s,the first packages started to be delivered to the public andsome of them have survived the computational evolution ofthe last two decades. Among these were PDQuest™ (based onQuest) [3] and ImageMaster™ 2D Platinum (based on Mela-nie) [4]. Currently, several other dedicated software packagesare also commercialised, but only a few of them have steadydistributions or reliable support and documents describingtheir methodologies, other than the information given attheir marketing websites. Table 1 gives the list of the majoravailable packages (cited in the scientific literature, sourcePubMed, at least five times in the last 10 years for havingbeing used to support a Proteomics study).

Every end-user of 2-DE gel analysis software has beenconfronted to the issue of selecting the best and most appro-priate tool for his/her needs. In general, 2-DE gel imageanalysis packages have the same basic operations and func-tionalities necessary to carry out a complete gel study, whichshould end up with highlighting differentially expressedspots in populations of gels. Dowsey et al. [5] give a compre-hensive review of computational techniques and algorithmdetails. Besides basic visualisation properties, the majorfunctions of software systems for 2-DE gel image analysis are(i) the detection and quantification of protein spots on thegels, (ii) the matching of corresponding spots across the gelsand (iii) the localisation of significant protein expressionchanges. Any other additional feature, such as data manage-ment and database integration, may or may not be included.Functions (i) and (ii) have to be successfully executed prior tocarrying out function (iii). The optimal and reproducibledefinition of the spot borders, as a consequence of spotquantitation, depends mostly on gel experimental details, aswell as on the quality of focusing and polymerisation. Often,proteins are resolved as overlapping spots, in particular inregions with high spot density. Weak spots can be missedbecause they are confused with the background, while othersmay be wrongly detected, such as streaks located in the basicregions. In order to eliminate or reduce the impact of these

Table 1. Major commercialised 2-DE image analysis softwarea)

Software Company Source website

DeCyder GE Healthcare www.gehealthcare.comDelta2D Decodon www.decodon.comImageMaster 2D Platinum (powered

by Melanie)GE Healthcare www.gehealthcare.com

PDQuest BioRad www.bio-rad.comProgenesis (formerly Phoretix) Nonlinear Dynamics www.nonlinear.comProteomweaver Definiens/BioRad www.definiens.com/www.bio-rad.com

a) Listed in PubMed at least five times in the last 10 years for having being used to support a Proteomics study.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 3: Proteome informatics I: Bioinformatics tools for processing experimental data

Proteomics 2006, 6, 5435–5444 Bioinformatics 5437

issues, detection algorithms included in the packages gen-erally comprise filtering steps to automatically remove streakartefacts and noise spikes [6] or a segmentation processbased on the analysis of the grey-levels [7].

Spot detection algorithms also produce quantitativeinformation of the protein spots, such as the spots’ area, OD(maximum intensity value in the area), volume (integrationof all intensity values over the area) and relative measures ofthese values that allow for partially compensating for varia-tions in sample load or staining, and as a consequence pro-vide better reproducibility of data analysis and results.

Matching gel images is also a critical process, whetherbased on the previous detection of spots [8] or on the inten-sities of the regions before the detection of spots [9]. Itdepends on the similarity of the spatial distribution of spotsfrom a gel image taken as a reference image (e.g., control)and another gel image (e.g., disease), which then may varyaccording to experimental gel running conditions and gelscanning. These issues should be discarded while keepingonly the changes in protein expression. The subsequentsteps on gel image analysis are erroneous when spots repre-senting the same protein are not correctly matched or whenspots representing different proteins are mistakenlymatched together. Some tools propose the initialisation of afew pairs of spots representing the same proteins in differentgels: a landmarking step. These landmarks will then be usedto warp the gel images and correct possible distortions, andconsequently improve the matching quality. However, thisadditional step has the inconvenience of being time-con-suming and labour-intensive for the end-users.

Some of the software listed in Table 1 was comparedregarding the ability to correctly detect and quantify spots andto match gels. These are very difficult tasks, as on one handthe assortment of different parameters differs greatly be-tween the various software, and on the other hand gels have avery complex nature. Rogers et al. [10] compared the softwarePDQuest, Melanie 3 (called ImageMaster 2D Platinum sinceversion 5), Phoretix, Progenesis and Z3 (not supported anymore), and concluded that all packages perform well in allevaluations. Two other comparative studies showed equiva-lent results when comparing PDQuest and Progenesis [11],PDQuest and Phoretix [12] and Melanie and Z3 [13]. Theseresults are not very conclusive in terms of selecting the bestsoftware. They all have important features that will be takeninto consideration by the end-user such as the managementof mass spectrometric, 2-DE or any other related data in anintegrated way, powerful comparative analysis with multiplestatistical values to increase reliability in the results, etc. Inshort, each tool has its own strengths and weaknesses thatvary depending on the gel type and experimental conditions.They all propose very similar attractive features that mayimprove data reliability and reduce time-consuming, makingthe end-user choice even more bewildering.

All packages in Table 1 have as well the ability to compare2-D difference gel electrophoresis (2-D DIGE), also calledmultiplex experiments. In this fluorescent technique for

protein labelling [14], each sample is labelled with a fluores-cent dye (Cy2, Cy3 or Cy5) prior to electrophoresis and arecoseparated on the same 2-D gel. Scanning the gel at wave-lengths specific for each dye reveals the different proteomes.These images are then overlaid using the above-mentionedsoftware and the differences in abundance of specific proteinspots can be detected. One advantage of this technique is thatvariation in spot location due to gel-specific experimentalfactors is the same for each sample within a single DIGE gel.Consequently, the relative amount of a protein in a gel in onesample compared to another is unaffected. Besides, quanti-tation analysis is more accurate since an internal commonstandard can be included during migration. The internalstandard is created, for example, by pooling aliquots of allbiological samples in the experiment, labelling with one ofthe dyes and running together with individual samples. It isthus an average image of all samples, with all proteins of theexperiment represented in the same support. From thebioinformatics point of view, only the spot detection functionis adapted in the software that deals with DIGE gels. Sincethe same proteins will be localised in the same x and y coor-dinates on the gels, the spot detection procedure is the samefor the codetected gels. The matching step is thus straight-forward and subject to much less errors.

Free tools available through the Internet are an alter-native to commercial systems, even though usually lim-ited to simple operations on a small number of gels(Table 2). The free tool Flicker was created to visuallycompare 2-DE gels [15, 16]. It has a Java applet and astand-alone version to be installed on the user’s computer.Both versions allow to visually compare two gels eitherside-by-side or superimposed. When gels are in the over-lay mode, by ‘flickering’ both images, changes in proteinintensities and expression are more easily discernable.GelScape [17] is a web-based tool to display gel imagesand at the same time a database to house gels and theirannotations. The gels uploaded in GelScape can also becompared with the other available gels in this database.The free Viewer of ImageMaster 2D Platinum/Melaniehas the usual visualisation operations of the full versionand most of the analysis procedures as well. However, theanalysis is restricted to a small number of proteins andonly from gels that have already been analysed by a fullversion. The viewer version of PDQuest gives the possi-bility of elementary visualisation of gel images as well.

Table 2. Open-source or free software for 2-DE analysis

Software Source website

Flicker www.lecb.ncifcrf.gov/flicker/wgFlkPair.html

GelScape www.gelscape.ualberta.caImageMaster 2D Platinum

& Melanie Viewerwww.expasy.org

PDQuest www.bio-rad.com

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 4: Proteome informatics I: Bioinformatics tools for processing experimental data

5438 P. M. Palagi et al. Proteomics 2006, 6, 5435–5444

A totally different approach is to send one’s gels to be ana-lysed by experts (for example, the company Ludesi; www.ludesi.com) which accrues additional costs.

2.2 LC/MS image analysis

An alternative, though common, proteomics workflow com-bines the separation of proteins and peptides by LC followedby direct analysis by MS. This workflow has initiated thedesign of new bioinformatics tools for proteomics that arecomplementary to 2-DE gel image analysis. In the proteom-ics imaging of LC/MS studies, data are also represented intwo dimensions, i.e., the elution time and m/z, and they canbe visualised and analysed as images. The LC/MS imageanalysis, albeit being a recent proteomics field (the first arti-cles were published in 2002 [18, 19]), shows promisingapplications in differential proteome analysis by comparingseveral proteome sets and detecting significant quantitativedifferences, and in discovering specific proteins. Theseexpectations certainly explain the sudden emergence of anumber of packages in a short period of time. The list ofavailable tools is given in Table 3.

Certain basic operations and functionalities are neces-sary to carry out a complete LC/MS study, and filtering is oneof them. Usually, this operation removes the peaks withweakest intensities (e.g., background noise) or high spikesconstant in time (e.g., chemical noise such as column con-taminants), and thus reduces the complexity of spectra andfacilitates peak detection. Peak detection involves looking forthe monoisotopic peaks (also called the deisotoping proce-dure) and determining states of the ion charge. Finally, iso-topic peaks of the same corresponding mass value are clus-tered into one single peak signal. In fact, this procedureselects the peaks of interest from the enormous quantity ofdata.

Ideally, the same molecules analysed in the same LC/MSplatform should have the same retention time, molecularweight and signal intensity. However, due to experimentalvariations, this is not always the case. While m/z valuesdepend on mass accuracy and resolution of the mass spec-

Table 3. Available LC/MS image analysis software

Software Source website

Decyder MSa) www.gehealthcare.comMapQuantb) arep.med.harvard.edu/MapQuantMsightc) www.expasy.org/MSightMsInspectb) proteomics.fhcrc.org/CPASMzmineb) mzmine.sourceforge.netOpenMSb) open-ms.sourceforge.netSpecArrayc) tools.proteomecenter.org/SpecArray.phpXCMSb) metlin.scripps.edu/download

a) Commercialised product.b) Open-source package.c) Free software.

trometer, the retention times largely depend on the analyticalmethod used. Peaks from the same compound or peptidematch fairly close to m/z values, but the retention times be-tween the runs can vary significantly. The peak alignmentoperation corrects these variations and finds correspondingpeaks across different LC/MS runs. Once runs are aligned,they can be compared and statistically analysed in order tofind differentially expressed proteins and peptides, andquantify these differences.

Similarly to 2-DE gel analysis software, due to the highamount of data generated through LC/MS experiments, theavailable software does not usually run through websiteinterfaces and on a remote basis. All tools are available asstand-alone versions that have to be installed on a user’scomputer and run locally. From the list given in Table 3,Decyder MS (GE Healthcare) is the only currently commer-cialised tool; the others are either free or open-source soft-ware.

MSight® [20] is a free downloadable software availablethrough the ExPASy [21] server. Its interface and functional-ities are based on the Melanie gel image analysis system(mentioned in Section 2.1). It runs on the latest Windows™

operating systems and accepts data generated from themajority of mass spectrometers supplied by Bruker, Watersor ABI-SCIEX, for example. It also supports the mzXML [22]and mzData [23] formats. It is worth noting that the mzXMLteam (http://sashimi.sourceforge.net/) provides variousconverter tools to generate mzXML files from mass spectrom-eter native acquisition files. MSight has the advantage of auser-friendly interface, which makes navigation through thelarge volumes of data very easy. Several visualisation toolsallow to discriminate peptide or protein from noise or toperform differential analysis. Peak detection and peak align-ment are in development, as well as semiautomatic analysisof LC-MS datasets, including quantitative differential pro-teome analysis.

Mzmine [24] is an open-source software package for LC/MS analysis written in Java, making it suitable for any com-puter platform. It accepts input in NetCDF (network CommonData Form) and mzXML formats. Several spectral filters areimplemented to correct the raw data files, such as smoothingto filter noise in the mass spectra. Other methods are alsoimplemented, such as peak detection, peak alignment andnormalisation of multiple data files. Mzmine has a user-friendly interface as well; however, the tool does not providestatistical analysis procedures to find differences in a com-parative study; these have to be performed with other packagessuch as MATLAB® (http://www.mathworks.com/) or R(http://www.r-project.org/). Options are available for visualis-ing each action and a batch-processing mode is also accessiblefor the analysis of large sample sets. Another tool, msInspect,is also an open source application written in Java with connec-tions to the R package to align and match MS data.

MapQuant [25] is an open-source software for LC/MSanalysis but written in ANSI C and has a command-lineinterface. It accepts a single type of data format generated by

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 5: Proteome informatics I: Bioinformatics tools for processing experimental data

Proteomics 2006, 6, 5435–5444 Bioinformatics 5439

a program (OpenRaw) that interfaces the MS equipment andMapQuant. It also has procedures to smooth data and filterthe background noise, and to detect, refine and resolve iso-topic peaks; however, it does not automatically align experi-ments. It matches LC/MS images based on already identifiedpeptides.

The SpecArray software suite [26] contains various close-ly related programs with specific functionalities that can beused in sequential mode: the Pep3D program generates the2-D gel type images from LC/MS data; the mzxml2dat pro-gram extracts high quality data by cleaning the spuriousnoise and creating centroid MS spectra; the PepList programextracts a list of peptide features from LC/MS data, such asthe monoisotopic masses, the charges, and the retentiontimes of the MS spectra; the PepMatch program aligns pep-tide features of multiple samples and finally, the PepArrayprogram generates an array of peptide information. For eachselected peptide present in each sample, this program gen-erates its normalised abundance value and its retention time.These final arrays can be exported to a clustering tool andthen be further analysed, i.e., to find quantitative differencesin LC/MS samples. XCMS [27] and OpenMS [28] are alsoopen-source suites of scripts, but written in the R statisticalprogramming and C languages (C11 for OpenMS). Com-piled versions of XCMS for Mac OS X and Windows™ areavailable for download from the website listed in Table 3.

LC/MS imaging analysis being a quite recent field, theProteomics applications of the above software reported in thescientific literature are so far scarce. SpecArray has beenused to analyse serum samples of male and female mice [26],and Mzmine has been applied to describe the stress-inducedresponse of medicinal plants Catharanthus roseus and theproduction of metabolic compounds [29]. Moreover, none ofthe available packages yet provide all the necessary functionsfor differential analysis in one single application, obligingthe user to jump from one application/script to another.Also, note that a performance comparison of availablepackages is for the moment impossible.

3 Protein identification and quantitationwith mass spectral analysis

One of the main objectives of Proteomics is to separate andidentify proteins of interest. For the unambiguous identifi-cation of any protein, information that is unique to that par-ticular molecule or its family is required. 2-DE gels can pro-vide information about pI and molecular weight; LC alonecan give the retention time, but these values are not preciseand are thus very poor predictors. Sequencing the protein (orparts of it) gives much better predictors; however, it is time-consuming. These issues are overcome with MS. MS isnowadays a central technique in Proteomics. It is part of al-most all workflows for proteomic analysis and allows a fineand precise study of specific characteristics of proteins andpeptides. Typically, one protein, or a mixture of them, is

enzymatically digested, and the resulting peptide masses aremeasured, producing an MS spectrum, also called a PMF.Peptides can also be isolated and fragmented within themass spectrometer, leading to MS/MS spectra, also calledPFF. Various approaches for identification of proteins orpeptides present in biological samples appeared in the last20 years. At the beginning, in 1980s, strategies were devel-oped for the sequencing of peptides from MS/MS spectrawithout the help of known sequences (de novo sequencing);then, the growth of protein and genomic databases broughtforth the so-called PMF identification method in 1990s, andsoon after that, the PFF identification method, named assuch by analogy to the PMF approach. These three MS-basedidentification approaches are now routinely used and thenext bewildering step, from the end-user point of view, is tocertify his/her confidence in the obtained results. Recentdevelopments have been made to help them validate theirresults with tools such as ProteinProphet™ [56] and Peptide-Prophet™ [30]. A quite new domain of bioinformatics forproteomics is the pipelines to automate all processes impli-cated on the MS identification. In addition to automation,these pipelines put together different identification strate-gies to increase the fraction of spectra that may be identifiedand improve the quality and confidence in the identificationand characterisation results.

3.1 PMF and PFF analysis

In the PMF analysis, the experimental spectrum is comparedwith theoretical ones computed from protein sequencesstored in databases and in silico digested using the samecleavage specificity of the protease employed in the experi-ment. In a simplistic view, the procedure roughly countsoverlapping masses between the experimental and theoreti-cal spectra, leading to ‘similarity scores’ for each candidateprotein. The candidate proteins are then sorted according totheir scores. The top-ranked protein (or proteins, in casethere are homologues in the searched database or there areseveral proteins in the spectrum) is considered as the identi-fication of the spectrum. The key step of the procedure lies inthe scoring function. Many factors must be taken intoaccount to produce a robust score, like dissimilarities in thepeak positions due to internal or calibration errors or mod-ified amino acids, expected peak intensities, noise, contami-nant or missing peaks and so on. A variety of different scor-ing schemes have been implemented in various algorithms,and some of them are integrated to available software.Table 4 gives the names and URLs of a number of PMF tools.PeptideSearch [31] and PepFrag [32] use a simple score basedon the number of common masses between the experi-mental and theoretical spectra. Pappin et al. [33] designed ascoring function for the algorithm called MOWSE, whichaccounts for the nonuniform distribution of protein andpeptide molecular weights in databases. Similar scoreschemes are exploited in MS-Fit [34], MASCOT [35] andProFound [36]. Aldente [37] on the other hand exploits the

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 6: Proteome informatics I: Bioinformatics tools for processing experimental data

5440 P. M. Palagi et al. Proteomics 2006, 6, 5435–5444

Table 4. Available PMF tools

Software Query website

Aldente www.expasy.org/tools/aldenteMASCOT www.matrixscience.comMS-Fit prospector.ucsf.eduPepFrag prowl.rockefeller.edu/prowl/

pepfragch.htmlPepMAPPER wolf.bms.umist.ac.uk/mapperPeptideSearch www.mann.embl-heidelberg.de/

GroupPages/PageLink/peptidesearchpage.html

ProFound prowl.rockefeller.edu/profound_bin/WebProFound.exe

Hough transform to determine the mass spectrometerdeviation, to realign the experimental masses and to excludeoutliers. Complete reviews on these and other related scoringfunctions are given in [38, 39]. In the list of tools given inTable 4, all have web interfaces for submissions online, andMASCOT has a commercial solution for in-house submis-sions as well.

The PFF approach is very similar to the PMF approach,but is applied to MS/MS spectra, and hence correlates pep-tide spectra with theoretical peptides from a database. Withthe exception of this fundamental difference, the PFFapproach is copied on the PMF; theoretical MS/MS spectraare computed from the theoretical peptide sequences andcorrelated with the experimental MS/MS spectrum in orderto find the most similar (highest score) candidate peptide.MS/MS-based identification presents several advantages overPMF. It is possible to work with complex peptide mixtures orto search homologous databases. Moreover, provided signifi-cant peptide coverage, detailed information about the pep-tide sequence and about possible modifications and muta-tions can be obtained. Last but not least, MS/MS identifica-tion does not require all the peptides of a given protein to beconfirmed to achieve confident identification. However, MS/MS identification is also confronted to difficulties. Actually, agreat majority of MS/MS spectra collected during an experi-ment cannot be confidently matched to theoretical peptides.Possible reasons for these nonmatches are: (i) presence ofcontaminants and coeluting peptides; (ii) bad quality spectrawith noise or unusual fragmentation; (iii) incorrect orimprecise precursor mass; (iv) spectra derived from proteinsnot present in the database or with an alternative splicing notannotated in the database; (v) missed or exotic cleavage sites;(vi) transpeptidation [40]; (vii) unexpected PTMs and muta-tions, as well as nonannotated polymorphisms (allelic varia-tions) in the protein sequences; (viii) sequencing errorspropagated in the databases (specially when obtained byautomatic translation of genomic data); or (ix) any othernonforeseeable event where a peptide spectrum does notexactly correspond to any candidate peptide from the data-base. Fortunately, algorithms have been developed to deal

with most of these issues (Table 5), although none of them iscurrently able to handle all problems at once. Some are spe-cialised in reducing the number and complexity of MS/MSspectra while increasing their quality (such as NoDupe [41]);others have been specifically designed to handle unexpectedmodifications or mutations, such as Popitam [39, 42],GutenTag [43] and InsPecT [44]; and some can deal with theextended list of modifications using various strategies.GutenTag and InsPecTuse a strong filtering based on de novotag extraction to limit the number of peptides candidate so asto take into account a large number of potential modifica-tions. Others, like Phenyx (based on the Olav scoring system[45]), MASCOT, X!Tandem [46] and VEMS, split the identifi-cation process into several stages. The first stage is aimed atbuilding a list of candidate proteins from confidently identi-fied spectra, and the second one is aimed at matching un-identified spectra against this list with more combinatorialparameters (e.g., taking into account a larger number ofmodification types).

From Table 5, SEQUEST [47] and Spectrum Mill arecommercialised software. Phenyx and MASCOT as well; butboth have a web interface for low-throughput submissions.NoDupe, GutenTag, X!Tandem and ProID [48] are down-loadable tools to be installed on the user’s computer. Theothers offer web interfaces for free submissions online.

Apart from their scoring schemes, the tools listed inTables 4, 5 function in a very similar way. In fact, all followthe same list of routines: (i) database digestion (ii) candidateproteins and peptides filtering, (iii) similarity scoring andusually (iv) results validation. Typically, they allow searching

Table 5. Available PFF tools

Software Source website

GutenTag fields.scripps.edu/GutenTag/index.html

InsPecT peptide.ucsd.edu/inspect.pyMASCOT www.matrixscience.com/

search_form_select.htmlMS-Tag and MS-Seq Prospector.ucsf.eduNoDupe fields.scripps.edu/nodupe/index.htmlOMSSA Pubchem.ncbi.nlm.nih.gov/omssaPepFrag prowl.rockefeller.edu/prowl/

pepfragch.htmlPepProbe bart.scripps.edu/public/search/

pep_probe/search.jspPhenyx www.phenyx-ms.comPopitam www.expasy.org/tools/popitamProID Sashimi.sourceforge.net/

software_mi.htmlSEQUEST fields.scripps.edu/sequest/index.htmlSonar ms/ms 65.219.84.5/service/prowl/sonar.htmlSpectrum Mill www.home.agilent.comVEMS www.bio.aau.dk/en/biotechnology/

vems.htmX!Tandem human.thegpm.org/tandem/

thegpm_tandem.html

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 7: Proteome informatics I: Bioinformatics tools for processing experimental data

Proteomics 2006, 6, 5435–5444 Bioinformatics 5441

in various databases, such as the UniProt Knowledgebase(Swiss-Prot and TrEMBL) [49] or NCBI [50]. Some, like Sonarms/ms™ [51], have the peculiarity of performing searchesagainst on-the-fly translated genomic data (as well as proteindata). This is particularly interesting when dealing withorganisms whose genomics data only are known (but whoseproteomics data are incomplete). Cleavage rules depend onthe type of enzyme used for proteolysis, and trypsin isusually the preferred protease, even though some others arealso proposed. In most software interfaces, the user mayallow for the skipping of one or two cleavage sites (missedcleavages), or account for peptides with one unspecific end(half cleavages). Completely unspecific cleavage (no enzyme)can also be taken into account, although this option resultsin a significant increase in computing time [52]. In general,these tools allow the user to use filtering criteria that can beapplied before or after the digestion to reduce the analysis toa small portion of the database that contains the correct pep-tide with high probability. Reducing the number of candidatepeptides reduces the computing time and diminishes therisk of getting a high score by chance. Various filters may beused, such as the sample species, thus avoiding parsing thewhole taxonomy range of the database, the protein massrange or protein pI for PMF, the measured precursor mass ofthe spectrum or de novo extracted tags for PFF tools [39].

Some of the above-mentioned tools were compared con-sidering their sensitivity and specificity. The tool sensitivityindicates its ability to make a correct identification regardlessof the quality of the data, while the specificity indicates itsability to assign low-ranking scores to random (or incorrect)matches. For example, the PMF tool ProFound was con-sidered more sensitive and specific than MS-Fit [53]. For thecompared PFF tools, SEQUEST and Spectrum Mill had goodsensitivity values, while MASCOT, Sonar and X!Tandem hadgood specificity [54]. Another study [55] compared Phenyxand SEQUEST in a shotgun approach to identify proteinsfrom cellular extract of Drosophila Kc167 cells and concludedthat a great number of confident identifications overlapamong these two softwares. To the best of our knowledge,these few articles were the only ones published to bench-mark identification tools. A significant barrier that remainsis the lack of appropriate comparable environments, becausethe tools not only differ on the score calculation, but also donot necessarily use standardised methodologies when gen-erating the candidate peptides (e.g., use of variants or PTMinformation as described in UniProtKB/Swiss-Prot [56], forexample) or filtering procedures (e.g., filtering by tags). Toovercome these issues, PepProbe [57] proposes a single webinterface (a single environment) allowing the comparison ofvarious MS/MS spectra preprocessing methods and variousscoring functions. It is particularly interesting for experi-enced end-users who would like to have a better under-standing of different scoring mechanisms.

The plethora of apparently similar MS-based identifica-tion tools generates quite variable results. Judging the cor-rectness of the assigned proteins and peptides has become a

laborious problem for the end-user. ProteinProphet [56],PeptideProphet [30] and DTASelect [58] help to validate, atthe protein and peptide levels, respectively, the identifica-tions performed by SEQUEST and MASCOT, for instance.On the other hand, there are several rules to observe whenidentifying proteins with mass spectrometric data [59].Notably, among them, accurate mass measurements, data-base stringency to a specific taxonomy, large numbers ofpeptides matched for one protein and large percentages ofthe protein sequence covered are essential for PMF and PFFidentification. These elementary rules are very appropriate toreduce false positive matches and guarantee better con-fidence in the obtained results. Similar rules were tran-scribed into guidelines for publications purposes [60, 61]. Inthe case the user would not really be familiar with the varioussearch engines, multiple queries should be submitted usingdifferent parameter values for the same dataset, and ideallyusing different search engines. A higher confidence wouldbe placed on the protein being the first hit or having thehighest score on various queries. While validation tools andguidelines are essential to guarantee good analysis results,one should never accept results produced by computer pro-grams as absolute truths, and thus one should also use his/her own biological knowledge, experience and judgement.

3.2 MS/MS de novo sequencing

De novo sequencing consists in inferring knowledge aboutthe peptide sequence independently of any informationextracted from a pre-existing protein or DNA database. Thepeptide sequence, or part(s) of it, is directly read in the MS/MS spectrum and usually includes amino acids as well aspartially interpreted (or degenerated) masses representingcombinations of several amino acids. The complete or partialde novo sequences are then compared to theoretical sequen-ces using specifically developed string similarity searchalgorithms. Since they do not use database information dur-ing spectrum interpretation, de novo sequencing algorithmswork in a search space composed of the set of all possiblesequences that can be represented by the spectrum withoutany other restriction than the arrangement of the peaks. Dueto the size of this search space, de novo sequencing methodsare disadvantaged compared to PFF methods. They requirespectra of higher quality with smaller fragment errors and amore or less continuous signal, or at least high-quality signalfor several adjacent amino acids in the case of partial spec-trum interpretation. Spectra with unusual fragmentationwill be very hard or even impossible to analyse. Despite thesedisadvantages, de novo methods may overcome PFF meth-ods, notably when searching genomic databases subjected tosequencing errors, when searching databases composed ofhomologous sequences in the case of cross-species identifi-cation, and when analysing a spectrum that originates from amutated protein or a variant. In effect, de novo algorithmsnaturally extract sequences from the spectrum that includethe amino acid replacements. The similarity search algo-

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 8: Proteome informatics I: Bioinformatics tools for processing experimental data

5442 P. M. Palagi et al. Proteomics 2006, 6, 5435–5444

rithm then handles replacements by allowing mismatchesbetween the de novo and the database sequences.

Table 6 presents a number of algorithms and tools dedi-cated to de novo sequencing. DeNovoX and Spectrum Millare proprietary software included in the vendors’ equipment.PEAKS is a stand-alone commercialised software, but it alsohas a web interface for free submissions. Lutefisk andAUDENS [62] are downloadable tools, while PepNovo andSequit! have website interfaces for free online submissions.

Table 6. Available MS/MS de novo sequencing tools

Software Source website

AUDENS www.ti.inf.ethz.ch/pw/software/audens/DeNovoX www.thermo.comLutefisk www.hairyfatguy.com/LutefiskPEAKS www.bioinformaticssolutions.comPepNovo peptide.ucsd.edu/pepnovo.pySequit! sequit.proteomefactory.comSpectrum Mill www.home.agilent.com

Currently, PFF is the most common strategy for identi-fying proteins in complex mixtures; it is more robust thanPMF or de novo. De novo sequencing is mainly used for cross-species identification [63] or in the PFF approach to generatepartial sequence information in order to filter candidatepeptides prior to identification [64].

3.3 Identification platforms

Too often, laboratory scientists multiply manual proceduresto efficiently analyse proteomics data. Software tools are runseveral times in order to empirically discover the best pa-rameter settings [65]. When various strategies of MS analysisare used, the results are manually selected and combined. Inmost situations though, only one single tool is utilised forprotein identification along with a unique parameter setting.Many spectra are thus missed due to inappropriate parame-ter values to inadequate filtering or merely to under-perfor-mance of certain scoring schemes for the quality of thespectra at hand. The flexible automation of computer tasks,as well as a combination of different workflow strategies, arethus necessary to enhance data analysis, to reduce humaninteraction and to achieve high-throughput analysis.

Up to now, very few platforms dedicated to proteomicdata processing have been implemented. These platformsaim to automate the identification process so as to reducedata analysis time and to enhance the quality of identificationas well as the coverage of matched spectra. The Trans-Prote-omic Pipeline (TPP) [66] from the Institute for Systems Bi-ology is an open-source platform comprising the suite oftools for MS/MS analysis pointed out in the LC/MS analysissection. This pipeline allows importing output files fromSEQUEST and comprises various modules, mainly for post-processing, including result validation, quantification of iso-topically labelled samples, as well as the Pep3D tool forviewing raw LC/MS data and results at the peptide and pro-tein levels.

Scaffold, a commercial platform from Proteome Soft-ware, analyses MASCOTand SEQUEST results, validates thehits by crosscorrelation with the X!Tandem tool, filters outuninteresting spectra and exports high-quality unidentifiedones for future analysis.

ProteinScape [67], another commercial platform fromBruker Daltonics, covers many steps of a proteome study. Itmanages data storage, archiving and retrieval and handlesidentification or quantification of the data. Results from var-ious identification tools (MASCOT, Phenyx, SEQUEST andX!Tandem for MS/MS) are combined as a unique score(meta-score) and high-quality unidentified spectra may beinterpreted with an algorithm that takes into accountunknown PTMs (so-called an open-search algorithm).De novo sequencing can be triggered for unidentified spectra(with the RapiDeNovo tool), and if necessary, visualisationtools help in manually validating hits (BioTools). ProteinLynxGlobal SERVER (from Waters/Micromass) is also a com-mercial platform that combines multiple processing runsand integrates protein identification results from their pro-prietary software (ProteinLynx) and MASCOT.

Ideally, these architectures should, on one hand, runseveral tools in serial mode in order to cover the full analysis,i.e., from the preprocessing of MS data until the results’validation. On the other hand, they should analyse in parallelthe same data with several similar tools (with various identi-fication strategies, parameters and filtering modes) toincrease the confidence on the obtained results and thenumber of identified spectra. In addition, these platformsshould be totally automated and should be very flexible. Forexample, they should avoid being completely linear andallow feedback loops that would improve the identification

Table 7. Pipeline tools

Software Company Source website

ProteinLynx Global SERVER Waters/Micromass www.waters.comProteinScape™ Bruker/Protagen AG www.proteinscape.comScaffold Proteome Software www.proteomesoftware.comTrans-Proteomic Pipeline Institute for Systems Biology tools.proteomecenter.org/TPP.php

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 9: Proteome informatics I: Bioinformatics tools for processing experimental data

Proteomics 2006, 6, 5435–5444 Bioinformatics 5443

Table 8. Various tools for MS data analysis

Software Usage Source website

ASAPRatio Quantitative analysis tools.proteomecenter.org/ASAPRatio.phpDTASelect Validation of protein identifications fields.scripps.edu/DTASelectMSQuant Quantitative analysis Msquant.sourceforge.netPeptideProphet Validation of protein identifications tools.proteomecenter.org/PeptideProphet.phpProteinProphet Validation of protein identifications tools.proteomecenter.org/ProteinProphet.phpRelEx Quantitative analysis fields.scripps.edu/relexXPRESS Quantitative analysis tools.proteomecenter.org/XPRESS.phpZoomQuant Quantitative analysis Proteomics.mcw.edu

process. SwissPIT, an identification platform implementingsuch a multilevel pipeline, which may be accessed through aweb interface, is currently under development at the SwissInstitute of Bioinformatics (Hernandez et al. manuscript inpreparation).

The development of unified data storage and exchangeformats is undoubtedly a necessity for pipelines that inte-grate various algorithms. Advancements in this field havebeen made through the HUPO Proteomic Standards Initia-tive (PSI) [23]. Subgroups of this initiative are currentlyworking on defining formats for capturing information re-lated to peak lists (mzData) as well as the parameters andresults of search engines (analysisXML). Numerous toolsmentioned in this manuscript have already integratedmzData reading and writing capabilities (a list of them isgiven at (http://psidev.sourceforge.net/ms/mzdata_imple-menters.html). Besides from the MS point of view, other PSIworking groups are dealing with standardisation in for gelelectrophoresis and protein-protein interaction data (http://psidev.sourceforge.net/). Other issues on proteomics datastandardisation are further discussed in the joint article ofthis special issue [2].

3.4 Isotope labelling quantitation tools

Several tools are available to help quantify and interpret datagenerated through specific applications of MS (summary inTable 8). MS-based quantitative approaches include taggingor chemical modification methods, such as isotope-coded af-finity tags (ICAT™), isobaric tag for relative and absolutequantitation (iTRAQ™) or stable isotope labelling by aminoacids in cell culture (SILAC™) to name just a few. Somebioinformatics tools were designed to calculate relative ratiosof proteins and peptide pairs labelled with these taggingmethods. ASAPRatio [68], XPRESS [69] and RelEx [70] cal-culate the relative abundances of ICAT-labelled peptides aftertheir analysis with LC-MS/MS and identification withSEQUEST or MASCOT. The MSQuant tool quantifiesSILAC-labelled peptides as well. ZoomQuant [71] is specia-lised in the quantitation of 18O-labelled peptides, after theiridentifications with SEQUEST.

4 Conclusions

In the last 20 years, bioinformatics for proteomics hasevolved from protein sequences alignment, through theanalysis of 2-DE gel images, to whole comprehensive pipe-lines that integrate data from various separation methods aswell as protein identification by MS. Several tools have beendeveloped reducing the end-user’s manual analysis. Cer-tainly, improvements directed towards increasing perfor-mance, reducing operation time and making automation amore realistic goal are still needed. One should keep in mindthough that whatever the quality of the bioinformatics tools,the quality of the results they produce is directly dependenton the input data they are fed with. As a matter of fact, pro-teomics techniques and bioinformatics tools have a symbio-tic relationship; new experimental methods require newlyadapted tools, while the new tools need well-establishedtechniques.

5 References

[1] Wilkins, M. R., Williams, K. L., Appel, R. D., Hochstrasser, D.,Proteome Research: New Frontiers in Functional Genomics,Springer-Verlag, Berlin, Heidelberg, New York 1997.

[2] Lisacek, F., Cohen-Boulakia, S., Appel, R. D., Proteomics2006 DOI 10.1002/pmic.200600275.

[3] Garrels, J. I., J. Biol. Chem. 1989, 264, 5269–5282.

[4] Appel, R. D., Hochstrasser, D. F., Funk, M., Vargas, J. R. et al.,Electrophoresis 1991, 12, 722–735.

[5] Dowsey, A. W., Dunn, M. J., Yang, G. Z., Proteomics 2003, 3,1567–1596.

[6] Appel, R. D., Vargas, J. R., Palagi, P. M., Walther, D. et al.,Electrophoresis 1997, 18, 2735–2748.

[7] Cutler, P., Heald, G., White, I. R., Ruan, J., Proteomics 2003, 3,392–401.

[8] Pleissner, K. P., Hoffmann, F., Kriegel, K., Wenk, C. et al.,Electrophoresis 1999, 20, 755–765.

[9] Smilansky, Z., Electrophoresis 2001, 22, 1616–1626.

[10] Rogers, M., Graham, J., Tonge, R. P., Proteomics 2003, 3,879–886.

[11] Rosengren, A. T., Salmi, J. M., Aittokallio, T., Westerholm, J.et al., Proteomics 2003, 3, 1936–1946.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 10: Proteome informatics I: Bioinformatics tools for processing experimental data

5444 P. M. Palagi et al. Proteomics 2006, 6, 5435–5444

[12] Wheelock, A. M., Buckpitt, A. R., Electrophoresis 2005, 26,4508–4520.

[13] Raman, B., Cheung, A., Marten, M. R., Electrophoresis 2002,23, 2194–2202.

[14] Ünlü, M., Morgan, M. E., Minden, J. S., Electrophoresis1997, 18, 2071–2077.

[15] Lemkin, P. F., Electrophoresis 1997, 18, 461–470.

[16] Lemkin, P. F., Thornwall, G., Evans, J., in: Walker, J. (Ed.), TheProtein Protocols Handbook, Humana Press Inc., Totowa, NJ2005, pp. 279–305.

[17] Young, N., Chang, Z., Wishart, D. S., Bioinformatics 2004, 20,976–978.

[18] Berger, S. J., Lee, S. W., Anderson, G. A., Pasa-Tolic, L. et al.,Anal. Chem. 2002, 74, 4994–5000.

[19] Palmblad, M., Ramstrom, M., Markides, K. E., Hakansson, P.et al., Anal. Chem. 2002, 74, 5826–5830.

[20] Palagi, P. M., Walther, D., Quadroni, M., Catherinet, S. et al.,Proteomics 2005, 5, 2381–2384.

[21] Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I. et al.,Nucleic Acids Res. 2003, 31, 3784–3788.

[22] Pedrioli, P. G., Eng, J. K., Hubley, R., Vogelzang, M. et al., Nat.Biotechnol. 2004, 22, 1459–1466.

[23] Orchard, S., Hermjakob, H., Binz, P. A., Hoogland, C. et al.,Proteomics 2005, 5, 337–339.

[24] Katajamaa, M., Miettinen, J., Oresic, M., Bioinformatics2006, 22, 634–636.

[25] Leptos, K. C., Sarracino, D. A., Jaffe, J. D., Krastins, B. et al.,Proteomics 2006, 6, 1770–1782.

[26] Li, X. J., Yi, E. C., Kemp, C. J., Zhang, H. et al., Mol. Cell.Proteomics 2005, 4, 1328–1340.

[27] Smith, C. A., Want, E. J., O’maille, G., Abagyan, R. et al.,Anal. Chem. 2006, 78, 779–787.

[28] Gröpl, C., Lange, E., Reinert, K., Kohlbacher, O. et al., in:Berthold, M., Glen, R., Diederichs, K., Kohlbacher, O. F. I.(Eds.), Lecture Notes in Bioinformatics, Springer, Heidel-berg, Germany 2005, pp. 151–163.

[29] Katajamaa, M., Oresic, M., BMC Bioinformatics 2005, 6, 179.

[30] Keller, A., Nesvizhskii, A. I., Kolker, E., Aebersold, R., Anal.Chem. 2002, 74, 5383–5392.

[31] Mann, M., Wilm, M., Anal. Chem. 1994, 66, 4390–4399.

[32] Fenyo, D., Qin, J., Chait, B. T., Electrophoresis 1998, 19, 998–1005.

[33] Pappin, D. J., Hojrup, P., Bleasby, A. J., Curr. Biol. 1993, 3,327–332.

[34] Clauser, K. R., Baker, P., Burlingame, A. L., Anal. Chem. 1999,71, 2871–2882.

[35] Perkins, D. N., Pappin, D. D. J., Creasy, D. M., Cottrell, J. S.,Electrophoresis 1999, 20, 3551–3567.

[36] Zhang, W., Chait, B. T., Anal. Chem. 2000, 72, 2482–2489.

[37] Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S. et al., in:Walker, J. M. (Ed.), The Proteomics Protocols Handbook,Humana Press, New Jersey, 2005, pp. 571–607.

[38] Gras, R., Muller, M., Curr. Opin. Mol. Ther. 2001, 3, 526–532.

[39] Hernandez, P., Müller, M., Appel, R. D., Mass Spectrom. Rev.2006, 25, 235–254.

[40] Schaefer, H., Chamrad, D. C., Marcus, K., Reidegeld, K. A. etal., Proteomics 2005, 5, 846–852.

[41] Tabb, D. L., MacCoss, M. J., Wu, C. C., Anderson, S. D. et al.,Anal. Chem. 2003, 75, 2470–2477.

[42] Hernandez, P., Gras, R., Frey, J., Appel, R. D., Proteomics2003, 3, 870–878.

[43] Tabb, D. L., Saraf, A., Yates, J. R. III, Anal. Chem. 2003, 75,6415–6421.

[44] Tanner, S., Shu, H., Frank, A., Wang, L. C. et al., Anal. Chem.2005, 77, 4626–4639.

[45] Colinge, J., Masselot, A., Giron, M., Dessingy, T. et al., Pro-teomics 2003, 3, 1454–1463.

[46] Craig, R., Beavis, R. C., Bioinformatics 2004, 20, 1466–1467.

[47] Eng, J. K., McCormack, A. L., Yates, I. J. R., J. Am. Soc. MassSpectrom. 1994, 5, 976–989.

[48] Zhang, N., Aebersold, R., Schwikowski, B., Proteomics 2002,2, 1406–1412.

[49] Wu, C. H., Apweiler, R., Bairoch, A., Natale, D. A. et al.,Nucleic Acids Res. 2006, 34, D187–D191.

[50] Wheeler, D. L., Church, D. M., Edgar, R., Federhen, S. et al.,Nucleic Acids Res. 2004, 32, D35–D40.

[51] Field, H. I., Fenyo, D., Beavis, R. C., Proteomics 2002, 2, 36–47.

[52] Craig, R., Beavis, R. C., Rapid Commun. Mass Spectrom.2003, 17, 2310–2316.

[53] Chamrad, D. C., Korting, G., Stuhler, K., Meyer, H. E. et al.,Proteomics 2004, 4, 619–628.

[54] Kapp, E. A., Schutz, F., Connolly, L. M., Chakel, J. A. et al.,Proteomics 2005, 5, 3475– 3490.

[55] Heller, M., Ye, M., Michel, P. E., Morier, P. et al., J. ProteomeRes. 2005, 4, 2273–2282.

[56] Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M. C. et al.,Nucleic Acids Res. 2003, 31, 365–370.

[57] Sadygov, R., Wohlschlegel, J., Park, S. K., Xu, T. et al., Anal.Chem. 2006, 78, 89–95.

[58] Tabb, D. L., McDonald, W. H., Yates, J. R. III, J. Proteome Res.2002, 1, 21–26.

[59] Baldwin, M. A., Mol. Cell. Proteomics 2004, 3, 1–9.

[60] Carr, S., Aebersold, R., Baldwin, M., Burlingame, A. et al.,Mol. Cell. Proteomics 2004, 3, 531–533.

[61] Wilkins, M. R., Appel, R. D., Van Eyk, J. E., Chung, M. C. et al.,Proteomics 2006, 6, 4–8.

[62] Grossmann, J., Roos, F. F., Cieliebak, M., Liptak, Z. et al., J.Proteome Res. 2005, 4, 1768–1774.

[63] Liska, A. J., Shevchenko, A., Proteomics 2003, 3, 19–28.

[64] Frank, A., Tanner, S., Bafna, V., Pevzner, P., J. Proteome Res.2005, 4, 1287–1295.

[65] Ossipova, E., Fenyo, D., Eriksson, J., Proteomics 2006, 6,2079–2085.

[66] Keller, A., Eng, J. K., Zhang, N., Li, X., Aebersold, R., Mol.Syst. Biol. 2005, 7, 2005.0017.

[67] Chamrad, D. C., Koerting, G., Gobom, J., Thiele, H. et al.,Anal. Bioanal. Chem. 2003, 376, 1014–1022.

[68] Li, X. J., Zhang, H., Ranish, J. A., Aebersold, R., Anal. Chem.2003, 75, 6648–6657.

[69] Han, D. K., Eng, J., Zhou, H., Aebersold, R., Nat. Biotechnol.2001, 19, 946–951.

[70] MacCoss, M. J., Wu, C. C., Liu, H., Sadygov, R. et al., Anal.Chem. 2003, 75, 6912–6921.

[71] Halligan, B. D., Slyper, R. Y., Twigger, S. N., Hicks, W. et al., J.Am. Soc. Mass Spectrom. 2005, 16, 302–306.

© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com