8
Differential Protein Expression Analysis via Liquid-Chromatography/Mass-Spectrometry Data Visualization Lars Linsen * Julia L¨ ocherbach *Matthias Berth org Bernhardt orte Becher * Department of Mathematics and Computer Science Ernst-Moritz-Arndt-Universit¨ at Greifswald Greifswald, Germany ² DECODON GmbH Greifswald, Germany Department of Microbiology Ernst-Moritz-Arndt-Universit¨ at Greifswald Greifswald, Germany (a) (b) Figure 1: Interactive visualization of liquid-chromatography/mass-spectrometry data: (a) Global view of entire data set. (b) Close-up view of region of interest. ABSTRACT Differential protein expression analysis is one of the main chal- lenges in proteomics. It denotes the search for proteins, whose encoding genes are differentially expressed under a given experi- mental setup. An important task in this context is to identify the differentially expressed proteins or, more generally, all proteins present in the sample. One of the most promising and recently widely used approaches for protein identification is to cleave pro- teins into peptides, separate the peptides using liquid chromatog- raphy, and determine the masses of the separated peptides using mass spectrometry. The resulting data needs to be analyzed and matched against protein sequence databases. The analysis step is typically done by searching for intensity peaks in a large number of 2D graphs. We present an interactive visualization tool for the ex- ploration of liquid-chromatography/mass-spectrometry data in a 3D space, which allows for the understanding of the data in its entirety and a detailed analysis of regions of interest. We compute differen- tial expression over the liquid-chromatography/mass-spectrometry domain and embed it visually in our system. Our exploration tool can treat single liquid-chromatography/mass-spectrometry data sets * [email protected] ² {loeba,berth}@decodon.com {Joerg.Bernhardt,dbecher}@uni-greifswald.de as well as data acquired using multi-dimensional protein identi- fication technology. For efficiency purposes we perform a peak- preserving data resampling and multiresolution hierarchy genera- tion prior to visualization. CR Categories: I.3.7 [Computer Graphics]: Three-dimensional Graphics and Realism; I.3.8 [Computer Graphics]: Applications Keywords: interactive visual exploration, hierarchical data repre- sentation, visualization in bioinformatics, proteomics 1 I NTRODUCTION While the last decade was dominated by the study of genomes (genomics), the past few years have experienced a tremendous progress in the study of gene expression-transcriptomics and espe- cially proteomics. The term proteome was first used to describe the set of proteins encoded by the genome [22]. Nowadays, proteomics is referred to as the study of the function of all expressed proteins [20]. One of the main challenges in proteomics is the analysis of dif- ferential protein expression or, more precisely, the identification of proteins, whose encoding genes are differentially expressed. Its sig- nificance in biology and medicine is evident. For example, in order to understand how diseases affect organisms, one can differentially compare the expression in healthy and diseased cells. For protein identification, a protein consisting out of a large chain of amino acids can be digested to many smaller peptides, which are separated using liquid chromatography and measured us- ing mass spectrometry. The measured data is analyzed and matched

Di erential Protein Expression Analysis via Liquid ... · aid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Di erential Protein Expression Analysis via Liquid ... · aid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography

Differential Protein Expression Analysis viaLiquid-Chromatography/Mass-Spectrometry Data Visualization

Lars Linsen∗ Julia Locherbach∗† Matthias Berth† Jorg Bernhardt‡ Dorte Becher‡

∗Department of Mathematics and Computer ScienceErnst-Moritz-Arndt-Universitat Greifswald

Greifswald, Germany

† DECODON GmbHGreifswald, Germany

‡ Department of MicrobiologyErnst-Moritz-Arndt-Universitat Greifswald

Greifswald, Germany

(a) (b)

Figure 1: Interactive visualization of liquid-chromatography/mass-spectrometry data: (a) Global view of entire data set. (b) Close-up view ofregion of interest.

ABSTRACT

Differential protein expression analysis is one of the main chal-lenges in proteomics. It denotes the search for proteins, whoseencoding genes are differentially expressed under a given experi-mental setup. An important task in this context is to identify thedifferentially expressed proteins or, more generally, all proteinspresent in the sample. One of the most promising and recentlywidely used approaches for protein identification is to cleave pro-teins into peptides, separate the peptides using liquid chromatog-raphy, and determine the masses of the separated peptides usingmass spectrometry. The resulting data needs to be analyzed andmatched against protein sequence databases. The analysis step istypically done by searching for intensity peaks in a large number of2D graphs. We present an interactive visualization tool for the ex-ploration of liquid-chromatography/mass-spectrometry data in a 3Dspace, which allows for the understanding of the data in its entiretyand a detailed analysis of regions of interest. We compute differen-tial expression over the liquid-chromatography/mass-spectrometrydomain and embed it visually in our system. Our exploration toolcan treat single liquid-chromatography/mass-spectrometry data sets

[email protected]† {loeba,berth}@decodon.com‡ {Joerg.Bernhardt,dbecher}@uni-greifswald.de

as well as data acquired using multi-dimensional protein identi-fication technology. For efficiency purposes we perform a peak-preserving data resampling and multiresolution hierarchy genera-tion prior to visualization.

CR Categories: I.3.7 [Computer Graphics]: Three-dimensionalGraphics and Realism; I.3.8 [Computer Graphics]: Applications

Keywords: interactive visual exploration, hierarchical data repre-sentation, visualization in bioinformatics, proteomics

1 INTRODUCTION

While the last decade was dominated by the study of genomes(genomics), the past few years have experienced a tremendousprogress in the study of gene expression-transcriptomics and espe-cially proteomics. The term proteome was first used to describe theset of proteins encoded by the genome [22]. Nowadays, proteomicsis referred to as the study of the function of all expressed proteins[20].

One of the main challenges in proteomics is the analysis of dif-ferential protein expression or, more precisely, the identification ofproteins, whose encoding genes are differentially expressed. Its sig-nificance in biology and medicine is evident. For example, in orderto understand how diseases affect organisms, one can differentiallycompare the expression in healthy and diseased cells.

For protein identification, a protein consisting out of a largechain of amino acids can be digested to many smaller peptides,which are separated using liquid chromatography and measured us-ing mass spectrometry. The measured data is analyzed and matched

Page 2: Di erential Protein Expression Analysis via Liquid ... · aid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography

against existing protein sequence databases. When using multi-dimensional protein identification technology (MudPIT) [21], theliquid chromatography method is applied under multiple differentconditions in order to obtain data even from proteins, whose prop-erties are hard to capture. The multiple results need to be mergedduring the analysis step.

For differential protein expression analysis, data from differentcell populations (e. g., diseased vs. healthy) or from one cell pop-ulation acquired under different conditions need to be compared.Protein identification is used to register the data sets and to identifydifferentially expressed proteins. More information on the relevantbiological background in the field of proteomics is given in Section2.

We present an interactive visualization tool, whose goal is toaid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography and massspectrometry in a three-dimensional space. Traditionally, a two-dimensional plot of the liquid-chromatography result and a largenumber of two-dimensional plots of the mass-spectrometry resultsare examined independently. The integration of data from bothtechniques allows for

• a better understanding of the data set in its entirety,• quantitative depiction of expression ratios on a global scale,• and easy visual detection of data acquisition errors.

Our tool also supports the visualization of data acquired usingmulti-dimensional protein identification technology. Each fractioncan be displayed individually or in a merged setup. In the mergedsetup, each peptide (or its fragments) is represented by its maxi-mum intensity peak while color coding is used to indicate the con-dition, under which the maximum intensity has been observed. Themerged setup requires a registration of the individual fractions priorto visualization, as liquid chromatography-mass spectrometry ex-periments suffer from not being precisely reproducible.

Registration is also required prior to differential expression com-putation. Differential expression of a test vs. a control data setis computed and displayed over the liquid chromatography-massspectrometry domain. Thus, differential expression visualizationcan naturally be integrated into our visual exploration tool. Ourtool is described in detail in Section 5. In Section 6, we describescenarios on how our visualization tool is used for error detection,protein identification, and differential protein expression analysis.

Depending on the experimental setup, data sets may be prettylarge. Moreover, the data is unstructured in one of the dimensionsand has non-equidistant spacing in the second dimension. In orderto obtain an efficient visualization at interactive frame rates, we ap-ply a resampling of the data prior to visualization. Our resamplingmethod ensures that maximum intensity peaks are maintained. Af-ter resampling, we build a hierarchical data scheme, which allowsfor multiresolution visualization of the data set. The data prepara-tion is described in Section 4.

2 BACKGROUND

2.1 Protein Identification

Proteins consist of large chains of amino acids, while smaller chainsof amino acids form peptides. A protein is defined by its sequenceof amino acids. A common technique to identify a protein is tocut the protein into peptides or even fragments of peptides, iden-tify the fragments, and determine the amino acid sequence of theprotein from the fragments. There are several ways of performingthe individual steps. For our experiments, we have decided to useliquid chromatography (LC) followed by mass spectrometry (MS)or even multi-dimensional protein identification technology (Mud-PIT) [21]. It has been demonstrated that LC/MS-based methods are

very powerful and in certain aspects superior to other approachessuch as 2D electrophoresis, see for example [12]. In particular,LC/MS-based methods are capable of capturing both cell proteinsand membrane proteins and seem to perform especially well for thelatter.

Figure 2 illustrates the individual processing steps leading to pro-tein identification. In the first step of our processing pipeline, pro-teins are digested by the enzyme trypsin. One can even predict,where a protein will be cut. For example, trypsin cuts after lys andarg (in case no prolin is present) We obtain peptides of a protein.

In order to examine the peptides individually, we need to sep-arate them. Peptide separation is done by liquid chromatography(LC). A liquid containing the peptides is forced through a column(loading). The column contains a substrate that binds the peptides.Afterwards, the peptides are washed out of the column using awater-acetonitril-mixture (eluting). The weaker a peptide is bound,the faster it gets washed out. Thus, peptides can be separated bytheir binding properties (hydrophobicity).

The masses of the separated peptides can be determined individ-ually using mass spectrometry (MS). Mass spectrometry is a tech-nique that separates ions by their mass-to-charge ratios (m/z-ratios).Thus, we need to ionize the peptides. Different approaches for ion-ization exist. Among the most popular are electrospray ionization(ESI) [5] and matrix-assisted laser desorption ionization (Maldi)[8]. When using electrospray ionization one may have to normal-ize the results due to decreasing ionization. After ionization themolecules are accelerated and handed to the mass analyzer. Themass analyzer uses electric or magnetic fields (quadrupol) to deflectthe charged particles, while the kinetic energy imparted by motiongives the particles inertia dependent on their mass. The mass an-alyzer steers the particles to a detector based on their m/z-ratio.The detector measures intensity in counts per second. The result istypically displayed in a two-dimensional graph, where intensity isshown over mass or m/z-ratio, respectively, see Figure 3.

Figure 3: Mass spectrometry outputs intensity values over m/z-ratios. Intensity peaks represent the amount of most frequentlypresent peptides.

For the final identification step, the fragment pattern of each pep-tide is matched against patterns in a database (e. g., Mascot [13]) todetermine the amino acid sequence.

One problem of the described LC-MS approach is that the simul-taneous digest of the protein mixture results in a highly complexcollection of thousands of peptides. Thus, a single LC step may notbe capable to separate all of them. This problem is typically solvedby dividing the peptides into several fractions using multiple LCiterations. Each fraction is then analyzed using mass spectrome-

Page 3: Di erential Protein Expression Analysis via Liquid ... · aid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography

Figure 2: Protein identification pipeline: (i) Proteins are cleaved to form peptides. (ii) Peptides are separated using liquid chromatography:reversed-phase trap (RP) column is loaded with sample peptides and eluted using acetonitril gradient. (iii) Masses of separated peptides aredetermined using mass spectrometry: ionize peptides, deflect ions using quadrupol, and detect ions. (iv) Analyze data using 3D visual explorationand quantification tool. (v) Match analysis results with database.

try. This is referred to as multi-dimensional protein identificationtechnology (MudPIT) [21].

2.2 Structure of the Data

The data coming out of a liquid chromatograph is a function overtime. When using a detector to measure the peptides leaving thecolumn, it detects peptides at discrete points in time t1, . . . ,tn ∈ IR.The points in time ti, i = 1, . . . ,n, are not distributed equidistantly.The number n can be in the range of many thousands.

The data coming out of the mass spectrometer is a function overthe m/z-ratio. The intensity is measured in counts per time andstored as an intensity list for discrete m/z-ratios m1, . . . ,mp ∈ IR.The m/z-ratios m j, j = 1, . . . , p, are not distributed equidistantly,neither. The number p depends on the experimental setup and canbe in the range of tenths of thousands. Figure 3 shows a plot of suchan intensity list. The plot exhibits several high peaks.

Instead of generating such two-dimensional graphs for eachpoint in time ti, i = 1, . . . ,n, we use a three-dimensional setup,where the intensity is shown as a heightfield over the dimensionstime and m/z-ratio. Unfortunately, the m/z-ratios m1, . . . ,mp varyfrom one point in time ti to the subsequent point in time ti+1, andeven the number p of values in the m/z-ratio dimension varies sig-nificantly. Thus, when looking at a two-dimensional domain withthe dimensions being m/z-ratio and time, data positions are scat-tered in one dimension and non-equidistant in the other dimension.Figure 4 sketches the discrete data positions for LC-MS in the two-dimensional domain.

In terms of acquired data, MudPIT can be regarded as multi-ple LC-MS. A MudPIT data set consists of several data sets of thestructure shown in Figure 4, each of which represents one fraction.Thus, MudPIT adds another dimension.

t 1 t 2 tn

m/z

t

Figure 4: Structure of LC-MS data: Data value positions are scat-tered in m/z-dimension and non-equidistant in t-dimension.

2.3 Driving Biological Questions

In the following, we formulate the driving biological questions thatwe intend to answer using our visual exploration tool.

Since liquid chromatography operations are rather tricky to ex-ecute and many things can go wrong, the first thing that biologistsneed to know before they proceed with an excessive data analysisis whether the data is actually correct.Question 1: Can visualization provide an immediate and obviouscheck for correctness of the data?

Typically, LC-MS and MudPIT data are explored by looking atintensity plots over m/z-ratio for many points in time (many pep-tides) and various fractions (cf. Figure 3). Looking at many or even

Page 4: Di erential Protein Expression Analysis via Liquid ... · aid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography

all these mass spectrometry plots individually is a very tiring pro-cess. It also does not provide an intuitive understanding of the en-tire data on a global scale. When looking at a selected set of massspectrometry plots it is unknown whether the shown peaks are sig-nificant, i. e., among the ones with highest intensity, or not.Question 2: Can visualization provide a global understanding ofthe data while still displaying full (quantitative) information aboutposition and size of intensity peaks? Does such a visualization al-low for interactive data exploration even for large data sets withoutlosing detail information?

Most important for a visualization tool in terms of biological in-sight is not to display one data set, but to compare different datasets, where each data set represents an experiment taken under cer-tain conditions.Question 3: Can visualization display information that allows fordifferential protein expression analysis?

3 RELATED WORK

Visualization of LC-MS data is commonly restricted to a simpleplotting of mass spectrometry functions for each peptide (cf. Figure3). To allow for a global understanding of the data, a visualizationmethod over a two-dimensional domain (like the one sketched inFigure 4) has been introduced by Li et al. [9]. The output of thevisualization method, however, restricts itself to two-dimensionalimages, where intensity is displayed using different shades of grayin a linear or logarithmic mapping. The images exhibit clearly thepositions of intensity peaks but the different shades of gray are hardto distinguish, which limits the perception of the actual intensityvalues.

In the work by Li et al. [9], the resolution of the resulting imagescan be chosen by the user. To deal with the challenges of not op-erating on a Cartesian grid (cf. Section 2.2), the authors averagedthe intensities collected within each pixel. Due to the averagingstep, intensity peaks may not have been displayed with their max-imum intensity but with a much lower value. Significant intensitypeaks may have been reduced to smaller peaks that do not seem tobe noteworthy. From a biological point of view, one would rathermaintain the actual maximum intensity of a peak, even if it getsslightly shifted. The shift is bounded by the value 1/

√2 of the

pixel resolution. Such maximum-preserving operations have beenused in other contexts, e. g. [17].

To deal with large data sets, multiresolution methods are com-monly applied in visualization. Many different approaches exist fortwo-dimensional and even higher-dimensional domains [4, 7, 10,14]. A common technique to build multiresolution hierarchies isthe use of wavelets [18]. By storing data values explicitly for thelowest resolution only and computing higher-resolution representa-tions by successively adding details, wavelet-based hierarchies donot require any additional storage space (compared to the storagespace of the original data at highest resolution). In the work by An-dersson et al. [1], one-dimensional wavelets have been applied toLC data for data reduction and de-noising.

Differential expression data has been visualized in the contextof functional genomics. Gene expression determined using mi-croarray technology is displayed using two-dimensional [6, 19] orthree-dimensional scatter plots [16]. In proteomics, differential ex-pression can be visualized by (differential) intensities over the twodimensions time (from LC) and m/z-ratio (from MS).

When using 2D electrophoresis proteins of a sample are sepa-rated and identified by 2D orthogonal displacement. Visualizationsusing 2D images are common and have been used for differentialexpression display in [2]. Registration of such 2D images [3] allowsfor a subsequent fusion of various images [11].

4 DATA PREPARATION

4.1 Resampling

Since LC-MS data is scattered in the m/z-dimension, a direct datavisualization method would have to apply scattered data interpola-tion techniques or domain triangulation (e. g., Delaunay triangula-tion). Both approaches do (in their general form) not account forthe non-scattered structure in the time-dimension. Moreover, weobserved that they are too computationally expensive for practicalpurposes when applied to large data sets.

To allow for an efficient visualization with an acceptable amountof preprocessing, we decided to perform a resampling step. Sincethe time-dimension is already structured, we only apply a one-dimensional resampling in the m/z-direction. We generate a struc-tured domain with non-equidistant samples in the t-direction andequidistant samples in the m/z-direction.

Resampling should be done such that all intensity peaks are pre-served (with their maximum intensity). The only way to fulfill thiscondition is to use a sufficiently high resampling rate. If the reso-lution res of the mass spectrograph is known, it is best to resamplewith rate 1/(2 · res). If the resolution of the mass spectrograph isunknown, it can be estimated by determining the minimum distancebetween any two measurement.

Obviously, we generate a lot of redundant information. For dis-playing data visually on a computer screen, there is no need to gobeyond the screen’s resolution. We reduce the generated data bymerging adjacent data values. However, we still want to be able toretrieve the highest resolution data for display when zooming intoregions of interest and when outputting data to peak quantificationtools. We generate a hierarchical data representation that allows formultiresolution visualization and data access.

4.2 Hierarchical Data Representation

A hierarchical data representation scheme stores a data set at var-ious resolutions. For downsampling, the sample positions of res-olution Ln are split into two groups: the ones that belong to thenext coarser resolution Ln−1 (even vertices) and the ones that be-long to Ln \Ln−1 (odd vertices). The values at the even verticesare computed from the values at the sample positions ∈ Ln. Whenusing wavelet-based techniques the values at the odd vertices storethe “difference” between the values at the even vertices and the val-ues at the sample positions ∈ Ln. Thus, resolution Ln can be re-constructed from resolution Ln−1 at any time by adding the differ-ences. Only resolution Ln−1 and the difference set Ln \Ln−1 needto be stored, which is the same amount of data storage as neededto store Ln. Thus, setting up a multiresolution hierarchy using awavelet scheme does not require additional data storage.

The simplest and, thus, most widely used wavelets are Haarwavelets, see [18]. One-dimensional Haar wavelets compute thevalues f n−1

i ∈ Ln−1 at the even vertices by averaging the valuesf ni and f n

i+1 at the respective sample point pairs ∈ Ln. The values

f n−1i+1 at the odd vertices store the differences f n

i − f n−1i .

We adopt the ideas from wavelet-based multiresolution hierarchygeneration. However, averaging data values would cause intensitypeaks to lose their maximum intensity or even to vanish. To main-tain the maximum intensity of all peaks, we set the values at theeven vertices to

f n−1i = max( f n

i , f ni+1) .

The values at the odd vertices store the differences

| f n−1i+1 | = f n−1

i −min( f ni , f n

i+1) .

The sign of f n−1i+1 is used to indicate whether f n

i or f ni+1 was the

larger value. Figure 5 shows an example for our peak-preservingmultiresolution hierarchy generation.

Page 5: Di erential Protein Expression Analysis via Liquid ... · aid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography

even

odd

even

odd

n−1 n−2

n−1

n−1

n−1

n−2

n−2

n−1

LL

n

n

n

n

Ln

fff

ff

fff

f

f1

0 0 0

1

2

2

23

3

Figure 5: Hierarchical data representation: Multiresolution schemeis peak-preserving ( f n−2

0 = max{ f n0 , f n

1 , f n2 , f n

3 }) and does not require

additional storage space (stores only [{ f n−20 };{ f n−2

2 };{ f n−11 , f n−1

3 }]).

5 VISUAL DATA EXPLORATION

5.1 Interactive Visualization System

We have developed a three-dimensional visualization system for in-teractive exploration of LC-MS data. We render heightfields overthe two-dimensional resampled domain with dimensions time andm/z-ratio. The system allows for visualization of the entire data seton a global scale, see Figure 1(a), and zoomed views into regionsof interest, see Figure 1(b). To fulfill the real-time requirements,the appropriate resolution is selected from the multiresolution datahierarchy described in the previous section.

The color scheme for the visualization can be changed interac-tively. A one-dimensional transfer function is used to map intensityvalues to RGB color values. Thus, in the general setting, the colorof a peak indicates its intensity. Also, the material properties andthe shading method can be chosen by the user. In terms of view-ing, the system supports both parallel and perspective projection.To filter, i. e., select and display only the most significant peaks, athresholding method is provided that culls all peaks beneath a cho-sen threshold. The system also allows for peak labeling. The labelsshow the peaks’ properties (e. g., intensity, m/z-value, time-value)or – if known – the name of the respective peptide. Figures 1 and 6illustrate certain aspects of the system’s functionality.

5.2 Peak Quantification

While typically smooth shading, diffuse and specular material, andperspective projection is used to render the heightfield in an easy-to-perceive volumetric fashion, the other options are offered to servespecific tasks. If we look at the heightfield from a bird’s eye viewusing a parallel projection, a two-dimensional image is rendered,where color encodes intensity with respect to the chosen transferfunction. We use flat shading exhibiting no diffuse or specular light-ing to not fudge the color. Figure 7(a) shows such a bird’s eye view.

The rendered image can be outputted to a file using common im-age formats. The resolution for the output can be chosen arbitrarilyfrom the multiresolution data hierarchy. To maximize the accuracyof further processing steps, one would typically choose the highestresolution. Also, the output could be the entire data set as well as aselected region.

Since most image formats use a rather limited amount of bits forstoring the data, the high-precision representation used for visual-ization must be converted to a lower-precision representation for theoutputted images. As the intensity peaks tend to exhibit very highintensities, it is often beneficial to convert the data not using a linearbut a logarithmic mapping. Both conversion options are supportedthough. Figure 7(b) shows an outputted image using a logarithmicscale of gray values.

Existing quantification tools can be used to measure the spotsin the two-dimensional images. The quantified values are used tomatch with existing databases, which allows us to identify peptidesand, finally, proteins. We have used Delta2D1 for quantification andMascot [13] for database look-ups.

5.3 Differential Protein Expression

For differential protein expression analysis, data is acquired by ex-periments under changing conditions or of different cell popula-tions. Of interest is the change in expression from one experimen-tal setup to another. Typically, one experimental setup serves ascontrol data, while the other setups provide the test data. Eachexperiment can be visualized and analyzed individually using ourinteractive visualization tool described above. To visualize thedifferential expression of test vs. control data, we use the samethree-dimensional setup as before and display differential expres-sion in terms of heightfields over dimensions time (LC) and m/z-ratio (MS).

To compute the differential expression, we have to determinedifferences in peak intensity. Unfortunately, the reproducibility ofLC-MS measurements only works with a certain tolerance. We ob-served that this tolerance is pretty high. Intensity peaks for one andthe same peptide may shift significantly when executing an experi-ment multiple times. Thus, prior to differential expression compu-tation, we need to register (or align) test and control data. The reg-istration step warps the domain of the test data set onto the domainof the control data set. Figure 8 illustrates the shift by visualiz-ing test and control data as two heightfields over the same domain.The position of most dominant peaks of the two heightfields shouldcoincide but, instead, exhibit a severe shift.

Figure 8: Visualization of test (green) and control (red) data as twoheightfields over same domain without registration. Location of mostdominant peaks are supposed to coincide but clearly exhibit a shift.

The warping transformation is computed using a landmark ap-proach. In both data sets we identify the most significant inten-sity peaks. A one-to-one mapping of the test intensity peaks to thecontrol intensity peaks is done by hand for a few intensity peaks.These are the so-called landmarks. The user intervention is moder-ate, since only a few landmarks are required. (In our experiments,we used between five and 30 landmarks.) Our visual representationof the data makes it very easy to match the peaks even for non-professionals. We triangulate the domains using the landmarks.The warping transformation is linear within each triangle.

After the warping step, we can compute the differential expres-sion by mere subtraction of the intensity values. Figure 9 shows an

1http://www.decodon.com

Page 6: Di erential Protein Expression Analysis via Liquid ... · aid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography

(a) (b)

Figure 6: Options for interactive visualization system (cf. Figure 1): (a) Changing color, material properties, and shading method. (b)Thresholding for peak selection combined with peptide labeling of significant peaks.

(a) (b)

Figure 7: Peak quantification: Bird’s eye view of heightfield (a) is outputted to 2D image (b) using logarithmic scale. Existing software is usedfor peak quantification on spots in 2D image.

example. We merge the two heightfields (test vs. control) by takingthe maximum intensities and color the resulting heightfield usinggreen color in case of down-regulation and red color in case of up-regulation. (The resulting heightfield here represents the maximumof the two intensities not the difference.)

5.4 Multi-dimensional Protein Identification Technology

To visualize MudPIT data, we generate a heightfield for each frac-tion. We provide a slider for the user to switch between the height-field renderings of the individual fractions. Figure 10 shows thevisualization of three fractions exhibiting changes in intensity.

MudPIT data also suffers from not generating results that areprecisely reproducible. Thus, the locations of intensity peaks rep-resenting peptides present in subsequent fractions do not coincide.When trying to integrate MudPIT data from various fractions into

one setup, we need to perform a registration step, again. We pro-ceed as described in the previous section. Figure 11 shows the in-tegrated visualization of various fractions after registration. Eachfraction gets assigned one color. Thus, colors indicate in whichfraction intensity peaks are highest.

Using the registered fractions, we can also interpolate betweenthe intensities of subsequent fractions. Instead of using a slider toswitch between visualizations of different fractions as in Figure 10,we generate an animation over the fraction dimension, where in-tensity values are animated over fractions in the order increasingammonium chloride.2 For smooth transition we use linear interpo-lation of the heightfields. The animation allows for an even betterperception of intensity changes with changing fractions.

2A movie of an animation accompanies the paper.

Page 7: Di erential Protein Expression Analysis via Liquid ... · aid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography

Figure 10: Visualization of three fractions obtained by using MudPIT. Slider allows to interactively switch between fractions.

Figure 9: Visualization of differential protein expression over dimen-sions time (LC) and m/z-ratio (MS): Color green indicates downreg-ulation, color red upregulation. Differential expression computationrequires prior data registration by domain warping.

6 APPLICATION SCENARIO, RESULTS, AND DISCUSSION

To test our system, we have applied our methods to data acquiredfrom human cell line SiHa used to model cervical cancer [15]. Thecell line was grown under normal conditions and showed no pertur-bation. The MudPIT experiment was done using liquid chromatog-raphy (LC) with reversed-phase trap (RP) column and mass spec-trometry (MS) with electrospray ionization (ESI). 11 fractions with0mM, 20mM, 40mM, 60mM, 80mM, 100mM, 150mM, 200mM,300mM, 500mM, and 900mM ammonium chloride were used. Theretention time during liquid chromatography were in the range from0 to 85 minutes. The m/z-ratios measured by mass spectrometrywere in the range of 300 to 1500. The measured intensities were ashigh as 109 counts per second.

In a first step, we visualized the data as heightfields over thetime-dimension measured by liquid chromatography and the mass-over-charge-dimension measured by mass spectrometry. Variousvisualizations are shown in Figures 1 and 6. By looking at the datafrom a bird’s eye view as in Figure 7(a), we detected some artifacts.Some of the intensity peaks are not separated but form streaks in

Figure 11: Integrated visualization of various fractions from MudPIT:Colors indicate, in which fraction each peak has maximum intensity.The color scheme assigns colors cyan, white, magenta, yellow, red,green, and blue to fractions in the order of increasing ammoniumchloride.

the time-dimension. The streaks indicate problems with the liquidchromatograph, which did not separate the peptides properly. Usingour visualization tool the error becomes obvious and is detectedimmediately in an intuitive way. Consequentially, the experimentswould have needed to be repeated to gain any significant biologicalinsight from further analysis.

Despite the data acquisition error, the bird’s eye view of ourheightfield has been outputted to a two-dimensional image 7(b). Weused highest resolution and logarithmic mapping. Since the time-dimension of the data set has a resolution in the range of the resolu-tion of modern computer screens, the multiresolution hierarchy hasonly been generated in the m/z-dimension using a one-dimensionalscheme. The two-dimensional image can be used to quantify the in-tensity spots and to match the results with the values obtained froman existing database. This procedure allows for identification of thepeptides with highest occurrence. The identified peptides can belabeled with their name.

For the MudPIT data, we can interactively explore the variousfractions, as shown in Figure 10 and the video. After registration ofthe fractions, we integrate the various fractions into one heightfield.The result is shown in Figure 11. From there, we could proceedwith the quantification step as done for the LC-MS data.

Page 8: Di erential Protein Expression Analysis via Liquid ... · aid protein identification and differential protein expression analy-sis. Our tool displays data from liquid chromatography

After the identification step, we looked into differential proteinexpression to analyze changes under varying conditions. We de-fined the control and test data, registered them, and explored the dif-ferential expression using, once again, the 3D visualization setup.The resulting heightfield is shown in Figure 9. In an intuitive way,we could visually detect significant up- and downregulation of cer-tain peptides, which had been identified beforehand. This new in-sight can help to formulate biological hypotheses.

It remains to discuss whether the driving biological questionsformulated in Section 2.3 have been answered.

• Question 1: We have demonstrated that our visualization toolis, indeed, capable to immediately exhibit data acquisition er-rors in an intuitive and obvious way.

• Question 2: Our examples also show that our visual systemallows for a global understanding of the entire data set as wellas exploration of regions of interest. We were able to achieveinteractive frame rates when displaying the entire data set bygenerating a multiresolution hierarchy. The hierarchical datarepresentation provides all data information including all de-tails without requiring additional storage space.

• Question 3: The scenario presented in this chapter describesin detail the individual steps toward differential protein ex-pression analysis. It documents how our visualization systemhelps to accomplish several important tasks.

7 CONCLUSIONS AND FUTURE WORK

We have presented an interactive visualization system for the anal-ysis of differential protein expression. The system renders height-fields of liquid-chromatography/mass-spectrometry data. It usesdata resampling and hierarchical data representation to fit the real-time requirements even for large data sets. The visualization toolalso allows for integration of data obtained by multi-dimensionalprotein identification technology. Differential expression for con-trol vs. test data is computed and visualized after registration viadomain warping.

Our system provides an intuitive understanding of the data on aglobal scale and allows for detailed data exploration. Data acquisi-tion errors become easy to detect. Moreover, the visualization toolsupports protein/peptide identification. It also provides an intuitivemeans to match intensity peaks of different data sets, which can beused for data registration purposes.

In terms of future work, we plan on integrating a quantificationstep based on the highest-resolution three-dimensional representa-tion into our system, as the current conversion into a standard imageformat loses precision. Moreover, we would like to incorporate thesimultaneous visualization of multiple test data sets for differentialprotein expression analysis.

REFERENCES

[1] F. O. Andersson, R. Kaiser, and S. P. Jakobsson. Data preprocessingby wavelets and genetic algorithms for enhanced multivariate analysisof lc peptide mapping. Journal of Pharmaceutical and BiomedicalAnalysis, 34:531–541, 2004.

[2] J. Bernhardt, K. Buttner, C. Scharf, and M. Hecker. Dual channelimaging of two-dimensional electropherograms in bacillus subtilis.Electrophoresis, 20(11):2225–2240, 1999.

[3] J. Bernhardt, J. Weibezahn, C. Scharf, and M. Hecker. Bacillus sub-tilis during feast and famine: visualization of the overall regulationof protein synthesis during glucose starvation by proteome analysis.Genome Res., 13(2):224–237, 2003.

[4] P. Cignoni, C. Montani, E. Puppo, and R. Scopigno. Multiresolutionmodeling and visualization of volume data. IEEE Transactions onVisualization and Computer Graphics, 3(4):352–369, 1997.

[5] J. B. Fenn, M. Mann, C. K. Meng, S. F. Wong, and C. M. Whitehouse.Electrospray ionization for mass spectrometry of large biomolecules.Science, 246(64), 1989.

[6] D. R. Gilbert, M. Schroeder, and J. van Helden. Interactive visual-ization and exploration of relationships between biological objects.Trends Biotechnol., 18(12):487–494, 2000.

[7] R. Grosso, C. Lurig, and T. Ertl. The multilevel finite element methodfor adaptive mesh optimization and visualization of volume data. InR. Yagel and H. Hagen, editors, Proceedings of IEEE Conference onVisualization 1997, pages 135–142. IEEE, IEEE Computer SocietyPress, 1997.

[8] M. Karas and F. Hillenkamp. Laser desorption ionization of pro-teins with molecular masses exceeding 10 000 daltons. Anal Chem,60:2299–2301, 1988.

[9] X.-J. Li, P. G. A. Pedrioli, J. E. J, D. Martin, E. C. Yi, H. Lee, andR. Aebersold. A tool to visualize and evaluate data obtained by liq-uid chromatography/electrospray ionization/mass spectrometry. AnalChem, 76:3856–3860, 2004.

[10] L. Linsen, V. Pascucci, M. A. Duchaineau, B. Hamann, and K. I. Joy.Wavelet-based multiresolution with n

√2 subdivision. Journal on Com-

puting, 71(1+2), 2004.[11] S. Luhn, M. Berth, M. Hecker, and J. Bernhardt. Using standard po-

sitions and image fusion to create proteome maps from collections oftwo-dimensional gel electrophoresis images. Proteomics, 3(7):1117–1127, 2003.

[12] D. M. Maynard, J. Masuda, X. Yang, J. A. Kowalak, and S. P.Markey. Characterizing complex peptide mixtures using a multi-dimensional liquid chromatography-mass spectrometry system: Sac-charomyces cerevisiae as a model system. Journal of ChromatographyB, 810(1):69–76, 2004.

[13] D. N. Perkins, D. J. Pappin, D. M. Creasy, and J. S. Cot-trell. Probability-based protein identification by searching se-quence databases using mass spectrometry data. Electrophoresis,20(18):3551–3567, 1999.

[14] D. Pinskiy, E. Brugger, H. R. Childs, and B. Hamann. An octree-based multiresolution approach supporting interactive rendering ofvery large volume data sets. In H. Arabnia, R. Erbacher, X. He,C. Knight, B. Kovalerchuk, M. Lee, Y. Mun, M. Sarfraz, J. Schwing,and H. Tabrizi, editors, Proceedings of the 2001 International Con-ference on Imaging Science, Systems, and Technology (CISST 2001),Volume 1, pages 16–22. Computer Science Research, Education, andApplications Press (CSREA), Athens, Georgia, 2001.

[15] J. T. Prince, M. W. Carlson, R. Wang, P. Lu, and E. M. Marcotte.The need for a public proteomics repository (commentary). NatureBiotechnology, 22:471–472, 2004.

[16] N. Shah, V. Filkov, B. Hamann, and K. I. Joy. GeneBox: interac-tive visualization of microarray data sets. In F. Valafar and H. Vala-far, editors, International Conference on Mathematics and Engineer-ing Techniques in Medicine and Biological Sciences (METMBS ’03),pages 10–16. Computer Science Research, Education, and Applica-tions Press (CSREA), Athens, Georgia, 2003.

[17] Y. Shinagawa and T. L. Kunii. Unconstrained automatic image match-ing using multiresolutional critical-point filters. IEEE Trans. PatternAnal. Mach. Intell., 20(9):994–1010, 1998.

[18] E. J. Stollnitz, T. D. DeRose, and D. H. Salesin. Wavelets for Com-puter Graphics: Theory and Applications. The Morgan Kaufmann Se-ries in Computer Graphics and Geometric Modeling, Brian A. Barsky(series editor), Morgan Kaufmann Publishers, San Francisco, U.S.A.,1996.

[19] C. Tang, L. Zhang, and A. Zhang. Interactive visualization and anal-ysis for gene expression data. In Hawaii International Conference onSystem Sciences, 2002.

[20] M. Tyers and M. Mann. From genomics to proteomics. Nature,422:193 – 197, 2003.

[21] M. P. Washburn, D. Wolters, and J. R. Y. III. Large-scale analysisof the yeast proteome by multidimensional protein identification tech-nology. Nature Biotechnology, 19:242–247, 2001.

[22] M. R. Wilkins, C. Pasquali, R. D. Appel, K. Ou, O. Golaz, J. C.Sanchez, J. X. Yan, A. A. Gooley, G. Hughes, I. Humphery-Smith,K. L. Williams, and D. F. Hochstrasser. From proteins to proteomes:large scale protein identification by two-dimensional electrophoresisand amino acid analysis. Biotechnology, 14:61–65, 1996.