28
Transcriptomics Analysis Methods: Microarray Data Processing, Analysis and Visualization Using the Affymetrix Genechip ® Vitis Vinifera Genome Array Karen A. Schlauch, Jerome Grimplet, John Cushman, and Grant R. Cramer Abstract The study of transcriptomics is a powerful method of studying the responses of organisms to their environment. The transcriptome consists of the entire set of transcripts that are expressed within a cell or organism at a particular developmental stage or under various environmental conditions. There are various technologies for assaying the transcriptome including hybridization-based micro arrays and RNA sequencing. Microarrays have been used extensively to quantify the transcript abundance of grape cells, organs and tissues. Here we provide a practical guide on how to analyze microarray data using a study based on the Affymetrix GeneChip ® Vitis vinifera genome array. Microarray studies have proven to be very powerful for the elucidation of molecular response networks and physiological processes. In this Chapter, we have outlined the steps required to process and analyze mRNA expression data. The first step is to check both microarray and data quality. The second step is to remove array and data outliers and reduce the variability of the data with cleansing and normalization techniques. The third step is to perform statistical tests to identify sets of transcripts differentially expressed among conditions under statistical significance. The fourth step is to evaluate these sets of significant transcripts using functional categorization and molecular maps. Such datasets can be compared or integrated with proteomic and metabolomic data sets using a systems biology approach to increase the robustness of the conclusions. From data analysis performed in these ways,

[RTF]digital.csic.esdigital.csic.es/bitstream/10261/55638/1/Transcriptomics...0.999.For each feature on the array, the coefficient of variation was computed across the six technical

  • Upload
    lyquynh

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Transcriptomics Analysis Methods: Microarray Data Processing, Analysis and Visualization Using the Affymetrix Genechip ® Vitis Vinifera Genome Array

Karen A. Schlauch, Jerome Grimplet, John Cushman, and Grant R. Cramer

Abstract The study of transcriptomics is a powerful method of studying the responses of organisms to their environment. The transcriptome consists of the entire set of transcripts that are expressed within a cell or organism at a particular developmental stage or under various environmental conditions. There are various technologies for assaying the transcriptome including hybridization-based micro arrays and RNA sequencing. Microarrays have been used extensively to quantify the transcript abundance of grape cells, organs and tissues. Here we provide a practical guide on how to analyze microarray data using a study based on the Affymetrix GeneChip ® Vitis vinifera genome array. Microarray studies have proven to be very powerful for the elucidation of molecular response networks and physiological processes. In this Chapter, we have outlined the steps required to process and analyze mRNA expression data. The first step is to check both microarray and data quality. The second step is to remove array and data outliers and reduce the variability of the data with cleansing and normalization techniques. The third step is to perform statistical tests to identify sets of transcripts differentially expressed among conditions under statistical significance. The fourth step is to evaluate these sets of significant transcripts using functional categorization and molecular maps. Such datasets can be compared or integrated with proteomic and metabolomic data sets using a systems biology approach to increase the robustness of the conclusions. From data analysis performed in these ways, hypotheses can be generated for further experimentation and validation, which denotes the fifth, and likely the most important, final step.

22.1 Introduction

The study of global gene expression (transcriptomics)is part of a repertoire of methods used in functional genomics approaches today and is providing a powerful way to understand and compare the “holistic” responses of organisms to their environ-ment(Trewavas 2006,Wangetal. 2009).The transcriptome consists of the entire set of transcripts that areexpressedwithinacellororganismataparticulardevelopmentalstage or under various environmental conditions. There are various technologies for assaying the transcriptome including hybridization-based microarrays and RNA sequencing (Trewavas 2006, Wang et al. 2009). Microarrays have been used extensively to quantify the transcript abundance of grape cells, organs and tissues. (Terrier et al. 2005, Waters et al. 2005, Espinoza et al. 2006, Waters et al. 2006, Cramer et al. 2007,Deluc et al. 2007, Fernandez et al. 2007,Grimpletetal. 2007, Pilati et al. 2007,Tattersallet al. 2007,Chervinet al. 2008,Figueiredoetal. 2008, Gatto et al.

2008,Lund et al. 2008,Deluc et al. 2009,Mathiason et al. 2009). Here we provide a practical guide on how to analyze microarray data using a study based on the Affymetrix GeneChip ® Vitis vinifera genome array.

22.2 A Long-Term Growth Experiment as an Example for Microarray Data Analysis

A long-term experiment was designed and conducted to determine the impact of water deficit and salt stress on grapevine physiology and molecular profiles (Cramer et al. 2007). Salt stress, water-deficit stress, and no stress (control) were applied to randomly selected vines for a periodof16 days. Cabernet Sauvignon shoots with and without stress were harvestedevery4 days (Day4,8,12 and 16), and shoots without stress were harvested at Day0.The shoot length and midday stem water potential was recorded every two days as a measure of water-deficit stress. The water potential of plants of both water deficit and salt-stressed populations was kept nearly identical. This was accomplished by monitoring slight differences in the stem water potential between the two stressed populations, using data from a previous experiment, and adjusting on a daily basis the electrical conductivity of the salt application to attain water potentials similar to those measured in water deficit treated vines. Microarray and quantitative RT-PCR transcript profiling were used to define genes and metabolic pathways in Vitis vinifera cv. Cabernet Sauvignon with common or divergent responses to long-term (16days) water-deficit stress and isoosmotic salinity stress.

22.3 Long-Term Growth Experimental Design

The experiment followed a completely randomized factorial design consisting of 3×4(treatment× time) conditions, with six to eight individually potted plants per time point and treatment. For each of the experimental conditions, two shoot tips were pooled together to form one of three biological replicates. As an additional quality control measure, six technical microarray replicates were run on one experimental condition (control condition of Day 16) to assess microarray data quality and data reproducibility.

22.4 Vitis GeneChip ® Design The Affymetrix GeneChip ® Vitis genome array was developed by Affymetrix and the UNR Vitis research group in 2004. The array included 14,700 Unigenes, for which sequences were selected from GenBank ®, dbEST, and RefSeq. Sequence clusters were generated from the UniGene database, Build 7, October 2003. Each sequence was represented by sixteen oligonucleotide probe pairs, each containing a Perfect Match and a Mismatch. Detailed descriptions of probesets can be found elsewhere (Affymetrix 2002a).A number of standard control features was included on the array. A supplemental set of six control features was selected as an additional quality control measure. The extra six controls were placed in a tile that was spotted on the GeneChip ® in eight separate instances (Fig. 22.1a).One copy of each control features’ probesets was also spotted randomly on the array. This provided an additional mechanism to

verify quality control with respect to spatial variation and technical reproducibility. Three of these six controls were positive controls (e.g. actin, GAPDH, and elongationfactor1 alpha); the other three were negative controls (e.g. aphA (kanamycin resistance gene); beta-lactamase gene; beta-glucuronidase). Actin, GAPDH, and elongation factor 1 alpha are typically considered as housekeeping genes in that these genes generally do not exhibit large variations in mRNA abundance under many experimental conditions. Additionally, the genes are known to be expressed in many conditions, and thus act also as a positive control. AphA, beta-lactamase gene, and beta-glucuronidase are eubacterial genes that are not expected to be expressed in the genomes of vascular plant species, and thus were chosen as negative controls.

22.5 Microarray Quality Assessment

To assess the quality and reproducibility of the microarray data prior to performing any analysis procedures, a number of steps should be performed. The following steps are those recommended by Affymetrix and are cited directly from the Affymetrix GeneChip ® Data Analysis Fundamentals Manual (Affymetrix 2002a).

(1) 260/280 Absorbance Readings To ensure that the highest quality RNA is hybridized to the gene expression arrays, users should run the initial total RNA on an agarose gel or an Agilent Technologies 2100 Bioanalyzer to examine the integrity of ribosomal RNA bands. Non-distinct ribosomal RNA bands suggest possible RNA degradation. The 260/280 absorbance readings should be measured for total RNA, and should be within the range between 1.8 and 2.1. Ratios below 1.8 might be a sign of possible protein contamination, whereas ratios above 2.1 might suggest degraded RNA, truncated cRNA transcripts, or an excess of free nucleotides.

(2) Array Background Levels. The average background levels of each array are provided by the Affymetrix GeneChip ® Operating Software (GCOS) in the Expression Report files. When arrays are run on 10% PMT scanner settings, background levels should lie between20 and 100, and should be consistent across all arrays in the study.

(3) Noise Levels. The noise levels of each array are provided in the Expression Report

Fig. 22.1 Image examples of the Affymetrix GeneChip® Vitis vinifera genome array. a (left) shows the placement of the eight replicated tilings of control probesets(green), and the placement of the tile consisting of the hybridization controls(orange)on the array. Remaining control probesets are scattered about the array(single cells in red and green). b (right)is the image of a Vitis microarray with spatial variation likely caused by a fiber

files and are labeled as “RawQ”. RawQ values measure the pixel-to-pixel variation on each array. These levels should be consistent across all arrays, and any array deviating grossly from the mean noise level should be examined closely.

(4) Global Normalization and Scaling factors These factors are provided in the Expression Report files, and should be consistent across all arrays. More than three-fold differences between array scaling factors might indicate notable array variability or sample degradation.

(5) Present Call Rates The Present Call rates should be uniform across all arrays, with even more stringent uniformity across replicated samples. Arrays with a notably low percentage of Present Calls might have poor sample quality, and should be examined closely with respect to all other quality control metrics.

(6) Microarray Images. All microarray images should be checked for scratches, smeared regions stemming from fibers or air bubbles, exceptionally high or low overall expression levels, and other general spatial variation. Images are provided by the Affymetrix GCOS as jpeg files.

(7) Hybridization Controls. Affymetrix places four hybridization controls every GeneChip ®.The four transcripts bioB, bioC, bioD, cre represent genes in the biotin synthesis pathway of E. coli, and are added to the hybridization cocktail at increasing concentrations of 1.5, 5, 25, and 100 pM, respectively. BioB should have Present Calls in at least 50% of the arrays in the experiment; bioC, bioD, cre should have Present Calls in 100% of the arrays. The signal values for bioC, bioD and cre should be increasing, respectively. Absent calls or low signal values of these control probes might indicate a problem with the hybridization reaction or with the washing and staining steps. In this case, it should be verified that the hybridization cocktail was made properly, that the recommended temperature for hybridization and the correct fluidics protocol were used, and that the SAPE staining solution did not deteriorate. See http://www.affymetrix.com/support/help/faqs/ge_assays/faq_19.jsp for more details.

(8) Poly-A Controls Each Affymetrix GeneChip ® includes four poly-A controls used to examine the target labeling process. The controls lys, phe, thr, and dap are B. subtilis genes, which should be present across all arrays, and have increasing signal values, respectively.

(9) Internal Controls The two internal control genes GAPDH and actin are included on the array as housekeeping genes. The 3’/5’ ratios of each measure RNA sample and assay quality. More details can be found at http://www.affymetrix.com/ support/help/faqs/ge_assays/faq_17.jsp.Affymetrix guidelines suggest that the 3’/5’ ratio be less than three in general. Larger ratios might indicate RNA degradation. More details can also be found in the Affymetrix GeneChip ® Data Analysis Fundamentals Manual (Affymetrix 2002a).Any grossly outlying ratios across the group of arrays studied should be inspected more closely.

Step 1 must be verified by your personnel in the microarray facility. Steps 2–5 are verified by a quick inspection of the Expression Report files (.rpt) generated by the Affymetrix processing software (GCOS) in your microarray facility. Some quality control metrics can be generated using the simpleaffy package from Bioconductor. Step 6 is confirmed by viewing the image files generated by the GCOS or by generating the images using the Bioconductor’s package affy. Steps 7–9 are validated by examining the expression levels of the control probes activity upon normalization (see the section below for normalization methods). Additional details regarding these and other quality control protocols can be found in the Affymetrix Expression Analysis Data Analysis

Fundamentals Manual (Affymetrix 2002a). Several functions in the R programming language are available to perform some of theseAffymetrix quality control checks (e.g. simpleaffy, computeRawQ, yaqaffy, AffyExpress), and are freely available from the Bioconductor site [http://www.bioconductor.org/]. Please see the website http://bioinformatics.unr.edu/vitis to download some simple R scripts that perform these functions.

22.6 Microarray Quality Assessment of the Long-Term Growth Experiment

As this was the first study using the GeneChip ® Vitis GenomeArray, several quality control measures were instated within the experimental design. For each experimental treatment(water-deficit, salt, and control)at each of four time points(Day4, 8, 12, and 16),a set of biological triplicates was used. Additionally, six technical microarray replicates were runatDay16 under the control treatment. Expression data were subjected to the quality control steps as described above:

(1)After extraction, RNA quality was assessed by A260/280 absorbance ratios and by the Agilent Bioanalyzer. Samples were found to be of high quality, and identical to results reported by Tattersall et al.(2005).

(2)Average array background levels ranged from53 to 140, with a mean of 84.4 and a standard deviation of18.4.AGrubbs’testfor outliers (Grubbs 1969)was performedonthe44 background levels, and one array with a background level of 140 proved to be a statistically significant outlier(p<0.05).This array was excluded from further study. No other array was found to have a significant outlying background value. For experiments with less than 30 arrays, a Grubbs’ test can be performed using the R function grubbs.test from the package outliers. For larger experiments, formulas and tables are available for the Grubbs’ test (Grubbs 1969).

(3)RawQ noise levels fell between 1.7 and 4.4withamean value of 2.6 and standard deviation of0.5.AGrubbs’testfor outliers was performed on the44 noise levels, and only the microarray exhibiting the highest noise level (4.4) was shown to be a significant outlier, and excluded from further analyses. This array was the same array excluded for high background levels in Step2above.

(4) Scaling factors were consistent across all arrays, with a 2-fold difference between the minimum and the maximum value. The mean scaling factor was 0.18, with a standard deviation of0.032. Scaling factors were computed using the R function qc in the outlier package.

(5)Present Call rates ranged from74 to 79%.Theoverallaverage Present Call rate was76%, with a standard deviation of1.12%.Present Call rates across technical and biological replicates exhibited a notably lower standard deviation of0.74, 0.69%, respectively.

(6)Upon close inspection of all microarray images, one microarray showed spatial variation likely caused by a fiber (Fig. 22.1b), and was excluded from further study.

(7)All four hybridization controls were present on 100% of the arrays, and the nor-malized signals were increasing with respect to the order bioC,bioD, and cre.

(8)The behavior of all poly-A controls conformed to the desired recommendations, with the exception of one non-present call on one array for one of the controls, a negligible event.

(9) The internal controls also performed as recommended: 3'–5' actin ratios were less than 1.01; GAPDH ratios were consistently below 1.2.

In addition to the control steps recommended by Affymetrix as outlined above, the trends in RNA degradation between the 5’ end and the 3’ end in each probeset were examined. Probes in each probeset are numbered from1 to 16 from the 5’ end to the 3’ end for each feature on the array. The trends were computed for each microarray by computing the mean raw Perfect Match expression value for each probe number across all array features. Degradation curves should prove very consistent across all arrays in the experiment, which was the case for the long-term stress study (Fig. 22.2a).

22.7 Normalization of Microarray Data

Data in the long-term experiment were processed using the R programming language (version 2.9.0). The R language is publicly available at no cost for all operating

Fig. 22.2 Methods for quality assessment of arrays. a (left)shows the RNA degradation curves of each array. b (right) presents the distributions of the normalized expression values(log base2) of all microarrays

platforms [http://cran.r-project.org/].Bioconductor is an Open Source software development project that features many useful, freely available R programs for the analysis of microarray (and other) data [http://www.bioconductor.org/].

The long-term microarray data were normalized using the Robust Multi-Array Average (RMA) method (Irizarry et al.2003a). Other popular normalization methods for Affymetrix data include the Affymetrix MAS 5.0 algorithm (Affymetrix 2002a), and dChip (Li and HungWong 2001). The RMA method has been shown to perform better than both MAS 5.0 and dChip in terms of precision, consistency in fold-change estimates, and when detecting differential expression (Irizarry et al. 2003b). More detailed discussions regarding normalization methods can be found in a number of references (Bolstad et al. 2003, Irizarry et al. 2003a).Raw intensity values of the42 arrays passing quality control standards were processed and normalized by RMA using the Bioconductor R package affy (Gautier et al. 2004). Upon application of RMA, normalized and log-transformed (base2) expression values of all 42 arrays exhibited very similar distributions (Fig. 22.2b).

Normalized expression behavior of the six replicated control features was examined first. The coefficient of variation (CV) of expression levels across the nine separate tiles was measured for each of these controls, for each array. The average CV across tiles for the42 arrays of actin, GAPDH, elongationfactor1 alpha, aphA (kanamycin resistance gene), beta-lactamase gene, beta-glucuronidase was 1.0, 1.0, 0.8, 10, 6.1, 2.8%, respectively. This indicated that on average, the variation across the nine separate placements of each control feature was minimal, consistent with the lack of spatial variation portrayed in the array images. Profiles of the control features are graphically presented in Fig. 22.3, which shows the small variation among the nine instances within each probeset, with the aphA gene exhibiting the greatest variation. Expression values of the positive controls are of notably greater magnitude than those of the negative controls. Also of interest is the variation of GAPDH across all experimental conditions, which is not consistent with housekeeping behavior, but is consistent with the other Vitis microarray studies (VV1, VV2, VV3, VV4, VV10) at www.plexdb.org.Intheseexperiments, GAPDH expression levels range from6to 12.5onthe normalizedlog2 scale.

Data of the 16,436 non-control probesets on the GeneChip ® Vitis Genome Array were examined with respect to Present Call rates, as computed by the Affymetrix MAS5.0 presence/absence detection algorithm (Affymetrix 2002b). The relative intensities between the Perfect Match and Mismatch probes in each of the16 pairs were combined and used to generate a signal detection Present Call for each probe: P (Present), M (Marginal) and A (Absent), where Present means that the Perfect Matchprobes show stronger intensity than the mismatch probes. Another way to think about this is that a Present Call indicates that the signal strength is greater than the background strength. If a probeset has no signal strength, it is difficult to test its relative signal across different conditions. Therefore, the 1981 probesets (12%) with no Present Calls across the 42 arrays were excluded from any further analysis. This left 14,455 non-control probesets.

Fig. 22.3 Normalized expression profiles of the six repeated control features acrossall42 arrays. Arrays are represented by the horizontal axis, log-transformed (base2) expression values are represented by the vertical axis

The normalized data were evaluated for reproducibility. Pair-wise Pearson cor-relation coefficients were measured across the six technical replicate arrays, all of which were greater than 0.999, and higher than those computed by Baugh et al. (2001), using the Affymetrix C. elegans array, and to those of Redman et al.(2004) using the Affymetrix Arabidopsis array. Correlation coefficients across biological replicates were also computed and found to be very high: Spearman coefficients ranged from 0.967 to 0.998, and Pearson coefficients ranged between 0.970 and 0.999.For each feature on the array, the coefficient of variation was computed across the six technical replicate arrays: the average technical CV across all probesets was 3.9%, denotin2gverysmall amounts of technical variation. Expression values of the six technical replicates were combined into the average for all downstream analyses.

To ensure strict reproducibility standards, all sets of triplicate expression measures were examined for outliers and high levels of variation. The CV of the 187,915 sets of triplicates (187,915 = 14,455 probesets ∗ 13 conditions) was computed on the antilog of normalized expression values; the average triplicate CV was 8.6%. The distribution of the triplicate CV measures can be seen in Fig. 22.4a. Figure 22.4b shows that larger coefficients of variation are generally associated with lower expression measures. These CV measures are exceptionally low as compared to within-group CV computations of 25% using the HGU95A GeneChip ® and the MAS5.0 normalization method (Welle et al. 2002,Affymetrix 2002b);22and27%

Fig. 22.4 Estimation of the variation of expression measures on the arrays. a (left)shows the distribution of the CV values for all 187,915 triplicate sets of expression measures. b (right)shows the relationship between mean antilog triplicate expression values on the horizontal axis, and the associated CV on the vertical axis

using the arrays U95Av2 and U133A, respectively, and the MAS 5.1 normalization method (Affymetrix 2002c,Dalyetal. 2005); 5–15% for technical replicates and greater than 20% for biological and technical replicates of the Murine Genome U74Av2 array using RMA(Han et al. 2004). Any set of triplicates in which one of the measures exhibited a standard deviation of more than 1.14 (the maximum possible standard deviation for three measures is 1.1547;10% of all expression measures exhibited one outlying element with this great a standard deviation), and a CV greater than 0.25 (3% of all triplicates exhibited this large of a CV) for the triplicate set was scrutinized. These thresholds were chosen to represent the1% most variant triplicate sets having one deviating outlier. If one single measure in a triplicated set was near 1.1547, this indicated that the remaining two measures were nearly identical, and that the third triplicate was at its maximum outlying capacity, and thus this one triplicate was removed. This procedure left two replicates within the set of which the mean was used for subsequent analyses. Only 1% of all measurements were excluded using this rule (2,066 single measures). Additionally, any remaining triplicate set having a CV greater than 0.5 was removed. This included only 414 triplicate sets (0.26% of all triplicates), and reduced the mean coefficient of variation of all triplicates to a very low level of7.3%.We found that these thresholds allowed us to identify gross outlying individual measurements within triplicates.

22.8 Data Organization

Principal component analysis (PCA) was used to simplify and define associations between stress conditions and temporal stages. The first two principal components represented 97.9 and 1.6% of the overall variation in the data, respectively, and presented a distinct separation between Day 16 stress conditions and all other

conditions, as shown in Fig. 22.5.To examine genes differentially expressed among stress conditions at any of the four time points an ANOVA was performed on the RMA expression values. We first fit a model with 12 coefficients corresponding to all conditions excluding the Day 0 measurements and extracted comparisons of interest by using contrasts. Day 0 measurements consisted of only the non-stress condition, and created an unbalanced design, making an ANOVA application difficult to assess properly. The Bioconductor package limma was used for ANOVA methods (Smyth 2005). A multiple testing adjustment (Benjamini and Hochberg 1995) was performed on the t-statistics of each contrast to adjust the false discovery rate. Differentially expressed genes with adjusted p-value<0.05 were extracted for further inspection. Most of the differences in expression were exhibited between Day16 stress and control conditions, consistent with the PCA results. Lists of significantly differentially expressed probesets among stress and control conditions at all days can be found at http://bioinformatics.unr.edu/vitis.

Fig. 22.5 Principal component analysis of the normalized and averaged microarray data. The horizontal axis represents the first principal component, which explains 98% of total variation of the data; the second component is represented by the vertical axis, and represents 1.6% of the total data variation

22.9 Functional Annotation and Categorization

Data were then organized by assigning functions in order to understand the biological significance of the observed changes in mRNA abundance. Functional annotation and categorization can be complicated as transcripts might have multiple functions at many different levels of organization. This problem is analogous to the question of which airport (hub) has the most traffic. What kind of traffic? Is it people, helicopters, planes or jets? Is it in New York, Paris or Beijing, and which geographic category level should be used? Is it at the city, state, country or continent level? To extend this analysis to plants, is RUBISCO categorized as an enzyme or in photosynthesis or energy or carbon or sugar metabolism and in what organelle, cell, tissue or organ (i.e. stroma, chloroplast, mesophyll, leaf or shoot)? Such gene ontology (GO) assignments (Ashburner et al. 2000) have been made and can be adapted to grapevine. However, with multiple assignments how does one quantify categories of transcript abundance? The assignment of one probeset to multiple categories is problematic, as this results in the associated transcript being counted more than once in different functional categories. For this study, annotation from the Munich Information Center for Protein Sequences (MIPS,ver. 2.1) catalog of top Arabidopsis BLAST hits (Ruepp et al. 2004, Schoof et al. 2004) was used, as we thought it gave the best and most complete description of metabolic pathways for plants at the time we used it. Each probeset on the Affymetrix Vitis array was manually assigned to exactly one MIPS (Deluc et al. 2009) category. Associations were arbitrarily assigned based on the perspective of the analyst, and can be changed upon a different perspective. Our functional category assignments are available at the grape annotation page of PLEXdb, which is a public resource for gene expression for plants and plant pathogens. The database can be

found at http://www.plexdb.org and the Vitis annotation at http://www.plexdb.org/modules/PD_probeset/annotation.php?genechip=Grape. Please note that these assignments were not meant to be permanent and should be updated periodically as the annotation status of the Vitis genome improves. Others have also recognized the multiple-category annotation problem and have attemptedtosimplifyGO assignments. An example is the BiNGO plugin (Maere et al. 2005) with GOSlim ontologies [www.geneontology.org/GO.slims.shtml] for the Cytoscape visualization software platform (Maere et al. 2005), which can be used to discover over-representation of functional categories in a dataset. It might be wise to use both approaches in order to get different perspectives of your dataset. 22.10 Data Visualization and Integration

An important challenge of microarray studies is to provide biological meaning to the plethora of data produced by annotating genes and integrating them within their biological context. Several visualization tools for molecular pathways are available including AraCyc (Maereet al. 2005), MapMan (Thimm et al. 2004)and VitisNet (Grimpletet al. 2009).In contrast to these general models for Arabidopsis,V At this time only VitisNet [http://vitis-dormancy.sdstate.org] has been manually annotated extensively and specifically for visualizing microarray expression data on grapevine molecular pathways. Recently, a MapMan ontology has been developed for Vitis including Vitis specific pathways for carotenoids, terpenoids and phenylpropanoids (Rotter et al. 2009). The sequences from the Vitis genome sequencing projects (Jaillonetal. 2007,Velascoet al. 2007)and ESTs[www.ncbi.nlm.nih.gov] from the Vitis genus have been combined and the resulting 39,424 unique sequences have been manually annotated and mapped to molecular networks. To date, 13,145 genes have been assigned to 219 networks, including networks for metabolic, hormone, transport, and transcriptional pathways. More will be added in the future. Only 4,755 unique Affymetrix probesets (and an additional 2,110 redundant probesets) can be matched to the grapevine genome. This will be greatly improved when a whole genome array for grapevine is produced in the near future. Proteins and metabolites can also be visualized in VitisNet. The VitisNet tool incorporates Cytoscape, a versatile and customizable visualization software platform for molecular and interaction networks (Maereet al. 2005).

We have used the VV2 data at www.plexdb.org here to demonstrate network visualization using VitisNet. Networks can be downloaded from http://vitis-dormancy.sdstate.org and uploaded to Cytoscape. To begin to understand and identify ABA regulation in grapevine (Cramer 2010), a selection of probesets was created by taking Arabidopsis genes identified to be regulated by ABA(Huang et al. 2007,Matsuietal. 2008)and matching them with grapevine genes that were significantly induced by water deficit(Cramer et al. 2007,Tattersall et al. 2007, Cramer 2010). The genes have been assigned functional categories and were initially visualized and quantitatively analyzed with bar graphs (Fig. 22.6). This type of analysis allowed a broad visualization of the data. One change in functional categories was notable: water deficit substantially decreased the number of transcripts involved with protein synthesis. The large decrease in protein synthesis transcripts was a rather striking result and consistent with an earlier proteomic study (Vincent et al. 2007), in which a decrease in proteins involved in protein synthesis was highly correlated with the inhibition of growth.

To extend the analysis with more detail and meaning, probesets were mapped in VitisNet to the transcript box items corresponding to the grapevine genes using the Cytoscape software. A mapping text file was created. It contained in the first column a grapevine gene obtained from the genome sequencing project (Jaillonet al. 2007) and corresponding Affymetrix probesets in the following columns. The probeset with the highest average expression across all grapevine microarray experiments in PLEXdb [www.plexdb.org] was listed as the first probeset in the text file. In the case of multiple probesets per gene, additional probesets were presented in the following columns. The mapping file was loaded through the “ImportAnnotationFile” window in Cytoscape, which created a probeset attribute for every transcript associated to a probeset. The expression data were then assigned to the transcript box item by linking them to the probeset attribute through the “ImportAnnotationFile” window in Cytoscape. The probesets differentially expressed at Day 16 under water stress were loaded on the networks. Among them, 4,018 can be mapped on transcripts present on the networks(3,717 unique and 1,301 redundant transcripts). Once the expression data were loaded, nodes corresponding to transcripts with differential abundance were presented with colors representing their respective abundance with the VizMapper tool in Cytoscape. A subset of the genes induced by water deficit and ABA in the ABA signaling pathway can be visualized in Fig. 22.7.

A simple analysis of the ABA signaling pathway indicates that many, but not all water deficit induced genes are also induced by ABA. Other factors besides ABA might have a role (e.g. water potential or the level of stress-induced reactive oxygen species). This mapping effort serves to generate new hypotheses that need to be tested in subsequent experiments. One would have to test these hypotheses with carefully designed experiments. For example, one could separate water-deficit effects from ABA effects by conducting water deficit experiments with ABA mutants or applications of ABA to non-stressed vines. Although mapping of the molecular pathways is incomplete mapping efforts will continue with more research and annotation progress.

Fig. 22.6 Functional categorization of a subset of grape transcripts responsive to water deficit and also linked with Arabidopsis genes that are affected by ABA. This subset of grape transcripts is referred to as WD-ABA transcripts.ABA responsive genes in Arabidopsis were identified using the Arabidopsis tiling array dataset from(Matsui et al. 2008)and the Arabidopsis genome array data set from (Huang et al. 2007). The initial subset of genes in grapes responsive to water deficit and which were also identified as ABA-responsive genes in Arabidopsis were functionally categorized (black bars)and then separated into those genes with increased (grey bars) or decreased (white bars) transcript abundance during water deficit. Each subset is display separately with the subset of 2095 grape transcripts based on the tiling array data on the left and the subset of 644 grape transcripts based upon the genome array data on the right. Functional categories are organized along the horizontal axis from the highest to the lowest number of transcripts in a functional category (bottom to top). The percent of transcripts et refers to the percent of total transcripts in that particular set of transcripts that were placed into that functional category. The tiling array data are from (Matsui et al. 2008) and the genome array data are from (Huang et al. 2007)

Fig. 22.7 The effect of water deficit on Cabernet Sauvignon shoot transcripts as visualized on the ABA signaling pathway in VitisNet. Changes in transcript abundance (decreased abundance to increased transcript abundance; a gradient of dark green to dark red, respectively) are displayed by colored boxes. White boxes indicate no changes or the transcripts were not detected. Circles denote ABA-induced transcripts in Arabidopsis. Arrows are colored to indicate the kind of reaction:(red) metabolic;(blue)enzymatic catalysis;(green)trigger;(yellow)inhibition;(dark green)unknown interaction

References

Affymetrix (2002a)GeneChip ® Data Analysis FundamentalsManual.Affymetrix,Santa Clara Affymetrix (2002b)Affymetrix Microarray Suite 5.0User’sGuide.Affymetrix, SantaClara Affymetrix (2002c)AffymetrixMicroarray Suite 5.1User’sGuide.Affymetrix, Santa Clara AshburnerM,BallCA, BlakeJA,BotsteinD,ButlerH,CherryJM,DavisAP, DolinskiK, Dwight

SS,EppigJT, HarrisMA,HillDP,Issel-TarverL, KasarskisA,LewisS, MateseJC,Richardson JE, RingwaldM,RubinGM, SherlockG(2000) Gene ontology: tool forthe unification of biology.The gene ontology consortium. NatGenet 25:25–29

Baugh LR,Hill AA, Brown EL, Hunter CP (2001) Quantitative analysis of mRNAamplification byin vitrotranscription. NucleicAcids Res 29:E29 BenjaminiY,Hochberg,Y (1995) Controllingthefalsediscovery rate:a practical andpowerful approach to multiple testing.JRStat SocSerB57:289–300 BolstadBM, IrizarryRA,AstrandM, SpeedTP (2003)A comparison ofnormalization methods for high

density oligonucleotide array data based on variance and bias. Bioinformatics 19: 185–193 ChervinC,Tira-UmphonA,TerrierN, ZouineM,SeveracD,RoustanJP(2008) Stimulation of thegrape

berryexpansionby ethyleneandeffectsonrelated gene transcripts,overthe ripening phase.Physiol Plant 134:534–546

CramerGR (2010) Abioticstress&plant responses fromthe wholevinetothe genes. AusJGrape Wine Res 16:86–93

CramerGR,ErgulA, GrimpletJ,TillettRL,TattersallEA, BohlmanMC,VincentD,Sonderegger J,EvansJ,OsborneC, QuiliciD,SchlauchKA, SchooleyDA,CushmanJC (2007)Waterand salinity stressin grapevines:earlyandlatechangesintranscriptand metaboliteprofiles.Funct Integr Genomics7:111–134

Daly TM, Dumaual CM, Dotson CA, Farmen MW, Kadam SK, Hockett RD (2005) Precision profiling and components of variability analysis for Affymetrix microarray assays run in a clinical context.JMol Diagn7:404–412

DelucLG, GrimpletJ, WheatleyMD,TillettRL, QuiliciDR, OsborneC, SchooleyDA,Schlauch KA, Cushman JC, Cramer GR (2007) Transcriptomic and metabolite analyses of Cabernet Sauvignon grapeberrydevelopment. BMCGenomics8:429

Deluc LG, Quilici DR, Decendit A, Grimplet J, Wheatley MD, Schlauch KA, Mérillon JM, CushmanJC, CramerGR (2009)Waterdeficitalters differentially metabolic pathwaysaffect-ingimportant flavorand qualitytraitsin grapeberries of Cabernet Sauvignon andChardonnay. BMCGenomics 10:212

EspinozaC,VegaA,MedinaC,SchlauchK,Cramer G,Arce-JohnsonP(2006) Geneexpression associated with compatible viraldiseasesin grapevinecultivars.FunctIntegrGenomics7: 95–110

FernandezL,TorregrosaL,TerrierN, SreekantanL,GrimpletJ,DaviesC,ThomasMR,Romieu C, AgeorgesA(2007) Identification ofgenes associated with flesh morphogenesis during grapevine fruit development. Plant Mol Biol 63:307–323

FigueiredoA,FortesAM, FerreiraS,SebastianaM,Choi YH, SousaL,Acioli-SantosB, Pessoa F, Verpoorte R, Pais MS (2008) Transcriptional and metabolic profiling of grape (Vitis vinifera L.)leaves unravel possibleinnate resistance againstpathogenicfungi.J ExpBot 59: 3371–3381

GattoP,VrhovsekU, MuthJ,SegalaC,RomualdiC,FontanaP,PrueferD, StefaniniM, MoserC, MattiviF,VelascoR (2008) Ripening andgenotype controlstilbene accumulationinhealthy

grapes.JAgricFood Chem 56:11773–11785 GautierL,CopeL, BolstadBM, IrizarryRA (2004)affy –analysis ofAffymetrixGeneChipdata at

theprobelevel. Bioinformatics 20:307–315 Grimplet J, Cramer GR, Dickerson JA, Mathiason K, Fennell AY (2009) VitisNet: “Omics”

integrationthrough grapevinemolecularnetworks. PLoS ONE (submitted) GrimpletJ, DelucLG,TillettRL,WheatleyMD, SchlauchKA, CramerGR,CushmanJC (2007) Tissue-

specific mRNAexpression profilingingrape berry tissues.BMC Genomics8:187 GrubbsF(1969) Procedures fordetecting outlying observations in samples.Technometrics 11: 1–21 Han ES, Wu Y, McCarter R, Nelson JF, Richardson A, Hilsenbeck SG (2004) Reproducibility, sources

of variability, pooling, and sample size: important considerations for the design of high-densityoligonucleotidearray experiments.JGerontolABiolSci MedSci 59:306–315

HuangD,Jaradat MR,WuW,Ambrose SJ,RossAR, Abrams SR,CutlerAJ(2007) Structural analogs ofABA reveal novelfeatures ofABAperceptionand signalingin Arabidopsis.PlantJ 50:414–428

Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003a) Summaries of Affymetrix GeneChip probeleveldata. NucleicAcids Res 31:e15

IrizarryRA,HobbsB, CollinF, Beazer-BarclayYD, AntonellisKJ, ScherfU, SpeedTP (2003b) Exploration, normalization, andsummaries of high densityoligonucleotide array probelevel data. Biostatistics 4:249–264

JaillonO,AuryJM, NoelB, PolicritiA, ClepetC, CasagrandeA, ChoisneN,AubourgS,Vitulo N, JubinC,VezziA,LegeaiF, HugueneyP,DasilvaC,HornerD,MicaE,JublotD, PoulainJ, BruyereC, BillaultA,SegurensB,GouyvenouxM,UgarteE, CattonaroF, AnthouardV,Vico V, DelFabbroC,AlauxM,Di GasperoG,DumasV, FeliceN,PaillardS, JumanI, Moroldo M, Scalabrin S, Canaguier A, Le Clainche I, Malacrida G, Durand E, Pesole G, Laucou V, ChateletP, MerdinogluD,DelledonneM, PezzottiM,LecharnyA, ScarpelliC, Artiguenave F,PeME,ValleG,MorganteM,CabocheM, Adam-BlondonAF,WeissenbachJ, QuetierF, WinckerP(2007)Thegrapevine genomesequencesuggestsancestralhexaploidizationinmajor angiosperm phyla. Nature 449:463–467

LiC, HungWongW(2001) Model-based analysis of oligonucleotide arrays: modelvalidation, design issues andstandard errorapplication. GenomeBiol2:RESEARCH0032

LundST,PengFY, NayarT,ReidKE, SchlosserJ(2008) Geneexpression analysesin individual grape(Vitis viniferaL.)berries duringripeninginitiationreveal that pigmentation intensityisa valid indicator ofdevelopmentalstaging within thecluster.Plant Mol Biol 68:301–315

MaereS,HeymansK,KuiperM(2005) BiNGO:aCytoscapeplugintoassessoverrepresentation of gene ontology categoriesinbiological networks.Bioinformatics 21:3448–3449

MathiasonK,HeD,GrimpletJ,VenkateswariJ,GalbraithDW,OrE,FennellA(2009)Transcript profilinginVitis riparia duringchilling requirement fulfillment revealscoordination of gene expression patterns with optimizedbudbreak.FunctIntegrGenomics9:81–96

Matsui A, Ishida J, MorosawaT, MochizukiY, KaminumaE,EndoTA,Okamoto M, Nambara E, Nakajima M, Kawashima M, Satou M, Kim JM, Kobayashi N, Toyoda T, Shinozaki K, SekiM(2008) Arabidopsis transcriptomeanalysis underdrought,cold, high-salinity andABA treatment conditions usinga tilingarray. PlantCellPhysiol 49:1135–1149

PilatiS, PerazzolliM,MalossiniA, CestaroA,DematteL,FontanaP,DalRiA,ViolaR,Velasco R, MoserC(2007) Genome-wide transcriptionalanalysis ofgrapevine berry ripening revealsa set ofgenes similarlymodulated duringthree seasons andthe occurrence of an oxidativeburst atveraison. BMCGenomics8:428

Redman JC, Haas BJ, Tanimoto G, Town CD (2004) Development and evaluation of an Arabidopsis wholegenomeAffymetrixprobe array. PlantJ38:545–561

RotterA, CampsC,Lohse, KappelC,PilatiS,HrenM,StittM, Coutos-ThévenotP, MoserC, UsadelB, DelrotS, GrudenK (2009) Geneexpression profilinginsusceptibleinteraction of grapevine with its fungal pathogen Eutypa lata: extending MapMan ontology for grapevine. BMCPlant Biology 9:104

Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Guldener U, Mannhaupt G, Munsterkotter M, Mewes HW (2004) The FunCat, a functional annotation scheme forsystematic classification ofproteins fromwholegenomes. Nucleic AcidsRes 32: 5539–5545

Schoof H, Ernst R, Nazarov V, Pfeifer L, Mewes HW, Mayer KF (2004) MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource forplant genomics. NucleicAcids Res 32:D373–376

Smyth G (2005) Limma: linear models for microarray data. In: Gentlemen R, Carey V, Dudoit S, IrizarryR, HuberW(eds) Bioinformaticsand computationalbiology solutions usingRand bioconductor. Springer,NewYork,pp 397–420

TattersallE,ErgulA, Al-KayalF, CushmanJC, CramerGR (2005)Acomparison of methodsfor isolatingRNAfromleaves ofgrapevine(Vitis vinifera). .AmJEnolVitic 56:400–406

TattersallEA,GrimpletJ,DelucL, WheatleyMD,VincentD,OsborneC,ErgulA, LomenE,Blank RR, SchlauchKA, CushmanJC, CramerGR (2007)Transcriptabundance profilesreveallarger andmorecomplex responses of grapevinetochillingcomparedtoosmoticand salinity stress. FunctIntegrGenomics7:317–333

TerrierN, GlissantD, GrimpletJ, BarrieuF, AbbalP,CoutureC,AgeorgesA,AtanassovaR, Leon C, RenaudinJP, DedaldechampF,RomieuC,DelrotS,HamdiS(2005) Isogene specific oligo arrays reveal multifaceted changesingeneexpression duringgrape berry(Vitis vinifera L.) development. Planta 222:832–847

ThimmO,BlasingO, GibonY, NagelA,MeyerS, KrugerP,SelbigJ, MullerLA, RheeSY,StittM (2004) MAPMAN:auser-driven tool to displaygenomicsdatasets ontodiagrams ofmetabolic pathways andother biological processes. PlantJ37:914–939

TrewavasA(2006)ABrief History ofSystems Biology: "Every object that biology studiesisa system of systems."FrancoisJacob(1974).Plant Cell 18:2420–2430

VelascoR,ZharkikhA,TroggioM,CartwrightDA,CestaroA, PrussD,PindoM, FitzgeraldLM, VezzulliS,ReidJ,MalacarneG, IlievD,CoppolaG,WardellB, MichelettiD, MacalmaT, K.A. Schlauch et al.

FacciM, MitchellJT,PerazzolliM,EldredgeG, GattoP,OyzerskiR, MorettoM, GutinN, StefaniniM,ChenY,SegalaC,DavenportC,DematteL, MrazA, BattilanaJ, StormoK, Costa F,TaoQ,Si-AmmourA, HarkinsT,LackeyA, PerbostC,TaillonB, StellaA, SolovyevV, FawcettJA, SterckL,VandepoeleK, GrandoSM,ToppoS, MoserC, LanchburyJ,BogdenR, SkolnickM,SgaramellaV, BhatnagarSK,FontanaP,GutinA,VandePeerY, SalaminiF,Viola R(2007)Ahigh qualitydraft consensussequence ofthe genome ofa heterozygous grapevine variety. PLoS One2: e1326

VincentD,ErgulA, BohlmanMC,Tattersall EA,Tillett RL,WheatleyMD,WoolseyR,Quilici DR,JoetsJ, SchlauchK, SchooleyDA,Cushman JC,CramerGR(2007) Proteomicanalysis revealsdifferences between Vitis viniferaL. cv.Chardonnayand cv.CabernetSauvignon and theirresponsestowaterdeficitand salinity.JExpBot 58:1873–1892

WangZ, GersteinM,SnyderM(2009)RNA-Seq:arevolutionary toolfortranscriptomics.NatRev Genet 10:57–63

Waters DL, Holton TA, Ablett EM, Lee LS, Henry RJ (2005) cDNA microarray analysis of developing grape(Vitis viniferacv.Shiraz) berry skin.FunctIntegrGenomics5:40–58

Waters DLE, HoltonTA,AblettEM, SladeLeeL, HenryRJ(2006) Theripeningwinegrape berry skin transcriptome. PlantSci 171:132–138

Welle S, Brooks AI, Thornton CA (2002) Computational method for reducing variance with Affymetrix microarrays.BMC Bioinformatics3:23