11
Computational Statistics and Data Analysis 54 (2010) 1179–1189 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Modeling epigenetic modifications under multiple treatment conditions Dong Wang Department of Statistics, University of Nebraska, Lincoln, NE 68583, USA article info Article history: Received 13 March 2009 Received in revised form 28 September 2009 Accepted 28 September 2009 Available online 6 October 2009 abstract ChIP–chip is a powerful tool for epigenetic research. However, current statistical methods are developed primarily for detecting transcription factor binding sites, and there is currently no satisfactory method for incorporating covariates such as time, hormone levels, and genotypes. In this study, we develop a varying coefficient model for epigenetic modifications such as histone acetylation and DNA methylation. By taking into account the special features of ChIP–chip data, a plug-in type method is derived for bandwidth selection in the local linear fitting of the varying coefficient model. Our results show that analyses using the proposed varying coefficient model can effectively detect diverse characteristics of epigenetic modifications over genomic regions as well as across different treatment conditions. © 2009 Elsevier B.V. All rights reserved. 1. Introduction The study of epigenetics concerns mechanisms that regulate changes in phenotype or gene expression without alterations in the DNA sequence. Post-translational histone modifications and DNA methylation are among the most intensively studied epigenetic modifications. Changes in epigenetic modification patterns can alter chromatin structures and the accessibility of DNA to transcription factors (Schones and Zhao, 2008) and thus have a profound impact on gene expression profiles. In the nucleus, the fundamental structure of chromatin is the nucleosome, consisting of 147 base pairs of DNA wound around a core of histone proteins (Fig. 1). While the core of the nucleosome is always constituted of histones H2A, H2B, H3, and H4, the tail region of these histone proteins can be modified by chemical processes such as methylation, acetylation, phosphorylation and other modifications of amino acids. It is now widely appreciated that different patterns of histone modification constitute a ‘‘histone code’’ that is distinctive from the primary DNA sequence. Due to the importance for the regulation of gene expression, the distribution patterns of various histone modifications are of critical interest to the research community. Like that of transcription factor binding sites, the distribution of modified histones can be studied with chromatin immunoprecipitation experiments followed by hybridization on tiling microarrays (ChIP on chip, reviewed in Wu et al., 2006). In ChIP–chip experiments, proteins are crosslinked chemically with the DNA molecule. The chromatin is then broken into pieces of several hundred base pairs and immunoprecipitated with antibodies against specific proteins. The DNA is subsequently amplified and hybridized to genomic tiling microarrays to measure the enrichment of DNA sequences in the immunoprecipitated fraction. Since the probes on tiling microarrays are designed to cover the genome (or portions of the genome) consecutively, each hybridization will produce a data set consisting of the hybridization signal of numerous oligonucleotide probes as well as the corresponding genomic locations. Since the amount of DNA depends on the association of DNA with targeted proteins in the immunoprecipitation, the high enrichment levels of certain DNA sequences suggest elevated binding of the protein of interest at the corresponding genomic regions. There have been several methods developed for the analysis of ChIP–chip data. Most efforts have focused on detecting transcription factor binding sites, though examples have also been reported in the study of histone modifications. Early E-mail address: [email protected]. 0167-9473/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2009.09.035

Modeling epigenetic modifications under multiple treatment conditions

Embed Size (px)

Citation preview

Computational Statistics and Data Analysis 54 (2010) 1179–1189

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda

Modeling epigenetic modifications under multiple treatment conditionsDong WangDepartment of Statistics, University of Nebraska, Lincoln, NE 68583, USA

a r t i c l e i n f o

Article history:Received 13 March 2009Received in revised form 28 September2009Accepted 28 September 2009Available online 6 October 2009

a b s t r a c t

ChIP–chip is a powerful tool for epigenetic research. However, current statistical methodsare developed primarily for detecting transcription factor binding sites, and there iscurrently no satisfactory method for incorporating covariates such as time, hormonelevels, and genotypes. In this study, we develop a varying coefficient model for epigeneticmodifications such as histone acetylation and DNAmethylation. By taking into account thespecial features of ChIP–chip data, a plug-in typemethod is derived for bandwidth selectionin the local linear fitting of the varying coefficient model. Our results show that analysesusing the proposed varying coefficient model can effectively detect diverse characteristicsof epigenetic modifications over genomic regions as well as across different treatmentconditions.

© 2009 Elsevier B.V. All rights reserved.

1. Introduction

The study of epigenetics concernsmechanisms that regulate changes in phenotype or gene expressionwithout alterationsin the DNA sequence. Post-translational histonemodifications andDNAmethylation are among themost intensively studiedepigenetic modifications. Changes in epigenetic modification patterns can alter chromatin structures and the accessibilityof DNA to transcription factors (Schones and Zhao, 2008) and thus have a profound impact on gene expression profiles.In the nucleus, the fundamental structure of chromatin is the nucleosome, consisting of 147 base pairs of DNA woundaround a core of histone proteins (Fig. 1). While the core of the nucleosome is always constituted of histones H2A, H2B,H3, and H4, the tail region of these histone proteins can bemodified by chemical processes such asmethylation, acetylation,phosphorylation and other modifications of amino acids. It is now widely appreciated that different patterns of histonemodification constitute a ‘‘histone code’’ that is distinctive from the primary DNA sequence. Due to the importance for theregulation of gene expression, the distribution patterns of various histonemodifications are of critical interest to the researchcommunity.Like that of transcription factor binding sites, the distribution of modified histones can be studied with chromatin

immunoprecipitation experiments followed by hybridization on tiling microarrays (ChIP on chip, reviewed in Wu et al.,2006). In ChIP–chip experiments, proteins are crosslinked chemically with the DNAmolecule. The chromatin is then brokeninto pieces of several hundred base pairs and immunoprecipitated with antibodies against specific proteins. The DNA issubsequently amplified and hybridized to genomic tiling microarrays to measure the enrichment of DNA sequences inthe immunoprecipitated fraction. Since the probes on tiling microarrays are designed to cover the genome (or portionsof the genome) consecutively, each hybridization will produce a data set consisting of the hybridization signal of numerousoligonucleotide probes aswell as the corresponding genomic locations. Since the amount of DNA depends on the associationof DNA with targeted proteins in the immunoprecipitation, the high enrichment levels of certain DNA sequences suggestelevated binding of the protein of interest at the corresponding genomic regions.There have been several methods developed for the analysis of ChIP–chip data. Most efforts have focused on detecting

transcription factor binding sites, though examples have also been reported in the study of histone modifications. Early

E-mail address: [email protected].

0167-9473/$ – see front matter© 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2009.09.035

1180 D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189

attempts include methods based on sliding windows (Cawley et al., 2004; Keles et al., 2006), as well as hidden Markovmodels (Li et al., 2005). The Mpeak method (Zheng et al., 2007) fits a triangle-shaped regression function over a movingwindow of fixed distance using double regression. Ji andWong’s (2005) TileMapmethod uses a hierarchical empirical Bayesmodel to compute probe specific statistics, and then combine themwith amoving average or a hiddenMarkovmodel. Keles(2007) has proposed a hierarchical gamma mixture model of binding intensities, incorporating a distribution for the peaksize derived from the experimental design and model parameters. A Bayesian hierarchical model for ChIP–chip data hasbeen suggested by Gottardo et al. (2008).Though some of the aforementioned methods have been applied to detect enrichment regions for various histone

modifications, there remain some important issues in the data analysis for epigenetic research. First, most of currentmethods emphasize finding transcription factor binding sites, especially in the form of ‘‘peak’’ detection. This emphasis isperfectly reasonable for studying transcription factors, since the binding sites of transcription factors are usually narrow, anddemonstrating amore or less triangular ‘‘peak’’ shape. Thus themain goal of this type of study is to accurately determine thenumber and location of these peaks. Data on histone modifications, on the other hand, have different characteristics whencompared with those of transcription factors. Histone modifications can often stretch over regions of several kilobases inlength, routinely with complex shapes that do not resemble discrete peaks. The other aspect is that as studies to date havemostly focused on the presence/non-presence of a certain protein in some genomic regions using data from a control sampleand a sample obtained by chromatin immunoprecipitation, ChIP–chip experiments involvingmultiple treatment conditions,time points, or cell types have begun to appear and are expected to increase in popularity. For example, in time courseexperiments, one would be interested in discerning the change in histone modification profiles over different time pointsas well as across genomic regions. Statistical methods dealing with these experiments with multiple treatment conditionshave not been well developed.In this article we introduce a varying coefficient model for ChIP–chip data that can naturally incorporate multiple

experimental conditions. The varying coefficient model was proposed in Cleveland et al. (1991) and Hastie and Tibshirani(1993), which has tremendously extended the applications of traditional regression models. It has since been successfullyapplied to longitudinal studies, environmental studies, time series, and many other settings (e.g. Cai et al., 2000a; Cai andTiwari, 2000; Lee and Ullah, 2003; Qu and Li, 2006). The estimation and inference for varying coefficient models based onlocal polynomial modeling have been discussed in detail in Fan and Zhang (1999) and Cai et al. (2000b). As the varyingcoefficient model has been shown to be readily expandable and adaptable, it can serve as a starting point for developingmodels that will be able to integrate disparate sources of information in epigenetic research. Though in this paper we usean example of acetylated histones based on Affymetrix tiling microarrays, the method proposed here should be readilyapplicable to other platforms such as NimbleGen arrays. Data for DNA methylation studies obtained through methylatedDNA immunoprecipitation (MeDIPWeber et al., 2005) also provide potential applications for the varying coefficient model.The rest of this article is organized as follows. The data set that motivates this research is described in Section 2. The

varying coefficient model for ChIP–chip data is defined in Section 3, and a bandwidth selection procedure and permutationbased tests are discussed. Section 4 describes the results of the analysis of the HL60 time course data using the proposedmethod. Section 5 reports the results from simulation studies. Section 6 is general conclusion and discussion, while sometechnical details are provided in the Appendix.

2. The HL60 cell time course data

Tomotivate the proposed varying coefficientmodel for epigeneticmodifications,we consider the time course experimenton the HL60 cell-line described by Ghosh et al. (2006). This experiment is performed as part of the Encyclopedia of DNAElements (ENCODE) consortiumproject (The ENCODE Project Consortium, 2004, 2007), which seeks to identify all functionalelements in one per cent of the human genome sequence during the pilot phase of the project.For this particular experiment, HL60 cells are stimulated by all-trans retinoic acid (RA, commonly known as vitamin A)

for time periods of 0, 2, 8 and 32 h to induce differentiation along the granulocitic lineage (a category of white blood cells).At each time point, chromatin immunoprecipitation experiments are carried out, and both control and treatment samplesare hybridized on tiling microarrays. The control sample is the amplified genomic DNA, and the treatment sample is fromDNA recovered in chromatin immunoprecipitation. Antibodies against various proteins are used for immunoprecipitation,including a trio of modified histones: tetra-acetylated histone H4, di-acetylated histone H3, and tri-methylated histone H3.In this article, we consider the data relating to tetra-acetylated histone H4 (TetAc-H4), which is believed to be primarilyassociated with actively expressed genes.At each time point of the TetAc-H4 immunoprecipitation experiment, five biological samples are processed, then the

genomic DNA and immunoprecipitated DNA (after amplification) of each sample are hybridized to two Affymetrix ENCODEI tiling microarrays respectively. Thus there are 20 tiling arrays for each time point, 10 for controls (five biological replicateswith two technical replicates each) and 10 for immunoprecipitated samples. The Affymetrix ENCODE I tiling arrays employshort oligonucleotide probe pairs of length 25 bases to interrogate the one per cent of the human genome sequence selectedby the ENCODE consortium. Each probe pair includes a perfect match and a mismatch. The mismatch sequence is identicalto its corresponding perfect match sequence except for the central base, which is intended to measure the degree of cross-hybridization. In our analysis, only measurements on perfect match probes are utilized.

D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189 1181

Fig. 1. A diagram of the basic structure of chromatin. The upper left section of the figure shows the structure of a nucleosome. The lower section illustratesthe ‘‘beads on a string’’ structure of chromatin. Chemical modifications of histone tails modulate the binding of transcription factors and the recruitmentof RNA polymerase.

3. A varying coefficient model for epigenetic modifications

3.1. Model specification

We are interested in modeling the change in epigenetic modification profiles over certain predefined genomic regionswith characteristics of interest; common examples include promoters (sequences critical for RNA polymerase attachmentupstream of the transcription start site) and enhancers (sequences with special protein binding properties that can elevategene expression levels). Also suppose that the ChIP–chip data have been suitably normalized across arrays. For eachbiological sample, the log ratio of the probe intensity for the immunoprecipitated DNA versus that for the control DNA isconsidered as the enrichment score at the genomic locus of that probe (if multiple technical replicates are used, the averagelog ratio will be used for a particular biological sample).Now we consider a specific genomic region of interest with fixed length. Let ui, i = 1, . . . , n, represent the genomic

position of the center of the probe for the ith observation, for which we use the distance from the transcription start sitein kilobases for the HL60 example. Here i is used to index all possible combinations of biological samples and probes. Theprobe position is considered to be fixed and roughly equally spaced along the chromosome. In this experiment, it is of specialinterest to detect the change in histone modification in each region between different points of RA stimulation, which mayitself vary along the genomic sequence of the region. This leads us to use a varying coefficient model to simultaneouslyaccount for the variation across both time and genomic positions. The enrichment score Yi is modeled as

Yi =q∑j=1

αj(ui)Xij + εi, (1)

where the Xij’s are covariates, αj(u) is the jth coefficient function, and q is the number of different treatment conditions.Also the εi’s are independent noise terms with zero mean and variance σ 2. In addition, we denote as Xi = (Xi1, . . . , Xiq)T

the covariate vector of the ith observation and as r(ui, Xi) =∑qj=1 αj(ui)Xij the mean enrichment score at locus ui given

covariate Xi. For illustration, consider the time course experiment for TetAc-H4. Supposewe define Xi1 to be 1 for all i values;Xi2 = 1 if the observation is taken 2 h after RA stimulation and the value is zero otherwise; in a similar fashion, Xi3 and Xi4indicate whether the observation is taken after 8 h or 32 h after RA stimulation respectively. In this setting, α1(u) is themean enrichment score at genomic locus u before RA stimulation, α2(u) is the change in enrichment score at this positionafter 2 h of RA stimulation, α3(u) is the change in enrichment after 8 h, and so on. Accordingly, the mean enrichment levelr(u, ·) at the four time points will be represented by α1(u), α1(u) + α2(u), α1(u) + α3(u), and α1(u) + α4(u) respectively.Besides testing whether αj(u) is zero, which we shall discuss later, examining the shapes of αj(u), j = 2, 3, 4, which showspecific up-regulated or down-regulated histone modification patterns, is also important. Potentially, we can cluster genesaccording to different shapes of αj(u). One may also use the αj(u) functions of known genes to train classifiers in order topredict the property of unknown genes (See Heintzman et al., 2007, for an example regarding identifying enhancers).Note that though we define the covariates to be indicator variables with values of 0 or 1 in the analysis of the HL60 time

course experiment, there are situations where it is reasonable to consider continuous covariates, e.g., when X representsshort time intervals or a gradient of a certain chemical in a close range. For continuous covariates, the methods proposed inthis paper can be readily adapted with straightforward modifications.

1182 D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189

3.2. Parameter estimation

Several methods have been proposed for parameter estimation in varying coefficient models. In this article, we use alocal linear regression approach, i.e., for each given point u0, approximate the function locally as

αj(u) ≈ aj + bj(u− u0).

A general treatment of local linear regression can be found in Wand and Jones (1995). Defining Kh(z) = h−1K(z/h), theestimation can be carried out by solving the least squares problem of minimizing

Q =n∑i

[Yi −

q∑j=1

{aj + bj(ui − u0)}Xij]2Kh(ui − u0)

for a given kernel function K(·) and bandwidth h.Define Y = (Y1, . . . , Yn)T,

X0 =

X11 X11(u1 − u0) · · · X1q X1q(u1 − u0)...

.... . .

......

Xn1 Xn1(un − u0) · · · Xnq Xnq(un − u0)

and W = diag(Kh(u1 − u0), . . . , Kh(un − u0)). By standard least squares theory, we can estimate the varying coefficientfunctions as

αj(u0) = eT2j−1,2q(XT0WX0)−1XT0WY , j = 1, . . . , q.

Here ei,m denote the unit vector of length m with 1 at the ith position. Recalling Xi = (Xi1, . . . , Xiq)T, we can estimate theenrichment score at position u0 and with covariates Xi as

r(Xi, u0) = (Xi ⊗ e1,2)T(XT0WX0)−1XT0WY ,

where⊗ denotes the Kronecker product. Here we consider only ‘‘internal’’ points, i.e., the distance from u0 to either end ofthe genomic region is at least h. Usually in the tilingmicroarray setting,we can include enrichment scoreswithin the distanceh from the boundary outside the region of interest in the estimation process and then discard these extra segments. In thecase that boundary regions have to be considered, theories on the boundary condition for local polynomial regression canbe applied (e.g. Fan and Gijbels, 1996).For simplicity, suppose there are k biological samples under each of the q treatment conditions. Then n = qknc , where

nc is the number of probes on the tiling array for this region of length c. Under some smoothness conditions given in theAppendix, both the coefficient functions and the mean enrichment scores can be estimated consistently under h→ 0 andknch→∞.

3.3. Bandwidth selection

As for all smoothing based methods, bandwidth selection is of great importance for the performance of the model. Herewe use a plug-inmethod similar to that of Ruppert et al. (1995) for local polynomial regression. First we can show that giventhe covariate vector X and the genomic position u0, the bias of the estimator r(X, u0) is

E{r(X, u0)− r(X, u0)} =12h2µ2(K)XTα′′(u0)+ o(h2)+ O

{(knc)−1

}, (2)

where µ2(K) =∫z2K(z)dz and α′′(u0) is the vector of the second derivative of the coefficient functions. Accordingly, the

variance is

Var{r(X, u0)} =cR(K)σ 2

knch+ o{(knch)−1}, (3)

where R(K) =∫K 2(z)dz, and c is the length of the genomic region (in kilobases for our numerical results). By the mean

squared error criterion, the optimal bandwidth of given X and u0 is then

hop =

[cR(K)σ 2

kncµ22(K){XTα′′(u0)

}2]1/5

. (4)

As expected for the local linear regressionmethod, the optimal bandwidth hop is of the order (knc)−1/5. Results in Eqs. (2) and(3) also suggest that the performance of the estimatorwill improvewithmore biological samples and higher probe densities.Obviously the variance σ 2 and the second derivative of the coefficient functions will have to be estimated. In addition, sincewe mainly consider genomic regions with the length of a few kilobases, it is convenient to use a single bandwidth for all

D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189 1183

probe positions in the same region and the second derivative will be averaged. Specifically, we will use the following todetermine the bandwidth:

hop =

cR(K)σ 2

kncµ22(K)1n

n∑i=1

{XTi α′′(ui)

}21/5

.

We use the method of Härdle and Marron (1995) to obtain the estimates of σ 2 and α′′(u) in hop with blocked polynomialfunctions as a preliminary fit, though more sophisticated methods of Ruppert et al. (1995) can also be used with straight-forward modifications.In addition to the plug-in method, one can also use the generalized cross-validation criterion for bandwidth selection,

which tends to give shorter bandwidth than the plug-inmethod. The plug-inmethod is easy to implement and achieves goodperformance in the analysis of actual data and in simulation studies. Here we use a separate bandwidth for each genomicsegment of four to five kilobases in the analysis of the HL60 time course data and in simulation studies. Alternatively, onemay choose to apply different bandwidths to smaller intervals if the region is of particular interest, or one may use a singlebandwidth for a large region if only exploratory analysis is intended.On the other hand, the choice of the kernel function K(·) does not seem to have a major effect on the results of analysis.

Throughout this paper, we use the Epannechnikov kernel, K(z) = 34 (1− z

2)I(|z| < 1), for the numerical results.

3.4. Inference

In addition to estimating the mean enrichment levels and the coefficient functions representing the treatment effect, itis sometimes of interest to test whether a coefficient function is zero. In the HL60 time course example, this is equivalent totesting whether the epigenetic profile changes after RA stimulation when compared to that of the zero hour.For this purpose, we propose to use a version of the multiresponse permutation procedure (MRPP) described by Mielke

and Berry (2001). The test statistic is given by

D =1ncq

∑u

q∑i=1

Di(u),

where nc is the number of probes in the region of interest, q is the number of treatments, and Di(u) is the average of allpairwise Euclidean distances between residuals of the same treatment i at position u after fitting the reduced model. Theadvantages of using the Euclidean distance in the test statistic are extensively discussed in Mielke and Berry (2001). Theproposed permutation test prove to be very robust under various settings as demonstrated in the simulation studies.The null distribution of the test statistic is obtained by permuting the residuals of blocks of probes. For a genomic region

on which the test will be carried out, the ordered probes are divided into equal sized blocks (except the two blocks at theends). The position of the first dividing point is randomly chosen. For example, for a region of 160 probes and a block sizeof 20, the second block can begin anywhere from probe number 2 to number 21. If two or more treatment conditions areexpected to result in the same coefficient function under the null hypothesis, the samples associated with these treatmentconditions are randomly permuted for each of the blocks. After all blocks have been permuted for themth time, the statisticDm is calculated using the permuted residuals. This process is repeatedM times for a sufficient largeM and the permutationp-value is given by

1M

M∑m=1

I(D ≥ Dm),

where I(·) is the indicator function. This method is similar in spirit to the moving block bootstrap (Liu and Singh, 1992). Asnoted in the next section, we use a wavelet denoising method in the normalization step to remove high frequency features.If intensity measurements do not completely satisfy the white noise assumption after normalization, the permutation testprovides additional robustness and is still likely to retain the correct type I error rate. The power of the test increases withmore biological samples and higher probe density.The R code implementing methods described in this section will be available from the author upon request.

4. Application to the HL60 time course data

Due to the myriad factors affecting the intensity measurements in ChIP–chip experiments, normalization is crucial fordetecting variations caused by truly biological sources. In our analysis, we use the MAT (Model based Analysis of Tilingarrays) algorithm proposed by Johnson et al. (2006), which standardize the intensity reading of each probe using sequenceand copy number information. We also use the wavelet denoising method of Ozsolak et al. (2007) to remove high frequencyfeatures from each array. After the normalization, the readings from the two tiling arrays hybridized to the same DNA

1184 D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189

Fig. 2. The TetAc-H4 enrichment profiles for genes KIAA0999 and SERPINB2 as estimated by the varying coefficient model (left panels) and the estimatedcoefficient functions representing changes in enrichment at 2 h and 32 h after RA stimulation (right panels).

sample (technical replicates) are averaged. For each probe of a certain biological sample under a specific time point, thelog transformed intensity from genomic DNA is subtracted from that of immunoprecipitated DNA to obtain the enrichmentscore.The model for the enrichment score is thus

Yi =4∑j=1

αj(ui)Xij + εi

where i indexes all combinations of probes and biological samples. The coding of Xij is explained in Section 3. For thisparticular example, we focus on gene promoter regions since the promoter regions of annotated genes are well defined andthe change in epigenetic profile at the promoter region is of critical interest for understanding gene regulationmechanisms.Among the genes located inside the ENCODE region, we selected promoters that are at least 2.4 kilobases from other genes.Though ourmethod can be applied to any genomic region, the presence of multiple genes will make the interpretationmorecomplicated. Also since the probes are not completely equally spaced due to considerations of the DNA sequence in the arraydesigning process, we also excluded promoter regions with very few probes. As a result, our analysis focuses on 101 genepromoter regions covered by the ENCODE tiling arrays. The genomic position ui represents the distance from the center of aprobe to the transcription start site in kilobases with negative values denoting positions upstream of the transcription startsite.For each promoter region, the TetAc-H4 enrichment profile is estimated using the varying coefficient model described

in Section 3. The estimated enrichment profiles of two genes (KIAA0999 and SERPINB2) are shown in Fig. 2. KIAA0999 isa protein kinase, while SERPINB2 is a serpin peptidase inhibitor that has also been implied in the differentiation of whiteblood cells. Here the profiles at 8 h after RA stimulation are omitted to reduce cluttering. Note that at zero hour, the TetAc-His4 enrichment profiles of both genes show a ‘‘dip’’ near the transcription start site, which is likely due to nucleosomedepletion at these loci. This characteristic is also observed by Heintzman et al. (2007) and others. It should be noted thatthe observed TetAc-H4 enrichment profile reflects the composite effect of the proportion of tetra-acetylated histone H4 andthe nucleosome density. At 2 h after retinoic acid stimulation, there are changes in the TetAc-His4 profiles for both genes,most prominently the decrease in enrichment levels for both genes on the two sides of the transcription start site, implyingincreased depletion of nucleosomes related to gene expression activities.At the 32 h time point, the TetAc-H4 enrichment level of the SERPINB2 promoter appears to return to near the original

level before RA stimulation. For the KIAA099 promoter, however, the pattern is different; the change in the TetAc-H4 profileis more prominent at 32 h than that at 2 h after RA stimulation. This suggests that the promoters of different genes respondto RA stimulation in heterogeneous manners. Using the proposed varying coefficient model, one can capture distinctive

D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189 1185

Table 1Genes with promoter regions showing significant changes in the TetAc-H4 enrichment profile on at least one of the three time points (2, 8 or 32 h) afterRA stimulation.

Locus Chromosome Gene Time points

NM_002575 18 SERPINB2 2, 8NM_198833 18 SERPINB8 2, 8, 32NM_0005118 11 HBB 2BC030525 5 Transcribed gene 2NM_000463 2 UGT1A1 2NM_007214 6 SEC63 homolog 2, 8, 32NM_000522 7 HOXA13 2NM_001001916 11 OR52J3 8AB209327 11 KIAA0999 32NM_006863 19 LILRA1 32AK126690 19 TTYH1 32NM_024523 7 GCC1 2

patterns of response, and pinpoint genomic regions where most significant changes occur. Also we notice that those ‘‘peak’’or rather ‘‘mount’’ patterns in Fig. 2 are reminiscent of those observed in nucleosome placement studies. Thus it is feasiblethat the varying coefficient model picks up information of nucleosome placement even though this experiment is notdesigned to detect nucleosome positioning.To determine which promoter regions are affected by RA stimulation, we carried out tests on whether the TetAc-H4

enrichment scores from the transcription start site to two kilobases upstream at 2, 8 and 32 h after RA stimulation are thesame as those at the zero hour using the permutation test described in Section 3. Each test is based on 1000 permutations.We focused on the region from the transcription start site to two kilobases upstream because this region is considered to becritical for the initiation of gene transcription. We then used the method of Benjamini and Hochberg (1995) to control thefalse discovery rate (FDR) at the specified level for the three time points (2, 8, and 32 h) respectively.Table 1 shows genes with at least one time point at which the TetAc-H4 profile is significantly different from that before

RA stimulation with FDR = .10. The genes in Table 1 include genes known to be involved in pathways regulated by retinoicacid, such as OR52J3, UGT1A1, and HOXA13. Some genes are also known to be related to the differentiation of blood cells,e.g., SERPINB2, SERPINB8, LILRA1, and GCC1, which is not surprising as the HL60 cell-line originates fromwhite blood cells. AtFDR = .10, eight genes show significant changes in TetAc-H4 enrichment profiles at 2 h, four and five genes show changesat 8 h and 32 h respectively. Some genes like SERPINB8 and the SEC63 homolog show persistent changes from 2 to 32 h afterRA stimulation, while others only show significant changes at one time point. Notably, the two serpin peptidase inhibitors,SERPINB2 and SERPINB8, both show changes at 2 h and 8 h, though unlike that of SERPINB8, the profile of the SERPINB2promoter seems to reverse to that before RA stimulation at the 32 h time point.Thus from the analysis of the TetAc-H4 profiles at different time points after retinoic acid stimulation, epigenetic

modifications show diverse patterns across genes and time points. Different patterns and their changes in response tobiological stimuli will be of significant interest for future research. The proposed varying coefficient model can effectivelycapture and quantify these patterns for genomic regions of interest.

5. Simulation studies

In this section we report results from simulation studies for evaluating the performance and robustness of the proposedvarying coefficientmethod.We simulated enrichment scores from−2 to 2 kilobases around the transcription start site underthree treatment conditions. Condition C1 is considered the baseline condition. The enrichment profile under treatmentcondition C2 is always the same as that under C1, while mean enrichment scores under C3 are different from those underthe other two conditions. The actualmean enrichment profiles are shown in Fig. 3. These are piecewise polynomial functionsdesigned tomimic actual enrichment profiles. Details are available in the Online Support Information for this paper. Hereweuse three different coefficient functions (corresponding to C3a, C3b and C3c) for the difference in enrichment scores betweenC3 and C1. Among the three coefficient functions for C3, the coefficient function for C3a represents the smallest departurefrom the baseline enrichment patternwhile the coefficient function for C3c is designed tomake themean enrichment profileeither higher or lower than that under C1/C2 at different locations.In Simulation I, we assume that probes are equally spaced with the distance between centers of two adjacent probes

being 25 bases (0.025 kilobases). Also the error term in (1) is assumed to have a normal distribution with mean zero andstandard deviation σ = 0.24. In Simulations II, III and IV, we evaluate the performance of the proposed varying coefficientmethod under different assumptions that are less than ideal. In the second simulation, we assume that the noise term has ascaled t3 distribution, and the standard deviation is set to be

√2σ = 0.34. In Simulation III, we assume the error variance to

be varying across genomic positions. Specifically, we assume that the standard deviation takes the value 0.2 cos(2u)+ 0.3,which results in the values of the error standard deviation being between 0.1 (at u = ±1.57) and 0.5 (at u = 0) with theaverage value 0.3. For Simulation IV, the probes are assumed to be unequally spaced. This is often the case as one has toconsider the effect of DNA sequences on the physical property of probes in designing the tiling arrays. For this simulation,

1186 D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189

Fig. 3. The actual mean enrichment profiles under conditions C1/C2 and C3 in simulation studies.

Table 2Type I error rate and power of the permutation test for simulations I–IV. Three coefficient functions are used for condition C3: C3a, C3b, and C3c. Thenominal sizes of the permutation test are 0.05 and 0.10, and results appear in that order.

C3 Simulation I Simulation II Simulation III Simulation IVC2/C1 C3/C1 C2/C1 C3/C1 C2/C1 C3/C1 C2/C1 C3/C1

C3a .051 .248 .045 .217 .049 .202 .045 .185.094 .352 .094 .349 .103 .312 .087 .288

C3b .043 .721 .054 .696 .045 .665 .052 .579.091 .839 .102 .808 .095 .789 .105 .699

C3c .050 .717 .046 .666 .050 .409 .059 .553.112 .825 .089 .793 .098 .552 .105 .687

we assume that the distances between centers of adjacent probes are 0.025 kilobases for 60% occasions, 0.05 kilobases for30% occasions, and 0.075 kilobases for 10% occasions.Fig. 4 shows the median estimated curves for the enrichment profile as well as curves for the 5th and 95th percentiles

under condition C1 for the four simulations. In general, the median estimated curve (dotted) is very close to the actualprofile (solid), with best results obtained under normal errors with constant variance and probe spacing, which is obviouslythe ideal condition for this type of experiment. Examination of the 5th and 95th percentile curves shows that the estimatesare more variable (wider between these two curves) in Simulations II and IV when compared with the case for Simulation I.The estimates in Simulation III are more variable in regions with larger error variance than that of Simulation I and areless variable in regions with smaller error variance (towards the two ends). The results for C2 and C3 are similar and arethus omitted. We also carried out the proposed permutation test on the simulated data with results reported in Table 2.The results show that for tests involving conditions C2 and C1, which have no difference in the enrichment score profile, theempirical type I error rate is close to the nominal size of 0.05 or 0.10 for all four simulations. For the difference in enrichmentprofiles between conditions C3 and C1, the power of the test is the lowest for C3a. This is expected, as the coefficient functionfor C3a represents the smallest deviation from condition C1. The best power is obtained in Simulation I also as expected.Loss of power is present in other three simulations, notably in Simulation IV with unequally spaced and less dense probes(the probe density is a third less in Simulation IV than in Simulation I). Of course, if the gaps are too large (larger than onekilobase, say), we have to accept that information is completely missing from these segments, and we should avoid makingstatements about epigeneticmodifications regarding the corresponding genomic sequence. For C3c, the power in SimulationIII is affected the most as the largest difference between enrichment profiles is concentrated at the regions with larger errorvariance in Simulation III. But in general, the loss of power is not too severe when compared with the ideal condition ofSimulation I.In summary, even though we made some simplifying assumptions in the varying coefficient model for epigenetic

modifications, the proposedmodel proved to be sufficiently robust in the less than ideal conditions commonly encounteredin real experiments.

6. Discussion

ChIP–chip has become an important tool for epigenetic research. Most statistical methods for ChIP–chip data have beenoriginally developed for locating transcription factor binding sites, and the modeling of the diverse pattern of histonemodifications has not been well studied. Furthermore, if experiments using tiling microarrays follow the same trajectory as

D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189 1187

Fig. 4. The actual mean enrichment profiles (solid lines), the median estimated mean enrichment profiles (dotted lines), and curves for the 5th percentileand the 95th percentile of the estimated mean enrichment at each point (dashed and dot–dash lines respectively) for simulations I, II, III and IV undercondition C1.

those using expression microarrays, more and more researchers will seek to incorporate a variety of experimental factorssuch as genotypes, time, and chemical concentration into the experimental design. Thus it will be vital to devise methodsthat can accommodate a diverse array of settings. In this paper we show that the proposed varying coefficient model canmeet the needs of modeling epigenetic profiles under multiple treatment conditions. The coefficient function can naturallybe used tomodel the profile of enrichment without enforcing any specific shape, while the inherent flexibility of the varyingcoefficientmodel can accommodate a variety of experimental factors. Since the change in the histonemodification profile inresponse to various biological factors has become a topic of increasing importance, using the coefficient function to representthe variation both along the chromosome andwith regard to the biological factor of interest is very attractive. The coefficientfunction can then be used for further analysis such as that of the clustering of genes or building predictive algorithms foridentifying unknown genomic features. We plan to pursue these applications in future research.In this paper we focus on promoter regions since these regions are relatively well studied. But the methods proposed

here should be applicable to genomic segments of other types.With the eye to screening a significant number of regions, weintentionally make some simplifying assumptions for easy implementation. More complex methods can be incorporated ifdeemed necessary. For example, adaptive bandwidth selection could be useful if the genomic region of interest is long. Atwo-stepmethod, such as that in Fan and Zhang (1999), can also be used. For caseswhere the error variance is some boundedfunction of genomic positions such as in Simulation III, discussions on variance function estimation in the nonparametricregression setting can be found in Ruppert et al. (1997) and Fan and Yao (1998) among others.Though the local least squares method used in this paper does not require specific error distributions other than having

zeromean and finite secondmoment, it can be less efficient when the error distribution is not normal. For example, the localleast absolute deviation (LAD) polynomial regression (Fan et al., 1994; Welsh, 1996) will be more efficient when the errordistribution is Laplacian. The problem, of course, is that the error distribution is seldom known with any certainty. Someattempts have been made to take into account more information on error distributions (Linton and Xiao, 2007; Kai et al., inpress), which might be extended to varying coefficient models in the future.One aspect of future research is to incorporate multiple types of information in the model by taking advantage of the

expandability of the varying coefficient model. For example, to separate the effect of nucleosome density from that due tothe proportion of modified histones, one can include a coefficient function for the overall level (modified and not modified)of core histones when experimental data are available. Gene expression levels, chromatin configurations, and sequencemotif information can also be incorporated. Though this paper focuses on ChIP–chip experiments using tiling microarrays,it will be of interest to extend the model to data obtained through massively parallel signature sequencing (e.g. Barski et al.,2007).

1188 D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189

Acknowledgments

The author thanks the associate editor, three referees, and Dr. K.M. Eskridge for insightful comments. The authoracknowledges the financial support of the National Science Foundation (Award Number 0701892) and University ofNebraska Research Council.

Appendix A. Derivation of the plug-in method

We denote as nc the number of probes on the tiling array within the genomic region of interest. Suppose there are qdifferent treatment conditions and k is the number of biological samples under each treatment; then n = qknc . We assumethe following conditions:

(i) The genomic region under consideration is of fixed length c and the distance between centers of adjacent probes isc/nc , i.e., this is a fixed equally spaced design.

(ii) The second derivative of the coefficient functions, α′′j (u), 1 < j < q, is continuous and bounded over the genomicregion under consideration.

(iii) The kernel function K is symmetric about zero and is supported on [−1, 1].(iv) The bandwidth is a sequence satisfying the conditions that h→ 0 and knch→∞ as knc →∞.(v) The point u at which the estimation is taking place is located a distance of at least h from the boundary of the region.

By Taylor expansion, we can show that the bias of the estimates of mean enrichment levels is

E{r(Xi, u0)− r(Xi, u0)

}=12(Xi ⊗ e1,2)T(XT0WX0)−1XT0WX2α′′(u0)+ o(h2),

where

X2 =

X11(u1 − u0)2· · · X1q(u1 − u0)2

.... . .

...

Xn1(un − u0)2 · · · Xnq(un − u0)2

.By an argument similar to that used in Chapter 5 of Wand and Jones (1995), we can show that

cnXT0WX0 =

1 0 · · ·1q

0

0 h2µ2(K) · · · 01qh2µ2(K)

......

. . ....

...1q

0 · · ·1q

0

01qh2µ2(K) · · · 0

1qh2µ2(K)

+ O{(knc)−1}

and

cnXT0WX2 =

h2µ2(K)1qh2µ2(K) · · ·

1qh2µ2(K)

1qh2µ2(K)

0 0 · · · 0 01qh2µ2(K)

1qh2µ2(K) · · · 0 0

0 0 · · · 0 0...

.... . .

......

1qh2µ2(K) 0 · · · 0

1qh2µ2(K)

0 0 · · · 0 0

+ O{(knc)−1}.

By some matrix manipulations, we have

E{r(X, u0)− r(X, u0)} =12h2µ2(K)XTα′′(u0)+ o(h2)+ O{(knc)−1}.

For the variance, let V = σ 2In, where In is the n by n identity matrix, then we have

Var{r(Xi, u0)} = (Xi ⊗ e1,2)T(XT0WX0)−1XT0WVWX0(XT0WX0)−1(Xi ⊗ e1,2).

D. Wang / Computational Statistics and Data Analysis 54 (2010) 1179–1189 1189

Using approximations analogous to those used for the derivation of the bias,

cnXT0WX0 = σ 2

V1 0 q−1V1 0 · · · q−1V1 00 V2 0 q−1V2 · · · 0 q−1V2

q−1V1 0 q−1V1 0 · · · 0 00 q−1V2 0 q−1V2 · · · 0 0...

......

.... . .

......

q−1V1 0 0 0 · · · q−1V1 00 q−1V2 0 0 · · · 0 q−1V2

,

where V1 = R(K)h−1 + o(h−1) and V2 = hµ2(K 2)+ O{(knc)−1}. Then by matrix manipulations,

Var(r(X, u0)) =cR(K)σ 2

knch+ o{(knch)−1}.

The optimal bandwidth in (4) can in turn be derived by minimizing the mean squared error.

Appendix B. Supplementary data

Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.csda.2009.09.035.

References

Barski, A., Cuddapah, S., Cui, K., et al., 2007. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837.Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: A practical and powerful approach tomultiple testing. Journal of the Royal StatisticalSociety, Series B 85, 289–300.

Cai, Z., Fan, J., Yao, Q., 2000a. Functional-coefficient regressionmodels for nonlinear time series. Journal of the American Statistical Association 95, 941–956.Cai, Z., Fan, J., Yao, Q., 2000b. Efficient estimation and inferences for varying-coefficientmodels. Journal of the American Statistical Association 95, 888–902.Cai, Z., Tiwari, R.C., 2000. Application of a local linear autoregressive model to BOD time series. Environmetrics 11, 341–350.Cawley, S., Bekiranov, S., Ng, H.H., et al., 2004. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points towidespread regulation of noncoding RNAs. Cell 116, 499–511.

Cleveland, W.S., Grosse, E., Shyu, W.M., 1991. Local regression models. In: Chambers, J.M., Hastie, T. (Eds.), Statistical Models in S. Wadsworth andBrooks/Cole, Pacific Grove.

The ENCODE Project Consortium, 2004. The ENCODE project. Science 306, 636–640.The ENCODE Project Consortium, 2007. The ENCODE pilot project: Identification and analysis of functional elements in 1 percent of the human genome.Nature 447, 799–816.

Fan, J., Gijbels, I., 1996. Local Polynomial Modeling and Its Applications. Chapman & Hall, London.Fan, J., Hu, T.-C., Truong, Y.K., 1994. Robust non-parametric function estimation. Scandinavian Journal of Statistics 21, 433–446.Fan, J., Yao, Q., 1998. Efficient estimation of conditional variance functions in stochastic regression. Biometrika 85, 645–660.Fan, J., Zhang, W., 1999. Statistical estimation in varying coefficient models. The Annals of Statistics 27, 1491–1518.Ghosh, S., Hirsch, H.A., Sekinger, E., et al., 2006. Rank-statistics based enrichment -site prediction on chip experiments. BMC Bioinformatics 7, 434.Gottardo, R., Li, W., Johnson, W.E., et al., 2008. A flexible and powerful Bayesian hierarchical model for ChIP–chip experiments. Biometrics 64, 468–478.Härdle, W., Marron, J.S., 1995. Fast and simple scatterplot smoothing. Computational Statistics & Data Analysis 20, 1–17.Hastie, T., Tibshirani, R., 1993. Varying-coefficient Models. Journal of the Royal Statistical Society, Series B 55, 757–796.Heintzman, N.D., Stuart, R.K., Hon, G., et al., 2007. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the humangenome. Nature Genetics 39, 311–318.

Ji, H., Wong, W., 2005. Tilemap: Create chromosomal map of tiling array hybridization. Bioinformatics 18, 3629–3636.Johnson, W.E., Li, W., Meyer, C.A., et al., 2006. Model-based analysis of tiling-arrays for ChIP–chip. Proceedings of the National Academy of Sciences USA103, 12457–12462.

Kai, B., Li, R., Zou, H., 2008. Local CQR smoothing: An efficient and safe alternative to local polynomial regression. Journal of the Royal Statistical Society,Series B (in press).

Keles, S., van der Laan, M.J., Dudoit, S., et al., 2006. Multiple testingmethods for ChIP–chip high density oligonucleotide array data. Journal of ComputationalBiology 13, 579–613.

Keles, S., 2007. Mixture modeling for genome-wide localization of transcription factors. Biometrics 63, 10–21.Lee, T.H., Ullah, A., 2003. In: Giles, D. (Ed.), Computer Aided Econometrics. Marcel Dekker, New York.Li, W., Meyer, C.A., Liu, X.S., 2005. A hidden Markov model for analyzing ChIP–chip experiments on genome tiling arrays and its application to p53 bindingsequences. Bioinformatics 21, i275–i282.

Linton, O., Xiao, Z., 2007. A nonparametric regression estimator that adapts to error distribution of unknown form. Econometric Theory 23, 371–413.Liu, R.Y., Singh, K., 1992. Moving blocks jackknife and bootstrap capture weak dependence. In: Lepage, R., Billard, L. (Eds.), Exploring the Limits of theBootstrap. Wiley, New York.

Mielke, P.W., Berry, K.J., 2001. Permutation Methods, a Distance Function Approach. Springer, New York.Ozsolak, F., Song, J.S., Liu, X.S., et al., 2007. High-throughput mapping of the chromatin structure of human promoters. Nature Biotechnology 25, 244–248.Qu, A., Li, R., 2006. Quadratic inference functions for varying-coefficient models with longitudinal data. Biometrics 62, 379–391.Ruppert, D., Sheather, S.J.,Wand,M.P., 1995. An effective bandwidth selector for local least squares regression. Journal of theAmerican Statistical Association90, 1257–1270.

Ruppert, D., Wand, M.P., Holst, U., Hössjer, O., 1997. Local polynomial variance function estimation. Technometrics 39, 262–273.Schones, D.E., Zhao, K., 2008. Genome-wide approaches to studying chromatin modifications. Nature Reviews Genetics 9, 179–191.Wand, M.P., Jones, M.C., 1995. Kernel Smoothing. CRC, Boca Raton.Weber, M., Davies, J.J., Wittig, D., et al., 2005. Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normaland transformed human cells. Nature Genetics 37, 853–862.

Welsh, A.H., 1996. Robust estimation of smooth regression and spread functions and their derivatives. Statistica Sinica 6, 347–366.Wu, J., Smith, L.T., Plass, C., et al., 2006. ChIP–chip comes of age for genome-wide functional analysis. Cancer Research 66, 6899–6902.Zheng, M., Barrera, L.O., Ren, B., et al., 2007. ChIP–chip: Data, model, and analysis. Biometrics 63, 787–796.