Hoeschele I.-[Article] a Note on Joint Versus Gene-specific Mixed Model Analysis of Microarray Gene Expression Data (2005)

Biostatistics (2005), 6, 2, pp. 183186doi:10.1093/biostatistics/kxi001

A note on joint versus gene-specific mixed model analysisof microarray gene expression data

INA HOESCHELE, HUA LIVirginia Bioinformatics Institute and Department of Statistics, Virginia Tech,

Blacksburg, VA 24061-0477, [email protected]

SUMMARYCurrently, linear mixed model analyses of expression microarray experiments are performed either in agene-specific or global mode. The joint analysis provides more flexibility in terms of how parameters arefitted and estimated and tends to be more powerful than the gene-specific analysis. Here we show how toimplement the gene-specific linear mixed model analysis as an exact algorithm for the joint linear mixedmodel analysis. The gene-specific algorithm is exact, when the mixed model equations can be partitionedinto unrelated components: One for all global fixed and random effects and the others for the gene-specificfixed and random effects for each gene separately. This unrelatedness holds under three conditions: (1) anygene must have the same number of replicates or probes on all arrays, but these numbers can differ amonggenes; (2) the residual variance of the (transformed) expression data must be homogeneous or constantacross genes (other variance components need not be homogeneous) and (3) the number of genes in theexperiment is large. When these conditions are violated, the gene-specific algorithm is expected to benearly exact.

Keywords: Differential gene expression; Microarrays; Mixed model analysis.

1. INTRODUCTION

Microarray gene expression experiments produce expression profiles of thousands or ten thousands ofgenes simultaneously. The focus of this paper is on the application of linear mixed model methodologyto the detection and estimation of differential gene expression or, more generally, to the identification ofwhich factors of interest influence the expression of which of the arrayed genes and to the estimation oftheir effects. Linear mixed model analysis (LMMA) of microarray data is either gene-specific (Wolfingeret al., 2001) or global (e.g. Kerr and Churchill, 2001), i.e. the analysis is performed separately for eachgene or jointly for all genes, respectively. Gene-specific analyses can have low power as noticed, e.g. byWu et al. (2003) and Pfister-Genskow et al. (2004). Reasons for lower power of gene-specific analyses,relative to the joint analysis, include the difference in degrees of freedom, joint versus gene-specificestimation of the error variance and other variance components, and the use of different contrasts.

Conditions for equivalence between gene-specific and joint analyses have been stated in previouscontributions for the case of fixed linear models (e.g. Kerr, 2003; Wu et al., 2003). Here we discuss theuse of the gene-specific analysis as an algorithm for the joint analysis under a mixed linear model.

To whom correspondence should be addressed.

c The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected].

184 I. HOESCHELE AND H. LI

2. METHODS

We write a general linear mixed model for microarray data as follows

y = X11 + Z1u1 + X22 + Z2u2 + e, (2.1)where 1 and u1 contain the effects of the fixed (e.g. treatment) and random global factors, respectively, 2and u2 contain the effects of the fixed and random gene-specific factors, respectively, and X1, Z1 and X2,Z2 are design matrices for the global and gene-specific factors, respectively. Gene-specific effects includethe gene main effects (fixed) and interactions between the global fixed and random factors with the genefactor. It is usually appropriate to treat array effects as random, so e.g. let u1 = A be the vector of randomglobal array effects and let u2 = AG be the vector of random gene-specific array effects. For the mixedmodel analysis, we need to specify the variancecovariance structure of the random factors. We typicallyspecify that E(A) = 0, Var(A) = G11 = I 2A, E(A G) = 0, E(e) = 0, Var(A G) = G22 = I 2AGand Var(e) = R = I 2e , under the assumption that gene-specific variances are constant across genes, orVar(A G) = G22 = gi=1 Ina 2AG(i) and Var(e) = R = gi=1 Ini 2e(i), under the assumption that gene-specific variances differ among genes, where na is the number of arrays, ni is the number of observationsfor gene i , 2A is the global array variance and

2AG and 2e are the gene-specific array variance or variance

due to array-by-gene interaction and the residual variance, respectively. Inferences about the unknownparameters (fixed effects and variance components) are obtained by utilizing the mixed model equations(MME) (Goldberger, 1962; Henderson, 1963). The MME for the uncentered mixed model in (2.1) are

X1 R1X1 X1 R1X2 X1 R1Z1 X1 R1Z2X2 R1X1 X2 R1X2 X2 R1Z1 X2 R1Z2Z1 R1X1 Z1 R1X2 Z1 R1Z1 + G11 Z1 R1Z2 + G12Z2 R1X1 Z2 R1X2 Z2 R1Z1 + G21 Z2 R1Z2 + G22

1

2u1u2

=

X1 R1yX2 R1yZ1 R1yZ2 R1y

, (2.2)

whereG1 =

[G11 G12G21 G22

]1=[

G11 G12G21 G22

],

where G12 = Cov(A, A G) = 0. Given known values of the variance components, -solutions to theMME are best linear unbiased estimates and u-solutions are best linear unbiased predictions.

Wolfinger et al. (2001) proposed to implement the gene-specific analysis with a two-step procedure.First, the data y are analyzed with a sub-model (normalization model) containing only global effects(1 and u1), and then the predicted residuals from this model are analyzed with the second sub-model(gene-model) containing only gene-specific effects (2 and u2). Fitting model (2.1), with centering of thecolumns of X2 and Z2 (Rencher, 2000) or without 2 and u2, implies a reparameterization from 1 to 1and u1 to u1, where an element in 1 or u1 now contains the average of the interactions of this factorwith the gene factor (G). We denote the centered matrices by X2 and Z2. After centering, or equiva-lently, after fitting a mixed (normalization) model containing only global factors, the vector of globalarray effects becomes A with elements Ak = Ak + (A G)k and with Var(A) = G11 = I 2A andCov(A, A G) = G12. The array variance component now is 2A = Var(Ak) = Var(Ak + A Gk),which in the simplest case of equal replicate or probe numbers across all genes simplifies to 2A = 2A + 1nG 2AG , where nG is the number of genes. Matrix G12 has non-zero elements between any Ak and(A G)kg for all g and a given k, which in the simplest case are equal to 1nG 2AG . The MME for thecentered mixed model are obtained by replacing X2 and Z2 by X2 and Z2 and replacing G1 with

G1 =[

G11 G12G21 G22

]1=[

G11 G12G21 G22

].

Joint model analysis 185

It can be shown that there are non-zero elements in matrix G12 between Ak and (A G)kg for allg, and that there are non-zero elements in matrix G22 between (A G)kg and (A G)kg for all g, gwith g = g. Therefore, the offdiagonal block in the coefficient matrix of the centered MME betweenu1 = A and u2 = A G is no longer strictly zero, and the same holds for offdiagonal blocks betweenany sub-vectors of u2 corresponding to two different genes. As a consequence, the global and the gene-specific MME are not exactly unrelated so that the gene-specific analysis is no longer an exact algorithmto perform the joint analysis. The degree of approximation approaches exactness when the number ofgenes becomes large so that G approaches G. For the centered MME, all offdiagonal sub-matrices ofthe coefficient matrix in (2.2) involving global and gene-specific design matrices (e.g. X1 R1X2) arezero when R = I 2e , but are not (exactly) zero for other R. Separating the MME (2.2) into nG + 1(approximately) unrelated components greatly improves the efficiency of LMMA with joint estimation ofglobal and gene-specific variances.

3. RESULTS

The gene-specific algorithm for the joint analysis is illustrated with a worked example using a small,artificial data set and fixed and mixed models, which is presented as supplementary data available atBiostatistics online. We also performed the joint analysis on a larger real data set and on data sets simu-lated with structure and parameters similar to those of the real data. The real data included 2640 cDNAsprinted in duplicate on 100 arrays, two cell populations and 10 biological replicates from each cell popu-lation [the analysis of this data set is described in detail by Pfister-Genskow et al. (2004)]. We analyzedthe real and simulated data with the ASReml software (Gilmour et al., 2002) and with our gene-specificalgorithm for the joint analysis and confirmed that the estimates of the variance components and the teststatistics were identical up to numerical accuracy [differences in the results between the joint analysisand the original gene-specific analysis of Wolfinger et al. (2001) were reported in Pfister-Genskow et al.(2004)].

4. DISCUSSION

Joint analysis of the expression data on all genes in a microarray experiment has multiple advantagesover the common practice of analyzing the data on individual genes separately: (1) There is great flexi-bility on how the variancecovariance structure of the data is modeled, and different formulations canbe compared (homogeneous within-gene variance components, heterogeneous within-gene variance com-ponents estimated by shrinkage methods or by grouping of genes, etc.). (2) It allows us to evaluate thesignificance of global treatment and technical factors and to evaluate gene-specific treatments contrastswhich include main effects (Black and Doerge, 2002). (3) As pointed out by Kerr (2003), joint analysisalso permits model evaluation and residual analysis. We note, however, that while for linear regressionmodels, there is well-established theory on residuals and influence, and outlier detection statistics or dele-tion diagnostics are commonly available in regression software packages, residual analysis and deletiondiagnostics are not yet performed routinely in mixed model analysis and are still underdeveloped formixed models despite recent progress (e.g. Haslett and Dillane, 2004).

We do not consider the joint and gene-specific analyses as alternative methods, but we view the gene-specific analysis merely as an efficient algorithm for performing the joint analysis. We have shown thatthe gene-specific algorithm provides (essentially) exact inferences for the joint LMMA under three con-ditions: (1) equal number of replicate spots or probes for any gene across all arrays (these numbers candiffer among genes), (2) homogeneous residual variance across genes (other variance components may beheterogeneous) and (3) sufficiently large numbers of genes in the experiment (this condition is needed formixed but not for fixed models).

186 I. HOESCHELE AND H. LI

ACKNOWLEDGMENTS

This work was supported by NSF Plant Genome Cooperative Agreement DBI-0211863.

REFERENCESBLACK, M. A. AND DOERGE, R. W. (2002). Calculation of the minimum number of replicate spots required for

detection of significant gene expression fold change in microarray experiments. Bioinformatics 18, 16091616.GILMOUR, A. R., CULLIS, B. R., WELHAM, S. J. AND THOMPSON, R. (2002). ASREML Reference Manual.

NSW, Australia: NSW Agriculture, Orange Agricultural Institute.

GOLDBERGER, A. S. (1962). Best linear unbiased prediction in the generalized linear regression model. Journal ofthe American Statistical Association 57, 369375.

HASLETT, J. AND DILLANE, D. (2004). Application of delete = replace to deletion diagnostics for variance com-ponent estimation in the linear mixed model. Journal of the Royal Statistical Society, Series B 66, 131143.

HENDERSON, C. R. (1963). Selection index and expected genetic advance. In Statistical Genetics and Plant Breeding(NRC publication 982). Washington, DC: National Academy of Sciences, pp. 141163.

KERR, M. K. (2003). Linear models for microarray data analysis: hidden similarities and differences. Journal ofComputational Biology 10, 891901.

KERR, M. K. AND CHURCHILL, G. (2001). Experimental design for gene expression microarrays. Biostatistics 2,183201.

PFISTER-GENSKOW, M., CHILDS, L., MYERS, C., LACSON, J., BETTHAUSER, J., GOULEKE, P., ZHENG, Y.,LENO, G., FORSBERG, E., YANG, X., HOESCHELE, I. AND EILERTSEN, K. J. (2004). Analysis of individualpre-implantation cattle embryos using cDNA microarrays: comparison of NT and IVF produced embryos usingan interwoven loop design and linear mixed model analysis. Biology of Reproduction 72. In press.

RENCHER, A. C. (2000) Linear Models in Statistics. New York: John Wiley & Sons.WOLFINGER, R. D., GIBSON, G., WOLFINGER, E. D., BENNETT, L., HAMADEH, H., BUSHEL, P., AFSHARI, C.

AND PAULES, R. S. (2001). Assessing gene significance from cDNA microarray expression data via mixedmodels. Journal of Computational Biology 8, 625637.

WU, H., KERR, M. K., CUI, X. Q. AND CHURCHILL, G. A. (2003). MAANOVA: A Software Package for the Analy-sis of Spotted cDNA Microarray Experiments. The Analysis of Gene Expression Data: Methods and Software.New York: Springer. http://www.jax.org/research/churchill/.

[Received February 18, 2004; first revision June 17, 2004; second revision September 14, 2004;accepted for publication October 11, 2004]

Documents

Hoeschele I.-[Article] a Note on Joint Versus Gene-specific Mixed Model Analysis of Microarray Gene Expression Data (2005)