Upload
duongquynh
View
220
Download
1
Embed Size (px)
Citation preview
SCALABLE ESTIMATION AND INFERENCE FOR MASSIVE LINEAR MIXED
MODELS WITH CROSSED RANDOM EFFECTS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF STATISTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Katelyn Gao
June 2017
Abstract
With modern electronic activity, large crossed data sets are increasingly common, with factors such
as users and items. It is often appropriate to model them with crossed random effects, since specific
levels are temporary. Their size provides challenges for statistical analysis. For such large data sets,
the computational costs of estimation and inference (time, space, and communication) should grow
at most linearly with the sample size and the algorithms should be parallelizable.
Both traditional maximum likelihood estimation and numerous Markov chain Monte Carlo
Bayesian algorithms take superlinear time in order to obtain good parameter estimates in the sim-
ple two-factor crossed random effects model. We propose moment based, parallelizable algorithms
that, with at most linear cost, estimate variance components and measure the uncertainties of those
estimates. These estimates are consistent and, proven using a martingale central limit theorem de-
composition of U -statistics, asymptotically Gaussian. When run on simulated normally distributed
data, our algorithm performs competitively with maximum likelihood methods.
Maximum likelihood and Bayesian approaches run into similar challenges for the linear mixed
model with crossed random effects. We propose a method of moments based algorithm that scales
linearly and can be easily parallelized at the expense of some loss of statistical efficiency. This
algorithm is proven to give estimates that are consistent and asymptotically Gaussian. We apply
the algorithm to some data from Stitch Fix where the crossed random effects correspond to clients
and items. The random effects analysis is able to account for the increased variance due to intra-
client and intra-item correlations in the data. Ignoring the correlation structure can lead to standard
error underestimates of over 10-fold for that data. As in the crossed random effects model, when
run on simulated normally distributed data, our algorithm performs nearly as well as maximum
likelihood.
iv
Preface and Acknowledgements
Since the beginning of my PhD, I have been interested in statistical methods for massive data sets.
I was excited when Brad Klingenberg from Stitch Fix, an online personal styling service, suggested
this project to my advisor Art Owen and me. He said that the data analytics team at Stitch Fix
wanted to fit mixed models to their data sets, but existing methodology and software could not scale
to data sets with millions of observations. Because such large data sets are common in e-commerce,
he would like to see if there was an efficient and scalable way to fit mixed models and perform
inference for the resulting estimates.
My deepest gratitude goes to my advisor, Prof. Art Owen. He introduced this project to me, and
provided invaluable guidance and support the past four years. Moreover, he taught me how to think
about statistical problems, and being part of his research group exposed me to many fascinating
problems and areas of study. He was unfailingly kind, encouraging, and generous with his time.
Many thanks go to the members of my reading committee, Prof. Robert Tibshirani and Prof.
Lester Mackey, and the other members of my oral committee, Prof. Jerome Friedman and Prof.
Sharad Goel, who have provided valuable suggestions for the completion of this thesis. I would like
to thank Prof. Norm Matloff from UC Davis and Prof. Trevor Hastie for fruitful discussions during
the course of this project. Lastly, I would also like to thank Brad Klingenberg and Stitch Fix for
providing me with the motivation and data for this project.
I would like to express my appreciation to the faculty, staff, and other students of the Stanford
statistics department, who have made the past five years very enjoyable. I am especially grateful
that I have had the opportunity to learn from some of the best statisticians in the world. Finally, this
work was supported under the National Science Foundation grants DGE-114747 and DMS-1407397.
I dedicate this thesis to my parents, who have made many sacrifices so that I could have the best
education possible.
v
Contents
Abstract iv
Preface and Acknowledgements v
1 Introduction 1
1.1 Motivation and Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Goals of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Crossed Random Effects Model 6
2.1 Model Description and Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Notation and Asymptotic Regime . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Moment-based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Bayesian Analysis via MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Implications from Probability Theory . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Method of Moments Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 U -statistics for variance components . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Estimator Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.3 Asymptotic Normality of Estimates . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 Examples from Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Linear Mixed Model with Crossed Random Effects 33
3.1 Additional Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vi
3.2.1 Moment-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Likelihood-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Challenges of Likelihood-based Approaches . . . . . . . . . . . . . . . . . . . 38
3.3 Alternating Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Estimating Variance Components . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Scalable and More Efficient Regression Coefficient Estimators . . . . . . . . . 41
3.4 Theoretical Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.2 An Example from E-commerce . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Conclusions 59
4.1 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.1 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Factor Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A Crossed Random Effects Model 63
A.1 Bayesian Analysis via MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.1.1 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.1.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.2 Expectation and Variance of Moment Estimates . . . . . . . . . . . . . . . . . . . . . 64
A.2.1 Weighted U -statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.2.2 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2.3 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.2.4 Variance of Ua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.2.5 Variance of Ue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.2.6 Covariance of Ua and Ub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.2.7 Covariance of Ua and Ue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.2.8 Estimating Kurtoses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.3 Asymptotic Normality of Moment Estimates . . . . . . . . . . . . . . . . . . . . . . 87
A.3.1 Proof of Theorem 2.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.3.2 Asymptotic approximation: proof of Theorem 2.4.4 . . . . . . . . . . . . . . . 89
A.3.3 Proof of Lemma 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vii
A.3.4 Proof of Theorem 2.4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
B Linear Mixed Model with Crossed Random Effects 111
B.1 Estimator Efficiencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.2 A Useful Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.3 Consistency of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.3.1 Proof of Theorem 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.3.2 Proof of Theorem 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.3.3 Proof of Theorem 3.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B.4 Asymptotic Normality of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.4.1 Proof of Theorem 3.4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.4.2 Proof of Lemma 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
B.4.3 Proof of Theorem 3.4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
viii
List of Tables
2.1 Summary of simulation results for cases with R = C = 1000. The first row gives CPU
time in seconds. The next four rows give median estimates of the 4 parameters. The
next four rows give the number of lags required to get an autocorrelation below 0.5. 14
3.1 Stitch Fix Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.1 Median CPU time in seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2 Median estimates of µ and number of lags after which ACF(µ) ≤ 0.5. . . . . . . . . . 66
A.3 Median estimates of σ2A and number of lags after which ACF(σ2
A) ≤ 0.5. . . . . . . . 67
A.4 Median estimates of σ2B and number of lags after which ACF(σ2
B) ≤ 0.5. . . . . . . . 68
A.5 Median estimates of σ2E and number of lags after which ACF(σ2
E) ≤ 0.5. . . . . . . . 69
ix
List of Figures
2.1 Schematic of our algorithm. The expressions in the smallest boxes are the values
computed at each step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Simulation results: log-log plots of the five recorded measurements against the number
of observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 For p = 5, log-log plot of the computation time per iteration for maximum likelihood. 51
3.2 For each value of p, log-log plots of the computation times of the two algorithms
against N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 For each value of p, log-log plots of the (scaled by p) mean squared error of β against
N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 For each value of p, log-log plots of the mean squared error of σ2A against N . . . . . 54
3.5 For each value of p, log-log plots of the mean squared error of σ2B against N . . . . . 55
3.6 For each value of p, log-log plots of the mean squared error of σ2E against N . . . . . 56
3.7 For each subset of data, estimates and confidence intervals for Bohemian-related vari-
ables of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
x
Chapter 1
Introduction
Driven by the growth of the Internet in the last twenty years, the amount of data available in the
world has grown exponentially. According to a BBC article from 2014 [48], 2.5 billion gigabytes of
data were being generated every day in 2012; that number could only have increased since then.
Numerous hardware, software, and algorithms have been devised to deal with these massive data
sets. Graphics processing units (GPUs) have been very successful at processing image data in parallel
and have become the default for many deep learning applications. Frameworks such as Hadoop [2]
and Spark [3] facilitate large-scale computing over clusters of machines. For data analysis, streaming
algorithms, algorithms that only require a few passes over the data set at hand, have been invented.
In the field of statistics, we are primarily concerned with the algorithmic (analytic) aspect. That
is, how can we construct statistical procedures that scale to such massive data sets and that still
have theoretical guarantees, enabling us to efficiently carry out the traditional statistical tasks of
estimation, inference, and prediction?
Because the amount of available data often outstrips the available compute power, as was the case
over a hundred years ago when most computations had to be done by hand, many old statistical
techniques have become popular. In this thesis, we apply two such techniques, the method of
moments and least squares, to the analysis of massive data sets from e-commerce. Details of our
motivation and statistical problem are given next.
1.1 Motivation and Problem Overview
The Internet has enabled transactions that once required face-to-face interaction to be made in
front of a computer screen. People can now purchase anything, including food, with a mouse click,
and have access to all kinds of information. Every time they enter into an online transaction or
visit a website, data about them is collected, such as their IP address, queries, pages visited, items
purchased, and total time spent.
1
CHAPTER 1. INTRODUCTION 2
The data sets are massive, easily having millions of observations. More importantly for us, they
have an unbalanced crossed factor structure. The factors are users, queries, URLs, items, etc. At
the intersection of levels for each factor, data of interest and auxiliary data are collected. Yet, these
electronic data sets are sparse in the sense that we only have observations at a small fraction of the
intersections of levels for each factor.
The inspiration for our project originated from e-commerce, in particular, Stitch Fix [1], a per-
sonal styling service for clothing and accessories primarily geared towards women. Clients answer
questions about their style preferences, stylists pick out items to send to the clients, and the clients
choose to keep or return the items. Stitch Fix collects data on clients’ opinions of items they re-
ceived, as well as information about the clients and items themselves. The factors are clients and
items, some examples of data of interest include the client’s rating of the item, and some examples of
auxiliary data are item material and the client’s style preferences. The analysis goal is to understand
which characteristics of items or users influence customer satisfaction. This would help to inform
future selections by stylists, as they would only send send items for which they believe the client
has a fairly high probability of keeping. For instance, if certain materials are significantly negatively
associated with the rating, stylists should be more selective about sending clothes made with those
materials.
As in the above example, in this thesis we focus on understanding how the auxiliary data affect
the data of interest. For convenience, we also assume that the data of interest is continuous. We
choose to model the factors as random and not fixed, even though fixed factors are in some sense
easier to deal with, for two reasons. First, the goal is to discover some behavior patterns of the
entire population and not of a specific person. Second, in these electronic data sets, the factor levels
are evanescent, making random factors more realistic. Clothing items rise and fade in popularity,
clients turn over, and URLs are created and destroyed.
Therefore, we describe the relationship between the data of interest and the auxiliary data using
a linear mixed model, where each random effect corresponds to a factor.
1.2 Data Model
For simplicity, we assume, as in the Stitch Fix example, that there are only two factors. We model
the relationship between the data of interest and the auxiliary data as follows:
Model 1. Two-factor linear mixed model
Yij = xTijβ + ai + bj + eij , xij ∈ Rp, i, j ∈ N where,
aiiid∼ (0, σ2
A), bjiid∼ (0, σ2
B), eijiid∼ (0, σ2
E) (independently) and,
E(a4i ) <∞, E(b4j ) <∞, E(e4ij) <∞,
(1.1)
CHAPTER 1. INTRODUCTION 3
where i indicates the level of the first factor and j the level of the second factor.
For convenience, let us call the i’s rows and the j’s columns. Yij is the data of interest, and xij
is the auxiliary data, which contains p predictors. Note that we can have at most one observation
in each cell (i, j). The random effect at row i is ai, the random effect at column j is bj , and noise
is eij . For example, for Stitch Fix, ai could be the effect of a particular client and bj the effect of a
specific item.
The variance of the random effect of the first factor is σ2A, the same for the random effect of the
second factor is σ2B , and the same for the noise is σ2
E . Because the random effects have finite fourth
moment, we also define the excess kurtoses of the random effects and noise as κA, κB , and κE . Most
importantly, we do not make any parametric assumptions about the random effects or noise. This
is relevant for Internet applications, in which non-normal data distributions often appear.
In practice, we only observe Yij and xij in N cells (i, j), where 1 ≤ N < ∞. However, as we
expect future data to bring hitherto unseen levels, a countable index set is appropriate. We consider
the regime where p is much smaller than and does not increase with N . Let R denote the number
of distinct i for which we observe data and C denote the number of distinct j for which we observe
data. The unseen (Yij , xij) are taken to be missing at random, and our results are conditional on
the observed responses Yij and predictors xij for which Zij = 1.
In this thesis, we assume that the missingness pattern is not informative. But in many appli-
cations the observed values are likely to differ in some way from the missing values. For instance,
in movie ratings data people may be more likely to watch and rate movies they believe they will
like, and so missing ratings could be lower on average than observed ones. In general, the observed
ratings may have both high and low values oversampled relative to middling values.
From observed values alone we cannot tell how different the missing values would be. To do
so requires making untestable assumptions about the missingness mechanism. Even in cases where
followup sampling can be made, e.g., giving some users incentives to make additional ratings, there
will still be difficulties such as users refusing to make those ratings, or if forced, making inaccurate
ratings. Methods to adjust for missingness have to be designed on a case by case basis, using
whatever additional data and assumptions can be brought to bear.
We end with a brief discussion on how such massive data sets with crossed factor structure are
handled. There are various possible data storage models. We consider the log-file model with a
collection of (i, j, Yij , xij) triples, which for the purposes of this thesis we assume are stored at the
same location. A pass over the data proceeds via an iteration over all (i, j, Yij , xij) triples in the
data set. Such a pass may generate intermediate values that we assume can be retained for further
computations.
CHAPTER 1. INTRODUCTION 4
1.2.1 Goals of Analysis
As hinted in Section 1.1, in this thesis we focus on the problems of parameter estimation and infer-
ence. Because we are dealing with massive data sets (at least on the order of a million observations),
we would like to design algorithms that are computationally efficient and scale to big data.
More concretely, with reference to model (1), we would like to estimate β, σ2A, σ2
B , and σ2E and
construct confidence intervals for those estimates using easily parallelizable algorithms that require
at most O(N) computation and O(N) space. Ideally, these algorithms should only require a few
passes over the data and should also be simple to analyze.
To guarantee good statistical behavior, the estimates should be consistent. To construct con-
fidence intervals, we need to find the variance of the estimates and prove that the estimates are
asymptotically normal. Unfortunately, the exact variances of our estimates cannot be computed
in O(N) time. Therefore, we may either compute the asymptotic variances of the estimates or a
somewhat conservative approximation to them. Note that because of the crossed factor structure,
the usual definition of asymptotics as being N →∞ is not enough; we also need to control how the
observed data are spread over the levels of each factor. A detailed explanation of the asymptotic
regime used in this thesis is given in Section 2.4.3.
1.3 Outline
The remainder of this thesis is organized as follows. We consider two models of increasing complexity.
In Chapter 2, we study the simplest linear model with crossed random effects, the crossed random
effects model with two random factors. In this case, there is only an intercept. We concentrate on
the tasks of estimation and inference for the variance components. First, we review some classical
estimators. Next, we show through theory and simulations that a Bayesian approach is not scalable.
By modifying one of the classical estimators, we obtain our proposed method of moments estimator,
which can be computed using linear computation time and space and whose exact variance can be
found. Indeed, in linear computation time and space, we can also find a conservative approximation
to the variance of the estimates. In addition, we prove that the estimates are asymptotically normal.
Lastly, we illustrate the behavior of the estimates on simulated data and several large data sets from
social media.
In Chapter 3, we extend the crossed random effects model to allow for arbitrary predictors,
obtaining model (1). We start by reviewing previous literature, which has primarily focused on
likelihood-based methods that are expensive for massive linear mixed models with crossed random
effects. Then, we propose an algorithm that alternates between estimating the regression coeffi-
cients and the variance components, using only linear computation time and space. We show that
the resulting estimates are consistent and asymptotically normal. Moreover, the variance of the
estimated regression coefficients can be found using linear computation time and space, enabling us
CHAPTER 1. INTRODUCTION 5
to efficiently perform inference. Using simulated data, we compare the quality and computational
cost of our estimates to those obtained by maximum likelihood. Then, we apply the alternating
algorithm to a large data set from e-commerce, containing ratings of clothing by customers, along
with information about the particular customer or piece of clothing.
Finally, in Chapter 4, we discuss some practical recommendations for applying our methods to
real-world data. We then summarize the thesis and briefly discuss several possible directions for
future work. All proofs are included in the appendices.
The bulk of Chapters 2 and 3 and their accompanying proofs are taken from two papers [13, 14]
written by me and my advisor Art Owen. We collaborated on the algorithm design and analysis
and I performed the experiments. In this thesis, I have included some additional literature review
and analysis.
Chapter 2
Crossed Random Effects Model
In this chapter, we start with the simplest version of model (1) where there is only an intercept
instead of arbitrary auxiliary data. It is still able to provide us with challenges and insights into the
more general model.
2.1 Model Description and Our Contributions
We consider the following model for the data of interest:
Model 2. Two-factor crossed random effects model
Yij = µ+ ai + bj + eij , xij ∈ Rp, i, j ∈ N with
aiiid∼ (0, σ2
A), bjiid∼ (0, σ2
B), eijiid∼ (0, σ2
E) (independently) and,
κA <∞, κB <∞, κE <∞,
(2.1)
where µ is the intercept (global mean) and the other quantities are defined as in Section 1.2.
The assumptions about the patterns of observation discussed in that section also hold.
For this model, the parameters of interest are µ, σ2A, σ2
B , and σ2E . However, in this chapter
we restrict our attention to the variance components σ2A, σ2
B , and σ2E , as estimating µ has been
well-studied in the literature. For instance, a simple estimator is the sample mean, whose value and
exact variance can be computed in O(N) time [37]. In addition, when we return to model (1) in
Chapter 3, our discussion and results about the estimation of β will also apply to the estimation
of µ in this model. Before presenting an algorithm that has the properties desired in Section 1.2.1,
we argue that a Bayesian approach, where we assume that the random effects have some known
distribution and sample from the posterior distribution of the variance components, does not scale
to big data. To do so, we show both simulation results from Markov Chain Monte Carlo (MCMC)
6
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 7
algorithms and theory describing the convergence rate of those algorithms.
Let θ = (σ2A, σ
2B , σ
2E)T be the vector of variance components. Following Section 1.2.1, we first
obtain an unbiased estimate θ of θ using a parallelizable algorithm requiring one pass over the data,
with computational cost O(N) and O(R + C) additional space. Second, we find the variance of θ,
Var(θ | θ, κ), which depends on both θ and the kurtoses κ = (κA, κB , κE)T. The exact variance is
too expensive to compute, but we develop (mildly) overestimating approximations V (θ, κ) that can
be computed in a second pass over the data, requiring O(N) time and O(R + C) space, once given
values for θ and κ. By estimating κ in a similar manner to θ, we then estimate the variances of the
estimates of θ by Var(θ) = V (θ, κ). Third, we show that θ is normally distributed under a suitable
asymptotic regime. Combining these three results, we have the tools we need to find estimates and
confidence intervals for the variance components efficiently and scalably.
For large data sets we might suppose that Var(θ) is necessarily very small and getting exact
values is not important. While this may be true, it is wise to check. The effective sample size (as
defined in Lavrakas (2008) [30]) in model (2) might be as small as R or C if the row or column
effects dominate. Moreover, if the sampling frequencies of rows or columns are very unequal, then
the effective sample size can be much smaller than R or C. For example, the Netflix data set [8]
has N.= 108. But there are only about 18,000 movies and so for statistics dominated by the movie
effect the effective sample size might be closer to 18,000. That the movies do not appear equally
often would further reduce the effective sample size. Indeed, Owen (2007) [36] shows that for some
linear statistics the variance could be as much as 50,000 times larger than a formula based on IID
sampling would yield. That factor is perhaps extreme but it would translate a nominal sample size
of 108 into an effective sample size closer to 2,000.
2.1.1 Notation and Asymptotic Regime
In this section, we review some additional notation we need for this chapter and the next.
To index rows, we use i′, r, and r′ in addition to i. To index columns, we use j′, s, and s′ in
addition to j. Let Zij = 1 if Yij is observed, and Zij = 0 otherwise.
The number of observations in row i is Ni• =∑j Zij and the number of observations in row j
is N•j =∑i Zij . In the following, all of our sums over rows are only over rows i with Ni• > 0, and
similarly for sums over columns. We state this because there are a small number of expressions where
omitting rows without data changes their values. This convention corresponds to what happens when
one makes a pass through the whole data set.
Let Z be the R × C matrix containing Zij . Of interest are (ZZT)ii′ =∑j ZijZi′j , the number
of columns for which we have data in both rows i and i′, and (ZTZ)jj′ . Note that (ZZT)ii′ ≤ Ni•
and furthermore
∑ir
(ZZT)ir =∑jir
ZijZrj =∑j
N2•j , and
∑js
(ZTZ)js =∑i
N2i•.
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 8
Two other useful idioms are
Ti• =∑j
ZijN•j and T•j =∑i
ZijNi•. (2.2)
Here Ti• is the total number of observations in all of the columns j that are represented in row i.
Our notation allows for an arbitrary pattern of observations. Some special cases are as follows.
A balanced crossed design can be described via Zij = 1i≤R1j≤C . If maxiNi• = 1 but maxj N•j > 1
then the data have a nested structure with rows nested in columns. If maxiNi• = maxj N•j = 1, then
the observed Yij are IID. Some patterns are difficult to handle. For example, if all the observations
are in the same row or column, some of the variance components are not identifiable. We assume
that we do not meet such worst cases.
We also need to introduce some notation useful for asymptotic analysis. The quantities
εR = maxiNi•/N and εC = max
jN•j/N (2.3)
measure the extent to which a single row or column dominates the data set. For the remainder of
this section, we discuss the asymptotic regime used in this thesis; simply N →∞ is not enough.
First, we expect that no row or column contains a large portion of the observed data. Thus, as
N →∞, we assume that
max(εR, εC)→ 0. (2.4)
It is also often reasonable to suppose that maxi Ti•/N and maxj T•j/N are both small.
In many data sets, the average row and column sizes are large, but much smaller than N . One
way to measure the average row size is N/R. Another way to measure it is to randomly choose an
observation and inspect its row size, obtaining an expected value of (1/N)∑iN
2i•. Similar formulas
hold for the average column size. Therefore, we assume that as N →∞
max(R/N,C/N)→ 0 (2.5)
and
min( 1
N
∑i
N2i•,
1
N
∑j
N2•j
)→∞ and
max( 1
N2
∑i
N2i•,
1
N2
∑j
N2•j
)→ 0.
(2.6)
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 9
Notice that
1
N2
∑i
N2i• ≤
1
N2
∑i
Ni•(εRN) ≤ εR and1
N2
∑j
N2•j ≤ εC (2.7)
and so the second part of (2.6) merely follows from (2.3) and (2.4).
While the average row count may be large, many of the rows corresponding to newly seen
entities can have Ni• = 1. In our algorithms and analysis, it is not necessary to assume that all of
the rows and columns contain at least some minimum number of observations. Thus, we avoid losing
information by the practice of iteratively removing all rows and columns with few observations.
To illustrate the appropriateness of our assumptions for data sets of practical interest, we consider
the Netflix data [8], which has N = 100,480,507 ratings on R = 17,770 movies by C = 480,189
customers. Therefore R/N.= 0.00018 and C/N
.= 0.0047, and the observation pattern is sparse
with N/(RC).= 0.012. It is not dominated by a single row or column because εR
.= 0.0023 and
εC = 0.00018 even though one customer has rated an astonishing 17,653 movies. Similarly
N∑iN
2i•
.= 1.78× 10−5,
∑iN
2i•
N2
.= 0.00056,
N∑j N
2•j
.= 0.0015, and
∑j N
2•j
N2
.= 6.43× 10−6
so that the average row or column has size � 1 and � N .
2.2 Previous Work
Before presenting our results, we review algorithms to estimate variance components for model (2)
presented in the literature.
They fall into two main categories. The first, which our proposal would fall under, uses moments
of the data. These have the advantage that they do not require any parametric assumptions and
moments can be computed easily by passing over the data, which also enables parallelization. The
second assumes some distribution for the random effects and noise and maximizes the data likelihood
(or some version thereof). These have the advantage that if the data actually follow the chosen
distribution, the resulting estimates often have smaller mean squared error, i.e. are more efficient
[44].
Here, we discuss the first category of algorithms. We defer the discussion of the second category
to Section 3.2.2, since such algorithms for crossed random effects models do not differ much from such
algorithms for linear mixed models. We note that likelihood-based methods require computation
time superlinear in the number of observations, which make them not scalable; see Section 3.2.2 and
also Raudenbush (1993) [40].
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 10
Before doing so, we refer to Clayton & Rasbash (1999) [11], which propose an alternating imputa-
tion posterior (AIP) algorithm for crossed random effects models with missing data. They show that
it has good performance on fairly large data sets. It may be termed a ‘pseudo-MCMC’ method since
it alternates between sampling the missing data from its distribution given the parameter estimates
and sampling the parameters from a distribution centered on the maximum likelihood estimates.
Because of this last step, we do not consider AIP to be scalable to Internet size problems.
2.2.1 Moment-based Algorithms
Note that model (2) can be written as
Y = µ1N + Zaa+ Zbb+ e, with
a ∼ (0, σ2AIR), b ∼ (0, σ2
BIC), e ∼ (0, σ2EIN ) (independently),
(2.8)
where Y is the vector of Yij , 1N is the N -dimensional vector of ones, Za is a N × R binary matrix
with entry (k, i) equal to 1 if the kth observation is from level i of the first factor, a is the vector of
ai, Zb is a N ×C binary matrix with entry (k, j) equal to 1 if the kth observation is from level j of
the second factor, b is the vector of bj , and IN is the identity matrix of dimension N .
Suppose that we choose a set of three matrices A1, A2, and A3 for which 1TNA11N = 1TNA21N =
1TNA31N = 0. Then, applying the method of moments on Si = Y TAiY for i = 1, 2, 3, where
E(Si) = σ2A Tr(ZT
aAiZa) + σ2B Tr(ZT
b AiZb) + σ2E Tr(Ai), will give unbiased estimates of the variance
components [44]. When the random effects are normally distributed, the matrices Ai can be chosen
optimally, obtaining minimum variance unbiased estimates [33]. We now describe two special cases
of this technique.
Henderson I The most common choice, called Henderson I [23], has estimating equations
SSA =
R∑i=1
Ni•(Yi• − Y••)2,
SSB =
C∑j=1
N•j(Y•j − Y••)2, and
SSE =∑ij
Zij(Yij − Y••)2 − SSA− SSB,
which are based on the ANOVA sum of squares for balanced data. We propose an alternative to
Henderson I based on U -statistics that is more amenable to analysis, and we develop a method to
approximate its sampling variance in linear time. A disadvantage of Henderson I is that it only
applies to random effects models. Henderson II extends Henderson I to linear mixed models and we
will discuss it in Section 3.2.1.
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 11
Henderson III Henderson III [44] comes from a regression perspective rather than an ANOVA
perspective. To construct estimating equations, it pretends that the random effects are fixed effects.
Let R(a, b, µ) be the reduction in sum of squares due to fitting the model with µ, a, and b as fixed
effects. Similarly, let R(a, µ) be the same for fitting the model with only µ and a, and let R(µ) be the
same for fitting the model with only µ. Then, we apply the method of moments to the estimating
equations
R(a, b | µ) = R(a, b, µ)−R(µ),
R(b | a, µ) = R(a, b, µ)−R(a, µ), and
SSE = Y TY −R(a, b, µ),
with expectations under model (2.8)
E(R(a, b | µ) =
(N −
∑iN
2i•
N
)σ2A +
(N −
∑j N
2•j
N
)σ2B + (R+ C − 2)σ2
E ,
E(R(b | a, µ)) = (N −R)σ2B + (C − 1)σ2
E , and
E(SSE) = (N −R− C + 1)σ2E .
2.3 Bayesian Analysis via MCMC
In addition to the moment and likelihood based algorithms discussed in the previous section, we
also may take a Bayesian approach. As in likelihood based algorithms, we assume that the random
effects and noise follow some known distribution; we consider the Gaussian distribution for this
section only. To estimate σ2A, σ2
B , and σ2E , we then utilize MCMC algorithms to sample from the
posterior distribution given the data: π = p(µ, a, b, σ2A, σ
2B , σ
2E | Y ) where a is the vector of ai, b is
the vector of bj , and Y is the vector of Yij . Let
S(t) =(µ(t) a(t)
T
b(t)T
σ2(t)A σ
2(t)B σ
2(t)E
)T, for t ≥ 1
denote the resulting chain.
We show that numerous MCMC methods scale badly for crossed random effects models even
though they have been shown to be effective for hierarchical random effects models [15, 46, 51].
Indeed, in limits where R,C →∞, the dimension of our chain S(t) approaches infinity. Convergence
rates of many MCMC methods slow down as the dimension of the chain increases, making them
ineffective for high dimensional parameter spaces.
In our arguments, we make two restrictions. First, balanced data is a fully sampled R×C matrix
with Yij for rows i = 1, . . . , R and columns j = 1, . . . , C. We present some analyses for balanced
data with interspersed remarks on how the general unbalanced case behaves. The balanced case
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 12
allows sharp formulas that we find useful and that case is the one we simulate. In particular, we can
obtain convergence rates for some MCMC algorithms. Second, the MCMC algorithms we consider
go over the entire data set at each iteration. There are alternative samplers that save computation
time by only looking at subsets of data at each iteration. However, so far those approaches are
developed for IID data only.
2.3.1 Implications from Probability Theory
We start by exploring the theoretical properties of some common MCMC algorithms for our problem.
2.3.1.1 Gibbs Sampling
In each iteration of Gibbs sampling [16], we draw from the conditional posteriors of µ, a, b, σ2A,
σ2B , and σ2
E in turn. To analyze the behavior of this sampler, let us consider the problem of Gibbs
sampling from the ‘smaller’ distribution φ = p(a, b | µ, σ2A, σ
2B , σ
2E , Y ). At iteration t+ 1, we sample
a(t+1) ∼ p(a | b(t), µ, σ2A, σ
2B , σ
2E , Y ) and b(t+1) ∼ p(b | a(t+1), µ, σ2
A, σ2B , σ
2E , Y ), which are normal
distributions with diagonal covariance matrices. Let X(t) be the resulting chain.
Roberts & Sahu (1997) [42] give the following definition.
Definition 2.3.1. Let θ(t), for integer t ≥ 0 be a Markov chain with stationary distribution h. Its
convergence rate is the minimum number ρ such that
limt→∞
Eh[(Eh[f(θ(t)) | θ(0)]− Eh[f(θ)])2
]r−t = 0
holds for all measurable functions f such that Eh(f(θ)2) <∞ and all r > ρ.
Theorem 2.3.1. Let ρ be the convergence rate of X(t) to φ, as in Definition 2.3.1. Then,
ρ =σ2B
σ2B + σ2
E/R× σ2
A
σ2A + σ2
E/C.
Proof. See Section A.1.1.
We see that ρ→ 1 as R,C →∞, outside of trivial cases with σ2A or σ2
B equal to zero. If R and
C grow proportionately then ρ = 1 − α/√N + O(1/N) for some α > 0. We can therefore expect
the Gibbs sampler to require at least some constant multiple of√N iterations to approximate the
target distribution sufficiently. When the data are not perfectly balanced, numerical computation
of ρ shows that Gibbs still mixes increasingly slowly as N → ∞ while the sampler requires O(N)
computation per iteration. In sum, Gibbs takes O(N3/2) work to sample from φ, which is not
scalable.
Because sampling from φ can be viewed as a subproblem of sampling from π, we believe that
the Gibbs sampler that draws from π, which also requires O(N) time per iteration, will exhibit the
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 13
same slow convergence and hence require superlinear computation time. Our simulations, described
in Section 2.3.2, confirm this.
2.3.1.2 Other MCMC algorithms
The Gibbs sampler is widely used for problems like this, where it is tractable to sample from the
full conditional distributions. But there are other MCMC algorithms that one could use. Now we
consider random walk Metropolis (RWM), Langevin diffusion, and Metropolis adjusted Langevin
(MALA). They also have difficulties scaling to large data sets.
At iteration t+ 1 of RWM, a Gaussian random walk proposal S(t+1) ∼ N (S(t), σ2I) for σ2 > 0
is made and the step is taken with the Metropolis-Hastings acceptance probability. If the target
distribution is a product distribution of dimension d, the chain S(t) ≡ S(dt) (i.e. the chain formed by
every dth state of the chain S(t)) converges to a diffusion whose solution is the target distribution.
We may interpret this as a convergence time for the algorithm that grows as O(d) [41].
For our problem, evaluating the acceptance probability requires time at least O(N), so the overall
algorithm then takes O(N(R+C)) time. This is at best O(N3/2), as we found for Gibbs sampling,
and could be worse for sparse data where N � RC. Our target distribution is not of product
form, but we have no reason to expect that RWM mixes orders of magnitude faster here than for a
distribution of product form. Indeed, it seems more likely that mixing would be faster for product
distributions than for distributions with more complicated dependence patterns such as ours.
At iteration t+ 1, Langevin diffusion steps S(t+1) ∼ N (S(t) + (h/2)∇ log π(S(t)), hI) for h > 0.
As h → 0, the stationary distribution for this process converges to π, as shown for general tar-
get distributions in Liu (2004) [32]. Because h 6= 0 in practice, the Langevin algorithm is biased.
To correct this, the MALA algorithm uses the Metropolis-Hastings algorithm with the Langevin
proposal S(t+1). When the target distribution is a product distribution of dimension d, the chain
S(t) ≡ S(d1/3t) converges to a diffusion with solution π; the convergence time grows as O(d1/3) [41].
Computing the gradients and the acceptance probability requires O(N) time, so with similar reason-
ing as for RWM, the computation time of MALA is O(N(R+ C)1/3), which is at best O(N1+1/6).
2.3.2 Simulation Results
We carried out simulations of the four algorithms described above, as well as five others: the block
Gibbs sampler (‘Block’), the reparameterized Gibbs sampler (‘Reparam.’), the independence sampler
(‘Indp.’), RWM with subsampling (‘RWM Sub.’), and the pCN algorithm of Hairer et al. (2014)
[20]. Descriptions of these five algorithms are given below with discussions of their simulation results.
Every algorithm was implemented in MATLAB and run on a cluster using 4GB memory.
For each algorithm and a range of values of R and C, we generated balanced data from model (2)
with µ = 1, σ2A = 2, σ2
B = 0.5, and σ2E = 1. We ran 20,000 iterations of the algorithm, retaining the
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 14
Table 2.1: Summary of simulation results for cases with R = C = 1000. The first row gives CPUtime in seconds. The next four rows give median estimates of the 4 parameters. The next four rowsgive the number of lags required to get an autocorrelation below 0.5.
Method Gibbs Block Reparam. Lang. MALA Indp. RWM RWM Sub. pCN
CPU sec. 3432 15046 4099 2302 4760 2513 2141 2635 1966
med µ 0.97 1.02 1.04 0.99 0.96 2.39 1.55 1.07 1.53med σ2
A 1.96 1.99 2.02 1.90 1.95 1.78 2.01 1.96 1.99med σ2
B 0.51 0.50 0.50 0.40 0.50 2.94 0.51 0.50 0.49med σ2
E 1.00 1.00 1.00 65.22 2.66 0.15 0 0.93 0
ACF(µ) 801 790 694 1 2501 5000+ 1133 1656 1008ACF(σ2
A) 1 1 1 122 2656 5000+ 1133 989 912ACF(σ2
B) 1 1 1 477 2514 5000+ 1133 855 556ACV(σ2
E) 1 1 1 385 3062 5000+ 1518 1724 621
last 10,000 for analysis. We record the CPU time required, the median values of µ, σ2A, σ2
B , and σ2E ,
and the number of lags needed for their sample auto-correlation functions (ACF) to go below 0.5.
The entire process is repeated in 10 independent runs. Table 2.1 presents median values of the
recorded statistics over the 10 runs for the case R = C = 1000. Section A.1.2 includes corresponding
results at a range of (R,C) sizes.
Block Gibbs, which updates a and b together to try to improve mixing, has computation time
per iteration superlinear in the number of observations. To improve mixing, reparameterized Gibbs
scales the random effects to have equal variance. This gives an algorithm equivalent to the conditional
augmentation of Van Dyk & Meng (2001) [47]. For all three Gibbs-type algorithms, the parameter
estimates are good but µ mixes slower as R and C increase, while the variance components do not
exhibit this behavior.
The computation times of Langevin diffusion (‘Lang.’) and MALA are approximately linear in
the number of observations. However, σ2E tends to explode for large data sets in Langevin diffusion,
while the chain does not mix well in MALA.
The independent sampler is a Metropolis-Hastings algorithm where the proposal distribution is
fixed. We propose µ ∼ N (1, 1), a ∼ N (0, IR), b ∼ N (0, IC), and σ2A, σ
2B , σ
2E ∼ InvGamma(1, 1).
The computation time grows linearly with the data size. The parameters do not mix well, and their
estimates are not good. It is possible that better results would be obtained from a different proposal
distribution, but it is not clear how best to choose one in practice.
RWM and RWM with subsampling, the latter of which updates a subset of parameters at each
iteration, both have computation time linear in the number of observations. Neither algorithm
mixed well, and for RWM σ2E tended to go to zero in large data sets.
The pCN algorithm is a Metropolis-Hastings algorithm where the proposals are Gaussian random
walk steps shrunken towards zero: S(t+1) ∼ N (√
1− σ2S(t), σ2I), for σ2 ≤ 1. Hairer et al. (2014)
[20] show that under certain conditions on the target distribution, the convergence rate of this
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 15
algorithm does not slow with the dimension of the distribution. We include it here, even though
our π does not satisfy those conditions. The computation time grows linearly with the data size.
However, the estimates for µ and σ2E are not good, and those for σ2
E even get worse as the data size
increases. None of the parameters seem to mix well.
In summary, for large data sets each algorithm mixes increasingly slowly or returns flawed esti-
mates of µ and the variance components. We have also simulated some unbalanced data sets and
slow mixing is once again the norm, with worse performance as R and C grow.
2.4 Method of Moments Estimation
We propose a method of moments estimate θ for θ = (σ2A, σ
2B , σ
2E)T that requires one pass over the
data. We find an expression for Var(θ | θ, κ) and describe how to obtain an approximation of it after
a second pass over the data. We also show that the estimates are asymptotically Gaussian under
the regime described in Section 2.1.1, which enables the efficient construction of reliable confidence
intervals.
2.4.1 U-statistics for variance components
The usual unbiased sample variance estimate can be formulated as a U -statistic, which is more
convenient to analyze. Thus, we use the following U-statistics as our method of moments estimators:
Ua =1
2
∑ijj′
N−1i• ZijZij′(Yij − Yij′)2,
Ub =1
2
∑jii′
N−1•j ZijZi′j(Yij − Yi′j)2, and
Ue =1
2
∑iji′j′
ZijZi′j′(Yij − Yi′j′)2.
(2.9)
We explain our choice of these estimators in Section A.2.1.
To understand Ua note that for each row i, the quantities Yij − µ − ai are IID with variance
σ2B +σ2
E . Thus, Ua is a weighted sum of within-row unbiased estimates of σ2B +σ2
E . The explanation
for Ub is similar, while Ue is proportional to the sample variance estimate of all N observations.
Lemma 2.4.1. Let Yij follow the two-factor crossed random effects model (2) with the observation
pattern Zij as described in Section 2.1.1. Then the U -statistics defined at (2.9) satisfy
E(Ua) = (σ2B + σ2
E)(N −R),
E(Ub) = (σ2A + σ2
E)(N − C), and
E(Ue) = σ2A(N2 −
∑i
N2i•) + σ2
B(N2 −∑j
N2•j) + σ2
E(N2 −N).
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 16
Proof. See Section A.2.2.
To obtain unbiased estimates σ2A, σ2
B , and σ2E given values of the U -statistics, we solve the 3× 3
system of equations
M
σ2A
σ2B
σ2E
=
Ua
Ub
Ue
for M =
0 N −R N −R
N − C 0 N − CN2 −
∑iN
2i• N2 −
∑j N
2•j N2 −N
. (2.10)
For our method to return unique and meaningful estimates, the determinant of M
detM = (N −R)(N − C)(N2 −
∑i
N2i• −
∑j
N2•j +N
)≥ (N −R)(N − C)(N2(1− εR − εC) +N)
must be nonzero. This is true when no row or column has more than half of the data and at least
one row and at least one column has more than one observation.
To compute the U -statistics, notice that Ua =∑i Si•, where Si• =
∑j Zij(Yij − Yi•)
2 and
Yi• = (1/Ni•)∑j ZijYij . In one pass over the data and time O(N), we compute Ni•, Yi•, and Si•
for all R observed levels of i using the incremental algorithm described in the next paragraph. We
can also compute N , R and C in such a pass if they are not known beforehand.
Chan et al. (1983) [10] show how to compute both Yi• = Ni•Yi• and Si• in a numerically stable
one pass algorithm. At the initial appearance of an observation in row i, with corresponding column
j = j(1), set Ni• = 1, Yi• = Yij and Si• = 0. After that, at the kth appearance of an observation in
row i with corresponding column j(k), Yij(k),
Ni• ← Ni• + 1, Yi• ← Yi• + Yij(k), and Si• ← Si• +(Ni• × Yij(k) − Yi•)2
Ni•(Ni• − 1). (2.11)
Note that here k = Ni•. Chan et al. (1983) [10] give a detailed analysis of roundoff error for
update (2.11) as well as generalizations that update higher moments from groups of data values.
In that same pass over the data, Ue and the analogous quantities needed to compute Ub (S•j ,
Y•j , N•j) are also computed using the incremental algorithm. Finally, in additional time O(R+C),
we calculate∑i Si•,
∑j S•j ,
∑iN
2i•, and
∑j N
2•j . Now, we have Ua, Ub, Ue, and all the entries
of M .
Given Ua, Ub, Ue, and M we can calculate σ2A, σ2
B , and σ2E in constant time. Therefore, finding
our method of moments estimators takes O(N) time overall. Note that in practice we may get
negative estimates of the variance components, since the method of moments does not take into
account any constraints. However, there is no consensus on how to deal with this situation. In this
thesis, we automatically set any negative variance component estimates to zero.
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 17
2.4.2 Estimator Variances
Now we show how to estimate the covariance matrix of θ = (σ2A, σ
2B , σ
2E)T.
2.4.2.1 True variance of θ
This section discusses the finite sample covariance matrix of θ. Theorem 2.4.1 below gives the exact
variances and covariances of our U -statistics.
Theorem 2.4.1. Let Yij follow the random effects model (2) with the observation pattern Zij as
described in Section 2.1.1. Then the U -statistics defined at (2.9) have variances
Var(Ua) = σ4B(κB + 2)
∑ir
(ZZT)ir(1−N−1i• )(1−N−1r• )
+ 2σ4B
∑ir
N−1i• N−1r• (ZZT)ir[(ZZ
T)ir − 1] + 4σ2Bσ
2E(N −R)
+ σ4E(κE + 2)
∑i
Ni•(1−N−1i• )2 + 2σ4E
∑i
(1−N−1i• ),
(2.12)
Var(Ub) = σ4A(κA + 2)
∑js
(ZTZ)js(1−N−1•j )(1−N−1•s )
+ 2σ4A
∑js
N−1•j N−1•s (ZTZ)js[(Z
TZ)js − 1] + 4σ2Aσ
2E(N − C)
+ σ4E(κE + 2)
∑j
N•j(1−N−1•j )2 + 2σ4E
∑j
(1−N−1•j ),
(2.13)
and
Var(Ue) = 2σ4A
[(∑iN
2i•
)2−∑iN
4i•
]+ σ4
A(κA + 2)(N2∑i
N2i• − 2N
∑i
N3i• +
∑i
N4i•
)+ 2σ4
B
[(∑jN
2•j
)2−∑j
N4•j
]+ 4σ2
Aσ2B
(N3 − 2N
∑ij
ZijNi•N•j +∑ij
N2i•N
2•j
)+ σ4
B(κB + 2)(N2∑j
N2•j − 2N
∑j
N3•j +
∑j
N4•j
)+ σ4
E(κE + 2)N(N − 1)2
+ 2σ4EN(N − 1) + 4σ2
Aσ2E
(N3 −N
∑i
N2i•
)+ 4σ2
Bσ2E
(N3 −N
∑j
N2•j
).
(2.14)
Their covariances are
Cov(Ua, Ub) = σ4E(κE + 2)
∑ij
Zij(1−N−1i• )(1−N−1•j ), (2.15)
Cov(Ua, Ue) = 2σ4B
(∑i
N−1i• T2i• −
∑ij
ZijN−1i• N
2•j
)
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 18
+ σ4B(κB + 2)
∑ij
Zij(N −N•j)N•j(1−N−1i• ) + 2σ4E(N −R) (2.16)
+ σ4E(κE + 2)(N −R)(N − 1) + 4σ2
Bσ2EN(N −R), and
Cov(Ub, Ue) = 2σ4A
(∑j
N−1•j T2•j −
∑ij
ZijN−1•j N
2i•
)+ σ4
A(κA + 2)∑ij
Zij(N −Ni•)Ni•(1−N−1•j ) + 2σ4E(N − C) (2.17)
+ σ4E(κE + 2)(N − C)(N − 1) + 4σ2
Aσ2EN(N − C).
Proof. Equation (2.12) is proved in Section A.2.4 and then equation (2.13) follows by exchanging
indices. Equation (2.14) is proved in Section A.2.5. Equation (2.15) is proved in Section A.2.6.
Equation (2.16) is proved in Section A.2.7 and then equation (2.17) follows by exchanging indices.
Now we consider Var(θ). From (2.10)
Var(θ) = M−1Var
Ua
Ub
Ue
(M−1)T. (2.18)
We show in Section 2.4.2.2 that while Var(Ue) and the covariances of the U -statistics may be
exactly computed in time O(N), Var(Ua) and Var(Ub) cannot. Therefore, we approximate Var(Ua)
and Var(Ub) such that when we apply formula (2.18) we get conservative estimates of Var(σ2A),
Var(σ2B), and Var(σ2
E) (the values of primary interest).
For intuition on what sort of approximation is needed, we give a linear expansion of Var(θ) in
terms of the variances and covariances of the U -statistics. Letting ε = max(εR, εC , R/N,C/N) we
have as ε→ 0
M =
N
N
N2
0 1 1
1 0 1
1 1 1
(1 +O(ε))
and so
M−1 =
−1 0 1
0 −1 1
1 1 −1
N−1
N−1
N−2
(1 +O(ε)).
It follows that
σ2A = (Ue/N
2 − Ua/N)(1 +O(ε)),
σ2B = (Ue/N
2 − Ub/N)(1 +O(ε)), and
σ2E = (Ua/N + Ub/N − Ue/N2)(1 +O(ε)).
(2.19)
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 19
Disregarding the O(ε) terms,
Var(σ2A)
.= Var(Ue)/N
4 + Var(Ua)/N2 − 2Cov(Ua, Ue)/N3,
Var(σ2B)
.= Var(Ue)/N
4 + Var(Ub)/N2 − 2Cov(Ub, Ue)/N
3, and
Var(σ2E)
.= Var(Ua)/N2 + Var(Ub)/N
2 + Var(Ue)/N4
− 2Cov(Ua, Ue)/N3 − 2Cov(Ub, Ue)/N
3 + 2Cov(Ua, Ub)/N2.
(2.20)
In light of equation (2.20), to find computationally attractive but conservative approximations
of Var(θ) in finite samples, we use (slight) over-estimates of Var(Ua) and Var(Ub). We discuss how
to do so in Section 2.4.2.2.
In practice, when obtaining Var(θ), we plug in σ2A, σ2
B , σ2E , and estimates of the kurtoses into
the covariance matrix of the U -statistics where Var(Ua) and Var(Ub) have been replaced by their
over-estimates. Then we apply equation (2.18). We discuss estimating the kurtoses in Section 2.4.2.3.
2.4.2.2 Computable approximations of Var(U)
First, we show how to obtain over-estimates of Var(Ua) in time O(N); the case of Var(Ub) is similar.
In addition to N −R, Var(Ua) contains the quantities
∑ir
(ZZT)ir(1−N−1i• )(1−N−1r• ),∑ir
N−1i• N−1r• (ZZT)ir((ZZ
T)ir − 1),∑i
Ni•(1−N−1i• )2, and∑i
(1−N−1i• ).
The third and fourth quantities above can be computed in O(R) work after the first pass over the
data.
The first quantity is a sum over i and r, and cannot be simplified any further. Computing it
takes more than O(N) work. Since its coefficient σ4B(κB + 2) is nonnegative, we must use an upper
bound to obtain an over-estimate of Var(Ua). We have the bound
∑ir
(ZZT)ir(1−N−1i• )(1−N−1r• ) ≤∑ij
∑r
ZijZrj(1−N−1i• ) =∑j
N2•j −
∑ij
ZijN•jN−1i• ,
which can be computed in O(N) work in a second pass over the data. Other weaker bounds may be
obtained without the second pass. An example is
∑ir
(ZZT)ir(1−N−1i• )(1−N−1r• ) ≤∑ir
(ZZT)ir =∑j
N2•j
which can be computed in O(C) work.
For our motivating problems this over-estimation of variance is negligible. The true term is a
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 20
weighted sum of (ZZT)ir and we use a weight of 1 instead of 1 − N−1r• and a typical Nr• will be
large. Consider first a small row r, with Nr• = 1. Let j(r) be the unique column with Zrj(r) = 1.
Our bound replaces that row’s contribution of 0 by
∑ij
ZijZrj(1−N−1i• ) =∑i
ZijZrj(r)(1−N−1i• ) = Zij(r)(1−N−1i• ) ≤ 1
thereby adding at most 1 to the sum. The total from such terms is then at most the number of
singleton rows which is in turn below R � N �∑j N
2•j . The latter quantity dominates that
coefficient. When Nr• ≥ 2 our approximation at most doubles the contribution. A near doubling of
one term under the extreme setting where most rows have only two observations, is acceptable.
For the same reason as the first quantity, the second quantity cannot be computed in time O(N)
and we upper bound it via (ZZT)ir ≤ Nr•, getting
∑ir
N−1i• N−1r• (ZZT)ir((ZZ
T)ir − 1) ≤∑ir
N−1i• N−1r• (ZZT)ir(Nr• − 1)
=∑ij
ZijN−1i• N•j −
∑ir
N−1i• N−1r• (ZZT)ir
≤∑ij
ZijN−1i• N•j
which can be computed in O(N) work on a second pass. Even this upper bound is a small part of
the variance; it is a sum of column sizes divided by row sizes. Another term in the variance is of
order∑ij ZijN•j . When typical row sizes are large then
∑ij ZijN•jN
−1i• �
∑ij ZijN•j .
All but one expression in Var(Ue) (see (2.14)) can be computed in O(R+C) time after the first
pass over the data. That one expression is
∑ij
ZijNi•N•j . (2.21)
Equation (2.21) requires a second pass over the data in time O(N), because it is the sum over i and
j of a polynomial in Zij , Ni•, and N•j . Hence computing Var(Ue) takes O(N) time total.
With the same reasoning as for (2.21), we see that Cov(Ua, Ub) can be computed in a second
pass over the data in time O(N). This reasoning also shows that we can compute nearly every term
in Cov(Ua, Ue) in a second pass over the data; the exception is
∑i
N−1i• T2i•. (2.22)
We compute Ti• for each i in a second pass over the data. But we must use additional time O(R) to
get (2.22). Nevertheless, the total computation time is still O(N). Symmetrically Cov(Ub, Ue) can
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 21
be computed in time O(N) as well.
2.4.2.3 Estimating kurtoses
Under a Gaussian assumption, κA = κB = κE = 0. If however the data have heavier tails than
this, a Gaussian assumption will lead to underestimates of Var(θ). Therefore, we will estimate the
kurtoses using the method of moments with U -statistics as well.
Let µA,4 = E(a4i ) = (κA + 3)σ4A, µB,4 = E(b4i ) = (κB + 3)σ4
B , and µE,4 = E(e4ij) = (κE + 3)σ4E .
The fourth moment U -statistics we use are
Wa =1
2
∑ijj′
N−1i• ZijZij′(Yij − Yij′)4,
Wb =1
2
∑iji′
N−1•j ZijZi′j(Yij − Yi′j)4, and
We =1
2
∑iji′j′
ZijZi′j′(Yij − Yi′j′)4.
(2.23)
Theorem 2.4.2. Let Yij follow the random effects model (2) with the observation pattern Zij as
described in Section 2.1.1. Then the statistics defined at (2.23) have means
E(Wa) = (µB,4 + 3σ4B + 12σ2
Bσ2E + µE,4 + 3σ4
E)(N −R),
E(Wb) = (µA,4 + 3σ4A + 12σ2
Aσ2E + µE,4 + 3σ4
E)(N − C), and
E(We) = (µA,4 + 3σ4A + 12σ2
Aσ2E)(N2 −
∑i
N2i•)
+ (µB,4 + 3σ4B + 12σ2
Bσ2E)(N2 −
∑j
N2•j)
+ (µE,4 + 3σ4E)(N2 −N) + 12σ2
Aσ2B
(N2 −
∑i
N2i• −
∑j
N2•j +N
).
Proof. See Section A.2.8.
Using Theorem 2.4.2, we compute estimates µA,4, µB,4, and µE,4, by solving the 3× 3 system of
equations
M
µA,4
µB,4
µE,4
=
Wa −ma
Wb −mb
We −me
, (2.24)
where M is the same matrix that we used for the U -statistics in equation (2.10), with
ma = (3σ4B + 12σ2
Bσ2E + 3σ4
E)(N −R),
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 22
mb = (3σ4A + 12σ2
Aσ2E + 3σ4
E)(N − C), and
me = (3σ4A + 12σ2
Aσ2E)(N2 −
∑i
N2i•) + (3σ4
B + 12σ2Bσ
2E)(N2 −
∑j
N2•j)
+ 3σ4E(N2 −N) + 12σ2
Aσ2B
(N2 −
∑i
N2i• −
∑j
N2•j +N
).
Then, κA = µA,4/σ4A − 3, κB = µB,4/σ
4B − 3, and κE = µE,4/σ
4E − 3.
We compute the statistics (2.23) via
Wa =∑i
(∑j
Zij(Yij − Yi•)4 + 3N−1i• S2i•
),
Wb =∑j
(∑i
Zij(Yij − Y•j)4 + 3N−1•j S2•j
), and
We = N∑ij
Zij(Yij − Y••)4 + 3S2••,
(2.25)
where Y•• = N−1∑ij ZijYij and S•• =
∑ij Zij(Yij − Y••)2.
Therefore, the kurtosis estimates κ requires R+ C + 1 new quantities
∑j
Zij(Yij − Yi•)4,∑i
Zij(Yij − Y•j)4, and∑ij
Zij(Yij − Y••)4 (2.26)
beyond those used to compute θ. These can be computed in a second pass over the data after Yi•,
Y•j and Y•• have been computed in the first pass. They can also be computed in the first pass using
update formulas analogous to the second moment formulas (2.11). Such formulas are given by Pebay
(2008) [38], citing an unpublished paper by Terriberry.
Because the kurtosis estimates are used in formulas for Var(θ) and those formulas already require
a second pass over the data, it is more convenient to compute (2.26) and the sample fourth moments
in a second pass. By a similar argument as in Section 2.4.1, obtaining κA, κB , and κE has space
complexity O(R + C) and time complexity O(N), and is therefore scalable. As with the variance
component estimates, in practice sometimes we get kurtosis estimates less than −2, outside the
parameter space. In this thesis, we simply threshold the kurtoses at −2, in line with the common
practice of raising variance estimates to zero.
2.4.2.4 Algorithm summary
For clarity of exposition, here we gather all of the steps in our algorithm for estimating σ2A, σ2
B , and
σ2E and the variances of those estimators. An outline is shown in Figure 2.1. We assume that all
of the computations below can be done with large enough variable storage that overflow does not
occur. This may require an extended precision representation beyond 64 bit floating point, such as
that in the python package mpmath [25].
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 23
Figure 2.1: Schematic of our algorithm. The expressions in the smallest boxes are the valuescomputed at each step.
The first task is to compute θ. In a first pass over the data compute counts N , R, C, row values
Ni•, Yi•, Si• for all unique rows i in the data set, and column values N•j , Y•j , S•j for all unique
columns j in the data set as well as Y•• and S••. Incremental updates are used as described in (2.11).
Then compute
Ua =∑i
Si•, Ub =∑j
S•j , Ue = NS••,
the matrix M from (2.10), and θ = (σ2A, σ
2B , σ
2E)T = M−1(Ua, Ub, Ue)
T in time O(R+ C).
The second task is to compute approximately the variance of θ. First, we estimate the kurtoses.
A second pass over the data computes the centered fourth moments in (2.26). Then one calculates
the fourth order U -statistics of equation (2.25), solves (2.24) for the centered fourth moments, and
converts them to kurtosis estimates, all in time O(R+ C).
In that second pass over the data, we also compute
ZNp,q ≡∑ij
ZijNpi•N
q•j (2.27)
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 24
for (p
q
)∈{(−1
−1
),
(1
−1
),
(−1
1
),
(1
1
),
(−1
2
),
(2
−1
)}
as well as Ti• and T•j of equation (2.2) for all i and j in the data.
Then, we estimate the variances of the U -statistics. Some of these next computations require
even more bits per variable than are needed to avoid overflow, because they involve subtraction in a
way that could lose precision. To estimate the variances of Ua and Ub, we apply the upper bounds
discussed in Section 2.4.2.2 to (2.12) and (2.13) and plug in σ2A, σ2
B , σ2E , κA, κB , and κE , calculating
using time and space O(R+ C)
Var(Ua) = σ4B(κB + 2)
(∑j
N2•j − ZN−1,1
)+ 2σ4
B
(ZN−1,1 −R
∑i
N−1i•
)+ 4σ2
Bσ2E(N −R) + σ4
E(κE + 2)∑i
Ni•(1−N−1i• )2 + 2σ4E
∑i
(1−N−1i• )
and
Var(Ub) = σ4A(κA + 2)
(∑i
N2i• − ZN1,−1
)+ 2σ4
A
(ZN1,−1 − C
∑j
N−1•j
)+ 4σ2
Aσ2E(N − C) + σ4
E(κE + 2)∑j
N•j(1−N−1•j )2 + 2σ4E
∑j
(1−N−1•j ).
To estimate Var(Ue) and the covariances of the U -statistics, we again plug in the variance com-
ponent and kurtosis estimates into Theorem 2.4.1 without approximation. We get Var(Ue) from
(2.14), using ZN1,1 from the second pass over the data. We get Cov(Ua, Ue) from (2.16) using
ZN−1,1, ZN−1,2, and Ti•, and Cov(Ub, Ue) from (2.17) using ZN1,−1, ZN2,−1, and T•j . We get
Cov(Ua, Ub) from (2.15) using ZN−1,−1. It can be easily verified that these calculations also take
time and space O(R+ C).
The final plug-in estimator of variance is
Var
σ2A
σ2B
σ2E
= M−1
Var(Ua) Cov(Ua, Ub) Cov(Ua, Ue)
Cov(Ub, Ua) Var(Ub) Cov(Ub, Ue)
Cov(Ue, Ua) Cov(Ue, Ub) Var(Ue)
(M−1)T (2.28)
where M is the matrix in (2.10).
Aggregating the computation times and counting the number of intermediate values we must
calculate, we see that our algorithm takes time O(N) and space O(R+ C).
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 25
2.4.3 Asymptotic Normality of Estimates
We can bound the bias and variance of our estimates, only requiring that R/N and C/N be bounded
away from one.
Theorem 2.4.3. Suppose that max(R/N,C/N) ≤ θ for some θ < 1 and let ε = max(εR, εC). Then
E(σ2A) = (σ2
A + Υ)(1 +O(ε)),
E(σ2B) = (σ2
B + Υ)(1 +O(ε)), and
E(σ2E) = (σ2
E + Υ)(1 +O(ε)),
where
Υ ≡ σ2A
∑iN
2i•
N2+ σ2
B
∑j N
2•j
N2+σ2E
N= O(ε).
Furthermore
max(
Var(σ2A),Var(σ2
B),Var(σ2E))
= O(∑
iN2i•
N2+
∑j N
2•j
N2
)= O(ε).
Proof. See Section A.3.1.
Theorem 2.4.3 has the same variance rate for all variance components. Both bias and variance
are O(ε) and so a (conservative) effective sample size is then O(1/ε). The quantity Υ appearing
in Theorem 2.4.3 is Var(Y••) where Y•• = (1/N)∑ij ZijYij . The variances of the variance compo-
nents contain similar quantities to Υ although kurtoses and other quantities appear in their implied
constants.
Furthermore, under asymptotic conditions, we may obtain simple expressions for the covariance
matrix of our method of moments estimators.
Theorem 2.4.4. As described in Section 2.1.1, suppose that
Ni• ≤ εN, N•j ≤ εN, R ≤ εN, C ≤ εN, N ≤ ε∑i
N2i•, and N ≤ ε
∑j
N2•j
hold for the same small ε > 0 and that
0 < κA + 2, κB + 2, κE + 2, σ4A, σ
4B , σ
4E <∞.
Suppose additionally that
∑ij
ZijN−1i• N•j ≤ ε
∑i
N2i• and
∑ij
ZijNi•N−1•j ≤ ε
∑j
N2•j (2.29)
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 26
hold. Then
Var(Ua) = σ4B(κB + 2)
∑j
N2•j(1 +O(ε)),
Var(Ub) = σ4A(κA + 2)
∑i
N2i•(1 +O(ε)), and
Var(Ue) =(σ4A(κA + 2)N2
∑i
N2i• + σ4
B(κB + 2)N2∑j
N2•j
)(1 +O(ε)).
Similarly
Cov(Ua, Ub) = σ4E(κE + 2)N(1 +O(ε)),
Cov(Ua, Ue) = σ4B(κB + 2)N
∑j
N2•j(1 +O(ε)), and
Cov(Ub, Ue) = σ4A(κA + 2)N
∑i
N2i•(1 +O(ε)).
Finally σ2A, σ2
B and σ2E are asymptotically uncorrelated as ε→ 0 with
Var(σ2A) = σ4
A(κA + 2)1
N2
∑i
N2i•(1 +O(ε)),
Var(σ2B) = σ4
B(κB + 2)1
N2
∑j
N2•j(1 +O(ε)), and
Var(σ2E) = σ4
E(κE + 2)1
N(1 +O(ε)).
Proof. See Section A.3.2.
Equation (2.29) is a new condition. The first part can be written∑ij ZijN
−1i• N•j ≤ ε
∑ij ZijNi•.
It means that the sum (over entries) of row sizes is much larger than the corresponding sum of
column sizes divided by row sizes. These two conditions impose mild ‘squareness’ on the observation
pattern Z.
In an asymptotic setting with ε → 0 the three variance estimates become uncorrelated. Also
each of them has the same variance it would have had if the other variance components had truly
been zero.
In addition, under the asymptotic regime in Section 2.1.1, the estimates are normally distributed,
which can be proven via a martingale central limit theorem [21].
From (2.19), as N →∞,σ2A
σ2B
σ2E
=
−N−1 0 N−2
0 −N−1 N−2
N−1 N−1 −N−2
Ua
Ub
Ue
. (2.30)
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 27
Therefore, to show that σ2A, σ2
B , and σ2E are asymptotically normal, we show that Ua, Ub,
and Ue are asymptotically jointly normal. More precisely, we show that Ua = Ua/√∑
j N2•j ,
Ub = Ub/√∑
iN2i•, and Ue = Ue/(N
√∑iN
2i• +
∑j N
2•j) are asymptotically jointly normal. We
prove that their joint distribution is Gaussian, with a martingale construction found by combining
ideas from Serfling (2009) [45] and Gu (2008) [18].
Our martingales are constructed as follows. Let a = (a1, a2, . . . , aR) and b = (b1, b2, . . . , bC).
Next define the increasing sequence of σ-algebras
FN,k,0 = σ(bj , 1 ≤ j ≤ k), k = 1, . . . , C,
FN,q,1 = σ(b, ai, 1 ≤ i ≤ q), q = 1, . . . , R, and
FN,`,m,2 = σ(a, b, eij , 1 ≤ i ≤ `− 1, 1 ≤ j ≤ C, e`t, 1 ≤ t ≤ m), ` = 1, . . . , R, m = 1, . . . , C.
Then define random variables
SAN,k,0 =∑i
N−1i• Zik
(∑j:j 6=k
Zij(b2k − σ2
B)− 2∑j:j<k
Zijbjbk
),
SAN,q,1 = 0,
SAN,`,m,2 = N−1`• Z`m
( ∑j:j 6=m
Z`j(2(bm − bj)e`m + e2`m − σ2E)− 2
∑j:j<m
Z`je`je`m
),
SBN,k,0 = 0,
SBN,q,1 =∑j
N−1•j Zqj
(∑i:i 6=q
Zij(a2q − σ2
A)− 2∑i:i<q
Zijaiaq
),
SBN,`,m,2 = N−1•mZ`m
(∑i:i 6=`
Zim(2(a` − ai)e`m + e2`m − σ2E)− 2
∑i:i<`
Zimeime`m
),
SEN,k,0 =∑ii′
Zik
(∑j:j 6=k
Zi′j(b2k − σ2
B)− 2∑j:j<k
Zi′jbjbk
),
SEN,q,1 =∑jj′
Zqj
(∑i:i 6=q
Zij′(a2q − σ2
A + 2(bj − bj′)aq)− 2∑i:i<q
Zij′aiaq
), and
SEN,`,m,2 = Z`m
( ∑ij:ij 6=`m
Zij(e2`m − σ2
E + 2(a` − ai)e`m + 2(bm − bj)e`m)− 2∑
ij:i<` ori=`,j<m
Zijeije`m
).
Lemma 2.4.2. With the definitions above {SAN,k,0, SAN,q,1, SAN,`,m,2}, {SBN,k,0, SBN,q,1, SBN,`,m,2}, and
{SEN,k,0, SEN,q,1, SEN,`,m,2} constitute martingale difference sequences with
Ua − E(Ua) =∑k
SAN,k,0 +∑q
SAN,q,1 +∑`m
SAN,`,m,2,
Ub − E(Ub) =∑k
SBN,k,0 +∑q
SBN,q,1 +∑`m
SBN,`,m,2, and
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 28
Ue − E(Ue) =∑k
SEN,k,0 +∑q
SEN,q,1 +∑`m
SEN,`,m,2.
Proof. See Section A.3.3.
We use the following martingale central limit theorem from Hall & Heyde (1980) [21] to obtain
Theorem 2.4.6 below.
Theorem 2.4.5. For each n, let Mns be a martingale difference sequence with filtration Fns. Define
σ2ns = E(M2
ns | Fn,s−1) and suppose that∑s σ
2ns
p→ σ2 for some constant σ2 > 0. If for each ε > 0,
limn→∞
∑s
E(M2nsI{|Mns| ≥ ε} | Fn,s−1) = 0,
then∑sMns converges in distribution to σN (0, 1) as n→∞.
We will need a growth condition on Ni• and N•j that is mild but stricter than we had from
Section 2.1.1. Specifically we require
maxiNi• + maxj N•j
min(√∑
iN2i•,√∑
j N2•j
) → 0. (2.31)
Under (2.31) any row or column size squared is ultimately small compared to sums of squared row
or column sizes.
Theorem 2.4.6. Suppose that ai, bj, and eij are bounded and the assumptions in Theorem 2.4.4
hold. In addition, suppose that∑iN
2i•∑
j N2•j
→ c,∑ir
(ZZT)irN−1i• N
−1r• ≤ ε
∑j
N2•j , and
∑js
(ZTZ)jsN−1•j N
−1•s ≤ ε
∑i
N2i•
for some constant c > 0. Then, if (2.31) holds, Ua, Ub, and Ue are asymptotically normal with mean
(σ2B + σ2
E)(N −R)√∑j N
2•j
(σ2A + σ2
E)(N − C)√∑iN
2i•
σ2A(N2 −
∑iN
2i•) + σ2
B(N2 −∑j N
2•j) + σ2
E(N2 −N)
N√∑
iN2i• +
∑j N
2•j
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 29
and covariance
σ4B(κB + 2)
σ4E(κE + 2)N√∑iN
2i•
∑j N
2•j
σ4B(κB + 2)
∑j N
2•j√∑
j N2•j(∑iN
2i• +
∑j N
2•j)
∗ σ4A(κA + 2)
σ4A(κA + 2)
∑iN
2i•√∑
iN2i•(∑iN
2i• +
∑j N
2•j)
∗ ∗σ4A(κA + 2)
∑iN
2i• + σ4
B(κB + 2)∑j N
2•j + σ4
E(κE + 2)N∑iN
2i• +
∑j N
2•j
.
Proof. See Section A.3.4.
From Theorem 2.4.6 and (2.30), it is clear that the estimated variance components are also
asymptotically normal. Then, we can construct asymptotic confidence intervals for the estimated
variance components using Theorem 2.4.4 if so desired. Of course, the confidence intervals should
not include negative numbers. This should not happen for the data sizes of our interest, but in
practice if necessary we could truncate the confidence intervals at zero. However, we have found
that in the literature authors rarely give confidence intervals for estimated variance components.
2.5 Experimental Results
2.5.1 Simulations
First, we compare the performance of our method of moments algorithm (MoM), described in Sec-
tion 2.4.2.4 and implemented in Python, to maximum likelihood estimation as implemented in the
commonly used R package for mixed models, lme4 [7]. The package maximizes the profiled log-
likelihood with respect to the variance components and then estimates the regression coefficients via
least squares [5]. Computing the profiled log-likelihood requires finding the Cholesky decomposition
of a square matrix with dimension R+ C, which uses O((R+ C)3), or O(N3/2), time.
We assume that the random effects are normally distributed. We do so because it is the as-
sumption implicitly made by lme4, and so we expect that these are the conditions under which lme4
performs best. These experiments were carried out on a Windows machine with 8 GB memory and
Intel i5 processor.
For our algorithm, we consider a range of data sizes, with R = C ranging from 10 to 500. At
each fixed value of R = C, for 100 iterations, we generate data according to model (2) with σ2A = 2,
σ2B = 0.5, and σ2
E = 1. Exactly 25 percent of the cells were randomly chosen to be observed. We
measure the CPU time needed to obtain the variance component estimates σ2A, σ2
B , and σ2E (labeled
short) and the CPU time need to obtain the variance component estimates as well as conservative
estimates of the variances of those estimates (labeled long). In addition, we measure the mean
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 30
squared errors of the variance component estimates. At the end, those five measurements were
averaged over the 100 iterations.
With regard to lme4, our simulation steps are nearly the same, with the following differences.
Due to the slowness of lme4, we only consider data sizes with R = C up to 300. In addition, because
lme4 finds the maximum likelihood variance component estimates, the variances of those estimates
were computed asymptotically using the inverse expected Fisher information matrix. The simulation
results are shown in Figure 2.2.
(a) CPU (b) MSE of σ2A
(c) MSE of σ2B (d) MSE of σ2
E
Figure 2.2: Simulation results: log-log plots of the five recorded measurements against the numberof observations.
Note that lme4 always takes more time than our algorithm. From Figure 2.2(a), we see that
our method of moments algorithm takes time at most linear in the data size to compute both the
variance component estimates and conservative estimates of the variances of those estimates. For
lme4 the computation time is always superlinear in the data size, for data sets large enough that
the startup cost of the package is no longer dominant.
The MSEs of σ2A for our algorithm and lme4 are comparable. Moreover, both decrease sublinearly
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 31
with the data size. The same is true for the MSEs of σ2B . However, the MSE of σ2
E in lme4 is
noticeably smaller than that of our algorithm; this appears to be the price we pay for the decreased
computation time. In both cases, though, the MSE of σ2E decreases approximately linearly with
the data size. In sum, our algorithm provides estimates that are nearly as efficient as MLE and is
scalable to huge data sets, unlike MLE.
We also considered data with other aspect ratios, e.g. R = 2C and C = 2R. The CPU time
required and the MSEs of the variance component estimates behave similarly to the case R = C.
For the sake of brevity, we do not explicitly show the results of those simulations here.
Remark: In our simulations we investigated the conservativeness of the estimates of the variances
of σ2A, σ2
B , and σ2E due to the over-estimates described in Section 2.4.2.2. When R = C, it appears
that they are conservative by a factor of at most four. The factors corresponding to σ2A and σ2
B
approached 1 as the data size increased. Therefore, the upper bounds appear to be a reasonable
approximation of the true variances of the variance components.
2.5.2 Examples from Social Media
We illustrate our algorithm on three real world data sets that are too large for lme4 to handle in a
timely manner.
The first, from Yahoo Webscope [49], contains a random sample of ratings of movies by users,
which are grades from A+ to F converted into a numeric scale. There are 211,231 ratings by 7,642
users on 11,916 movies, filtered with the condition that each user rates at least ten movies. Only
0.23 percent of the user-movie matrix is observed.
The estimated variances of the user random effect, the movie random effect, and the error are
2.57, 2.86, and 7.68. The estimated kurtoses are −2, −2, and 6.56. Conservative estimates of the
variances of the estimated variance components are 0.0030, 0.0018, and 0.0060, with corresponding
standard errors 0.055, 0.42, and 0.077. Therefore, most of the variation in the movie rating comes
from error or the interaction between movies and users; this is not surprising, since different people
have different tastes.
The second data set, from Yahoo Webscope as well [50], contains ratings of 1000 songs by 15,400
users, on a scale of 1 to 5. The first group of 10,000 users were randomly selected on the condition
that they had rated at least 10 of the 1000 songs. The rest of the users were randomly selected
from responders on a survey that asked them to rate a random subset of 10 of the 1000 songs. The
songs were selected to have at least 500 ratings. Here, about 2 percent of the user-song pairs were
observed.
The estimated variances of the user random effect, the song random effect, and the error are 0.97,
0.24, and 1.30. The estimated kurtoses are −2, −2, and 3.31. Conservative estimates of the variances
of the estimated variance components are 4.5 × 10−5, 10−5, and 5.8 × 10−5, yielding conservative
CHAPTER 2. CROSSED RANDOM EFFECTS MODEL 32
standard errors of about 0.007, 0.003 and 0.008. In this case there is negligible sampling variability
in the variance component estimates. For determining the rating, the user effect is dominant over the
song effect, but as in the previous example, the greatest variation comes from error or interactions
between song and user.
The third data set from Last.fm [29] contains the numbers of times artists’ songs are played by
about 360,000 users. Only the counts for the top k (for some k) artists for each user is recorded. The
users are randomly selected. This data set is extremely sparse; only about 0.03 percent of user-artist
pairs are observed.
The estimated variances of the user random effect, the artist random effect, and the error are
1.65, 0.22, and 0.27. The estimated kurtoses are 0.019, −2, and 23.14. Conservative estimates of
the variances of the estimated variance components are 1.68× 10−5, 4.06× 10−7, and 1.37× 10−6,
yielding small standard errors of about 0.004, 0.0006 and 0.001. The biggest source of variation in
the number of plays is the user. The kurtosis of the row effect is nearly zero, indicating possible
normality.
In all three data sets, the standard errors of the estimated variance components are at least one
order of magnitude smaller than the estimates themselves. This lends credence to the argument that
we do not need to truncate confidence intervals at zero in practice.
Moreover, in the data sets at least one of the estimated kurtoses was −2, which would be
unexpected if the model is correctly specified. However, if model (2) does not fit the data well, such
behavior may occur. We suspect that more realistic models incorporating fixed effects and SVD-like
interactions would reduce the prevalence of such kurtosis estimates.
Chapter 3
Linear Mixed Model with Crossed
Random Effects
In this chapter we return to the linear mixed model with crossed random effects, model (1).
Our first task is to obtain consistent estimates of β and the variance components σ2A, σ2
B , and σ2E
with computational cost O(N) and space cost O(R + C). Our second and more challenging task is
to find a reliable estimate of the variance of those estimates, also in O(N) time and O(R+C) space,
in order to perform inference. We could compute finite-sample variances of the estimates, but they
are complex and not efficiently computable. Hence, we consider asymptotic variances in the regime
described in Section 2.1.1, which are much simpler. These constraints would allow our algorithm
to scale to data sets with millions of observations. Our view is that algorithms with costs growing
faster than N are, in this setting, almost like asking for an oracle. In the case of Stitch Fix, our
algorithms will give confidence intervals on quantities such as the effects of certain styles on ratings.
To obtain a scalable algorithm and avoid parametric assumptions, we impose a major simplifica-
tion. In forming our estimate β, we forget about the correlations arising from one of the two factors.
This allows us to easily invert the corresponding covariance matrix of the data. Our estimate β sac-
rifices some efficiency. When computing Var(β) we do take account of both sources of correlation.
As a result, we get an inefficient estimator compared to an oracle that is not constrained to O(N)
cost, but we get a reliable variance estimate for that estimator.
A second simplification is to use the method of moments to estimate the variance components,
as in Chapter 2. The method of moments has many advantages in the big data world. It is easily
parallelizable, makes no parametric assumptions, and has no tuning parameters to select. The
method of moments does come with a drawback. It can lead to estimators that do not obey known
constraints. The best known of these are negative variance estimates. It can also lead to kurtosis
estimates below −2.
33
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 34
3.1 Additional Notation
The same notation and asymptotic regime described in Section 2.1.1 apply here, with two additional
definitions. Let v1,i be the N -dimensional vector with ones in entries∑i−1r=1Nr• + 1 to
∑ir=1Nr•
and zeros elsewhere, and let wik be a set of Ni• − 1 unit vectors orthogonal to v1,i with zeros in the
same entries. Similarly, let v2,j be the N -dimensional vector with ones in entries∑j−1s=1N•s + 1 to∑j
s=1N•s and zeros elsewhere and wjk be defined like wik.
Moreover, when there are covariates xij we need to rule out degenerate settings where the sample
variance of xij does not grow with N or where it is dominated by a handful of observations. We
add some such conditions when we discuss central limit theorems in Section 3.4.2.
3.2 Previous Work
Much of the previous literature for parameter estimation and inference for linear mixed models
is similar to and builds upon the literature for crossed random effects models. Again, there are
two categories of algorithms, moment-based and likelihood-based, with the same advantages and
disadvantages as described in Section 2.2. The moment-based algorithms are mostly generalizations
of the moment-based algorithms for crossed random effects models, described in Section 2.2.1. The
likelihood-based algorithms are the same for both crossed random effects models and linear mixed
models with crossed random effects.
Note that model (1) can be written as
Y = Xβ + Zaa+ Zbb+ e, with
a ∼ (0, σ2AIR), b ∼ (0, σ2
BIC), e ∼ (0, σ2EIN ) (independently),
(3.1)
where Y is the vector of Yij , X is the N × p matrix of predictors, 1N is the N -dimensional vector
of ones, Za is a N ×R binary matrix with entry (k, i) equal to 1 if the kth observation is from level
i of the first factor, a is the vector of ai, Zb is a N ×C binary matrix with entry (k, j) equal to 1 if
the kth observation is from level j of the second factor, b is the vector of bj , and IN is the identity
matrix of dimension N . We utilize this reformulation for the rest of this section.
3.2.1 Moment-based Approaches
Henderson II Henderson I, described in Section 2.2.1, can be generalized to linear mixed models.
The generalization is called Henderson II, and we give a description of it following Searle et al.
(2006) [44]. The intuition of the method is to somehow transform the linear mixed model with
crossed random effects to a simple crossed random effects model, and then apply a modified version
of Henderson I on the resulting data.
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 35
First, estimate β via β = LY for some matrix p×N matrix L; the choice of L is discussed next.
Then, the residuals are
Y −Xβ = (I −XL)Y = (I −XL)Xβ + (I −XL)Zaa+ (I −XL)Zbb+ (I −XL)e.
From here, there are two directions to go with the choice of L. First, we can choose L so that
β does not appear in the residuals, i.e. such that X = XLX [43]. That requirement is satisfied
by the OLS projection L = (XTX)−1XT. Then, Y −Xβ = (I −H)Zaa + (I −H)Zbb + (I −H)e
for H = X(XTX)−1XT. A modified version of Henderson I accounting for the fact that there is
a projection matrix multiplying Zaa and Zbb is used to estimate the variance components. The
regression coefficient β is estimated at the end using some sort of least squares estimator.
The disadvantage of this approach is that the expectation of the estimating equations become
much more complex to compute due to the presence of the projection matrix H. Moreover, no
analysis of the variances of the estimates is available. Our proposed algorithm, described in Section
3.3, is similar in that it also alternates between estimating β and the variance components and starts
by estimating β via OLS. However, we ignore the presence of the projection matrix H and directly
estimate the variance components using our alternative to Henderson I described in Section 2.4.1.
We prove that in the asymptotic setting, ignoring H doesn’t matter and we still get consistent and
asymptotically Gaussian estimates (with known variances) of β and the variance components. In
addition, unlike Henderson II, after estimating β, we adjust our estimates of the variance components
as well.
Because a more difficult version of Henderson I must be used with the above choice of L, Searle
et al. (2006) [44] describe another way to choose L with the following criterion:
• XLZa = XLZb = 0,
• XL1 = δ1 for some scalar δ, and
• X −XLX = 1τT for some vector τ .
Then, Y −Xβ = (τTβ − δ)1 +Zaa+Zbb+ (I −XL)e, on which the original version of Henderson I
can be applied to estimate the variance components.
The main disadvantage of this approach is that the construction of such a matrix L requires
inverting a square matrix with dimension approximately the total number of levels of the two factors,
which requires time O((R+ C)3) = O(N3/2). Moreover, there is no analysis of the variances of the
final estimates.
Henderson III Henderson III can also be applied to the linear mixed model with crossed random
effects. The estimating equations are now, with R(·) defined as in Section 2.2.1,
R(a, b | β) = R(β, a, b)−R(β)
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 36
R(b | β, a) = R(β, a, b)−R(β, a)
SSE = Y TY −R(β, a, b)
where E(R(a, b | β)) equals
σ2Atr(Z
Ta (IN −H)Za) + σ2
Btr(ZTb (IN −H)Zb) + σ2
E(rank([X Za Zb])− rank(X))
and H = X(XTX)−1XT. The disadvantage of this method is that it requires inverting a square
matrix with dimension the total number of levels of the two factors, which requires time O((R +
C)3) = O(N3/2). Moreover, the variances of the resulting estimates are very complex to analyze.
Sufficient statistics We could also use an algorithm developed for generalized linear mixed models
[24]. Suppose that the random effects are normally distributed. Then, a set of estimating equations
based on the sufficient statistics of the data distribution are
∑ij
ZijxijYij = XTY,
∑ijj′
ZijZij′YijYij′ ,∑ii′j
ZijZi′jYijYi′j , and
∑iji′j′
ZijZi′j′YijYi′j′ ,
with expectations
XTE(Y ) = XTXβ,
E
∑ijj′
ZijZij′YijYij′
= βT
∑ijj′
ZijZij′xijxTij′
β + σ2A
∑i
N2i• + σ2
BN + σ2EN,
E
∑ii′j
ZijZi′jYijYi′j
= βT
∑ii′j
ZijZi′jxijxTi′j
β + σ2B
∑j
N2•j + σ2
AN + σ2EN, and
E
∑iji′j′
ZijZi′j′YijYi′j′
= βT
∑iji′j′
ZijZi′j′xijxTi′j′
+ σ2A
∑i
N2i• + σ2
B
∑j
N2•j + σ2
EN.
To solve for β, σ2A, σ2
B , and σ2E , we would need to use a nonlinear root finder such as the Newton-
Raphson method. This seems unnecessarily complex for the linear mixed effects model, and no
analytic formulas for the variances of the estimates are given.
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 37
3.2.2 Likelihood-based Approaches
Likelihood-based approaches assume some distribution for the random effects and noise and then
estimate the parameters β and the variance components by maximizing the data likelihood or some
version thereof. The most common approach is to assume that ai, bj , and eij are normally dis-
tributed, which we believe is too stringent for many e-commerce applications.
Using the notation from model (3.1), under that assumption Y ∼ N (Xβ, V ), where
V = σ2AZaZ
Ta + σ2
BZbZTb + σ2
EIN . Maximum likelihood maximizes the log likelihood of the data Y
with respect to β, σ2A, σ2
B , and σ2E . Hartley & Rao (1967) [22] utilize a different parameterization,
letting V = Hσ2E and maximizing the log likelihood with respect to β, σ2
E , σ2A/σ
2E , and σ2
B/σ2E .
The asymptotic variances of the estimates are found by inverting the expected Fisher information
matrix.
Analytic formulas for the maximum likelihood estimates exist for balanced data and other special
observation patterns, but in general we need nonlinear optimization procedures to obtain the (biased)
estimates. Moreover, as shown in Bates (2014) [5] and Section 3.2.3, even computing the value of
the log likelihood at a given set of values for the parameters has O(N3/2) computation cost. We
describe some of these nonlinear optimization algorithms later.
To adjust for the bias of the maximum likelihood estimates, restricted maximum likelihood,
better known as REML, has been developed. Indeed, for a balanced crossed random effects model,
REML turns out to be equivalent to Henderson I [44]. An intuitive explanation for the approach
is to estimate variance components using maximum likelihood based on residuals from fitting only
the fixed effects β by least squares. More specifically, let K be a matrix such that KTX = 0.
That requirement is satisfied by the residual of any generalized least squares projection matrix, e.g.
I − X(XTX)XT. Then, KTY ∼ N (0,KTV K), and we apply the maximum likelihood procedure
on the log likelihood of KTY to estimate the variance components. The regression coefficients can
be estimated using generalized least squares and the estimated variance components. Again, the
asymptotic variance of the REML estimates can be found using the expected Fisher information
matrix.
There have been a number of optimization algorithms used in the literature for linear mixed
models. Graser et al. (1987) [17] utilizes a derivative-free line search algorithm when there is only
one random effect, and the MixedModels package in Julia [6] implements a derivative-free quadratic
approximation algorithm, BOBYQA [39]. The Expectation-Maximization algorithm [12] has also
been used, with the random effects as the hidden data. Raudenbush (1993) [40] applies it to a linear
mixed model with crossed random effects and Laird et al. (1987) [28] to a linear mixed model with
nested (and possibly crossed within nested) random effects.
Gradient-based optimization algorithms have also been popular, because the formulas for the
updates are usually simple. However, computing the gradients of the log likelihoods requires su-
perlinear time, like for computing the log likelihoods themselves. The Gauss-Newton algorithm is
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 38
implemented in lme4 [5, 7] and Lindstrom & Bates (1988) [31] uses the Newton-Raphson algorithm
for linear mixed models with nested random effects. Other possible algorithms are the scoring al-
gorithm, which is a variation of the Newton-Raphson algorithm that replaces the negative of the
Hessian of the log-likelihood with the Fisher information matrix, and quasi-Newton algorithms,
which approximate the Hessian with outer products of computed gradients.
3.2.3 Challenges of Likelihood-based Approaches
As in the previous subsection, let us assume that the random effects and noise are normally dis-
tributed. Then, when the observations are ordered by rows, the negative log likelihood is proportional
to
`(β, σ2A, σ
2B , σ
2E) = (Y −Xβ)T(σ2
AAR + σ2BBR + σ2
EIN )−1(Y −Xβ) + ln |σ2AAR + σ2
BBR + σ2EIN |
where AR and BR are defined in Section 3.3.2. We argue that it requires O(N3/2) time to compute
the negative log likelihood at one set of values of the parameters.
Let σ2BBR = PTUDUTP , where P is a permutation matrix and UDUT is the eigendecomposition
of BC . The columns of U are v2,j divided by√N•j . D is a diagonal matrix with entries σ2
BN•j .
Then by the Woodbury formula,
(σ2AAR + σ2
BBR + σ2EIN )−1
= (σ2AAR + σ2
EIN )−1
− (σ2AAR + σ2
EIN )−1PTU(D−1 + UTP (σ2AAR + σ2
EIN )−1PTU)−1UTP (σ2AAR + σ2
EIN )−1.
Similarly,
|σ2AAR + σ2
BBR + σ2EIN | = |σ2
AAR + σ2EIN ||D||D−1 + UTP (σ2
AAR + σ2EIN )−1PTU |
of which the first two terms can be computed analytically in O(R) time.
We can find a sparse representation of PTU by storing the columns seen for each row. Note that
(σ2AAR + σ2
EIN )−1 =
R∑i=1
[v1,iv
T1,i
Ni•(σ2E + σ2
ANi•)+
Ni•−1∑k=1
wikwTik
σ2E
].
Fixing i, we see that we can compute vT1,iPTU in O(Ni•) time and similarly for wik. Thus, we
can compute UTP (σ2AAR + σ2
EIN )−1PTU in O(∑iN
2i•) time. Therefore, we conclude that
• Computing the determinant term in the log likelihood and the inverse of D−1 +UTP (σ2AAR +
σ2EIN )−1PTU both require O(R3 +
∑iN
2i•) time.
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 39
• Computing (Y −Xβ)T(σ2AAR + σ2
EIN )−1(Y −Xβ) requires O(∑iN
2i•) time.
• Computing (Y −Xβ)T(σ2AAR + σ2
EIN )−1PTU requires O(∑iN
2i•) time.
• Computing the rest of the first term of the log likelihood requires O(R2) time.
In sum, we have shown explicitly that computing the log likelihood requires O(R3+∑iN
2i•) time.
Note that we could have taken a symmetric approach to computing the log likelihood, factoring out
the row random effects instead of the column random effects in the Woodbury formula. Doing that
requires time O(C3 +∑j N
2•j). Thus, we can choose whichever approach requires less time.
Nevertheless, we see that
∑i
N2i• = R(
∑i
N2i•/R) ≥ R(
∑i
Ni•/R)2 = N2/R.
Write R = Nα for 0 < α < 1. Then, O(R3 +∑iN
2i•) is equivalent to O(N3α + N2−α). This is
minimized when α = 1/2, in which we get that the order is at least O(N3/2). The same argument
applies for O(C3 +∑j N
2•j), so either approach takes at least O(N3/2) time.
It is clear that the lack of scalability arises from the fact that the likelihood involves the inversion
of a large matrix, the covariance matrix. Therefore, if the distribution of the random effects were
chosen so that the likelihood does not involve such an inversion (or multi-dimensional integrals), we
may be able to scale maximum likelihood procedures to massive data. However, it is unclear what
distribution besides the Gaussian would be appropriate, as it must realistically model the data.
3.3 Alternating Algorithm
We summarize our parameter estimation procedure for model (1) in Algorithm 1. We alternate twice
between finding β and the variance component estimates σ2A, σ2
B , and σ2E . Further details of steps
1-4, including the reasoning for which generalized least squares (GLS) estimator to use in step 3, are
given in the next two subsections. Step 5 is described in Section 3.4.2.1 where we derive Var(βRLS)
and Var(βCLS).
We only briefly discuss step 1, which computes the OLS estimate of β. Let X ∈ RN×p have rows
xij in some order and let Y ∈ RN be elements Yij in the same order. Then,
βOLS = (XTX)−1XTY =
(∑ij
ZijxijxTij
)−1∑ij
ZijxijYij . (3.2)
In one pass over the data, we can compute XTX and XTY incrementally and then solve for β.
Solving the normal equations this way is easy to parallelize but more prone to roundoff error than
the usual alternative based on computing the SVD of X. One can compensate by computing in
extended precision. It costs O(p3) to compute βOLS and so the cost of step 1 is O(Np2 + p3).
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 40
Algorithm 1: Alternating Algorithm
1 Estimate β via ordinary least squares (OLS): β = βOLS.2 Let σ2
A, σ2B , and σ2
E be the method of moments estimates from Section 2.4.1 defined on the
data (i, j, ηij), where ηij = Yij − xTij βOLS.
3 Compute a more efficient β using σ2A, σ2
B , and σ2E . If σ2
A maxiNi• ≥ σ2B maxj N•j , estimate β
via GLS accounting for row correlations: β = βRLS. Otherwise, estimate it via GLSaccounting for column correlations: β = βCLS.
4 Repeat step 2 using ηij = Yij − xTij β with β from step 3.
5 Compute an estimate Var(β) for β from step 3 using σ2A, σ2
B and σ2E from step 4.
3.3.1 Estimating Variance Components
In this subsection, we discuss steps 2 and 4 of Algorithm 1 in detail. The errors Yij − xTijβ follow a
two-factor crossed random effects model, model (2). If β is a good estimate of β, then the residuals
ηij = Yij − xTij β should approximately follow a two-factor crossed random effects model with µ = 0
and variance components σ2A, σ2
B , and σ2E .
We estimate σ2A, σ2
B , and σ2E using the algorithm described in Section 2.4.1 with data (i, j, ηij).
That algorithm gives unbiased estimates of the variance components in a two-factor crossed random
effects model by applying the method of moments on three U -statistics; the weighted sum of within-
row variances, the weighted sum of within-column variances, and the overall variance.
The (modified) versions of those statistics used in Algorithm 1 are:
Ua(β) =1
2
∑ijj′
N−1i• ZijZij′(ηij − ηij′)2,
Ub(β) =1
2
∑jii′
N−1•j ZijZi′j(ηij − ηi′j)2, and
Ue(β) =1
2
∑iji′j′
ZijZi′j′(ηij − ηi′j′)2.
(3.3)
Then, the variance component estimates are obtained by solving the system
M
σ2A
σ2B
σ2E
=
Ua(β)
Ub(β)
Ue(β)
, M =
0 N −R N −R
N − C 0 N − CN2 −
∑iN
2i• N2 −
∑j N
2•j N2 −N
. (3.4)
As described in Section 2.4.1, the U -statistics can be computed in one pass over the data. The
procedure differs slightly in this case; in the pass over the data, at each point, compute ηij and
use that instead of Yij to compute the U -statistics. Therefore, the computational cost is still O(N)
(actually O(Np)) and the space cost is still O(R + C). Solving (3.4) takes constant time. Thus,
steps 2 and 4 each have computational complexity O(N) and space complexity O(R+ C).
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 41
3.3.2 Scalable and More Efficient Regression Coefficient Estimators
Two estimators: We first define the GLS estimators of β accounting for row or column correla-
tions. We consider two orderings of the data. In one, the data are ordered by rows and the other
is an ordering by columns. Our algorithms do not require us to sort the data into either of these
orders; they are merely utilized to describe our estimators clearly. Let P be the permutation matrix
corresponding to the transformation of the column ordering to the row ordering. Let AR be the
block diagonal matrix with blocks 1Ni•1TNi•
and BC the block diagonal matrix with blocks 1N•j1TN•j .
In the row ordering
Cov(Y ) = VR ≡ σ2EIN + σ2
AAR + σ2BBR, BR = PBCP
T, (3.5)
while in the column ordering
Cov(Y ) = VC ≡ σ2EIN + σ2
AAC + σ2BBC , AC = PTARP. (3.6)
Solving a system of equations defined by VR or VC generally takes time that grows likeN3/2 by the
argument in Section 3.2.3, so the usual GLS estimator is not feasible. Our estimate β takes account
of either the row or column correlations but not both. To account for the row correlations, assuming
that the data are ordered by rows, we use the working covariance matrix VA = σ2EIN + σ2
AAR.
Similarly, to account for the column correlations, assuming that the data are ordered by columns,
we use the covariance matrix VB = σ2EIN + σ2
BBC .
We define the GLS estimator of β accounting for the row correlations as
βRLS = (XTV −1A X)−1XTV −1A Y, VA ≡ σ2EIN + σ2
AAR, (3.7)
where the rows of X and Y are xij and Yij ordered by i. We use estimates of σ2E and σ2
A. Similarly,
the GLS estimator of β accounting for the column correlations is
βCLS = (XTV −1B X)−1XTV −1B Y, VB ≡ σ2EIN + σ2
BBC , (3.8)
where now the rows of X and Y are xij and Yij ordered by j.
GLS Computations: Here we show how to compute βRLS in O(N) time and using O(R) space.
By symmetry, similar results hold for βCLS.
From the Woodbury formula [19] and defining D as the N ×R matrix with ith column v1,i (from
Section 3.1), we have
XTV −1A X = XT(σ2EIN + σ2
ADDT)−1X
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 42
=XTX
σ2E
− σ2A
σ2E
XTD diag( 1
σ2E + σ2
ANi•
)DTX
=1
σ2E
∑ij
ZijxijxTij −
σ2A
σ2E
∑i
1
σ2E + σ2
ANi•
(∑j
Zijxij
)(∑j
Zijxij
)T
.
Likewise, XTV −1A Y equals
1
σ2E
∑ij
ZijxijYij −σ2A
σ2E
∑i
1
σ2E + σ2
ANi•
(∑j
Zijxij
)(∑j
ZijYij
).
One pass over the data allows us to compute∑ij Zijxijx
Tij and
∑ij ZijxijYij , as well as Ni•,∑
j Zijxij and∑j ZijYij for i = 1, . . . , R. This has computational cost O(Np2) and space O(Rp).
Using these quantities, we calculate XTV −1A X and XTV −1A Y in time O(Rp2). Then, βRLS is com-
puted in O(p3). Hence, βRLS can be found with O(Rp) space and computational cost O(Np2 + p3),
which is linear in N .
Efficiencies: In step 3 of Algorithm 1, we choose between βRLS and βCLS. The choice could be
made dependent on X but in many applications one considers numerous different X matrices and we
prefer to have a single choice for all regressions. Accordingly, we find a lower bound on the efficiency
of RLS when X is a single nonzero vector x ∈ RN×1. We choose RLS if that lower bound is higher
than the corresponding bound for CLS, in this p = 1 setting. That rule translates to choosing RLS
if the variance component associated with rows is dominant and CLS otherwise, as shown below.
The full GLS estimator is
βGLS = (XTV −1R X)−1XTV −1R Y (3.9)
when the data are ordered by rows and (XTV −1C X)−1XTV −1C Y when the data are ordered by
columns. For data ordered by rows, the efficiency of βRLS is
effRLS =Var(βGLS)
Var(βRLS)=
(xTV −1A x)2
(xTV −1A VRV−1A x)(xTV −1R x)
, (3.10)
in this p = 1 setting. For data ordered by columns, the corresponding efficiency of βCLS is
effCLS =Var(βGLS)
Var(βCLS)=
(xTV −1B x)2
(xTV −1B VCV−1B x)(xTV −1C x)
. (3.11)
The next two theorems establish lower bounds on these efficiencies.
Theorem 3.3.1. Let A be a positive definite Hermitian matrix and u be a unit vector. If the
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 43
eigenvalues of A are bounded below by m > 0 and above by M <∞, then
(uTAu)(uTA−1u) ≤ (m+M)2
4mM.
Equality may hold, for example when uTAu = (M +m)/2 and the only roots of A are m and M .
Proof. This is Kantorovich’s inequality [34].
By two applications of Theorem 3.3.1 on (3.10) and (3.11) we prove:
Theorem 3.3.2. When p = 1 and effRLS and effCLS are defined as in (3.10) and (3.11),
effRLS ≥4σ2
E(σ2E + σ2
B maxj N•j)
(2σ2E + σ2
B maxj N•j)2and
effCLS ≥4σ2
E(σ2E + σ2
A maxiNi•)
(2σ2E + σ2
A maxiNi•)2,
where each inequality is tight.
Proof. See Section B.1.
After some algebra, we see that the worst case efficiency of βRLS is higher than that of βCLS
when σ2A maxiNi• > σ2
B maxj N•j . We set β to be βRLS when σ2A maxiNi• ≥ σ2
B maxj N•j , and βCLS
otherwise. As shown previously, either estimate requires O(N) computation time and O(R + C)
space. Our importance measure for the row random effects is σ2A maxiNi•. We believe that other
reasonable measures could be derived from other ways of quantifying efficiency.
Remark: We also considered GLS estimators of β with the covariance matrices of the form
VA = (σ2E + λσ2
B)IN + σ2AAR, in order to partially account for the variance induced by the column
random effects. The value of λ was chosen to maximize the worst case efficiency of the estimator.
However, doing so does not appear to decrease the MSE of the estimate of β in our simulations, so
we decided to remain with the simpler estimators.
Alternatively, we can use an iterative procedure such as backfitting [9] or linear algebra solvers
to compute βGLS; we are guaranteed to approach the true solution as the number of iterations goes
to infinity. The hope is that we get a good approximation in only O(1) iterations. The disadvantage
of this method for our tasks is that the variance of the resulting estimate after a finite number of
iterations is difficult to compute analytically. We could use the bootstrap, but doing so is expensive
for such large, correlated data sets. Indeed, it would appear that no bootstrap procedure can be
exact [35].
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 44
3.4 Theoretical Properties of Estimates
Here we show that the parameter estimates β, σ2A, σ2
B and σ2E obtained from Algorithm 1 are con-
sistent. We also give a central limit theorem for β and show how to compute its asymptotic variance
in O(N) time and O(R + C) space. A martingale CLT for the variance component estimates ap-
pears, based on the corresponding theorem for the crossed random effects model, Theorem 2.4.6. We
use the asymptotic assumptions described in Section 3.1 and some additional regularity conditions
on xij .
As in ordinary IID error regression problems our central limit theorem requires the information
in the observed xij to grow quickly in every projection while also imposing a limit on the largest
xij . For each i with Ni• > 0, let xi• be the average of those xij with Zij = 1 and similarly define
column averages x•j .
For a symmetric positive semi-definite matrix V , let I(V ) be the smallest eigenvalue of V . We
will need lower bounds on I(V ) for various V to rule out singular or nearly singular designs. Some
of those V involve centered variables. In most applications xij will include an intercept term, and so
we assume that the first component of every xij equals 1. That term raises some technical difficulties
as centering that component always yields zero. We will treat that term specially in some of our
proofs. For a symmetric matrix V ∈ Rp×p, we let
I0(V ) = I((Vij)2≤i,j≤p)
be the smallest eigenvalue of the lower (p− 1)× (p− 1) submatrix of V .
In our motivating applications, it is reasonable to assume that ‖xij‖ are uniformly bounded. We
let
M∞ ≡ maxij
Zij‖xij‖2 (3.12)
quantify the largest xij in the data so far.
3.4.1 Consistency
First, we give conditions under which βOLS from step 1 is consistent.
Theorem 3.4.1. Let max(εR, εC) → 0 and I(XTX) ≥ cN for some c > 0, as N → ∞. Then
E(‖βOLS − β‖2) = O((εR + εC)/I(XTX/N))→ 0.
Proof. See Section B.3.1.
Second, we show that the variance component estimates computed in step 2 are consistent.
Recall that we compute the U -statistics (3.3) on data (i, j, ηij = Yij − xTij β) and use them to obtain
estimates σ2A, σ2
B , and σ2E via (3.4).
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 45
Theorem 3.4.2. Suppose that as N →∞ that max(εR, εC)→ 0, max(R,C)/N ≤ θ ∈ (0, 1), βp→ β,
and that M∞ is bounded. Then σ2A
p→ σ2A, σ2
B
p→ σ2B, and σ2
E
p→ σ2E.
Proof. See Section B.3.2.
From Theorem 3.4.1, the estimate of β obtained in step 1 of Algorithm 1 is consistent. Therefore,
from Theorem 3.4.2, the variance component estimates obtained in step 2 are consistent, given the
combined assumptions of those two theorems. The proof of Theorem 3.4.1 shows that the estimated
variance components differ by O(‖β − β‖2 + ε‖β − β‖) from what we would get replacing β by an
oracle value β and computing variance components of Yij − xTijβ. Such an estimate would have
mean squared error O(ε) by Theorem 2.4.3. As a result the mean squared error for all parameters
of interest is O(ε).
Our third result shows that the estimate of β obtained in step 3 is consistent. We do so by
showing that estimators βRLS and βCLS are consistent when constructed using consistent variance
component estimates. We give the version for βRLS.
Theorem 3.4.3. Let βRLS be computed with σ2A
p→ σ2A and σ2
E
p→ σ2E as N →∞, where σ2
E > 0. If
max(εR, εC)→ 0 and,
I0(∑ij
Zij(xij − xi•)(xij − xi•)T/N)≥ c > 0 (3.13)
and
1
R2
∑ir
(ZZT)irN−1i• N
−1r• → 0, (3.14)
then βRLSp→ β.
Proof. See Section B.3.3.
From Theorem 3.4.2, the variance component estimates obtained in step 4 are consistent. There-
fore by Theorem 3.4.3, the final estimates returned by Algorithm 1 are consistent.
The most complicated part of the proof of Theorem 3.4.3 involves handling the contribution of
bj to βRLS. In row weighted GLS it is quite standard to have random errors ai and eij but here
we must also contend with errors bj that do not appear in the model for which βRLS is the MLE.
Condition (3.14) is used to control the variance contribution of the column random effects to the
intercept in βRLS. For balanced data it reduces to 1/C → 0 and so it has an effective number of
columns interpretation. Recalling that (ZZT)ir is the number of columns sampled in both rows i
and r, we have (ZZT)ir ≤ Nr• and so a sufficient condition for (3.14) is that (1/R)∑iN−1i• → 0. For
sparsely observed data we expect (ZZT)ir � max(Ni•, Nr•) to be typical, and then these bounds
are conservative.
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 46
Any realistic setting will have σ2E > 0 and we need σ2
E > 0 for βRLS to be well defined. So that
condition in Theorem 3.4.3 is not restrictive.
It remains to show that the variance component estimates from step 4 are consistent. We can
just apply Theorem 3.4.2 again. Therefore the final estimates returned by Algorithm 1 are consistent
given only weak conditions on the behavior of Zij and xij .
3.4.2 Asymptotic Normality
Here we show that the estimator βRLS constructed using consistent estimates of σ2A, σ2
B , and σ2E is
asymptotically Gaussian. We need stronger conditions than we needed for consistency.
These conditions are expressed in terms of some weighted means of the predictors. First, let
x•j =1
N•j
∑i
Zijσ2A
σ2A + σ2
E/Ni•xi•. (3.15)
This is a ‘second order’ average of x for column j: it is the average over rows i that intersect j,
of averages xi• shrunken towards zero. For a balanced design with Zij = 1i≤R1j≤C we would have
x•j = x••σ2A/(σ
2A + σ2
E/C), so then the second order means would all be very close to x•• for large
C. Apart from the shrinkage, we can think of x•j as a local version of x•• appropriate to column j.
Next let
k =∑j
N2•j(x•j − x•j)
/∑j
N2•j ∈ Rp. (3.16)
This is a weighted sum of adjusted columns means, weighted by the squared column size. The
intercept component of this k will not be used.
Theorem 3.4.4. Let βRLS be computed with σ2A
p→ σ2A, σ2
B
p→ σ2B, and σ2
E
p→ σ2E > 0 as N → ∞.
Suppose that
I(∑
i
xi•xTi•
), I0
(∑ij
Zij(xij − xi•)(xij − xi•)T), and
I0(∑
j
N2•j(x•j − x•j − k)(x•j − x•j − k)T
) /maxjN2•j
all tend to infinity, where x•j is given by (3.15) and k is given by (3.16). For cj =∑i Zijσ
2E/(σ
2E +
σ2ANi•) and cij = Zijσ
2E/(σ
2E + σ2
ANi•) assume that both maxj c2j/∑j c
2j and maxij c
2ij/∑ij c
2ij
converge to zero. Then βRLS is asymptotically distributed as
N (β, (XTV −1A X)−1XTV −1A VRV−1A X(XTV −1A X)−1). (3.17)
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 47
Proof. See Section B.4.1.
The statement that βRLS has asymptotic distribution N (β, V ) is shorthand for V −1/2(β − β)p→
N (0, Ip). Theorem 3.4.4 imposes three information criteria. First, the R rows i with Ni• > 0 must
have sample average xi• vectors with information tending to infinity. It would be reasonable to
expect that information to be proportional to R and also reasonable to require R →∞ for a CLT.
Next, the sum of within row sums of squares and cross products of row-centered xij must have
growing information, apart from the intercept term. Finally, thinking of x•j− x•j as locally centered
mean for column j, those quantities centered on the vector k must have a weighted sum of squares
that is not dominated by any single column when weights proportional to N2•j are applied.
The conditions on cj and cij are used to show that the CLT will apply to the intercept in the
regression. The condition on maxj c2j/∑j c
2j will fail if for example column j = 1 has half of the N
observations, all in rows of size Ni• = 1. In the case of an R × C grid maxj c2j/∑j c
2j = 1/C and
so we can interpret this condition as requiring a large enough effective number∑j c
2j/maxj c
2j of
columns in the data.
The condition on maxij c2ij/∑ij c
2ij will fail if for example the data contain a full R× C grid of
values plus a single observation with i = R+ 1 and j = C + 1. The problem is that in a row based
regression, a single small row can get outsized leverage. It can be controlled by dropping relatively
small rows. This pruning of rows is only used for the CLT to apply to the intercept term. It is not
needed for other components of β nor is it need for consistency. We do not know if it is necessary
for the CLT.
Next we show that the variance component estimates are asymptotically normal as well, building
off of the corresponding result from Section 2.4.3.
Define Ua, Ub, and Ue be the same as Ua(β), Ub(β), and Ue(β) with ηij = ai + bj + eij instead
of ηij .
As stated in Section 3.4.1, in step 3 of Algorithm 1 we have obtained a consistent estimate β of
β. From (2.19), the variance component estimates obtained in step 4 are, as N →∞,σ2A
σ2B
σ2E
=
−N−1 0 N−2
0 −N−1 N−2
N−1 N−1 −N−2
Ua(β)
Ub(β)
Ue(β)
. (3.18)
Therefore, to show that σ2A, σ2
B , and σ2E are asymptotically normal, we show that Ua(β), Ub(β),
and Ue(β) are asymptotically jointly normal. More precisely, we show that
Ua(β) = Ua(β)/√∑
j N2•j , Ub(β) = Ub(β)/
√∑iN
2i•, and Ue(β) = Ue(β)/(N
√∑iN
2i• +
∑j N
2•j)
are asymptotically jointly normal.
Here is an outline of our proof. The asymptotic joint distribution of Ua = Ua/√∑
j N2•j , Ub =
Ub/√∑
iN2i•, and Ue = Ue/(N
√∑iN
2i• +
∑j N
2•j) is Gaussian, from Theorem 2.4.6. We show that
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 48
Ua(β)− Ua, Ub(β)− Ub, and Ue(β)− Ue are asymptotically zero and then cite Slutsky’s Theorem.
Lemma 3.4.1. Under the assumptions of Section 3.1, if E(‖β−β‖2) = o(εR+εC), as well as (2.31)
and boundedness of M∞, Ua(β)− Ua, Ub(β)− Ub and Ue(β)− Ue all converge to 0 in probability.
Proof. See Section B.4.2.
The condition on β in Lemma 3.4.1 holds for βOLS by Theorem 3.4.1. Because βRLS and βCLS
use a better model for Cov(Y ), they should ordinarily satisfy it too.
Theorem 3.4.5. Under the conditions of Theorem 2.4.6 and Lemma 3.4.1, the variance component
estimates output by Algorithm 1 are asymptotically distributed as
N
σ2A
σ2B
σ2E
,
σ4A(κA + 2)
∑iN
2i•
N2σ4E(κE + 2)
1
N−σ4
E(κE + 2)1
N
∗ σ4B(κB + 2)
∑j N
2•j
N2−σ4
E(κE + 2)1
N
∗ ∗ σ4E(κE + 2)
1
N
.
Proof. See Section B.4.3.
As with the crossed random effects model, the variance component estimates are asymptotically
uncorrelated. However, they could be correlated with β. Indeed, we expect that they are correlated,
since we do not assume Gaussian or even symmetric distributions for ai, bj , and eij .
In practice, to compute the variances of σ2A, σ2
B , and σ2E , we need to estimate κA, κB , κE in
step 4 of Algorithm 1 using the method of moments technique described in Section 2.4.2.3. Then,
we plug in σ2A, σ2
B , σ2E , κA, κB , and κE into the result of Theorem 3.4.5.
3.4.2.1 Computing the Variance of β
Here we show how to compute the estimate of the asymptotic variance of β from Theorem 3.4.4.
Expanding its expression, we see that
(XTV −1A X)−1XTV −1A VRV−1A X(XTV −1A X)−1
= (XTV −1A X)−1XTV −1A (VA + σ2BBR)V −1A X(XTV −1A X)−1
= (XTV −1A X)−1 + (XTV −1A X)−1XTV −1A σ2BBRV
−1A X(XTV −1A X)−1
= (XTV −1A X)−1 + (XTV −1A X)−1Var(XTV −1A b)(XTV −1A X)−1, (3.19)
where b is the length-N vector of column random effects for each observation.
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 49
Using the Woodbury formula [19], we have Var(XTV −1A b) equals
σ2B
σ4E
∑j
(X•j − σ2
A
∑i
ZijXi•
σ2E + σ2
ANi•
)(X•j − σ2
A
∑i
ZijXi•
σ2E + σ2
ANi•
)T. (3.20)
In practice, we plug σ2A, σ2
B , and σ2E in for σ2
A, σ2B , and σ2
E in (3.19) and (3.20). We already have
(XTV −1A X)−1 as well as Ni• and Xi• for i = 1, . . . , R available from computing βRLS. In a new pass
over the data, we compute X•j and∑i ZijXi• for j = 1, . . . , C, incurring O(Np) computational and
O(Cp) storage cost. Then, (3.20) can be found in O(Cp2) time; a final step finds (3.19) in O(p3)
time. Overall, estimating the variance of β requires O(Np+Cp2 + p3) additional computation time
and O(Cp) additional space.
3.5 Experimental Results
Here we simulate Gaussian data to compare both the computational and statistical efficiency of our
algorithm with a state of the art code for linear mixed models, MixedModels [6]. In this section,
all our computations were carried out in Julia. MixedModels maximizes the log-likelihood of the
data under the assumption that ai, bj , and eij are normally distributed. It uses the derivative-free
optimization algorithm BOBYQA [39], iteratively computing and maximizing an approximation to
the objective. More specifically, at each iteration, it computes the objective at various values of β,
σ2A, σ2
B , and σ2E . Each computation of the objective can be shown to require O(N3/2) computation
(see Section 3.2.3), but the number of iterations needed for such a non-concave function does not
seem to be known.
3.5.1 Simulations
We used a range of data sizes: p = 1, 5, 10, 20 and N = 100, 400, 1600, 6400, 25,600, 102,400, 409,600,
1,638,400. We let β be the vector of ones, and each entry of xij was sampled from N (0, 1). For each
choice of R = C = 2√N and p, we sampled 100 observation patterns, randomly selecting exactly 25
percent of the (i, j) pairs to observe at each time. With the xij and observation pattern fixed, we
sampled Yij according to model (1) for (i, j) observed with σ2A = 2, σ2
B = 0.5, and σ2E = 1. Then
we ran our algorithm (including calculating the variance of β) and maximum likelihood, recording
their computation times and the mean squared errors of their parameter estimates. At the end, for
each pair of values of R = C and p, we recorded the average computation times and the four mean
squared errors of both algorithms. For comparison purposes, we divide the mean squared errors of
β by p.
We also carried out some auxiliary simulations. When p = 1, we also considered N = 6,553,600
and 14,745,600, but with 10 observation patterns instead of 100. For our algorithm, we additionally
usedN = 26,214,400 and 58,982,400; the maximum likelihood algorithm threw an error for more than
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 50
21 million observations. When p = 5, we also considered N = 6,553,600 and 10,240,000, but with
10 observation patterns instead of 100. For our algorithm, we additionally used N = 14,745,600,
26,214,400, and 36,966,400; the maximum likelihood algorithm threw an error for more than 13
million observations. When p = 10, we also considered N = 6,553,600, but with 10 observation
patterns instead of 100. For our algorithm, we additionally used N = 14,745,600 and 23,040,000;
the maximum likelihood algorithm threw an error for more than 9 million observations. When
p = 20, we also considered N = 3,686,400, but with 10 observation patterns instead of 100. For our
algorithm, we additionally used N = 6,553,600 and 14,745,600; the maximum likelihood algorithm
threw an error for more than 5 million observations.
3.5.1.1 Computational efficiency
Figure 3.2 shows log-log plots of the average computation time required in seconds versus N for each
value of p. The black dotted lines correspond to a computational complexity of O(N) and the green
dotted lines to a computational complexity of O(N3/2). We see that for large N , the computational
cost of maximum likelihood is indeed O(N3/2), while the computational cost of our algorithm is
O(N) throughout.
Notice that the computation time savings of our algorithm compared to maximum likelihood is
larger for smaller p. This is because as seen in Section 3.2.3, as p increases for constant N , the
fraction of the computation time taken to compute Xβ increases.
Because the maximum likelihood cost is superlinear we expect that it should become infeasible
at some N for which moments can still work. Indeed, this is described in the previous section.
Because our algorithm only computes simple summary statistics on each pass over the data, it is
easily parallelizable and we don’t have to store any large matrices in memory on one machine.
For p = 5, we also investigated the computation time per iteration for maximum likelihood.
We repeated the same simulation as above, but with 10 runs instead of 100 and N = 25 and
N = 6,553,600 being the smallest and largest data sizes, respectively. Figure 3.1 shows a log-log
scatter plot of the computation time per iteration versus N . The black line goes through the mean
computation time per iteration for each data size. For larger values of N , the computation time per
iteration indeed scales as O(N3/2), as shown by the dotted line.
3.5.1.2 Statistical efficiency
For both MLE and our algorithm, the scaled MSE of β appears to decrease linearly with the number
of observations. From the Gauss-Markov theorem we know that βRLS or βCLS cannot be statistically
efficient. In our simulated example, the MSE of the estimate output by our algorithm is larger by
at most a factor of two.
For σ2A and σ2
B , the MSE of the estimates output by both algorithms decreases sublinearly with
the number of observations. Initially, the MSE of our algorithm may be larger, but as N increases
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 51
Figure 3.1: For p = 5, log-log plot of the computation time per iteration for maximum likelihood.
the two MSEs are nearly identical. Occasionally, the MSEs output by our algorithm is smaller than
that of MLE. This may be due to the MLE process being caught in local maxima of the likelihood.
For σ2E , the MSE of the estimates output by both algorithms decreases approximately linearly with
the number of observations. However, the MSE of our estimate is usually larger by a factor of two
or three.
In this example, we take a modest loss in statistical efficiency of β and σ2E , while attaining a better
convergence rate in computational efficiency. For σ2A and σ2
B the two algorithms have practically
equivalent accuracy. These comparisons were run on data simulated from the Gaussian model that
the MLE assumes, but our moments method does not require that assumption. Likelihood based
variance estimates for variance components, e.g., Var(σ2A) can fail to be even asymptotically correct
when the underlying parametric model is violated.
3.5.2 An Example from E-commerce
Stitch Fix has provided us some of their client ratings data. This data is fully anonymized and void
of any personally identifying information. Further, the data provided by Stitch Fix is a sample of
their data, and consequently does not reflect their actual numbers of clients, items or their ratios,
for example. Nonetheless this is an interesting data set on which to fit a linear mixed effects model.
We received data on clients’ ratings of items they received, as well as the following information
about the clients and items. For each client and item pair, we have a composite rating Yij on a
scale from 1 to 10, the item material, whether the item style is edgy, whether the client likes edgy
styles, and a match score, which is the probability that the client keeps the item, predicted before it
is actually sent. The match score is a prediction from a baseline model and is not representative of
all algorithms used at Stitch Fix. There is some circularity in the data, because whether items are
labeled as edgy/Bohemian is based on whether clients who like edgy/Bohemian styles like it. We
ignore such complications for the sake of simplicity. Note that we treat the rating as a continuous
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 52
(a) p = 1 (b) p = 5
(c) p = 10 (d) p = 20
Figure 3.2: For each value of p, log-log plots of the computation times of the two algorithms againstN .
variable, even though it is actually ordinal.
We first investigated the observation pattern of the data. We received N = 5,000,000 ratings on
C = 6,318 items by R = 762,752 clients. Thus, in the sample of data provided by Stitch Fix we have
C/N.= 0.00126 and R/N
.= 0.153. The latter ratio indicates that only a relatively small number of
ratings from each client are included in the data (their full shipment history is not included in the
sampled data). The data are not dominated by a single row or column because εR.= 9× 10−6 and
εC.= 0.0143. Similarly
N∑iN
2i•
.= 0.103,
∑iN
2i•
N2
.= 1.95× 10−6,
N∑j N
2•j
.= 1.22× 10−4, and
∑j N
2•j
N2
.= 0.00164
so that the average row or column has size � 1 and � N .
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 53
(a) p = 1 (b) p = 5
(c) p = 10 (d) p = 20
Figure 3.3: For each value of p, log-log plots of the (scaled by p) mean squared error of β against N .
We built a linear mixed effects model for the ratings as a function of the other information about
the client and item, with client and item random effects, as follows:
Model 3. For client i and item j,
ratingij = β0 + β1matchij + β2I{client edgy}i + β3I{item edgy}j
+ β4I{client edgy}i ∗ I{item edgy}j + β5I{client boho}i + β6I{item boho}j
+ β7I{client boho}i ∗ I{item boho}j + β8materialij + ai + bj + eij ,
where I{client edgy}i ∈ {0, 1} indicates whether client i likes edgy styles and I{item edgy}j indicates
whether item j is edgy. Similar notation holds for the Bohemian variables. Note that since materialij
is a categorical variable, in practice it is replaced by indicator variables for each type of material.
We chose ‘Polyester’, the most common material, as the baseline.
Suppose that we erroneously believed that the data were independent and identically distributed
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 54
(a) p = 1 (b) p = 5
(c) p = 10 (d) p = 20
Figure 3.4: For each value of p, log-log plots of the mean squared error of σ2A against N .
by ignoring both client and item random effects. We might then run OLS. The resulting reported
regression coefficients and standard errors are shown in the first two columns of Table 3.1. Estimated
coefficients are starred if they are reported as being significant at the 0.05 level.
The third column contains more realistic standard errors of the OLS regression coefficients, ac-
counting for both the client and item random effects. These standard errors were computed using
the variance component estimates from our algorithm. As expected, they can be much larger, often
by a factor of ten, than the OLS reported standard errors. Ten of the variables, ‘Acrylic’, ‘Cash-
mere’, ‘Cupro’, ‘Fur’, ‘Linen’, ‘PVC’, ‘Rayon’, ‘Silk’, ‘Viscose’, and ‘Wool’, that appear significantly
different from polyester by OLS are not really significant when one accounts for client and item
correlations.
The final two columns contain the regression coefficients estimated by our algorithm and their
standard errors as described in Section 3.4.2.1. Again, estimated coefficients are starred if they are
significant at the 0.05 level. The estimated variance components are σ2A = 1.133, σ2
B = 0.1463, and
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 55
(a) p = 1 (b) p = 5
(c) p = 10 (d) p = 20
Figure 3.5: For each value of p, log-log plots of the mean squared error of σ2B against N .
σ2E = 4.474. Their standard errors are approximately 0.0046, 0.00089 and 0.0050 respectively, so
these components are well determined. The error variance component is largest, and the client effect
dominates the item effect by almost a factor of eight.
The ‘Match’ variable is significantly positively associated with rating, indicating that the baseline
prediction provided by Stitch Fix is a useful predictor in this data set. However the random effects
model reduces its coefficent from about 5 to about 3.5. We believe that this is because some clients
tend to give higher ratings on average than others; it can be verified that there are pairs of clients
(each with at least 40 ratings) for which the averages of the ‘Match’ variable are nearly equal but
have significantly different average ratings. OLS does not model this client effect and tries to overfit
using ‘Match’. The random effects model allows for a client effect and thus the magnitude of the
estimated coefficient for ‘Match’ diminishes.
Shipping an edgy item to a client who does not like edgy styles is associated with a rating decrease
of about a third, but shipping such an item to a client who does like edgy styles is associated with a
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 56
(a) p = 1 (b) p = 5
(c) p = 10 (d) p = 20
Figure 3.6: For each value of p, log-log plots of the mean squared error of σ2E against N .
small increase in rating. An interesting pattern emerges for Bohemian styles. For both clients who
like Bohemian styles and clients who don’t, shipping an item that is Bohemian is associated with a
decrease in rating. This suggests that it is very difficult to choose a Bohemian clothing item that
pleases a client, and so Stitch Fix may want to send a Bohemian item only when they are sure that
the client will like it based on a strong signal from some other variables.
To investigate this possibility, we consider subsets of the data based on the value of ‘Match’.
For t = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, we consider only the observations for which ‘Match’≥ t,
and fit model (3) using our algorithm. We compute the estimated β5, β6 (the effect of sending a
Bohemian item to a non-Bohemian client), β6 + β7 (the effect of sending a Bohemian item to a
Bohemian client), and β5 + β6 + β7, and their standard errors. Finally, we plot those four estimates
and their 95 percent confidence intervals as a function of t in Figure 3.7.
As we see from Figure 3.7(b), for clients who don’t like Bohemian styles, the optimal choice is
not to send them a Bohemian item, even if ‘Match’ is as high as 0.7. Similarly, from Figure 3.7(c),
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 57
(a) β5 (b) β6
(c) β6 + β7 (d) β5 + β6 + β7
Figure 3.7: For each subset of data, estimates and confidence intervals for Bohemian-related variablesof interest.
even when ‘Match’ is as high as 0.6, for clients who like Bohemian styles, the optimal choice is still
not to send them a Bohemian item. The Boho flag is not as useful as the Edgy one.
Of the materials, ‘Cotton’, ‘Faux Fur’, ‘Leather’, ‘Modal’, ‘Pleather’, ‘PU’, ‘PVC’, ‘Silk’, ‘Span-
dex’, and ‘Tencel’ are significantly different from the baseline, ‘Polyester in our crossed random effects
model. ‘PU’ and ‘PVC’ are associated with an increase in rating of at least half a point. Those
materials are often used to make shoes and specialty clothing, which may explain their association
with high ratings.
The computations in this section were done in python.
CHAPTER 3. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 58
Table 3.1: Stitch Fix Regression Results
βOLS seOLS(βOLS) se(βOLS) β se(β)
Intercept 4.635∗ 0.005397 0.05808 5.110∗ 0.01250Match 5.048∗ 0.01174 0.1464 3.529∗ 0.02153I{client edgy} 0.001020 0.002443 0.004593 0.001860 0.003831I{item edgy} −0.3358∗ 0.004253 0.03730 −0.3328∗ 0.01542I{client edgy}∗I{item edgy} 0.3925∗ 0.006229 0.01352 0.3864∗ 0.006432
I{client boho} 0.1386∗ 0.002264 0.004354 0.1334∗ 0.003622I{item boho} −0.5499∗ 0.005981 0.03049 −0.6261∗ 0.01661I{client boho}∗I{item boho} 0.3822∗ 0.007566 0.01057 0.3837∗ 0.007697
Acrylic −0.06482∗ 0.003778 0.03804 −0.01627 0.02149Angora −0.01262 0.007848 0.09631 0.07271 0.05837Bamboo −0.04593 0.06215 0.2437 0.05420 0.1716Cashmere −0.1955∗ 0.02484 0.1593 0.01354 0.1176Cotton 0.1752∗ 0.003172 0.04766 0.09743∗ 0.01811Cupro 0.5979∗ 0.3016 0.4857 0.5603 0.4852Faux Fur 0.2759∗ 0.02008 0.08631 0.3649∗ 0.07524Fur −0.2021∗ 0.03121 0.1560 −0.03478 0.1331Leather 0.2677∗ 0.02482 0.08671 0.2798∗ 0.07335Linen −0.3844∗ 0.05632 0.2729 0.006269 0.1660Modal 0.002587 0.009775 0.2052 0.1417∗ 0.06498Nylon 0.03349∗ 0.01552 0.1000 0.1186 0.06436Patent Leather −0.2359 0.1800 0.4235 −0.2473 0.4222Pleather 0.4163∗ 0.008916 0.09905 0.3344∗ 0.05023PU 0.4160∗ 0.008225 0.09019 0.4951∗ 0.04196PVC 0.6574∗ 0.06545 0.3898 0.8713∗ 0.3883Rayon −0.01109∗ 0.002951 0.04602 0.01029 0.01493Silk −0.1422∗ 0.01317 0.1004 −0.1656∗ 0.05471Spandex 0.3916∗ 0.01729 0.1549 0.3631∗ 0.1284Tencel 0.4966∗ 0.009313 0.1935 0.1548∗ 0.06718Viscose 0.04066∗ 0.006953 0.09620 −0.01389 0.03527Wool −0.06021∗ 0.006611 0.08141 −0.006051 0.03737
Chapter 4
Conclusions
We conclude by discussing some practical lessons we have gathered, summarizing, and outlining
some potential directions for future work.
4.1 Practical Considerations
In this section, we briefly describe some issues to keep in mind when working with large data sets
with crossed random effects structure.
Data Preprocessing In the real-world examples we considered, the responses Yij were not actu-
ally continuous, but ordinal. For e-commerce, in many cases the responses are ordinal or counts. We
did not transform the responses in any way, but there are several options for doing so. First, we can
add random perturbations to the responses and account for this additional variance in our estimate
of σ2E . Second, we can standardize the responses and adjust the estimated regression coefficients and
variance components accordingly. Third, we can apply some variance stabilizing transform, such as
taking logarithms. However, this approach has the disadvantage of not leading to interpretable re-
sults when the transformation is reversed. Alternatively, we could use ordinal or Poisson regression,
as described in Section 4.3.3.
Our algorithms have assumed that the observations are stored in log format, i.e. in tuples
(i, j, Yij , xij). We believe that this is the most efficient way to store the data since we expect that
the majority of the (i, j) tuples are not observed. For the Stitch Fix data set used in Section 3.5.2,
approximately 0.1 percent of the client-item pairs were observed. If the data are not originally in
this sparse format, it is simple to convert it to this format.
Computational Requirements Throughout this thesis, we have assumed that the N observa-
tions and O(R + C) intermediate quantities can fit on one machine. But, this may not always be
59
CHAPTER 4. CONCLUSIONS 60
possible.
In that case, we believe that the best setup is to separate the observations onto different machines
and store the intermediate quantities on one machine, which we call Machine 0. Then, as we pass
over the entire data set, the intermediate quantities that are sums of the observations, e.g. Ni•
and Xi•, are computed and then communicated to Machine 0. At the end, Machine 0 collects
all the intermediate quantities, (possibly summing over all the machines, since the observations
corresponding to the same level of a factor will be on separate machines) and then finishes computing
the estimates.
To obtain standard errors, we need to do a second pass over the data as stated previously. In
that pass, we would communicate some intermediate quantities back to the individual machines, and
the individual machines would then communicate O(1) new intermediate quantities back to Machine
0. Finally, Machine 0, as before, collects all the intermediate quantities and computes the standard
errors. In all, this procedure would incur O(R+ C) communication cost as well.
4.2 Summary
In this thesis, we studied large linear mixed models with two crossed random effects with unknown
densities, which appear in many modern e-commerce applications. The inspiration for our project
came from Stitch Fix, an e-commerce company that provides personal styling services for women
and men. Their goal was to determine, in general, what characteristics of clients and clothes lead to
greater client satisfaction. Mixed models are ideal for such tasks because clients and clothes should
be treated as random effects in order to obtain conclusions about the population.
We started with a simplification of the linear mixed model with two crossed random effects,
the crossed random effects model; the regression coefficients are replaced by a global mean. This
model still illustrates the difficulties of estimation and inference for large linear mixed models with
crossed random effects, since they are primarily caused by the correlations among observations. After
showing that Bayesian analysis does not scale to large crossed random effects models, we propose a
new method of moments estimator of the variance components based on U -statistics. The estimator
requires linear computation time and sublinear memory, which makes it scalable to the data sets of
our interest. It admits easier analysis than previous estimators in the literature, and thus we are
able to compute its variance exactly and prove a central limit theorem. The variance can also be
approximated in linear computation time and sublinear memory.
Going back to the linear mixed model with two crossed random effects, we propose a scalable
algorithm that alternates between estimating the regression coefficients and the variance components.
To estimate the variance components, we use the estimators proposed for the crossed random effects
model, which requires linear computation time and sublinear memory. To estimate the regression
coefficients, we use the generalized least squares estimator ignoring one of the random effects, which
CHAPTER 4. CONCLUSIONS 61
can be computed using linear time and sublinear space. The final estimates are shown to be consistent
and asymptotically normal.
For both the crossed random effects model and the linear mixed model with crossed random
effects, we carried out simulations and experiments on real-world data. The simulations compared
the computation time and mean squared errors of our proposed estimators and maximum likelihood
estimators, when data is generated using Gaussian random effects. We found that the mean squared
errors of our estimates of σ2A and σ2
B were comparable to those of the maximum likelihood estimates,
while the mean squared error of our estimates of β and σ2E were at most a few factors larger than
those of the maximum likelihood estimates. However, the computation time of our estimates is much
smaller, by as much as a factor of 10.
4.3 Future Work
There are numerous avenues for future work. Here we will discuss several that are based on gener-
alizations of model (1) and on which there has been some previous work in the literature.
4.3.1 Heteroscedasticity
In many real-world situations, the variances of the random effects at different levels of the same factor
may be different. For example, suppose that we have a data set of how many times a user bought
particular items during some time period. For common household items, each brand is expected to
have a fairly equitable and sizable market share, since many have been around for a long time; thus,
we expect that the random effects for those items have a small variance. For consumer discretionary
and luxury items, some brands are merely much more popular than others, e.g. some movies only
have limited release, while others are released in multiple countries; then, we expect that the random
effects for these items have a large variance.
There are several possible ways to implement this generalization in our model (1). We could
assign each level its own variance, such as in Owen (2007) [36] and Owen & Eckles (2012) [37].
However, this model may be too expressive to be practical. Alternatively, we can assume that
for each random effect, the variances are generated from a mixture model with a set number of
parametric distributions.
4.3.2 Factor Interactions
In many real-world data sets, we would also expect there to be interactions between the levels of
different factors. In the Stitch Fix data set, due to differences in personal preferences, certain clients
may tend to rate certain items higher. For instance, younger clients might like edgier styles while
older clients might prefer more expensive materials.
CHAPTER 4. CONCLUSIONS 62
Therefore, in model (1), we could add a term abij to account for such interactions. However,
some restrictions must be made on its distribution in order to make it identifiable from the noise
term eij . A popular technique from computer science is to assume that its covariance matrix has
some low rank structure.
4.3.3 Generalized Linear Models
In this thesis, we have assumed that the responses Yij are continuous. However, in practice in many
cases the responses are discrete. In the data set from Last.fm [29], the responses are actually counts,
while Stitch Fix not only collects data on the ratings of items by clients but also whether the clients
keep the items.
Therefore, a natural next step is to extend this work to link functions other than the identity,
e.g. logistic and Poisson regression. The extension of maximum likelihood procedures to generalized
linear mixed models has been well-studied and implemented in software including lme4 and Mixed-
Models. As shown in Bates (2014) [5], such procedures, in addition to requiring strong parametric
assumptions, requires computation time O(N3/2) to compute the likelihood at a given set of values
for the parameters.
Because we have utilized the method of moments to effect for linear mixed models, we can also
consider using it for generalized linear mixed models. Jiang (1998) [24] has done so, using sufficient
statistics as the estimating equations. Due to the nonlinearity of generalized linear mixed models,
the Newton-Raphson procedure is used to solve the system of equations.
Appendix A
Crossed Random Effects Model
A.1 Bayesian Analysis via MCMC
A.1.1 Proof of Theorem 2.3.1
We may assume that i ∈ {1, 2, . . . , R} and j ∈ {1, 2, . . . , C}. The posterior distribution of the
parameters is given by
p(µ, a, b, σ2A, σ
2B , σ
2E | Y ) ∝
R∏i=1
1√2πσ2
A
exp(− a2i
2σ2A
) C∏j=1
1√2πσ2
B
exp(−
b2j2σ2
B
)
×R∏i=1
C∏j=1
1√2πσ2
E
exp(− (Yij − µ− ai − bj)2
2σ2E
)∝ σ−RA σ−CB σ−RCE exp
(−∑i a
2i
2σ2A
−∑j b
2j
2σ2B
−∑ij(Yij − µ− ai − bj)2
2σ2E
).
Then, φ ≡ p(a, b | µ, σ2A, σ
2B , σ
2E , Y ) is proportional to
exp(−∑i a
2i
2
( 1
σ2A
+C
σ2E
)−∑j b
2j
2
( 1
σ2B
+R
σ2E
)−∑ij aibj
σ2E
).
Therefore, the posterior distribution of a and b is a joint normal with precision matrix
Q =
σ2E + Cσ2
A
σ2Aσ
2E
IR1
σ2E
1R1TC
1
σ2E
1C1TRσ2E +Rσ2
B
σ2Bσ
2E
IC
.
From Theorem 1 of [42], for the Gibbs sampler described in Section 2.3.1.1, we have the following
result. Let A = I −diag(Q−111 , Q−122 )Q, where Q11 denotes the upper left block of Q and Q22 denotes
63
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 64
the lower right block. Let L be the block lower triangular part of A, and U = A − L. Then, the
convergence rate ρ is given by the spectral radius of the matrix B = (I −L)−1U . Now, we compute
ρ. First
A = I −
σ2Aσ
2E
σ2E + Cσ2
A
IR 0
0σ2Bσ
2E
σ2E +Rσ2
B
IC
Q =
0 − σ2A
σ2E + Cσ2
A
1R1TC
− σ2B
σ2E +Rσ2
B
1C1TR 0
.
Next
L =
0 0
− σ2B
σ2E +Rσ2
B
1C1TR 0
and U =
0 − σ2A
σ2E + Cσ2
A
1R1TC
0 0
from which
B =
IR 0
σ2B
σ2E +Rσ2
B
1C1TR IC
−1
U =
IR 0
− σ2B
σ2E +Rσ2
B
1C1TR IC
U
=
0 − σ2A
σ2E + Cσ2
A
1R1TC
0Rσ2
Aσ2B
(σ2E + Cσ2
A)(σ2E +Rσ2
B)1C1TC
.
Clearly, B has rank one. Then, its spectral radius must be equal to its nonzero eigenvalue, which is
also the trace of B. Hence,
ρ =RCσ2
Aσ2B
(σ2E + Cσ2
A)(σ2E +Rσ2
B).
A.1.2 Simulation results
The results of our simulations described in Section 2.3.2 are presented here in Tables A.1 through A.5.
A.2 Expectation and Variance of Moment Estimates
In this section we provide the details and proofs for Sections 2.4.1 and 2.4.2. Before diving in, we
note an additional identity:
∑ir
N−1i• (ZZT)ir =∑ijr
N−1i• ZijZrj =∑ij
ZijN−1i• N•j . (A.1)
We also need some notation for equality among index sets. The notation 1ij=rs means 1i=r1j=s. It
is different from 1{i,j}={r,s} which we use as well. Additionally, 1ij 6=rs means 1− 1ij=rs.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 65
Table A.1: Median CPU time in seconds.Method Gibbs Block Reparam. Lang. MALA Indp. RWM RWM Sub. pCN
R=10C=10 20 9 23 20 27 21 19 21 21R=20C=20 33 10 37 35 45 34 32 33 33R=50C=50 71 17 80 79 101 71 68 75 70R=100C=100 143 361 159 156 199 139 133 141 136R=200C=200 326 984 351 323 462 300 279 303 280R=500C=500 1157 2356 1205 955 1786 952 851 1019 817R=1000C=1000 3432 15046 4099 2302 4760 2513 2141 2635 1966R=2000C=2000 10348 88756 11434 6991 15836 7815 5712 9274 6006R=50C=100 105 287 121 112 151 103 101 107 102R=10C=200 138 316 167 139 200 138 137 142 138R=100C=1000 898 5148 964 807 1179 795 748 822 760
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 66
Table A.2: Median estimates of µ and number of lags after which ACF(µ) ≤ 0.5.Method Gibbs Block Reparam. Lang. MALA Indp. RWM RWM Sub. pCN
R=10 0.72 0.94 1.27 1.07 1.18 2.40 0.76 0.74 1.51C=10 26 29 24 178 689 1604 1252 1522 1392R=20 0.81 1.02 1.01 1.07 0.94 2.89 1.69 1.08 1.47C=20 34 43 26 75 841 1019 1674 1720 1765R=50 1.09 0.91 0.98 0.98 1.04 2.97 1.66 1.70 1.58C=50 83 84 75 8 610 5000+ 1158 1681 1104R=100 0.98 1.02 1.13 0.99 0.85 2.73 1.57 1.61 1.49C=100 123 185 144 2 398 5000+ 1145 1713 1522R=200 1.01 1.02 1.03 1.01 0.95 3.22 1.60 1.31 1.52C=200 257 346 272 1 1 1278 1508 1692 807R=500 0.99 1.01 0.99 0.99 1.00 2.26 1.58 1.15 1.55C=500 536 617 576 9 4 1572 924 1687 1613R=1000 0.97 1.02 1.04 0.99 0.96 2.39 1.55 1.07 1.53C=1000 801 790 694 1 2501 5000+ 1133 1656 1008R=2000 0.98 1.01 1.00 1.01 1.00 2.57 1.55 1.03 1.55C=2000 672 721 771 1 5000+ 1086 1176 1716 799R=50 0.89 1.03 0.95 1.01 1.06 2.70 1.57 1.61 1.45C=100 144 155 118 7 1095 5000+ 1219 1725 1371R=10 0.86 1.08 0.84 0.94 0.80 2.40 1.41 1.36 1.23C=200 329 244 299 120 944 3339 1518 1657 1437R=100 1.06 1.06 1.02 1.01 1.03 2.73 1.57 1.11 1.55C=1000 573 536 672 1 1 3330 1161 1681 3333
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 67
Table A.3: Median estimates of σ2A and number of lags after which ACF(σ2
A) ≤ 0.5.
Method Gibbs Block Reparam. Lang. MALA Indp. RWM RWM Sub. pCN
R=10 2.76 2.49 2.05 2.07 2.45 2.39 1.88 2.05 1.38C=10 1 1 1 898 768 1604 759 606 1232R=20 2.00 2.06 1.65 1.89 2.32 1.48 1.96 1.76 2.00C=20 1 1 1 930 829 850 873 822 1083R=50 1.94 1.96 2.17 1.77 2.21 1.44 2.06 2.03 1.95C=50 1 1 1 797 720 5000+ 1035 1032 1079R=100 2.21 2.14 2.23 1.88 1.87 1.11 2.19 1.92 1.95C=100 1 1 1 649 398 5000+ 994 917 1522R=200 2.09 2.09 2.10 2.08 1.99 1.16 2.02 2.12 2.01C=200 1 1 1 410 437 1281 1598 673 1135R=500 1.97 2.12 1.99 1.64 1.96 1.07 2.02 2.01 1.97C=500 1 1 1 407 197 1572 895 826 1599R=1000 1.96 1.99 2.02 1.90 1.95 1.78 2.01 1.96 1.99C=1000 1 1 1 122 2656 5000+ 1133 989 912R=2000 1.97 2.00 2.03 1.94 1.99 1.04 2.01 2.00 1.99C=2000 1 1 1 69 5000+ 1086 1181 1262 1161R=50 2.22 2.29 2.05 2.24 1.98 1.10 2.00 1.96 2.09C=100 1 1 1 948 672 5000+ 1103 787 1005R=10 2.34 1.74 3.05 2.70 2.72 0.88 1.89 1.43 1.16C=200 1 1 1 891 1023 3309 1492 724 988R=100 2.04 2.03 2.14 1.98 1.98 1.46 1.90 1.87 2.05C=1000 1 1 1 512 450 3329 985 1086 3333
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 68
Table A.4: Median estimates of σ2B and number of lags after which ACF(σ2
B) ≤ 0.5.
Method Gibbs Block Reparam. Lang. MALA Indp. RWM RWM Sub. pCN
R=10 0.66 0.81 0.88 0.46 0.89 1.47 0.45 0.43 0.45C=10 1 1 1 382 638 1604 1214 956 1297R=20 0.54 0.45 0.44 0.43 0.44 1.55 0.49 0.46 0.57C=20 1 1 1 261 410 978 937 1217 704R=50 0.49 0.49 0.49 0.49 0.53 1.35 0.49 0.43 0.48C=50 1 1 1 123 138 5000+ 1308 786 1463R=100 0.51 0.54 0.49 0.46 0.48 0.84 0.52 0.47 0.49C=100 1 1 1 65 66 5000+ 691 1169 1522R=200 0.49 0.51 0.51 0.47 0.50 1.67 0.51 0.49 0.50C=200 1 1 1 36 37 1266 1497 1241 831R=500 0.51 0.49 0.50 0.28 0.47 1.56 0.50 0.48 0.47C=500 1 1 1 770 16 1572 696 993 1619R=1000 0.51 0.50 0.50 0.40 0.50 2.94 0.51 0.50 0.49C=1000 1 1 1 477 2514 5000+ 1133 855 556R=2000 0.50 0.50 0.49 0.39 0.50 1.65 0.48 0.49 0.50C=2000 1 1 1 224 5000+ 1086 1220 830 1253R=50 0.50 0.51 0.53 0.48 0.54 1.93 0.53 0.49 0.49C=100 1 1 1 69 85 5000+ 1378 910 1419R=10 0.47 0.51 0.51 0.40 0.52 1.65 0.61 0.59 0.55C=200 1 1 1 23 52 3332 1289 1004 1408R=100 0.50 0.49 0.50 0.47 0.49 2.95 0.50 0.49 0.50C=1000 1 1 1 6 8 3328 1345 962 3333
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 69
Table A.5: Median estimates of σ2E and number of lags after which ACF(σ2
E) ≤ 0.5.
Method Gibbs Block Reparam. Lang. MALA Indp. RWM RWM Sub. pCN
R=10 1.02 0.99 0.96 0.91 1.17 0.17 0.76 0.80 0.75C=10 1 1 1 196 334 1604 1354 1329 1504R=20 0.97 0.98 1.00 0.91 1.00 0.17 0.48 0.45 0.37C=20 1 1 1 61 75 1218 1649 1614 1827R=50 1.00 1.01 0.98 0.96 0.99 0.17 0 0.01 0C=50 1 1 1 10 12 5000+ 1107 1616 1466R=100 1.00 1.00 1.00 0.98 1.00 0.16 0 0.38 0C=100 1 1 1 3 3 5000+ 1199 1714 1532R=200 1.00 1.00 1.00 1.01 1.01 0.21 0 0.66 0C=200 1 1 1 1 1 1266 1626 1691 636R=500 1.00 1.00 1.00 118.45 52.70 0.14 0 0.87 0C=500 1 1 1 545 138 1572 834 1702 1616R=1000 1.00 1.00 1.00 65.22 2.66 0.15 0 0.93 0C=1000 1 1 1 385 3062 5000+ 1518 1724 621R=2000 1.00 1.00 1.00 115.59 1.05 0.18 0 0.97 0C=2000 1 1 1 10 5000+ 1021 1194 1702 1014R=50 1.01 0.99 1.00 0.98 1.01 0.15 0 0.19 0C=100 1 1 1 5 6 5000+ 1676 1774 1442R=10 0.99 0.99 1.01 0.92 0.99 0.17 0 0.55 0C=200 1 1 1 12 15 3309 1570 1678 1279R=100 1.00 1.00 1.00 3.50 3.46 0.19 0 0.87 0C=1000 1 1 1 3 3 3330 1454 1699 3333
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 70
A.2.1 Weighted U-statistics
We will work with weighted U-statistics
Ua =1
2
∑ijj′
uiZijZij′(Yij − Yij′)2,
Ub =1
2
∑iji′
vjZijZi′j(Yij − Yi′j)2, and
Ue =1
2
∑iji′j′
wijZijZi′j′(Yij − Yi′j′)2,
for weights ui, vj and wij chosen below.
We can write Ua =∑i uiNi•(Ni• − 1)s2i• where s2i• is an unbiased estimate of σ2
B + σ2E from
within any row i with Ni• ≥ 2. Under our model the values in row i are IID with mean µ+ ai and
variance σ2B + σ2
E , and so
Var(s2i•) = (σ2B + σ2
E)2( 2
Ni• − 1+κ(bj + eij)
Ni•
)where κ(bj + eij) = (κBσ
4B + κEσ
4E)/(σ2
B + σ2E)2 is the kurtosis of Yij for the given i and any j.
Thus
Var(s2i•) =2(σ2
B + σ2E)2
Ni• − 1+κBσ
4B
Ni•+κEσ
4E
Ni•. (A.2)
Inverse variance weighting then suggests that we weight s2i• proportionally to a value between Ni•
and Ni•− 1. Weighting proportional to Ni•− 1 has the advantage of zeroing out rows with Ni• = 1.
This consideration motivates us to take ui = 1/Ni•, and similarly vj = 1/N•j .
If Ue is dominated by contributions from eij then the observations enter symmetrically and there
is no reason to not take wij = 1. Even if the eij do not dominate, the statistic Ue compares more
data pairs than the others. It is unlikely to be the information limiting statistic. So wij = 1 is a
reasonable default.
If the data are IID then only Ue above is nonzero. This is appropriate as only the sum σ2A+σ2
B+σ2E
can be identified in that case. For data that are nested but not IID, only two of the U-statistics
above are nonzero and in that case only one of σ2A and σ2
B can be identified separately from σ2E .
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 71
The U-statistics we use are then
Ua =1
2
∑ijj′
N−1i• ZijZij′(Yij − Yij′)2,
Ub =1
2
∑iji′
N−1•j ZijZi′j(Yij − Yi′j)2, and
Ue =1
2
∑iji′j′
ZijZi′j′(Yij − Yi′j′)2.
(A.3)
A.2.2 Expected Values
Here we find the expected values for our three U -statistics.
Lemma A.2.1. Under the random effects model (2), the U-statistics in (A.3) satisfyE(Ua)
E(Ub)
E(Ue)
=
0 N −R N −R
N − C 0 N − C
N2 −∑iN
2i• N2 −
∑j N
2•j N2 −N
σ2A
σ2B
σ2E
. (A.4)
Proof. First we note that
E((ai − ai′)2) = 2σ2A(1− 1i=i′),
E((bj − bj′)2) = 2σ2B(1− 1j=j′), and
E((eij − ei′j′)2) = 2σ2E(1− 1i=i′1j=j′).
Now Yij − Yij′ = bj − bj′ + eij − eij′ , and so
E(Ua) =1
2
∑ijj′
N−1i• ZijZij′(2σ2
B(1− 1j=j′) + 2σ2E(1− 1i=i1j=j′)
)= (σ2
B + σ2E)∑ijj′
N−1i• ZijZij′(1− 1j=j′)
= (σ2B + σ2
E)∑ij′
Zij′(1− 1j=j′)
= (σ2B + σ2
E)∑i
(Ni• − 1)
= (σ2B + σ2
E)(N −R).
The same argument gives E(Ub) = (σ2A + σ2
E)(N − C).
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 72
The matrix in (A.4) is
M ≡
0 N −R N −R
N − C 0 N − C
N2 −∑iN
2i• N2 −
∑j N
2•j N2 −N
. (A.5)
Our moment based estimates are σ2A
σ2B
σ2E
= M−1
Ua
Ub
Ue
. (A.6)
They are only well defined when M is nonsingular. The determinant of M is
(N −R)[(N − C)(N2 −
∑j
N2•j)]− (N −R)
[(N − C)(N2 −N)− (N − C)(N2 −
∑i
N2i•)]
= (N −R)(N − C)[N2 −
∑i
N2i• −
∑j
N2•j +N
].
The first factor is positive so long as maxiNi• > 1, and the second factor requires maxj N•j > 1.
We already knew that we needed these conditions in order to have all three U-statistics depend on
the Yij . It is still of interest to know when the third factor is positive. It is sufficient that no row or
column has over half of the data.
A.2.3 Variances
From equation (A.6) we get
Var
σ2A
σ2B
σ2E
= M−1Var
Ua
Ub
Ue
M−1
where M is given at (A.5). So we need the variances and covariances of the three U statistics.
To find variances, we will work out E(U2) for our U -statistics. Those involve
E((Yij − Yi′j′)2(Yrs − Yr′s′)2)
= E((ai − ai′ + bj − bj′ + eij − ei′j′)2(ar − ar′ + bs − bs′ + ers − er′s′)2
)= E
[((ai − ai′)2 + (bj − bj′)2 + (eij − ei′j′)2
+ 2(ai − ai′)(bj − bj′) + 2(ai − ai′)(eij − ei′j′) + 2(bj − bj′)(eij − ei′j′))
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 73
×((ar − ar′)2 + (bs − bs′)2 + (ers − er′s′)2
+ 2(ar − ar′)(bs − bs′) + 2(ar − ar′)(ers − er′s′) + 2(bs − bs′)(ers − er′s′))].
This expression involves 8 indices and it has 36 terms. Some of those terms simplify due to
independence and some vanish due to zero means. To shorten some expressions we use
BA,ii′,rr′ ≡ E((ai − ai′)(ar − ar′)),
DA,ii′ ≡ E((ai − ai′)2), and
QA,ii′,rr′ ≡ E((ai − ai′)2(ar − ar′)2)
with mnemonics bilinear, diagonal and quartic. There are similarly defined terms for component B.
For the error term we have
BE,iji′j′,rsr′s′ ≡ E((eij − ei′j′)(ers − er′s′)),
DE,ij,i′j′ ≡ E((eij − ei′j′)2), and
QE,iji′j′,rsr′s′ ≡ E((eij − ei′j′)2(ers − er′s′)2).
The generic contribution E((Yij−Yi′j′)2(Yrs−Yr′s′)2) to the mean square of a U -statistic equals
QA,ii′,rr′ + QB,jj′,ss′ + QE,iji′j′,rsr′s′ + DA,ii′DB,ss′ + DA,ii′DE,rs,r′s′
+ DB,jj′DA,rr′ + DB,jj′DE,rs,r′s′ + DE,ij,i′j′DA,rr′ + DE,ij,i′j′DB,ss′
+ 4BA,ii′,rr′BB,jj′,ss′ + 4BA,ii′,rr′BE,iji′j′,rsr′s′ + 4BB,jj′,ss′BE,iji′j′,rsr′s′ .
(A.7)
The other 24 terms are zero.
Next we collect expressions for the quantities appearing in the generic term of our squared U -
statistics.
Lemma A.2.2. In the random effects model (2),
BA,ii′,rr′ = σ2A
(1i=r − 1i=r′ − 1i′=r + 1i′=r′
),
BB,jj′,ss′ = σ2B
(1j=s − 1j=s′ − 1j′=s + 1j′=s′
), and
BE,iji′j′,rsr′s′ = σ2E
(1ij=rs − 1ij=r′s′ − 1i′j′=rs + 1i′j′=r′s′
).
Proof. The first one follows by expanding and using E(aiar) = σ2A1i=r, et cetera. The other two use
the same argument.
Lemma A.2.3. In the random effects model (2),
DA,ii′ = 2σ2A(1− 1i=i′),
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 74
DB,jj′ = 2σ2B(1− 1j=j′), and
DE,ij,i′j′ = 2σ2E(1− 1ij=i′j′).
Proof. Take i = r and i′ = r′ in Lemma A.2.2.
Lemma A.2.4. In the random effects model (2),
QA,ii′,rr′ = 1i 6=i′1r 6=r′σ4A
(4 + (κA + 2)(1i∈{r,r′} + 1i′∈{r,r′}) + 4× 1{i,i′}={r,r′}
),
QB,jj′,ss′ = 1j 6=j′1s6=s′σ4B
(4 + (κB + 2)(1j∈{s,s′} + 1j′∈{s,s′}) + 4× 1{j,j′}={s,s′}
), and
QE,iji′j′,rsr′s′ = 1ij 6=i′j′1rs6=r′s′σ4E
(4 + (κE + 2)(1ij∈{rs,r′s′} + 1i′j′∈{rs,r′s′})
+ 4× 1{ij,i′j′}={rs,r′s′}
).
Proof. We prove the first one; the others are similar. This quantity is 0 if i = i′ or r = r′. When
i 6= i′ and r 6= r′, there are 3 cases to consider: |{i, i′} ∩ {r, r′}| = 0, |{i, i′} ∩ {r, r′}| = 1 and
|{i, i′} ∩ {r, r′}| = 2. The kurtosis is defined via κA = E(a4)/σ4A − 3, so E(a4) = (κA + 3)σ4
A.
For no overlap, we find
E((a1 − a2)2(a3 − a4)2) = E((a1 − a2)2)2 = 4σ4A.
For a single overlap,
E((a1 − a2)2(a1 − a3)2) = E((a21 − 2a1a2 + a22)(a21 − 2a1a3 + a23))
= E(a41) + 3σ4A = σ4
A(κA + 6).
For a double overlap,
E((a1 − a2)4) = E(a41 − 4a1a32 + 6a21a
22 − 4a31a2 + a42)
= 2E(a41) + 6σ4A = σ4
A(2κA + 12).
As a result,
E((ai − ai′)2(ar − ar′)2) =
4σ4
A, |{i, i′} ∩ {r, r′}| = 0,
σ4A(κA + 6), |{i, i′} ∩ {r, r′}| = 1,
σ4A(2κA + 12), |{i, i′} ∩ {r, r′}| = 2,
and so E((ai − ai′)2(ar − ar′)2) equals
1i 6=i′1r 6=r′σ4A
(4 + (κA + 2)(1i∈{r,r′} + 1i′∈{r,r′}) + 4× 1{i,i′}={r,r′}
).
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 75
A.2.4 Variance of Ua
We will work out E(U2a ) and then subtract E(Ua)2. First we write
U2a =
1
4
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′(Yij − Yij′)2(Yrs − Yrs′)2.
For E(U2a ) we use the special case i = i′ and r = r′ of (A.7), and
E(U2a ) =
1
4
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′
[QA,ii,rr + QB,jj′,ss′ + QE,ijij′,rsrs′ + DA,iiDB,ss′
+ DA,iiDE,rs,rs′ + DB,jj′DA,rr + DB,jj′DE,rs,rs′ + DE,ij,ij′DA,rr + DE,ij,ij′DB,ss′
+ 4BA,ii,rrBB,jj′,ss′ + 4BA,ii,rrBE,ijij′,rsrs′ + 4BB,jj′,ss′BE,ijij′,rsrs′]
=1
4
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′
[QB,jj′,ss′︸ ︷︷ ︸
Term 1
+QE,ijij′,rsrs′︸ ︷︷ ︸Term 2
+ DB,jj′DE,rs,rs′︸ ︷︷ ︸Term 3
+DE,ij,ij′DB,ss′︸ ︷︷ ︸Term 4
+ 4BB,jj′,ss′BE,ijij′,rsrs′︸ ︷︷ ︸Term 5
]
after eliminating terms that are always 0. We handle these five sums in the next paragraphs.
Term 1
1
4
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′QB,jj′,ss′
=σ4B
4
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′(1− 1j=j′)(1− 1s=s′)
×(
4 + (κB + 2)(1j∈{s,s′} + 1j′∈{s,s′}) + 4× 1{j,j′}={s,s′}
)= σ4
B
((N −R)2 + (κB + 2)
∑ir
(ZZT)ir(1−N−1i• )(1−N−1r• )
+ 2∑ir
N−1i• N−1r• (ZZT)ir((ZZ
T)ir − 1)).
Term 2
1
4
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′QE,ijij′,rsrs′
=σ4E
4
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′1j 6=j′1s6=s′
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 76
×(
4 + (κE + 2)1i=r(1j∈{s,s′} + 1j′∈{s,s′}) + 41i=r1{j,j′}={s,s′}
)= σ4
E
((N −R)2 + (κE + 2)
∑i
Ni•(1−N−1i• )2 + 2∑i
(1−N−1i• )).
Terms 3 and 4 These terms are equal by symmetry. We evaluate term 3.
1
4
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′DB,jj′DE,rs,rs′
=1
4
(∑ijj′
N−1i• ZijZij′DB,jj′)(∑
rss′
N−1r• ZrsZrs′DE,rs,rs′).
Now
∑ijj′
N−1i• ZijZij′DB,jj′ = 2σ2B
∑ijj′
N−1i• ZijZij′(1− 1j=j′) = 2σ2B(N −R)
and
∑rss′
N−1r• ZrsZrs′DE,rs,rs′ = 2σ2E
∑rss′
N−1r• ZrsZrs′(1− 1s=s′) = 2σ2E(N −R)
by the same steps. Therefore term 3 of E(U2a ) equals σ2
Bσ2E(N −R)2 and the sum of terms 3 and 4
is 2σ2Bσ
2E(N −R)2.
Term 5 The term equals
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′BB,jj′,ss′BE,ijij′,rsrs′
=∑ijj′
∑ss′
N−1i• ZijZij′BB,jj′,ss′∑r
N−1r• ZrsZrs′BE,ijij′,rsrs′ .
Now
∑r
N−1r• ZrsZrs′BE,ijij′,rsrs′ = σ2EN−1i• ZisZis′
(1j=s − 1j=s′ − 1j′=s + 1j′=s′
).
Term 5 is then
σ2E
∑ijj′
∑ss′
N−2i• ZijZij′ZisZis′(1j=s − 1j=s′ − 1j′=s + 1j′=s′
)BB,jj′,ss′
= σ2Eσ
2B
∑ijj′
∑ss′
N−2i• ZijZij′ZisZis′1j=s(1j=s − 1j=s′ − 1j′=s + 1j′=s′
)
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 77
− σ2Eσ
2B
∑ijj′
∑ss′
N−2i• ZijZij′ZisZis′1j=s′(1j=s − 1j=s′ − 1j′=s + 1j′=s′
)− σ2
Eσ2B
∑ijj′
∑ss′
N−2i• ZijZij′ZisZis′1j′=s(1j=s − 1j=s′ − 1j′=s + 1j′=s′
)+ σ2
Eσ2B
∑ijj′
∑ss′
N−2i• ZijZij′ZisZis′1j′=s′(1j=s − 1j=s′ − 1j′=s + 1j′=s′
)= 4σ2
Bσ2E(N −R).
Combination Combining the results of the previous sections, we have
E(U2a ) = σ4
B
((N −R)2 + (κB + 2)
∑ir
(ZZT)ir(1−N−1i• )(1−N−1r• )
+ 2∑ir
N−1i• N−1r• (ZZT)ir((ZZ
T)ir − 1))
+ 2σ2Bσ
2E(N −R)2 + 4σ2
Bσ2E(N −R)
+ σ4E
((N −R)2 + (κE + 2)
∑i
Ni•(1−N−1i• )2 + 2∑i
(1−N−1i• )).
Subtracting E(Ua)2 = (N −R)2(σ2B + σ2
E)2 we find that Var(Ua) equals
4σ2Bσ
2E(N −R) + σ4
E
((κE + 2)
∑i
Ni•(1−N−1i• )2 + 2∑i
(1−N−1i• ))
+ σ4B
((κB + 2)
∑ir
(ZZT)ir(1−N−1i• )(1−N−1r• ) + 2∑ir
N−1i• N−1r• (ZZT)ir((ZZ
T)ir − 1)).
(A.8)
Variance of Ub This case is exactly symmetric to the one above with Var(Ua) given by (A.8).
Therefore Var(Ub) equals
4σ2Aσ
2E(N − C) + σ4
E
((κE + 2)
∑j
N•j(1−N−1•j )2 + 2∑j
(1−N−1•j ))
+ σ4B
((κA + 2)
∑js
(ZTZ)js(1−N−1•j )(1−N−1•s ) + 2∑js
N−1•j N−1•s (ZTZ)js((Z
TZ)js − 1)).
(A.9)
A.2.5 Variance of Ue
As before, we find E(U2e ) and then subtract E(Ue)
2. Now
U2e =
1
4
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′(Yij − Yi′j′)2(Yrs − Yr′s′)2.
From (A.7),
E(U2e ) =
1
4
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′[QA,ii′,rr′︸ ︷︷ ︸Term 1
+QB,jj′,ss′︸ ︷︷ ︸Term 2
+QE,iji′j′,rsr′s′︸ ︷︷ ︸Term 3
+DA,ii′DB,ss′︸ ︷︷ ︸Term 4
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 78
+ DA,ii′DE,rs,r′s′︸ ︷︷ ︸Term 5
+DB,jj′DA,rr′︸ ︷︷ ︸Term 6
+DB,jj′DE,rs,r′s′︸ ︷︷ ︸Term 7
+DE,ij,i′j′DA,rr′︸ ︷︷ ︸Term 8
+DE,ij,i′j′DB,ss′︸ ︷︷ ︸Term 9
+ 4BA,ii′,rr′BB,jj′,ss′︸ ︷︷ ︸Term 10
+ 4BA,ii′,rr′BE,iji′j′,rsr′s′︸ ︷︷ ︸Term 11
+ 4BB,jj′,ss′BE,iji′j′,rsr′s′︸ ︷︷ ︸Term 12
].
We handle the twelve sums in the next paragraphs.
Terms 1 and 2 Term 1 is
1
4
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′QA,ii′,rr′
=1
4
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′1i6=i′1r 6=r′σ4A
(4 + (κA + 2)(1i∈{r,r′} + 1i′∈{r,r′}) + 4× 1{i,i′}={r,r′}
)=σ4A
4
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′(1− 1i=i′)(1− 1r=r′)(4 + (κA + 2)(1i∈{r,r′} + 1i′∈{r,r′}) + 4× 1{i,i′}={r,r′}
)= σ4
A
(N4 − 2N2
∑i
N2i• + 3
(∑i
N2i•
)2− 2
∑i
N4i•
)+ σ4
A(κA + 2)(N2∑i
N2i• − 2N
∑i
N3i• +
∑i
N4i•
).
We can use the symmetry of the roles of A and B and their indices. Therefore, term 2 is equal
to
σ4B
(N4 − 2N2
∑j
N2•j + 3
(∑j
N2•j
)2− 2
∑j
N4•j
)+ σ4
B(κB + 2)(N2∑j
N2•j − 2N
∑j
N3•j +
∑j
N4•j
).
Term 3
1
4
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′QE,iji′j′,rsr′s′
=σ4E
4
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′(1− 1ij=i′j′)(1− 1rs=r′s′)(4 + (κE + 2)(1ij∈{rs,r′s′} + 1i′j′∈{rs,r′s′}) + 4× 1{ij,i′j′}={rs,r′s′}
)= σ4
EN(N − 1)[N(N − 1) + 2] + σ4E(κE + 2)N(N − 1)2.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 79
Term 4
1
4
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′DA,ii′DB,ss′ =1
4
(∑ii′jj′
ZijZi′j′DA,ii′)( ∑
rr′ss′
ZrsZr′s′DB,ss′).
The first factor is
∑ii′jj′
ZijZi′j′DA,ii′ = 2σ2A
∑ii′jj′
ZijZi′j′(1− 1i=i′) = 2σ2A(N2 −
∑i
N2i•).
By the same argument, the second factor is
∑rr′ss′
ZrsZr′s′DB,ss′ = 2σ2B(N2 −
∑s
N2•s),
and so term 4 is
σ2Aσ
2B(N2 −
∑i
N2i•)(N
2 −∑j
N2•j).
Term 5
1
4
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′DA,ii′DE,rs,r′s′ =1
4
(∑ii′jj′
ZijZi′j′DA,ii′)( ∑
rr′ss′
ZrsZr′s′DE,rs,r′s′).
The first factor is computed in the previous section. The second factor is
∑rr′ss′
ZrsZr′s′DE,rs,r′s′ = 2σ2E
∑rr′ss′
ZrsZr′s′(1− 1rs=r′s′) = 2σ2EN(N − 1).
Thus, term 5 is
σ2Aσ
2EN(N − 1)(N2 −
∑i
N2i•).
Terms 6-9 By symmetry of indices, term 6 is the same as term 4:
σ2Aσ
2B(N2 −
∑i
N2i•)(N
2 −∑j
N2•j).
Term 7 is like term 5 with factors A and B interchanged. Thus, term 7 is equal to
σ2Bσ
2EN(N − 1)(N2 −
∑j
N2•j).
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 80
By symmetry of indices, term 8 is the same as term 5:
σ2Aσ
2EN(N − 1)(N2 −
∑i
N2i•).
By symmetry of indices, term 9 is the same as term 7:
σ2Bσ
2EN(N − 1)(N2 −
∑j
N2•j).
Term 10
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′BA,ii′,rr′BB,jj′,ss′
= σ2Aσ
2B
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′(1i=r − 1i=r′ − 1i′=r + 1i′=r′
)(1j=s − 1j=s′ − 1j′=s + 1j′=s′
)= 4σ2
Aσ2B
(N3 − 2N
∑ij
ZijNi•N•j +∑ij
N2i•N
2•j
).
Terms 11 and 12
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′BA,ii′,rr′BE,iji′j′,rsr′s′
= σ2Aσ
2E
∑ii′jj′
∑rr′ss′
ZijZi′j′ZrsZr′s′(1i=r − 1i=r′ − 1i′=r + 1i′=r′
)(1ij=rs − 1ij=r′s′ − 1i′j′=rs + 1i′j′=r′s′
)= σ2
Aσ2E
(4N3 − 4N
∑i
N2i•
).
For term 12, we can use the symmetry with term 11, interchanging rows columns. Thus, term
12 is
σ2Bσ
2E
(4N3 − 4N
∑j
N2•j
).
Combination Summing up the results of the previous twelve sections, E(U2e ) equals
σ4AN
4 − 2σ4AN
2∑i
N2i• + 3σ4
A
(∑i
N2i•
)2− 2σ4
A
∑i
N4i• + σ4
BN4 − 2σ4
B
∑j
N4•j
+ σ4A(κA + 2)
(N2∑i
N2i• − 2N
∑i
N3i• +
∑i
N4i•
)− 2σ4
BN2∑j
N2•j + 3σ4
B
(∑j
N2•j
)2+ σ4
B(κB + 2)(N2∑j
N2•j − 2N
∑j
N3•j +
∑j
N4•j
)+ σ4
E
(N4 − 2N3 + 3N2 − 2N
)
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 81
+ σ4E(κE + 2)N(N − 1)2 + σ2
Aσ2B(N2 −
∑i
N2i•)(N
2 −∑j
N2•j) + σ2
Aσ2EN(N − 1)(N2 −
∑i
N2i•)
+ σ2Aσ
2B(N2 −
∑i
N2i•)(N
2 −∑j
N2•j) + σ2
Bσ2EN(N − 1)(N2 −
∑j
N2•j)
+ σ2Aσ
2EN(N − 1)(N2 −
∑i
N2i•) + σ2
Bσ2EN(N − 1)(N2 −
∑j
N2•j)
+ 4σ2Aσ
2B
(N3 − 2N
∑ij
ZijNi•N•j +∑ij
N2i•N
2•j
)+ 4σ2
Aσ2E
(N3 −N
∑i
N2i•
)+ σ2
Bσ2E
(4N3 − 4N
∑j
N2•j
).
Then, we have, applying some simplifications,
Var(Ue) = E(U2e )− E(Ue)
2
= 2σ4A
((∑i
N2i•
)2−∑i
N4i•
)+ 2σ4
B
((∑j
N2•j
)2−∑j
N4•j
)+ 2σ4
EN(N − 1)
+ (κA + 2)σ4A
∑i
N2i•(N −Ni•)2 + (κB + 2)σ4
B
∑j
N2•j(N −N•j)2
+ (κE + 2)σ4EN(N − 1)2 + 4σ2
Aσ2B
∑ij
(Ni•N•j −NZij)2
+ 4σ2Aσ
2EN
(N2 −
∑i
N2i•
)+ 4σ2
Bσ2EN
(N2 −
∑j
N2•j
).
(A.10)
A.2.6 Covariance of Ua and Ub
We use the formula Cov(Ua, Ub) = E(UaUb) − E(Ua)E(Ub), so we just need to compute E(UaUb).
Using our preferred normalization,
UaUb =1
4
(∑ijj′
N−1i• ZijZij′(Yij − Yij′)2)(∑rr′s
N−1•s ZrsZr′s(Yrs − Yr′s)2)
=1
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′s(Yij − Yij′)2(Yrs − Yr′s)2.
Then,
E(UaUb) =1
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′s
(QE,ijij′,rsr′s︸ ︷︷ ︸
Term 1
+ DB,jj′DA,rr′︸ ︷︷ ︸Term 2
+DB,jj′DE,rs,r′s︸ ︷︷ ︸Term 3
+DE,ij,ij′DA,rr′︸ ︷︷ ︸Term 4
).
We consider each term separately.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 82
Term 1
1
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′sQE,ijij′,rsr′s
=σ4E
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′s1j 6=j′1r 6=r′(
4 + (κE + 2)(1ij∈{rs,r′s} + 1ij′∈{rs,r′s}) + 4× 1{ij,ij′}={rs,r′s}
)= σ4
E
(N −R
)(N − C
)+ σ4
E(κE + 2)∑ij
Zij(1−N−1i• )(1−N−1•j ).
Term 2
1
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′sDB,jj′DA,rr′
=1
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′s2σ
2B(1− 1j=j′)2σ
2A(1− 1r=r′)
= σ2Aσ
2B
(N −R
)(N − C
).
Term 3
1
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′sDB,jj′DE,rs,r′s
=1
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′s2σ
2B(1− 1j=j′)2σ
2E(1− 1r=r′)
= σ2Bσ
2E
(N −R
)(N − C
).
Term 4
1
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′sDE,ij,ij′DA,rr′
=1
4
∑ijj′
∑rr′s
N−1i• N−1•s ZijZij′ZrsZr′s2σ
2E(1− 1j=j′)2σ
2A(1− 1r=r′)
= σ2Aσ
2E
(N −R
)(N − C
).
Combination Adding up the four terms, we have
E(UaUb) = σ4E
(N −R
)(N − C
)+ σ4
E(κE + 2)∑ij
Zij(1−N−1i• )(1−N−1•j )
+ σ2Aσ
2B
(N −R
)(N − C
)+ σ2
Bσ2E
(N −R
)(N − C
)+ σ2
Aσ2E
(N −R
)(N − C
),
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 83
and so
Cov(Ua, Ub) = E(UaUb)− E(Ua)E(Ub)
= E(UaUb)− (σ2B + σ2
E)(σ2A + σ2
E)(N −R)(N − C)
= σ4E(κE + 2)
∑ij
Zij(1−N−1i• )(1−N−1•j ).
Notice that Cov(Ua, Ub) = 0 when σ2E = 0. This can be verified by noting that when σ2
E = 0 then
Ua is a function only of ai while Ub is a function only of bj . Therefore Ua and Ub are independent
when σ2E = 0.
A.2.7 Covariance of Ua and Ue
We use the formula Cov(Ua, Ue) = E(UaUe) − E(Ua)E(Ue), so we just need to compute E(UaUe).
First,
UaUe =1
4
(∑ijj′
N−1i• ZijZij′(Yij − Yij′)2)( ∑
rr′ss′
ZrsZr′s′(Yrs − Yr′s′)2)
=1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′(Yij − Yij′)2(Yrs − Yr′s′)2.
Then,
E(UaUe) =1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′(QB,jj′,ss′︸ ︷︷ ︸
Term 1
+QE,ijij′,rsr′s′︸ ︷︷ ︸Term 2
+DB,jj′DA,rr′︸ ︷︷ ︸Term 3
+ DB,jj′DE,rs,r′s′︸ ︷︷ ︸Term 4
+DE,ij,ij′DA,rr′︸ ︷︷ ︸Term 5
+DE,ij,ij′DB,ss′︸ ︷︷ ︸Term 6
+ 4BB,jj′,ss′BE,ijij′,rsr′s′︸ ︷︷ ︸Term 7
).
We consider each term separately.
Term 1
1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′QB,jj′,ss′
=1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′1j 6=j′1s 6=s′σ4B(
4 + (κB + 2)(1j∈{s,s′} + 1j′∈{s,s′}) + 4× 1{j,j′}={s,s′}
)= 2σ4
B
(∑i
N−1i•
(∑j
ZijN•j
)2−∑ij
N−1i• ZijN2•j
)+ σ4
B
(N −R
)(N2 −
∑j
N2•j
)+ σ4
B(κB + 2)∑ij
Zij(N −N•j)N•j(1−N−1i• ).
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 84
Term 2
1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′QE,ijij′,rsr′s′
=1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′1j 6=j′1rs6=r′s′σ4E(
4 + (κE + 2)(1ij∈{rs,r′s′} + 1ij′∈{rs,r′s′}) + 4× 1{ij,ij′}={rs,r′s′}
)= σ4
EN(N − 1)(N −R) + 2σ4E(N −R) + σ4
E(κE + 2)(N −R)(N − 1).
Term 3
1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′DB,jj′DA,rr′
=1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′2σ2B(1− 1j=j′)2σ
2A(1− 1r=r′)
= σ2Aσ
2B
(N −R
)(N2 −
∑r
N2r•
).
Term 4
1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′DB,jj′DE,rs,r′s′
=1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′2σ2B(1− 1j=j′)2σ
2E(1− 1r=r′1s=s′)
= σ2Bσ
2E
(N −R
)(N2 −N).
Term 5
1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′DE,ij,ij′DA,rr′
=1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′2σ2E(1− 1j=j′)2σ
2A(1− 1r=r′)
= σ2Aσ
2E
(N −R
)(N2 −
∑r
N2r•
)using the result for term 3.
Term 6
1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′DE,ij,ij′DB,ss′
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 85
=1
4
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′2σ2E(1− 1j=j′)2σ
2B(1− 1s=s′)
= σ2Bσ
2E
(N −R
)(N2 −
∑s
N2•s
).
Term 7
∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′BB,jj′,ss′BE,ijij′,rsr′s′
=∑ijj′
∑rr′ss′
N−1i• ZijZij′ZrsZr′s′σ2B
(1j=s − 1j=s′ − 1j′=s + 1j′=s′
)× σ2
E
(1ij=rs − 1ij=r′s′ − 1ij′=rs + 1ij′=r′s′
)= 4σ2
Bσ2EN(N −R).
Combination We add up the seven terms, replacing some Nr• and N•s expressions by equivalents
using Ni• and N•j , getting
E(UaUe) = σ4B
(N −R
)(N2 −
∑j
N2•j
)+ 2σ4
B
(∑i
N−1i•
(∑j
ZijN•j
)2−∑ij
N−1i• ZijN2•j
)+ σ4
B(κB + 2)∑ij
Zij(N −N•j)N•j(1−N−1i• ) + 2σ4E(N −R) + σ4
EN(N − 1)(N −R)
+ σ4E(κE + 2)(N −R)(N − 1) + σ2
Aσ2B
(N −R
)(N2 −
∑i
N2i•
)+ σ2
Bσ2E
(N −R
)(N2 −N) + σ2
Aσ2E
(N −R
)(N2 −
∑i
N2i•
)+ σ2
Bσ2E
(N −R
)(N2 −
∑j
N2•j
)+ 4σ2
Bσ2EN(N −R).
Now E(Ua)E(Ue) equals
(N −R)(σ2B + σ2
E)(σ2A(N2 −
∑i
N2i•) + σ2
B(N2 −∑j
N2•j) + σ2
E(N2 −N))
which contains terms equaling several of those in E(UaUe) above. Subtracting those term from
E(UaUe) yields
Cov(Ua, Ue) = 2σ4B
(∑i
N−1i•
(∑j
ZijN•j
)2−∑ij
N−1i• ZijN2•j
)+ σ4
B(κB + 2)∑ij
Zij(N −N•j)N•j(1−N−1i• ) + 2σ4E(N −R)
+ σ4E(κE + 2)(N −R)(N − 1) + 4σ2
Bσ2EN(N −R).
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 86
Covariance of Ub and Ue By interchanging the roles of the rows and columns in Cov(Ua, Ue),
we find that
Cov(Ub, Ue) = 2σ4A
(∑j
N−1•j
(∑i
ZijNi•
)2−∑ij
N−1•j ZijN2i•
)+ σ4
A(κA + 2)∑ij
Zij(N −Ni•)Ni•(1−N−1•j ) + 2σ4E(N − C)
+ σ4E(κE + 2)(N − C)(N − 1) + 4σ2
Aσ2EN(N − C).
A.2.8 Estimating Kurtoses
To estimate the kurtoses κA, κB and κE in our variance formulas, it suffices to estimate fourth
central moments such as µA,4 = σ4A(κA+3) and similarly defined µB,4 and µE,4. Given σ2
A, σ2B , and
σ2E , we can do this via method of moments. Consider the following estimating equations and their
expectations,
Wa =1
2
∑ijj′
1
Ni•ZijZij′(Yij − Yij′)4,
Wb =1
2
∑ii′j
1
N•jZijZi′j(Yij − Yi′j)4, and
We =1
2
∑ii′jj′
ZijZi′j′(Yij − Yi′j′)4.
Using previous results,
E(Wa) =1
2
∑ijj′
1
Ni•ZijZij′E
((bj − bj′ + eij − eij′)4
)=
1
2
∑ijj′
ZijZij′
Ni•E((bj − bj′)4 + 6(bj − bj′)2(eij − eij′)2 + (eij − eij′)4
)= (N −R)
(µB,4 + 3σ4
B + 12σ2Bσ
2E + µE,4 + 3σ4
E
).
By symmetry,
E(Wb) = (N − C)(µA,4 + 3σ4
A + 12σ2Aσ
2E + µE,4 + 3σ4
E
).
Next
E(We) =1
2
∑ii′jj′
ZijZi′j′E((Yij − Yi′j′)4
)=
1
2
∑ii′jj′
ZijZi′j′E((ai − ai′ + bj − bj′ + eij − ei′j′)4
)
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 87
=1
2
∑ii′jj′
ZijZi′j′E((ai − ai′)4 + 6(ai − ai′)2(bj − bj′)2 + (bj − bj′)4
+ 6(ai − ai′)2(eij − ei′j′)2 + 6(bj − bj′)2(eij − ei′j′)2 + (eij − ei′j′)4)
+ (µE,4 + 3σ4E)N(N − 1) + 12σ2
Aσ2B(N2 −
∑i
N2i• −
∑j
N2•j +N).
These expectations are all linear in the fourth moments. Therefore, given estimates of σ2A, σ2
B ,
and σ2E , we can solve another three-by-three system of equations to get estimates of the fourth
moments.
Letting M be the matrix in equation (A.5) we find thatE(Wa)
E(Wb)
E(We)
= M
µA,4
µB,4
µE,4
+
3(N −R)σ4
B + 12(N −R)σ2Bσ
2E + 3(N −R)σ4
E
3(N − C)σ4A + 12(N − C)σ2
Aσ2E + 3(N − C)σ4
E
H
where
H = (3σ4A + 12σ2
Aσ2E)(N2 −
∑i
N2i•) + (3σ4
B + 12σ2Bσ
2E)(N2 −
∑j
N2•j)
+ 3σ4EN(N − 1) + 12σ2
Aσ2B(N2 −
∑i
N2i• −
∑j
N2•j +N).
For plug-in method of moment estimators we replace expected W -statistics by their sample
quantities, replace the variance components by their estimates, and solve the matrix equation getting
µA,4 et cetera. Then κA = µA,4/σ4A − 3 and so on.
A.3 Asymptotic Normality of Moment Estimates
In this section we provide the proofs for Section 2.4.3.
A.3.1 Proof of Theorem 2.4.3
Letting ε = max(εR, εC), we have
M =
N
N
N2
0 1−R/N 1−R/N1− C/N 0 1− C/N
1 1 1
(1 +O(ε))
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 88
and so if max(R,C)/N ≤ θ for some θ < 1, then
M−1 =
− NN−R 0 1
0 − NN−C 1
NN−R
NN−C −1
N−1
N−1
N−2
(1 +O(ε)).
It follows that
σ2A =
( UeN2− UaN −R
)(1 +O(ε)),
σ2B =
( UeN2− UbN − C
)(1 +O(ε)), and
σ2E =
( UaN −R
+Ub
N − C− UeN2
)(1 +O(ε)).
(A.11)
From Lemma 2.4.1, E(Ua) = (σ2B + σ2
E)(N −R), E(Ub) = (σ2A + σ2
E)(N − C), and
E(Ue) = σ2A
(N2 −
∑i
N2i•
)+ σ2
B
(N2 −
∑j
N2•j
)+ σ2
E(N2 −N), so
E(Ue)
N2= σ2
A + σ2B + σ2
E −Υ, where
Υ =(σ2A
∑iN
2i•
N2+ σ2
B
∑j N
2•j
N2+σ2E
N
)= O(ε).
By substitution in (A.11) we find that all of the variance component biases are Υ×(1+O(ε)) = O(ε).
Turning now to variances,
Var(σ2A) = O
(Var(Ue)
N4+
Var(Ua)
N2
),
Var(σ2B) = O
(Var(Ue)
N4+
Var(Ub)
N2
), and
Var(σ2E) = O
(Var(Ua)
N2+
Var(Ub)
N2+
Var(Ue)
N4
).
(A.12)
We work from the exact finite sample formulas in Theorem 2.4.1.
Using (ZZT)ir ≤ Nr• and∑ir(ZZ
T)ir =∑j N
2•j , we find from (2.12) that
Var(Ua) ≤ σ4B(κB + 2)
∑ir
(ZZT)ir + 2σ4B
∑ir
N−1i• (ZZT)ir + 4σ2Bσ
2EN
+ σ4E(κE + 2)
∑i
Ni• + 2σ4E
∑i
1
≤ σ4B(κB + 4)
∑j
N2•j +
(4σ2
Bσ2E + σ4
E(κE + 2))N + 2Rσ4
E
= O(∑
j
N2•j
). (A.13)
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 89
The same logic yields Var(Ub) = O(∑iN
2i•). The second term in Var(Ua), which was lumped in with
the first, might ordinarily be much smaller than the first, and then a lead coefficient of σ4B(κB + 2)
would be more accurate than σ4B(κB + 4).
For Var(Ue) the nonnegative terms in (2.14) have magnitudes proportional to
(∑i
N2i•
)2,(∑
j
N2•j
)2, N2
∑i
N2i•, N
2∑j
N2•j ,∑i
N4i•,∑j
N4•j ,∑i
N2i•
∑j
N2•j , N
3
or smaller. These are all O(N2(∑iN
2i• +
∑j N
2•j)), and so
Var(Ue) = O(N2(∑
i
N2i• +
∑j
N2•j
)). (A.14)
Combining (A.13) and (A.14) into (A.12) yields
Var(σ2A) = O
(∑iN
2i•
N2+
∑j N
2•j
N2
)(1 +O(ε)),
and the same follows for σ2B by symmetry. Precisely the same terms appear in Var(σ2
E) so it also
has that rate.
A.3.2 Asymptotic approximation: proof of Theorem 2.4.4
We suppose that the following inequalities all hold
Ni• ≤ εN, N•j ≤ εN, R ≤ εN, C ≤ εN,
N ≤ ε∑i
N2i•, N ≤ ε
∑j
N2•j ,
∑i
N2i• ≤ εN2, and
∑j
N2•j ≤ εN2
for the same small ε > 0. The first six inequalities are assumed in the theorem statement. The last
two follow from the first two. We also assume that
0 < κA + 2, κB + 2, κE + 2, σ4A, σ
4B , σ
4E <∞.
Note that we can bound σ2Aσ
2B , σ2
Aσ2E , and σ2
Aσ2B away from 0 and ∞ uniformly with those other
quantities.
We also suppose that
∑ij
ZijN−1i• N•j ≤ ε
∑i
N2i• and
∑ij
ZijNi•N−1•j ≤ ε
∑j
N2•j . (A.15)
The bounds in (A.15) seem reasonable but it appears that they cannot be derived from the first
eight bounds above.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 90
We begin with the coefficient of σ4B(κB + 2) in Var(Ua) from equation (2.12). It is
∑ir
(ZZT)ir(1−N−1i• −N−1r• +N−1i• N
−1r• ) =
∑j
N2•j − 2
∑ij
ZijN−1i• N•j +
∑jir
ZijN−1i• ZrjN
−1r•
=∑j
N2•j(1 +O(ε)).
The third, fourth and fifth terms in Var(Ua) are all∑j N
2•jO(ε). The second term contains
∑ir
N−1i• N−1r• (ZZT)ir((ZZ
T)ir − 1) ≤∑ir
N−1i• (ZZT)ir =∑irj
N−1i• ZijZrj
=∑ij
ZijN−1i• N•j =
∑j
N2•jO(ε).
It follows that Var(Ua) = σ4B(κB + 2)
∑j N
2•j(1 +O(ε)). Similarly Var(Ub) = σ4
A(κA+ 2)∑iN
2i•(1 +
O(ε)).
The expression for Var(Ue) contains terms σ4A(κA + 2)N2
∑j N
2•j + σ4
B(κB + 2)N2∑iN
2i•. All
other terms are O(ε) times these two, mostly through N �∑iN
2i•,∑j N
2•j � N2. The coefficient
of σ2Aσ
2B contains
N∑ij
ZijNi•N•j ≤ εN2∑ij
ZijNi• = εN2∑i
N2i•
so it is of smaller order than the lead term, as well as
∑i
N2i•
∑j
N2•j ≤ εN2
∑i
N2i•.
As a result
Var(Ue) =(σ4A(κA + 2)N2
∑j
N2•j + σ4
B(κB + 2)N2∑i
N2i•
)(1 +O(ε)).
Turning to the covariances
Cov(Ua, Ub) = σ4E(κE + 2)
∑ij
Zij(1−N−1i• −N−1•j +N−1i• N
−1•j )
= σ4E(κE + 2)(N −R− C +O(R)) = σ4
E(κE + 2)N(1 +O(ε)).
Next Cov(Ua, Ue) contains the term σ4B(κB + 2)N
∑ij ZijN•j = σ4
B(κB + 2)N∑j N
2•j . The terms
appearing after that one are O(N2) = O(εN∑j N
2•j). The largest term preceding it is dominated
by ∑i
N−1i•
(∑j
ZijN•j
)2≤ εN
∑i
N−1i•
(∑j
ZijN•j
)(∑j
Zij
)= εN
∑j
N2•j .
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 91
It follows that Cov(Ua, Ue) = σ4B(κB + 2)N
∑j N
2•j(1 +O(ε)) and similarly, Cov(Ub, Ue) = σ4
A(κA +
2)N∑iN
2i•(1 +O(ε)).
Next, using (2.19)
Var(σ2A) =
(Var(Ue)
N4+
Var(Ua)
N2− 2
Cov(Ua, Ue)
N3
)(1 +O(ε))
= σ4A(κA + 2)
1
N2
∑i
N2i•(1 +O(ε)), and similarly
Var(σ2B) = σ4
B(κB + 2)1
N2
∑j
N2•j(1 +O(ε)).
The last variance is
Var(σ2E) =
(Var(Ua)
N2+
Var(Ub)
N2+
Var(Ue)
N4− 2
N3Cov(Ua, Ue)
− 2
N3Cov(Ub, Ue) +
2
N2Cov(Ua, Ub)
)(1 +O(ε))
= σ4E(κE + 2)
1
N(1 +O(ε)).
Next we verify that these variance estimates are asymptotically uncorrelated. Ignoring the 1 +
O(ε) factors we have
Cov(σ2A, σ
2B)
.=
Var(Ue)
N4− Cov(Ub, Ue)
N3− Cov(Ua, Ue)
N3+
Cov(Ua, Ub)
N2
.=
1
N2
(σ4A(κA + 2)
∑i
N2i• + σ4
B(κB + 2)∑j
N2•j))
− 1
N2σ4A(κA + 2)
∑i
N2i• −
1
N2σ4B(κB + 2)
∑j
N2•j +
1
Nσ4E(κE + 2)
=1
Nσ4E(κE + 2)
which is O(ε) times Var(σ2A) and Var(σ2
B). Likewise
Cov(σ2A, σ
2E)
.=
1
N3Cov(Ua, Ue) +
1
N3Cov(Ub, Ue)−
1
N4Var(Ue)−
1
N2Var(Ua)
− 1
N2Cov(Ua, Ub) +
1
N3Cov(Ua, Ue)
.= σ4
B(κB + 2)2
N2
∑j
N2•j + σ4
A(κA + 2)1
N2
∑i
N2i•
−(σ4A(κA + 2)
∑i
N2i• + σ4
B(κB + 2)∑j
N2•j
) 1
N2
− σ4B(κB + 2)
1
N2
∑j
N2•j − σ4
E(κE + 2)1
N
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 92
= −σ4E(κE + 2)
1
N
which is much smaller than Var(σ2A). Similarly Cov(σ2
B , σ2E)
.= −σ4
E(κE + 2)/N , is much smaller
than Var(σ2B).
A.3.3 Proof of Lemma 2.4.2
We first consider the relevant results for Ua−E(Ua). It is clear that SAN,k,0 is measurable with respect
to FN,k,0, SAN,q,1 is measurable with respect to FN,q,1, and SAN,`,m,2 is measurable with respect to
FN,`,m,2.
First,
E(SAN,k,0 | FN,k−1,0) =∑i
N−1i• Zik
(∑j:j 6=k
ZijE(b2k − σ2B)− 2
∑j:j<k
ZijE(bj)E(bk))
= 0,
E(SAN,q,1 | FN,q−1,0) = 0, and
E(SAN,`,m,2 | FN,`,m−1,0) = N−1`• Z`m
( ∑j:j 6=m
Z`j(2(bm − bj)E(e`m) + E(e2`m − σ2E))
− 2∑j:j<m
Z`jE(e`j)E(e`m))
= 0.
Next we sum the martingale differences,
∑k
SAN,k,0 +∑`m
SAN,`,m,2
=∑k
∑i
N−1i• Zik
(∑j:j 6=k
Zij(b2k − σ2
B)− 2∑j:j<k
Zijbjbk
)+∑`m
N−1`• Z`m
( ∑j:j 6=m
Z`j(2(bm − bj)e`m + e2`m − σ2E)− 2
∑j:j<m
Z`je`je`m
)=∑i
N−1i•
(∑j 6=k
ZikZij(b2k − σ2
B)− 2∑j<k
ZikZijbjbk
)+∑`
N−1`•
(∑j 6=m
Z`jZ`m(e2`m − σ2E + 2(bm − bj)e`m)− 2
∑j<m
Z`jZ`me`je`m
)=∑i
N−1i•
(∑j 6=k
ZikZijb2k − σ2
BNi•(Ni• − 1)−∑j 6=k
ZikZijbjbk
)+∑`
N−1`•(∑j 6=m
Z`jZ`me2`m − σ2
EN`•(N`• − 1)−∑j 6=m
Z`jZ`me`je`m + 2∑j 6=m
Z`jZ`m(bm − bj)e`m)
=1
2
∑ijk
N−1i• ZijZik(bj − bk)2 − σ2B(N −R) +
1
2
∑`jm
N−1`• Z`jZ`m(e`j − e`m)2 − σ2E(N −R)
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 93
+∑`jm
N−1`• Z`jZ`m(bj − bm)(e`j − e`m)
=1
2
∑ijk
N−1i• ZijZik(bj + e`j − bk − e`k)2 − (σ2B + σ2
E)(N −R)
= Ua − E(Ua).
The results for Ub − E(Ub) follow by symmetry.
Now, we consider the results for Ue − E(Ue). It is clear that SEN,k,0 is measurable with respect
to FN,k,0, SEN,q,1 is measurable with respect to FN,q,1, and SEN,`,m,2 is measurable with respect to
FN,`,m,2. Similarly the martingale terms all have expectation zero,
E(SEN,k,0 | FN,k−1,0) =∑ii′
Zik
(∑j:j 6=k
Zi′jE(b2k − σ2B)− 2
∑j:j<k
Zi′jE(bj)E(bk))
= 0,
E(SEN,q,1 | FN,q−1,1)
=∑jj′
Zqj
(∑i:i6=q
Zij′(E(a2q − σ2A) + 2(bj − bj′)E(aq))− 2
∑i:i<q
Zij′E(ai)E(aq))
= 0, and
E(SEN,`,m,2 | FN,`,m−1,2)
= Z`m
( ∑ij:ij 6=`m
Zij(E(e2`m − σ2E) + 2(a` − ai)E(e`m) + 2(bm − bj)E(e`m))
− 2∑
ij:i<` ori=`,j<m
ZijE(eij)E(e`m))
= 0,
and their sum is
∑k
SEN,k,0 +∑q
SEN,q,1 +∑`m
SEN,`,m,2
=∑k
∑ii′
Zik
(∑j:j 6=k
Zi′j(b2k − σ2
B)− 2∑j:j<k
Zi′jbjbk
)+∑q
∑jj′
Zqj
(∑i:i 6=q
Zij′(a2q − σ2
A + 2(bj − bj′)aq)− 2∑i:i<q
Zij′aiaq
)+∑`m
Z`m
( ∑ij:ij 6=`m
Zij(e2`m − σ2
E + 2(a` − ai)e`m + 2(bm − bj)e`m)− 2∑
ij:i<` ori=`,j<m
Zijeije`m
)
=∑ii′
(∑j 6=k
ZikZi′j(b2k − σ2
B)− 2∑j<k
ZikZi′jbjbk
)+∑jj′
(∑i 6=q
ZqjZij′(a2q − σ2
A + 2(bj − bj′)aq)− 2∑i<q
ZqjZij′aiaq
)
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 94
+∑ij 6=`m
ZijZ`m(e2`m − σ2E + 2(a` − ai)e`m + 2(bm − bj)e`m)− 2
∑i<` ori=`,j<m
ZijZ`meije`m
=∑ii′
(∑j 6=k
ZikZi′jb2k − σ2
B(Ni•Ni′• − (ZZT)ii′)−∑j 6=k
ZikZi′jbjbk
)+∑jj′
(∑i 6=q
ZqjZij′a2q − σ2
A(N•jN•j′ − (ZTZ)jj′) + 2∑i 6=q
(bj − bj′)aq −∑i 6=q
ZqjZij′aiaq
)+∑ij 6=`m
ZijZ`me2`m − σ2
E(N2 −N) + 2∑ij 6=`m
ZijZ`m(a` − ai + bm − bj)e`m −∑ij 6=`m
ZijZ`meije`m
=1
2
∑ii′jk
ZikZi′j(bj − bk)2 − σ2B(N2 −
∑j
N2•j)
+1
2
∑iqjj′
ZqjZij′(aq − ai)2 − σ2A(N2 −
∑i
N2i•) +
∑iqjj′
ZqjZij′(aq − ai)(bj − bj′)
+1
2
∑i`jm
ZijZ`m(e`m − eij)2 − σ2E(N2 −N) +
∑i`jm
ZijZ`m(a` − ai + bm − bj)(e`m − eij)
= Ue − E(Ue).
A.3.4 Proof of Theorem 2.4.6
We utilize the Cramer-Wold Theorem and show that any linear combination of the scaled martingale
difference sequences are asymptotically Gaussian:
w1SAN√∑j N
2•j
+ w2SBN√∑iN
2i•
+ w3SEN
N√∑
iN2i• +
∑j N
2•j
. (A.16)
Without loss of generality, assume that w21 + w2
2 + w23 = 1.
Consider the filtration indexed by k, FN,k,0. Then, (A.16) becomes the martingale difference
ηk(b2k − σ2B)− 2λkbk
where
ηk =w1√∑j N
2•j
(N•k −∑i
N−1i• Zik) +w3
N√∑
iN2i• +
∑j N
2•j
(NN•k −N2•k),
λk =w1√∑j N
2•j
∑i
N−1i• Zik∑j:j<k
Zijbj +w3
N√∑
iN2i• +
∑j N
2•j
∑j:j<k
N•kN•jbj ,
and
E(ηk(b2k − σ2B)− 2λkbk | FN,k−1,0) = η2kσ
4B(κB + 2) + 4λkσ
2B(λk − ηkγBσB)
where γB is the skewness of the column random effect.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 95
Next, for the filtration indexed by q, FN,q,1, the MG difference is
ηq(a2q − σ2
A)− 2λqaq
where
ηq =w2√∑iN
2i•
(Nq• −∑j
N−1•j Zqj) +w3
N√∑
iN2i• +
∑j N
2•j
(NNq• −N2q•),
λq =w2√∑iN
2i•
∑j
N−1•j Zqj∑i:i<q
Zijai +w3
N√∑
iN2i• +
∑j N
2•j
∑i:i<q
Nq•Ni•ai
− w3
N√∑
iN2i• +
∑j N
2•j
(N∑j
Zqjbj −Nq•∑j′
N•j′bj′),
and
E(ηq(a2q − σ2
A)− 2λqaq | FN,q−1,1) = η2qσ4A(κA + 2) + 4λqσ
2A(λq − ηqγAσA).
Finally, for the filtration indexed by ` and m, FN,`,m,2, the MG difference is
Z`m(η`m(e2`m − σ2E)− 2λ`me`m)
where
η`m =w1√∑j N
2•j
(1−N−1`• ) +w2√∑iN
2i•
(1−N−1•m) +w3
N√∑
iN2i• +
∑j N
2•j
(N − Z`m),
λ`m =w1√∑j N
2•j
N−1`•
∑j:j<m
Z`je`j −w1√∑j N
2•j
N−1`•
∑j:j 6=m
Z`j(bm − bj)
+w2√∑iN
2i•
N−1•m∑i:i<`
Zimeim −w2√∑iN
2i•
N−1•m∑i:i 6=`
Zim(a` − ai)
+w3
N√∑
iN2i• +
∑j N
2•j
∑ij:i<` ori=`,j<m
Zijeij
− w3
N√∑
iN2i• +
∑j N
2•j
∑ij:ij 6=`m
Zij(a` − ai + bm − bj),
and
E(Z`m(η`m(e2`m−σ2E)− 2λ`me`m) | FN,`,m−1,2) = Z`m(η2`mσ
4E(κE + 2) + 4λ`mσ
2E(λ`m− η`mγEσE)).
For the proof, we merely need to check two conditions. First, the Lindeberg condition, which
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 96
requires that for all δ > 0 the following has limit 0:
∑k
E((ηk(b2k − σ2B)− 2λkbk)2I{|ηk(b2k − σ2
B)− 2λkbk| ≥ δ} | FN,k−1,0)
+∑q
E((ηq(a2q − σ2
A)− 2λaq)2I{|ηq(a2q − σ2
A)− 2λqaq| ≥ δ} | FN,q−1,1)
+∑`m
Z`mE((η`m(e2`m − σ2E)− 2λ`me`m)2I{|η`m(e2`m − σ2
E)− 2λ`me`m| ≥ δ} | FN,`,m−1,2).
Second, the condition that the sum of the conditional variances of the martingale differences, shown
below, converges in probability to a constant:
σ4B(κB + 2)
∑k
η2k + 4σ2B
∑k
λ2k − 4γBσ3B
∑k
λkηk
+ σ4A(κA + 2)
∑q
η2q + 4σ2A
∑q
λ2q − 4γAσ3A
∑q
λqηq
+ σ4E(κE + 2)
∑`m
Z`mη2`m + 4σ2
E
∑`m
Z`mλ2`m − 4γEσ
3E
∑`m
Z`mλ`mη`m.
A.3.4.1 Lindeberg condition
Here, we assume that |ai| ≤ cA, |bj | ≤ cB , and |eij | ≤ cE . We show that |ηk(b2k − σ2B) − 2λkbk|,
|ηq(a2q − σ2A) − 2λqaq|, and |η`m(e2`m − σ2
E) − 2λ`me`m| are each bounded above by some quantity
that converges to 0. Then, for any δ > 0, for large enough N , I{|ηk(b2k − σ2B) − 2λkbk| ≥ δ},
I{|ηq(a2q − σ2A) − 2λqaq| ≥ δ}, and I{|η`m(e2`m − σ2
E) − 2λ`me`m| ≥ δ} are always zero. Therefore,
the Lindeberg condition is satisfied.
LetmaxiNi• + maxj N•j
min(√∑
iN2i•,√∑
j N2•j
) = ξ → 0.
Then,
|ηk| ≤|w1|√∑j N
2•j
N•k +|w3|√∑
iN2i• +
∑j N
2•j
N•k ≤ ξ(|w1|+ |w3|),
|λk| ≤|w1|√∑j N
2•j
∑i
N−1i• Zik∑j:j<k
ZijcB +|w3|
N√∑
iN2i• +
∑j N
2•j
∑j:j<k
N•kN•jcB
≤ |w1|√∑j N
2•j
cBN•k +|w3|√∑
iN2i• +
∑j N
2•j
cBN•k ≤ ξcB(|w1|+ |w3|), and
|ηk(b2k − σ2B)− 2λkbk| ≤ |ηk||b2k − σ2
B |+ 2|λk||bk| ≤ |ηk|(σ2B + c2B) + 2cB |λk|
≤ ξ(|w1|+ |w3|)(σ2B + 3c2B)→ 0.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 97
By symmetry, |ηq(a2q − σ2A)− 2λqaq| ≤ ξ(|w2|+ |w3|)(σ2
A + 3c2A)→ 0.
Similarly,
|η`m| ≤|w1|√∑j N
2•j
+|w2|√∑iN
2i•
+|w3|√∑
iN2i• +
∑j N
2•j
≤ ξ(|w1|+ |w2|+ |w3|),
|λ`m| ≤|w1|√∑j N
2•j
N−1`• cE∑j:j,m
Z`j +|w1|√∑j N
2•j
N−1`• 2cB∑j:j 6=m
Z`j
+|w2|√∑iN
2i•
N−1•mcE∑i:i<`
Zim +|w2|√∑iN
2i•
N−1•m2cA∑i:i6=`
Zim
+|w3|
N√∑
iN2i• +
∑j N
2•j
cE∑
ij:i<` ori=`,j<m
Zij
+|w3|
N√∑
iN2i• +
∑j N
2•j
2(cA + cB)∑
ij:ij 6=`m
Zij
≤ |w1|√∑j N
2•j
cE +|w1|√∑j N
2•j
2cB +|w2|√∑iN
2i•
cE +|w2|√∑iN
2i•
2cA
+|w3|√∑
iN2i• +
∑j N
2•j
cE +|w3|√∑
iN2i• +
∑j N
2•j
2(cA + cB)
≤ ξcE(|w1|+ |w2|+ |w3|) + 2ξ(|w1|cB + |w2|cA + |w3|(cA + cB)), and
|η`m(e2`m − σ2E)− 2λ`me`m| ≤ |η`m|(σ2
E + c2E) + 2cE |λ`m|
≤ ξ(|w1|+ |w2|+ |w3|)(σ2E + 3c2E)
+ 4ξcE(|w1|cB + |w2|cA + |w3|(cA + cB))
→ 0.
We now have the desired results.
A.3.4.2 Sum of conditional variances
To show that the sum of the conditional variances of the martingale differences converges in proba-
bility to a constant, we take the following steps.
1. Show that∑k η
2k,∑q η
2q , and
∑`m Z`mη
2`m converge in probability to constants.
2. Show that∑k λ
2k,∑q λ
2q, and
∑`m Z`mλ
2`m converge in probability to 0.
3. Show that∑k ηkλk,
∑q ηqλq, and
∑`m η`mλ`m converge in probability to 0.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 98
Notice that step 3 actually follows from the first two steps, by the Cauchy-Schwarz inequality.
We explain this explicitly for∑k ηkλk. Letting
∑k η
2k
p→ c,
|∑k
ηkλk| ≤ (∑k
η2k)1/2(∑k
λ2k)1/2p→√c ∗ 0 = 0.
Then,∑k ηkλk must converge in probability to 0.
Therefore, we need to show steps 1 and 2.
∑k
η2k =w2
1∑j N
2•j
∑k
(N•k −∑i
N−1i• Zik)2 +w2
3
N2(∑iN
2i• +
∑j N
2•j)
∑k
(NN•k −N2•k)2
+ 2w1√∑j N
2•j
w3
N√∑
iN2i• +
∑j N
2•j
∑k
(N•k −∑i
N−1i• Zik)(NN•k −N2•k)
=w2
1∑j N
2•j
(∑k
N2•k − 2
∑ik
N•kN−1i• Zik +
∑kir
ZikZrkN−1i• N
−1r• )
+w2
3
N2(∑iN
2i• +
∑j N
2•j)
(N2∑k
N2•k − 2N
∑k
N3•k +
∑k
N4•k)
+ 2w1√∑j N
2•j
w3
N√∑
iN2i• +
∑j N
2•j
(N∑k
N2•k −
∑k
N3•k −N
∑ik
ZikN−1i• N•k
+∑ik
ZikN−1i• N
2•k)
= w21(1− 2O(ε) +O(ε)) + w2
3
∑j N
2•j∑
iN2i• +
∑j N
2•j
(1− 2O(ε) +O(ε))
+ 2w1w3
√ ∑j N
2•j∑
iN2i• +
∑j N
2•j
(1−O(ε)−O(ε) +O(ε))
→ w21 + w2
3
∑j N
2•j∑
iN2i• +
∑j N
2•j
+ 2w1w3
√ ∑j N
2•j∑
iN2i• +
∑j N
2•j
.
The fact that∑q η
2q converges in probability to a constant follows by symmetry from the result
for∑k η
2k.
∑`m
Z`mη2`m =
w21∑
j N2•j
∑`m
Z`m(1−N−1`• )2 +w2
2∑iN
2i•
∑`m
Z`m(1−N−1•m)2
+w2
3
N2(∑iN
2i• +
∑j N
2•j)
∑`m
Z`m(N − Z`m)2
+2w1w2√∑iN
2i•
∑j N
2•j
∑`m
Z`m(1−N−1`• )(1−N−1•m)
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 99
+2w1w3
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
∑`m
Z`m(1−N−1`• )(N − Z`m)
+2w2w3
N√∑
iN2i•
√∑iN
2i• +
∑j N
2•j
∑`m
Z`m(1−N−1•m)(N − Z`m)
≤ w21O(ε) + w2
2O(ε) + w23O(ε) + |w1w2|O(ε2) + |w1w3|O(ε2) + |w2w3|O(ε2)→ 0.
Step 1 is concluded.
Now consider∑k λ
2k. It is a quadratic form in the bj . The coefficient of bj in λk is
Mjk = I{k > j}
w1√∑j N
2•j
∑i
N−1i• ZikZij +w3
N√∑
iN2i• +
∑j N
2•j
N•kN•j
and so we can write
λk =[M1k . . . MCk
]b1...
bC
.
Then,∑k λ
2k is a quadratic form in terms of
b1...
bC
, with matrix in the middle M having entry
(j, s)∑kMjkMsk. Now, we apply formulas about the expectation and variance of quadratic forms,
given in Khoshnevisan (2014) [27].
E(∑k
λ2k) = σ2B Tr(M) = σ2
B
∑jk
M2jk and
M2jk = I{k > j}
[w2
1∑j N
2•j
∑ir
N−1i• N−1r• ZikZrkZijZrj +
w23
N2(∑iN
2i• +
∑j N
2•j)N2•kN
2•j
+w1w3
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
N•j∑i
ZikN−1i• N•kZij
].
Thus,∑jkM
2jk has three terms.
Term 1:
w21∑
j N2•j
∑jk
I{k > j}∑ir
N−1i• N−1r• ZikZrkZijZrj ≤
w21∑
j N2•j
∑j
∑kir
ZikZrkN−1i• N
−1r• ZijZrj
≤ w21∑
j N2•j
∑kir
ZikZrkN−1i•
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 100
=w2
1∑j N
2•j
∑ik
ZikN−1i• N•k = w2
1O(ε)→ 0.
Term 2:
w23
N2(∑iN
2i• +
∑j N
2•j)
∑jk
I{k > j}N2•kN
2•j ≤ w2
3
∑j N
2•j
N2
∑j N
2•j∑
iN2i• +
∑j N
2•j
= w23O(ε)→ 0.
Term 3: Taking absolute values,
|w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
∑jk
I{k > j}N•jN•k∑i
ZikN−1i• Zij
≤ |w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
∑ik
ZikN−1i• N•k
∑j
N•j
≤ |w1w3|O(ε)→ 0.
Thus, E(∑k λ
2k)→ 0.
Var(∑k
λ2k) = (µ4,B − 3σ4B)∑j
(∑k
M2jk)2 + (σ4
B − 1)(∑jk
M2jk)2 + 2σ4
B
∑js
(∑k
MjkMsk)2.
We have proved that∑jkM
2jk = O(ε), so the second term in the above expression is O(ε2)→ 0.
Since∑j(∑kM
2jk)2 ≤ (
∑jkM
2jk)2 = O(ε2) → 0, the first term also converges to 0. It remains to
consider the third term, which we again upper bound.
By ignoring the indicator variables, we have
∑k
|MjkMsk| ≤w2
3
N2(∑iN
2i• +
∑j N
2•j)N•jN•s
∑k
N2•k
+|w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
N•j∑ik
ZikN−1i• N•kZis
+|w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
N•s∑ik
ZikN−1i• N•kZij
+w2
1∑j N
2•j
∑kir
ZikZrkN−1i• N
−1r• ZijZrs.
By squaring this and summing over j and s, we get an upper bound on∑js(∑kMjkMsk)2.
There are 10 terms in that sum, which are as follows.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 101
Term 1:
w43
N4(∑iN
2i• +
∑j N
2•j)
2
∑js
N2•jN
2•s(∑k
N2•k)2 = w4
3(
∑j N
2•j
N2)2
(∑j N
2•j)
2
(∑iN
2i• +
∑j N
2•j)
2= w4
3O(ε2)→ 0.
Term 2:
w21w
23
N2∑j N
2•j(∑iN
2i• +
∑j N
2•j)
∑js
N2•j(∑ik
ZikN−1i• N•kZis)
2
=w2
1w23
N2(∑iN
2i• +
∑j N
2•j)
∑s
∑iki′k′
ZikZi′k′N−1i• N
−1i′• N•kN•k′ZisZi′s
≤ w21w
23
N2(∑iN
2i• +
∑j N
2•j)
∑iki′k′
ZikZi′k′N−1i• N•kN•k′
= w21w
23
∑kN
2•k
N2
∑ik ZikN
−1i• N•k∑
iN2i• +
∑j N
2•j
= w21w
23O(ε)→ 0.
Term 3 is the same as term 2 by symmetry.
Term 4:
w41
(∑j N
2•j)
2
∑js
∑kk′ii′rr′
ZikZrkN−1i• N
−1r• ZijZrsZi′k′Zr′k′N
−1i′• N
−1r′•Zi′jZr′s
≤ w41
(∑j N
2•j)
2
∑kk′ii′rr′
ZikZrkN−1i• N
−1r• Zi′k′Zr′k′ =
w41
(∑j N
2•j)
2
∑ii′rr′
(ZZT)ir(ZZT)i′r′N
−1i• N
−1r•
=w4
1∑j N
2•j
∑ir
(ZZT)irN−1i• N
−1r• = w4
1O(ε)→ 0.
Term 5: Taking absolute values,
|w1w33|
N3(∑iN
2i• +
∑j N
2•j)√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
∑js
N2•jN•s
∑k
N2•k
∑ik
ZikN−1i• N•kZis
≤ |w1w33|∑j N
2•j
N2
∑kN
2•k∑
iN2i• +
∑j N
2•j
∑sN•sN
∑ik ZikN
−1i• N•k√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
= |w1w33|O(ε2)→ 0.
Term 6 is the same as term 5 by symmetry.
Term 7:
w21w
23
N2(∑iN
2i• +
∑j N
2•j)
∑js
N•jN•s∑kir
ZikZrkN−1i• N
−1r• ZijZrs
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 102
≤ w21w
23
N2(∑iN
2i• +
∑j N
2•j)
∑js
N•jN•s∑ir
N−1i• ZijZrs
= w21w
23
∑j N
2•j∑
iN2i• +
∑j N
2•j
∑ij ZijN
−1i• N•j
N2= w2
1w23O(ε2)→ 0.
Term 8:
w21w
23
N2∑j N
2•j(∑iN
2i• +
∑j N
2•j)
∑js
N•jN•s∑ii′kk′
ZikZi′k′N−1i• N•kN
−1i′• N•k′ZisZi′j
≤ w21w
23∑
j N2•j(∑iN
2i• +
∑j N
2•j)
∑ii′kk′
ZikZi′k′N−1i• N•kN
−1i′• N•k′ = w2
1w23O(ε2)→ 0.
Term 9: Taking absolute values,
|w31w3|
N∑j N
2•j
√∑j N
2•j
√∑iN
2i• +
∑j N
2•j
∑js
∑ii′kk′r
N•jZikN−1i• N•kZisZi′k′Zrk′N
−1i′• N
−1r• Zi′jZrs
≤ |w31w3|
N∑j N
2•j
√∑j N
2•j
√∑iN
2i• +
∑j N
2•j
∑j
∑ii′kk′r
N•jZikN−1i• N•kZi′k′Zrk′N
−1i′• Zi′j
≤ |w31w3|∑
j N2•j
√∑j N
2•j
√∑iN
2i• +
∑j N
2•j
∑ii′kk′
ZikN−1i• N•kZi′k′N
−1i′• N•k′ = |w3
1w3|O(ε2)→ 0.
Term 10 is the same as term 9 by symmetry.
Adding up the 10 terms, we see that the variance of∑k λ
2k approaches 0, and
∑k λ
2k converges
in probability to 0.
Consider∑q λ
2q. Note that λq is a linear combination of the ai and bj . Therefore,
∑q λ
2q is the
sum of a quadratic form in a, a quadratic form in b, and a weighted inner product of a and b.
The quadratic form in b and the weighted inner product of a and b are
∑q
α2q =
w23
N2(∑iN
2i• +
∑j N
2•j)
∑q
(∑j
(NZqj −Nq•N•j)bj)2 and
∑q
αqνq =w2w3
N√∑
iN2i•
√∑iN
2i• +
∑j N
2•j
∑q
(∑j
(Nq•N•j −NZqj)bj)(∑j
∑i:i<q
ZijZqjN−1•j ai)
+w2
3
N2(∑iN
2i• +
∑j N
2•j)
∑q
(∑j
(Nq•N•j −NZqj)bj)(Nq•∑i:i<q
Ni•ai)
and∑q ν
2q is the quadratic form in a that converges in probability to 0. The quadratic form in a
converges in probability to 0 by the same argument that∑k λ
2k converges in probability to 0.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 103
More explicitly, αq is a linear combination of the bj , with coefficients
Mqj =w3
N√∑
iN2i• +
∑j N
2•j
(NZqj −Nq•N•j).
Then,∑q α
2q is a quadratic form in the bj , with inner matrix entry (j, s)
∑qMqjMqs.
E(∑q
α2q) = σ2
B
∑jq
M2qj and
Var(∑q
α2q) = (µB,4 − 3σ4
B)∑j
(∑q
M2qj)
2 + (σ4B − 1)(
∑jq
M2qj)
2 + 2σ4B
∑js
(∑q
MqjMqs)2.
Then,
∑qj
M2qj =
w23
N2(∑iN
2i• +
∑j N
2•j)
(N2∑qj
Zqj +∑qj
N2q•N
2•j − 2N
∑qj
ZqjNq•N•j)
≤ w23O(ε)→ 0.
Thus, the first two terms of Var(∑q α
2q) also converge to 0, using the same reasoning as before.
∑q
MqjMqs =w2
3
N2(∑iN
2i• +
∑j N
2•j)
[N2(ZTZ)js +N•jN•s∑q
N2q• −NN•s
∑q
ZqjNq• −NN•j∑q
ZqsNq•].
We square this and sum over j and s. There are 10 terms in the sum.
Term 1:
w43
(∑iN
2i• +
∑j N
2•j)
2
∑js
(ZTZ)2js ≤w4
3
(∑iN
2i• +
∑j N
2•j)
2
∑j
N2•jC = w4
3O(ε)→ 0.
Term 2:
w43
N4(∑iN
2i• +
∑j N
2•j)
2(∑q
N2q•)
2∑j
N2•j
∑s
N2•s = w4
3O(ε)→ 0.
Term 3:
w43
N2(∑iN
2i• +
∑j N
2•j)
2
∑s
N2•s
∑j
(∑q
ZqjN2q•)
2 = w43O(ε)→ 0.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 104
Term 4 is the same as term 3 by symmetry.
Term 5:
w43
N2(∑iN
2i• +
∑j N
2•j)
2
∑q
N2q•
∑js
N•jN•s(ZTZ)js = w4
3O(ε)→ 0.
Term 6: Taking absolute values, we have
w43
N(∑iN
2i• +
∑j N
2•j)
2
∑js
(ZTZ)jsN•s∑q
ZqjNq• = w43O(ε)→ 0.
Term 7 is the same as term 6 by symmetry.
Term 8: Taking absolute values, we have
w43
N3(∑iN
2i• +
∑j N
2•j)
2
∑q
N2q•
∑s
N2•s
∑jq
ZqjNq•N•j = w43O(ε)→ 0.
Term 9 is the same as term 8 by symmetry.
Term 10:
w43
N2(∑iN
2i• +
∑j N
2•j)
2
∑jq
ZqjNq•N•j∑sq
ZqsNq•N•s = w43O(ε)→ 0.
Thus,∑js(∑qMqjMqs)
2 is bounded above by something that converges to 0, and Var(∑q α
2q)
converges to 0. Then,∑q α
2q converges in probability to 0.
Because∑q α
2q and
∑q ν
2q converge in probability to 0, again by the Cauchy-Schwarz inequality∑
q αqνq converges in probability to 0. Therefore,∑q λ
2q is the sum of three sequences that converge
in probability to 0 and thus converges in probability to 0.
Finally, we consider∑`m Z`mλ
2`m. Notice that λ`m is a linear combination of the ai, bj , and eij .
Therefore,∑`m Z`mλ
2`m is the sum of quadratic forms in a, b, and e, and weighted inner products
of them. Using the same reasoning as for∑q λ
2q, it suffices to show that the three quadratic forms
converge in probability to 0.
The coefficient of eij in λ`m is
Mij,`m =w1√∑j N
2•j
N−1i• ZijI{i = `, j < m}+w2√∑iN
2i•
N−1•j ZijI{i < `, j = m}
+w3
N√∑
iN2i• +
∑j N
2•j
ZijI{i < ` or i = `, j < m}.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 105
Then, let QF (e) be the quadratic form in e in∑`m Z`mλ
2`m, with inner matrix having entry
(ij, rs)∑`m Z`mMij,`mMrs,`m.
E(QF (e)) = σ2E
∑`m,ij
Z`mM2ij,`m and
Var(QF (e)) = (µE,4 − 3σ4E)∑ij
(∑`m
Z`mM2ij,`m)2 + (σ4
E − 1)(∑ij,`m
Z`mM2ij,`m)2
+ 2σ4E
∑ij,rs
(∑`m
Z`mMij,`mMrs,`m)2.
Then,
∑`m,ij
Z`mM2ij,`m ≤
w21∑
j N2•j
∑ijm
ZimZijN−2i• +
w2∑iN
2i•
∑ij`
Z`jZijN−2•j
+w2
3
N2(∑iN
2i• +
∑j N
2•j)
∑ij`m
ZijZ`m +2|w1w3|√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
+2|w2w3|√∑
iN2i•
√∑iN
2i• +
∑j N
2•j
→ 0.
Thus, E(QF (e)) and the first two terms of Var(QF (e)) converge to 0. To bound the third term
of Var(QF (e)),
∑`m
Z`m|Mij,`mMrs,`m| ≤w2
1∑j N
2•j
N−1r• ZijZrsI{i = r}+w2
2∑iN
2i•
N−1•s ZijZrsI{j = s}
+|w1w2|√∑iN
2i•
∑j N
2•j
ZisZijZrsN−1i• N
−1•s + ZijZrs[
w23
N(∑iN
2i• +
∑j N
2•j)
+|w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
+|w2w3|
N√∑
iN2i•
√∑iN
2i• +
∑j N
2•j
].
We square the above and sum over ij and rs. There are 10 terms in the sum.
Term 1:
w41
(∑j N
2•j)
2
∑ij,rs
N−2r• ZijZrsI{i = r} =w4
1
(∑j N
2•j)
2
∑ijs
N−2i• ZijZis = w41O(ε)→ 0.
Term 2 is the same as term 1 by symmetry.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 106
Term 3:
w21w
22∑
iN2i•
∑j N
2•j
∑ij,rs
ZisZijZrsN−1i• N
−1•s =
w21w
22∑
iN2i•
∑j N
2•j
∑is
Zis = w21w
22O(ε)→ 0.
Term 4:
[w2
3
N(∑iN
2i• +
∑j N
2•j)
+|w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
+|w2w3|
N√∑
iN2i•
√∑iN
2i• +
∑j N
2•j
]2N2
= O(ε)→ 0.
Term 5:
w21w
22∑
iN2i•
∑j N
2•j
∑ij,rs
N−1r• N−1•s ZijZrsI{i = r, j = s} =
w21w
22∑
iN2i•
∑j N
2•j
∑ij
N−1i• N−1•j Zij
= w21w
22O(ε)→ 0.
Term 6:
|w31w2|∑
j N2•j
√∑iN
2i•
∑j N
2•j
∑ijs
N−2i• ZijZisN−1•s =
|w31w2|∑
j N2•j
√∑iN
2i•
∑j N
2•j
∑is
ZisN−1i• N
−1•s
= |w31w2|O(ε)→ 0.
Term 8 is the same as term 6 by symmetry.
Term 7:
[w2
3
N(∑iN
2i• +
∑j N
2•j)
+|w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
+|w2w3|
N√∑
iN2i•
√∑iN
2i• +
∑j N
2•j
]
w21∑
j N2•j
∑ijs
N−1i• ZijZis = O(ε)→ 0.
Term 9 is the same as term 7 by symmetry.
Term 10:
[w2
3
N(∑iN
2i• +
∑j N
2•j)
+|w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
+|w2w3|
N√∑
iN2i•
√∑iN
2i• +
∑j N
2•j
]
|w1w2|√∑iN
2i•
∑j N
2•j
∑ij,rs
ZijZrsZisN−1i• N
−1•s = O(ε)→ 0.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 107
Thus, QF (e) converges in probability to 0. The coefficient of bj in λ`m is
Mj,`m =w1√∑j N
2•j
(N−1`• Z`j − I{j = m}) +w3
N√∑
iN2i• +
∑j N
2•j
(N•j −NI{j = m}).
Then, let QF (b) be the quadratic form in the b in∑`m Z`mλ
2`m, with inner matrix having entry
(j, s)∑`m Z`mMj,`mMs,`m.
E(QF (b)) = σ2E
∑`m,j
Z`mM2j,`m and
Var(QF (b)) = (µE,4 − 3σ4E)∑j
(∑`m
Z`mM2j,`m)2 + (σ4
E − 1)(∑j,`m
Z`mM2j,`m)2
+ 2σ4E
∑j,s
(∑`m
Z`mMj,`mMs,`m)2.
Then,
∑`m,j
Z`mM2j,`m =
w21∑
j N2•j
∑`m,j
Z`m(N−2`• Z`,j + I{j = m} − 2I{j = m}N−1`• Z`j)
+w2
3
N2(∑iN
2i• +
∑j N
2•j)
∑`m,j
Z`m(N2•j +N2I{j = m} − 2NN•jI{j = m}))
+2|w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
∑`m,j
(N−1`• N•jZ`j −NN−1`• Z`jI{j = m}
−N•jI{j = m}+NI{j = m})
≤ w21∑
j N2•j
(3∑`m
Z`mN−1`• +
∑`m
Z`m) +w2
3
N2(∑iN
2i• +
∑j N
2•j)
(3N∑j
N2•j +N3)
+2|w1w3|
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
(2∑j
N2•j +NR+N2)
= w21O(ε) + w2
3O(ε) + 2|w1w3|O(ε)→ 0.
Thus, the expectation and first two terms of the variance of Var(QF (b)) converge to 0. To bound
the third term of Var(QF (b)),
∑`m
Z`mMj,`mMs,`m
=w2
1∑j N
2•j
∑`m
Z`m(N−2`• Z`jZ`s −N−1`• Z`sI{j = m} −N−1`• Z`jI{s = m}+ I{j = s = m})
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 108
+w1w3
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
∑`m
Z`m(N−1`• N•sZ`j −NN−1`• Z`jI{s = m} −N•sI{j = m}
+NI{j = s = m})
+w1w3
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
∑`m
Z`m(N−1`• N•jZ`s −NN−1`• Z`sI{j = m} −N•jI{s = m}
+NI{j = s = m})
+w2
3
N2(∑iN
2i• +
∑j N
2•j)
∑`m
Z`m(N•jN•s −NN•jI{s = m} −NN•sI{j = m}
+N2I{j = s = m})
=w2
1∑j N
2•j
(N•jI{j = s} −∑`
N−1`• Z`jZ`s) +w2
3
N2(∑iN
2i• +
∑j N
2•j)
(N2N•jI{j = s} −NN•jN•s)
+w1w3
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
(NN•jI{j = s} −N∑`
N−1`• Z`jZ`s)
+w1w3
N√∑
j N2•j
√∑iN
2i• +
∑j N
2•j
(NN•sI{j = s} −N∑`
N−1`• Z`jZ`s).
We square the above and sum over j and s; there are 10 terms in the sum.
Term 1:
w41
(∑j N
2•j)
2
∑js
(N2•jI{j = s} − 2N•jI{j = s}
∑`
N−1`• Z`jZ`s +∑i`
N−1i• N−1`• ZijZisZ`jZ`s)
≤ w41
(∑j N
2•j)
2(∑j
N2•j − 2
∑`j
Z`jN−1`• N•j +R2) = w4
1O(ε)→ 0.
Term 2:
w43
N2(∑iN
2i• +
∑j N
2•j)
2
∑js
(N2N2•jI{j = s}+N2
•jN2•s − 2NN2
•jN•sI{j = s})
=w4
3
N2(∑iN
2i• +
∑j N
2•j)
2(N2
∑j
N2•j +
∑j
N2•j
∑s
N2•s − 2N
∑j
N3•j) = w4
3O(ε)→ 0.
Term 3:
w21w
23
N2∑j N
2•j(∑iN
2i• +
∑j N
2•j)
∑js
(N2N2•jI{j = s} − 2N2N•jI{j = s}
∑`
N−1`• Z`jZ`s
+N2∑i`
N−1i• N−1`• ZijZisZ`jZ`s)
=4w2
1w23
N2∑j N
2•j(∑iN
2i• +
∑j N
2•j)
(N2∑j
N2•j − 2N2
∑j
N•j∑`
N−1`• Z`j +N2R2)
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 109
= w21w
23O(ε)→ 0.
Term 4 is the same as term 3 by symmetry.
Term 5:
w21w
23
N∑j N
2•j(∑iN
2i• +
∑j N
2•j)
∑js
(N2•jI{j = s} −N2
•jN•sI{j = s} −NN•jI{j = s}∑`
N−1`• Z`jZ`s
+N•jN•s∑`
N−1`• Z`jZ`s)
=w2
1w23
N∑j N
2•j(∑iN
2i• +
∑j N
2•j)
(∑j
N2•j −
∑j
N3•j −N
∑`j
N−1`• N•jZ`j +∑`js
Z`jN−1`• N•jZ`sN•s)
= w21w
23O(ε)→ 0.
Term 6:
w31w3∑
j N2•j
√∑j N
2•j
√∑iN
2i• +
∑j N
2•j
∑js
(N2•jI{j = s} − 2N•jI{j = s}
∑`
N−1`• Z`jZ`s
+∑i`
N−1i• N−1`• ZijZisZ`jZ`s)
≤ |w31w3|∑
j N2•j
√∑j N
2•j
√∑iN
2i• +
∑j N
2•j
(∑j
N2•j − 2
∑`j
N•jN−1`• Z`j +R2) = |w3
1w3|O(ε)→ 0.
Term 7 is the same as term 6 by symmetry.
Term 8:
w1w33√∑
j N2•j
√∑iN
2i• +
∑j N
2•jN(
∑iN
2i• +
∑j N
2•j)
∑js
(NN2•jI{j = s} −N2
•jN•sI{j = s}
−NN•jI{j = s}∑`
N−1`• Z`jZ`s +N•jN•s∑`
N−1`• Z`jZ`s)
=|w1w
33|√∑
j N2•j
√∑iN
2i• +
∑j N
2•jN(
∑iN
2i• +
∑j N
2•j)
(N∑j
N2•j −
∑j
N3•j −N
∑`j
N−1`• N•jZ`j
+∑`js
Z`jN−1`• N•jZ`sN•s)
= |w1w33|O(ε)→ 0.
Term 9 is the same as term 8 by symmetry.
APPENDIX A. CROSSED RANDOM EFFECTS MODEL 110
Term 10:
w21w
23
N2∑j N
2•j(∑iN
2i• +
∑j N
2•j)
∑js
(N•jN•sI{j = s} −N2N•jI{j = s}∑`
N−1`• Z`jZ`s
−N2N•sI{j = s}∑`
N−1`• Z`jZ`s +N2∑i`
N−1i• N−1`• ZijZisZ`jZ`s)
≤ w21w
23
N2∑j N
2•j(∑iN
2i• +
∑j N
2•j)
(∑j
N2•j −N2
∑`j
Z`jN−1`• N•j +N2R2) = w2
1w23O(ε)→ 0.
Thus, Var(QF (b)) converges to 0, and QF (b) converges in probability to 0. By symmetry, the
same result holds for the quadratic form in a in∑`m Z`mλ
2`m. Thus,
∑`m Z`mλ
2`m converges in
probability to 0. The results follow.
Appendix B
Linear Mixed Model with Crossed
Random Effects
B.1 Estimator Efficiencies
To compute a lower bound for effRLS, we first transform x into z = V−1/2A x. Then, from (3.10),
effRLS =(zTz)2
(zTV−1/2A VRV
−1/2A z)(zTV
1/2A V −1R V
1/2A z)
.
Scaling z by a nonzero constant does not change effRLS. Hence, it suffices to consider unit vectors
u = z/‖z‖, and rewrite
1/effRLS = (uTV−1/2A VRV
−1/2A u)(uV
1/2A V −1R V
1/2A u) ≡ (uTAu)(uTA−1u)
for A = V−1/2A VRV
−1/2A . We get an upper bound for 1/effRLS from the Kantorovich inequality after
getting upper and lower bounds on the eigenvalues of
A = V−1/2A (VA + σ2
BBR)V−1/2A = IN + σ2
BV−1/2A BRV
−1/2A
The eigenvalues of A are the eigenvalues of σ2BV−1/2A BRV
−1/2A plus one.
The matrix BC and by extension BR is singular and positive semidefinite, with nonzero eigen-
values N•j for j = 1, . . . , R. Also, VA is symmetric and nonsingular with eigenvalues σ2E , and
σ2E + σ2
ANi• for i = 1, . . . , R. Then, V−1/2A is symmetric and nonsingular with eigenvalues 1/
√σ2E
and 1/√σ2E + σ2
ANi• for i = 1, . . . , R.
Therefore, σ2BV−1/2A BRV
−1/2A is singular and positive semidefinite. Its smallest eigenvalue is zero,
111
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 112
and its largest eigenvalue is bounded above by
‖σ2BV−1/2A BRV
1/2A ‖2 ≤ σ2
B‖V−1/2A ‖22‖BR‖2 =
σ2B
σ2E
maxjN•j .
Hence, the smallest eigenvalue of A is 1, and the largest eigenvalue is bounded above by 1 +
σ2B maxj N•j/σ
2E . By the Kantorovich inequality (Theorem 3.3.1), we have
1/effRLS = (uTV−1/2A VRV
−1/2A u)(uTV
1/2A V −1R V
1/2A u)
≤ (2 + σ2B maxj N•j/σ
2E)2
4(1 + σ2B maxj N•j/σ2
E)=
(2σ2E + σ2
B maxj N•j)2
4σ2E(σ2
E + σ2B maxj N•j)
Taking reciprocals gives the desired result. The result for effCLS follows by symmetry.
B.2 A Useful Lemma
Several of the proofs for Section 3.4 utilize the following lemma, which is not given in the main text
for brevity’s sake. This lemma rewrites Ua(β), Ub(β), and Ue(β) in a useful form.
Lemma B.2.1.
Ua(β) = Ua +1
2(β − β)T
(∑ijj′
N−1i• ZijZij′(xij − xij′)(xij − xij′)T)
(β − β)
+ (β − β)T∑ijj′
N−1i• ZijZij′(xij − xij′)(bj − bj′)
+ (β − β)T∑ijj′
N−1i• ZijZij′(xij − xij′)(eij − eij′),
Ub(β) = Ub +1
2(β − β)T
(∑jii′
N−1•j ZijZi′j(xij − xi′j)(xij − xi′j)T)
(β − β)
+ (β − β)T∑jii′
N−1•j ZijZi′j(xij − xi′j)(ai − ai′)
+ (β − β)T∑jii′
N−1•j ZijZi′j(xij − xi′j)(eij − ei′j), and
Ue(β) = Ue +1
2(β − β)T
(∑iji′j′
ZijZi′j′(xij − xi′j′)(xij − xi′j′)T)
(β − β)
+ (β − β)T∑iji′j′
ZijZi′j′(xij − xi′j′)(ai − ai′) + (β − β)T∑iji′j′
ZijZi′j′(xij − xi′j′)(bj − bj′)
+ (β − β)T∑iji′j′
ZijZi′j′(xij − xi′j′)(eij − ei′j′),
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 113
where for ηij = ai + bj + eij,
Ua =∑ijj′
N−1i• ZijZij′(ηij − ηij′)2,
Ub =∑jii′
N−1•j ZijZi′j(ηij − ηi′j)2, and
Ue =∑iji′j′
ZijZi′j′(ηij − ηi′j′)2.
Proof. Straightforward algebra.
Note that the ηij exactly follow a two-factor crossed random effects model. Thus, Lemma B.2.1
shows that we can leverage results about Ua, Ub, and Ue from Section 2.4 to analyze Ua(β), Ub(β),
and Ue(β).
B.3 Consistency of Estimates
B.3.1 Proof of Theorem 3.4.1
Let the data be ordered by row and write Y = Xβ + ε, where ε has mean zero and variance
σ2AAR + σ2
BBR + σ2EIN . Then βOLS = β + (XTX)−1XTε. Clearly E((XTX)−1XTε) = 0. Now let
w ∈ Rd be any unit vector. Then
Var(wT(XTX)−1XTε) = wT(XTX)−1XT(σ2AAR + σ2
BBR + σ2EIN )X(XTX)−1w
≤ (σ2E + σ2
A maxiNi• + σ2
B maxjN•j)w
T(XTX)−1XTX(XTX)−1w
=1
N(σ2E + σ2
A maxiNi• + σ2
B maxjN•j)w
T( 1
NXTX
)−1w
≤(σ2
E
N+ εRσ
2A + εCσ
2B
) /I(XTX
N
)→ 0.
The first inequality follows from the facts that the maximum eigenvalue of σ2EIN is σ2
E , the maximum
eigenvalue of σ2AAR is σ2
A maxiNi•, and the maximum eigenvalue of σ2BBR is σ2
B maxj N•j . The
conclusion now follows from our asymptotic assumptions.
B.3.2 Proof of Theorem 3.4.2
In light of Theorem 2.4.3 it suffices to show that (Ua(β)−Ua(β))/(N−R), (Ub(β)−Ub(β))/(N−C),
and (Ue(β)− Ue(β))/N2 all converge to zero in probability. Because max(R,C)/N < θ ∈ (0, 1) we
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 114
can replace denominators N −R and N − C by N . Using the expansion in Lemma B.2.1,
Ua(β)− Ua(β)
N=
1
2(β − β)T
( 1
N
∑ijj′
N−1i• ZijZij′(xij − xij′)(xij − xij′)T)
(β − β)︸ ︷︷ ︸A1
+ (β − β)T1
N
∑ijj′
N−1i• ZijZij′(xij − xij′)(bj − bj′)︸ ︷︷ ︸A2
+ (β − β)T1
N
∑ijj′
N−1i• ZijZij′(xij − xij′)(eij − eij′)︸ ︷︷ ︸A3
.
We consider Terms A1 through A3 in turn.
A1: The middle factor in Term A1 is no larger than (4M∞/N)∑ijj′ N
−1i• ZijZij′ = 4M∞ = O(1).
Therefore term A1 is O(‖β − β‖2).
A2: The coefficient of β − β has mean zero and second moment
E( 1
N2
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′(xij − xij′)(xrs − xrs′)T(bj − bj′)(bs − bs′)
)=
1
N2
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′(xij − xij′)(xrs − xrs′)TE((bj − bj′)(bs − bs′))
=σ2B
N2
∑ijj′
∑rss′
N−1i• N−1r• ZijZij′ZrsZrs′(xij − xij′)(xrs − xrs′)T(1j=s − 1j=s′ − 1j′=s + 1j′=s′)
=4σ2
B
N2
∑ijj′
∑rs′
N−1i• N−1r• ZijZij′ZrjZrs′(xij − xij′)(xrj − xrs′)T.
No component in this matrix is larger than
16M∞σ2B
N2
∑ijj′
∑rs′
N−1i• N−1r• ZijZij′ZrjZrs′ =
16M∞σ2B
N2
∑ij
∑r
ZijZrj =16M∞σ
2B
N2
∑j
N2•j = O(ε),
and so Term A2 is O(‖β − β‖ε).
A3: As in A2 we find that the coefficient of β − β has mean zero and second moment
4σ2E
N2
∑ijj′
∑s′
N−2i• ZijZij′Zis′(xij − xij′)(xij − xis′)T
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 115
in which no component is larger than
16σ2EM∞N2
∑ijj′
∑s′
N−2i• ZijZij′Zis′ =16σ2
EM∞N2
∑ij
Zij =16σ2
EM∞N
.
Therefore Term A3 is O(‖β − β‖/N).
Combining these results (Ua(β)−Ua(β))/N = O(‖β−β‖(ε+‖β−β‖)). The same argument applies
to (Ub(β)− Ub(β))/N . Now we turn to (Ue(β)− Ue(β))/N2. Using the expansion in Lemma B.2.1,
Ue(β)− Ue(β)
N2=
1
2(β − β)T
( 1
N2
∑iji′j′
ZijZi′j′(xij − xij′)(xij − xij′)T)
(β − β)︸ ︷︷ ︸E1
+ (β − β)T1
N2
∑iji′j′
ZijZi′j′(xij − xi′j′)(ai − ai′)︸ ︷︷ ︸E2
+ (β − β)T1
N2
∑iji′j′
ZijZi′j′(xij − xi′j′)(bj − bj′)︸ ︷︷ ︸E3
+ (β − β)T1
N2
∑iji′j′
ZijZi′j′(xij − xi′j′)(eij − ei′j′)︸ ︷︷ ︸E4
.
E1: By arguments like the one for A1, we find that E1 is also O(‖β − β‖2).
E2: Similarly to A2, the coefficient of β − β has mean zero and second moment
E( 1
N4
∑iji′j′
∑rsr′s′
ZijZi′j′ZrsZr′s′(xij − xi′j′)(xrs − xr′s′)T(ai − ai′)(ar − ar′))
=4σ2
A
N4
∑iji′j′
∑sr′s′
ZijZi′j′ZisZr′s′(xij − xi′j′)(xis − xr′s′)T
with components no larger than
16M∞σ2A
N4
∑iji′j′
∑sr′s′
ZijZi′j′ZisZr′s′ =16M∞σ
2A
N2
∑ij
∑s
ZijZis =16M∞σ
2A
N2
∑i
N2i•.
Therefore Term E2 is Op(‖β − β‖ε).
E3: Term E3 is also Op(‖β − β‖ε). by the argument used for Term E3.
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 116
E4: Following arguments similar to the preceding ones, the coefficient of β− β has mean zero and
second moment
4σ2E
N4
∑iji′j′
∑r′s′
ZijZi′j′Zr′s′(xij − xi′j′)(xij − xr′s′)T = O(16σ2
EM∞N
)
and so Term E4 is O(‖β − β‖/N). Combining these results we have consistency for the variance
components. The error in replacing β by β changes the variance component estimates by O(‖β −β‖(‖β − β‖+ ε)).
B.3.3 Proof of Theorem 3.4.3
Suppose that the data are ordered by rows. Then we may write Y = Xβ + η, where η has mean
zero and variance VR = σ2EIN + σ2
AAR + σ2BBR. Now
βRLS = β + (XTV −1A X)−1XTV −1A η
where VA = σ2AAR + σ2
EIN . The matrix X is not random and both σ2A
p→ σ2A and σ2
E
p→ σ2E so
it suffices to show that ε ≡ (XTV −1A X)−1XTV −1A ηp→ 0. Write η = a + b + e where these are the
random effects in the row order. We can easily handle the effect of a+ e, via
Var((XTV −1A X)−1XTV −1A (a+ e)) = (XTV −1A X)−1XTV −1A X(XTV −1A X)−1 = (XTV −1A X)−1.
The largest eigenvalue of VA is O(Ni•) and so this quantity is O(εR)→ 0.
We will need a sharper analysis of (XTV −1A X)−1 to control the contribution of the column random
effects b to the row-weighted GLS estimate βRLS. Furthermore their contribution to the intercept
term in β motivates centering the xij . For a nonrandom invertible matrix K ∈ Rp×p, we may replace
X by X∗ = XK and β by β∗ = K−1β. Now βRLS = Kβ∗RLS and so Var(βRLS) = KVar(β∗RLS)KT.
Our matrix K will be uniformly bounded as N → ∞ and independent of η. Then Var(β∗RLS) → 0
implies Var(βRLS)→ 0. The matrix we choose is
K =
1 −k2 · · · −kp0 1 · · · 0...
.... . .
...
0 0 · · · 1
with values kt for t = 2, . . . , p given below. We have x∗ij = (1, xij,2 − k2, xij,3 − k3, . . . , xij,p − kp).
We begin by noting that in the row ordering,
V −1A =1
σ2E
diag(INi• −N−1i• γi1Ni•1
TNi•
)
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 117
where there are R diagonal blocks of size Ni• ×Ni• and γi = Ni•σ2A/(σ
2E +Ni•σ
2A). Then
σ2EX
TV −1A X = XT(X − col
(γi1Ni• x
Ti•
))where col(·) ∈ RN×p is a column of R blocks of sizes Ni• × p. Continuing
σ2EX
TV −1A X =∑i
∑j
Zij(xijxTij − xij xTi•γi)
= XTX −∑i
γi∑j
Zijxij xTi•
= XTX −∑i
Ni•γixi•xTi•
=∑ij
Zij(xij − xi•)(xij − xi•)T +∑i
Ni•(1− γi)xi•xTi•. (B.1)
The lower right (p − 1) × (p − 1) submatrix of the first term in (B.1) grows proportionally to N .
We will see that the upper left 1× 1 submatrix of the second term grows at least as fast as R. We
choose our matrix K to zero out all of the top row and leftmost column of X∗TV −1A X∗ except the
upper left entry. To this end, define
kt =
∑iNi•(1− γi)xi•,t∑iNi•(1− γi)
=
∑i xi•,tNi•σ
2E/(Ni•σ
2A + σ2
E)∑iNi•σ
2E/(Ni•σ
2A + σ2
E), t = 2, . . . , p.
Now from (B.1),
σ2EX∗TV −1A X∗ =
(∑iNi•(1− γi) 0Tp−1
0p−1 V
)
where V is the lower right (p−1)× (p−1) submatrix of∑ij Zij(xij− xi•)(xij− xi•)T plus a positive
semidefinite matrix. Therefore
(X∗TV −1A X∗
)−1= σ2
E
(1/∑iNi•(1− γi) 0Tp−1
0p−1 V −1
).
Continuing the derivation,
Var((X∗TV −1A X∗)−1X∗TV −1A b) = (X∗TV −1A X∗)−1(X∗TV −1A BRV−1A X∗)(X∗TV −1A X∗)−1σ2
B .
The eigenvalues of V −1A are all smaller than 1, so in the ordering of positive semidefinite matrices,
X∗TV −1A BRV−1A X∗ ≤ X∗TBRX∗ =
∑j
N2•j x∗•j x∗•jT.
Now for a unit vector w ∈ Rp with w1 = 0 we have ‖(X∗TV −1A X∗)−1w‖ ≤ cN−1σ2E because the
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 118
sample covariance of non-intercept x’s grows (at least) proportionally to N . The xij are bounded
and so therefore the x∗•j are also bounded. So now
Var(wT(X∗TV −1A X∗)−1X∗TV −1A b) = O(N−2∑j
N2•j)→ 0.
Next we consider w = (1, 0, . . . , 0). For this w,
(X∗TV −1A X∗)−1w =σ2E∑
iNi•(1− γi)=
σ2E∑
iNi•σ2E/(Ni•σ
2A + σ2
E)≤ 1
Rσ2A
.
The matrix BR has C blocks of the form 1N•j1TN•j permuted into the row ordering. We may write
BR = ZbZTb where Zb ∈ {0, 1}N×C . The row of Zb corresponding to observation ij has only one 1
in it, at position j. Now
V −1A X∗ =
(1− γ1)1N1• 0 0 · · · 0
(1− γ2)1N2• 0 0 · · · 0...
......
. . ....
(1− γR)1NR• 0 0 · · · 0
∈ RN×p
and the j’th row of ZbV−1A X∗ ∈ RC×p is (
∑i Zij(1 − γi), 0, . . . , 0) ∈ Rp. Then the only nonzero
element of (X∗TV −1A BRV−1A X∗) is the upper left one and it equals
∑ijr ZijZrj(1 − γi)(1 − γr).
Therefore, for w = (1, 0, . . . , 0),
Var(wT(X∗TV −1A X∗)−1X∗TV −1A b) ≤ 1
R2σ4A
∑ijr
ZijZrj(1− γi)(1− γr)
=1
R2σ4A
∑ir
(ZZT)irσ2E
σ2E +Ni•σ2
A
σ2E
σ2E +Nr•σ2
A
≤ σ4E
R2σ8A
∑ir
(ZZT)irN−1i• N
−1r•
which vanishes by condition (3.14). A general unit vector w can be written as a linear combination
of unit vectors with w1 = 0 and w1 = 1 and so Var(wT(X∗TV −1A X∗)−1X∗TV −1A b)→ 0. Because K
is bounded Var(wT(XTV −1A X)−1XTV −1A b)→ 0 as well. This completes the proof.
B.4 Asymptotic Normality of Estimates
B.4.1 Proof of Theorem 3.4.4
We will use the following central limit theorem for a triangular array of weighted sums of IID random
variables.
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 119
Theorem B.4.1. For integers i and n with 1 ≤ i ≤ n, let εn,i be a triangular array of random
variables that are IID within each row with mean µn and variance σ2n ∈ (0,∞). Let cn,i be a
triangular array of finite constants, not all zero within each row. Define
Tn =1
Bn
n∑i=1
cni(εn,i − µn), where B2n = σ2
n
n∑i=1
c2ni.
If max1≤i≤n c2ni/∑ni=1 c
2ni → 0 as n→∞, then Tn
d→ N (0, 1).
Proof. This is from Theorem 2.2 of [4].
Our use case is for µn = 0 and σn constant in n. That case was also handled by Theorem 1 of
[26] who has a converse.
From Section B.3.3, βRLS− β = (XTV −1A X)−1XTV −1A η, where ηij = ai + bj + eij . We will make
use of sums ηi• =∑j Zijηij and Xi• =
∑j Zijxij ∈ Rp as well as corresponding column sums. The
matrix (XTV −1A X)−1 is not random. We will establish a central limit theorem for XTV −1A η.
Consider wTXTV −1A η for a unit vector w ∈ Rp. By the Woodbury formula,
wTXTV −1A η =wTXTη
σ2E
− σ2A
σ2E
∑i
wTXi•ηi•σ2E + σ2
ANi•
=1
σ2E
[∑i
aiwTXi• +
∑j
bjwTX•j +
∑ij
ZijeijwTxij
]− σ2
A
σ2E
∑i
wTXi•
σ2E + σ2
ANi•
(Ni•ai +
∑j
Zijbj +∑j
Zijeij
)=∑i
aiwTXi•
σ2E + σ2
ANi•︸ ︷︷ ︸Term R1
+∑j
bjσ2E
(wTX•j − σ2
A
∑i
ZijwTXi•
σ2E + σ2
ANi•
)︸ ︷︷ ︸
Term R2
+∑ij
Zijeijσ2E
(wTxij − σ2
A
wTXi•
σ2E + σ2
ANi•
)︸ ︷︷ ︸
Term R3
.
Terms R1, R2 and R3 are independent. We will show CLTs for each of them individually.
R1: We use Theorem B.4.1 with random variables ai and weights ci = wTXi•/(σ2E +σ2
ANi•). Now
maxi c2i ≤M2
∞ and
∑i
c2i =∑i
( wTXi•
σ2E + σ2
ANi•
)2≥∑i
( wTxi•σ2E + σ2
A
)2≥ (σ2
E + σ2A)−2 I
(∑i
xi•xTi•
)→∞.
Therefore Term R1 is asymptotically normally distributed.
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 120
R2: This term is a weighted sum of independent random variables bj/σ2E with weights cj =
wT∑i Zij(xij − γixi•), where γi = σ2
A/(σ2A + σ2
E/Ni•). Therefore cj = N•jwT(x•j − x•j) for the
second order averages x•j given by (3.15).
As in the proof of Theorem 3.4.4 from Section B.4.1 we employ a bounded invertible centering
matrixK =(
1 −k0 Ip−1
), not necessarily the same one as there. We will show thatK
∑j N•jbj(x•j−x•j)
is asymptotically Gaussian and then so is∑j N•jbj(x•j − x•j). Let c∗j = wTK
∑j N•j(x•j − x•j).
Then ∑j
c∗j2 = wT
∑j
N2•jK(x•j − x•j)(x•j − x•j)TKTw.
For 2 ≤ t ≤ p let
kt =∑j
N2•j(x•j,t − x•j,t)
/∑j
N2•j
and define k∗ = (0, k2, . . . , kp)T. Then
∑j N
2•jK(x•j− x•j−k∗)(x•j− x•j−k∗)TKT is block diagonal
with an upper 1× 1 block and a lower (p− 1)× (p− 1) block.
First suppose that w = (w1, w2, . . . , wp) is a unit vector with |w1| 6= 1, and ‖w‖2−1 be the squared
norm of w excluding w1. Then,
∑j
c∗j2 ≥ ‖w‖2−1I0
(∑j
N2•j(x•j − x•j − k∗)(x•j − x•j − k∗)T
)
which diverges faster than maxj N2•j by hypothesis. It remains to consider wT = (±1, 0, . . . , 0).
For this vector c∗j = cj =∑i Zij(1 − γi) =
∑i Zijσ
2E/(σ
2E + Ni•σ
2A) and maxj c
2j/∑j c
2j → 0 by
hypothesis. Therefore Term R2 is asymptotically normally distributed.
R3: This term is a weighted sum of IID random variables eij/σ2E with weights cij = Zijw
T(xij −γixi•). As in paragraph R2, we employ a bounded invertible centering matrix. Then for a unit
vector w 6= (±1, 0, . . . , 0)T
∑ij
c∗ij2 ≥ ‖w‖2−1I0
(∑ij
Zij(xij − γixi•)(xij − γixi•)T)
≥ ‖w‖2−1I0(∑ij
Zij(xij − xi•)(xij − xi•)T)
which, by hypothesis, diverges to infinity, while maxij c∗ij
2 = O(1). The case w = (±1, 0, . . . , 0)T is
handled by one of the assumptions in the theorem.
All three terms have asymptotic normal distributions with mean zero, and they are indepen-
dent. Therefore, (XTV −1A X)−1XTV −1A η is asymptotically Gaussian with mean zero and variance
(XTV −1A X)−1XTV −1A VRV−1A X(XTV −1A X)−1.
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 121
B.4.2 Proof of Lemma 3.4.1
We mimic the story in Section B.3.2. First, we use Lemma B.2.1,
Ua(β)− Ua√∑j N
2•j
=1
2(β − β)T
( 1√∑j N
2•j
∑ijj′
N−1i• ZijZij′(xij − xij′)(xij − xij′)T)
(β − β)
︸ ︷︷ ︸A2
+ (β − β)T1√∑j N
2•j
∑ijj′
N−1i• ZijZij′(xij − xij′)(bj − bj′)
︸ ︷︷ ︸A3
+ (β − β)T1√∑j N
2•j
∑ijj′
N−1i• ZijZij′(xij − xij′)(eij − eij′)
︸ ︷︷ ︸A4
.
These terms are the same ones we considered in Section B.3.2, except they are now normalized by√∑j N
2•j instead of by N . The A2 term was of the largest magnitude there. It is also largest here
and so we show that Term A2 is op(1).
The middle factor in Term A2 has entries no larger than
4M∞√∑j N
2•j
∑ijj′
N−1i• ZijZij′ =4M∞N√∑
j N2•j
.
Also ‖β − β‖2 = O(εR + εC) so this combined with the growth condition (2.31) makes this term
op(1).
It follows that (Ua(β)− Ua)/√∑
j N2•j
p→ 0, and similarly (Ub(β)− Ub)/√∑
iN2i•
p→ 0. Now we
expand
Ue(β)− UeN√∑
iN2i• +
∑j N
2•j
into terms just like those in Section B.3.2, but with a new denominator. As before, Term E2 is of
larger order than all the others. The magnitude of this term with its new normalization is
M∞‖β − β‖2N√∑
iN2i• +
∑j N
2•j
=Op(maxiNi• + maxj N•j)√∑
iN2i• +
∑j N
2•j
= op(1).
Therefore (Ue(β)− Ue)/[N√∑
iN2i• +
∑j N
2•j ]
p→ 0.
APPENDIX B. LINEAR MIXED MODEL WITH CROSSED RANDOM EFFECTS 122
B.4.3 Proof of Theorem 3.4.5
By Slutsky’s Theorem, Theorem 2.4.6, and Lemma 3.4.1, Ua(β), Ub(β), and Ue(β) are asymptotically
normally distributed with mean
(σ2B + σ2
E)(N −R)√∑j N
2•j
(σ2A + σ2
E)(N − C)√∑iN
2i•
σ2A(N2 −
∑iN
2i•) + σ2
B(N2 −∑j N
2•j) + σ2
E(N2 −N)
N√∑
iN2i• +
∑j N
2•j
and approximate covariance
σ4B(κB + 2)
σ4E(κE + 2)N√∑iN
2i•
∑j N
2•j
σ4B(κB + 2)
∑j N
2•j√∑
j N2•j(∑iN
2i• +
∑j N
2•j)
∗ σ4A(κA + 2)
σ4A(κA + 2)
∑iN
2i•√∑
iN2i•(∑iN
2i• +
∑j N
2•j)
∗ ∗σ4A(κA + 2)
∑iN
2i• + σ4
B(κB + 2)∑j N
2•j + σ4
E(κE + 2)N∑iN
2i• +
∑j N
2•j
.
Using (3.18), we see that asymptotically σ2A, σ2
B , and σ2E are jointly normal. With some algebra,
we see that the mean and approximate covariance are:
σ2A
σ2B
σ2E
and
σ4A(κA + 2)
∑iN
2i•
N2σ4E(κE + 2)
1
N−σ4
E(κE + 2)1
N
∗ σ4B(κB + 2)
∑j N
2•j
N2−σ4
E(κE + 2)1
N
∗ ∗ σ4E(κE + 2)
1
N
.
Bibliography
[1] Stitch Fix, Inc. https://www.stitchfix.com/. Accessed: 2017-04-08.
[2] Apache Software Foundation. Hadoop. https://hadoop.apache.org, March 2017. Version
2.8.0.
[3] Apache Software Foundation. Apache spark. https://spark.apache.org/, 2017 May. Version
2.1.1.
[4] Subhash C. Bagui, Dulal K. Bhaumik, and K. L. Mehra. A few counter examples useful in
teaching central limit theorems. The American Statistician, 67(1):49–56, 2013.
[5] Douglas Bates. Computational methods for mixed models. Technical report, Department of
Statistics, University of Wisconsin–Madison, November 2014. https://cran.r-project.org/
web/packages/lme4/vignettes/Theory.pdf.
[6] Douglas Bates. Linear mixed-effects models in Julia. https://github.com/dmbates/
MixedModels.jl, 2016.
[7] Douglas Bates, Martin Machler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects
models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
[8] James Bennett and Stan Lanning. The Netflix prize. In Proceedings of KDD Cup and Workshop
2007, 2007.
[9] Andreas Buja, Trevor Hastie, and Robert Tibshirani. Linear smoothers and additive models.
The Annals of Statistics, 17(2):453–510, 1989.
[10] Tony F. Chan, Gene H. Golub, and Randall J. LeVeque. Algorithms for computing the sample
variance: Analysis and recommendations. The American Statistician, 37(3):242–247, 1983.
[11] D. Clayton and J. Rasbash. Estimation in large cross random-effect models by data augmenta-
tion. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162(3):425–436,
1999.
123
BIBLIOGRAPHY 124
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological),
39(1):1–38, 1977.
[13] Katelyn Gao and Art Owen. Efficient moment calculations for variance components in large
unbalanced crossed random effects models. Electronic Journal of Statistics, 11(1):1235–1296,
2017.
[14] Katelyn Gao and Art B. Owen. Estimation and inference for very large linear mixed effects
models. Arxiv e-prints, 2016. http://arxiv.org/abs/1610.08088v2.
[15] Andrew Gelman, David A. Van Dyk, Zaiying Huang, and John W. Boscardin. Using redundant
parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics,
17(1), 2012.
[16] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–
741, 1984.
[17] H.-U. Graser, S.P. Smith, and B. Tier. A derivative-free approach for estimating variance
components in animal models by restricted maximum likelihood. Journal of Animal Science,
64(5):1362–1370, 1987.
[18] Zhonghua Gu. Model diagnostics for generalized linear mixed models. PhD thesis, University
of California, Davis, 2008.
[19] William W. Hager. Updating the inverse of a matrix. SIAM Review, 31(2):221–239, 1989.
[20] Martin Hairer, Andrew M. Stuart, and Sebastian J. Vollmer. Spectral gaps for a Metropolis
Hastings algorithm in infinite dimensions. The Annals of Applied Probability, 24(6):2455–2490,
2014.
[21] Peter Hall and Christopher C Heyde. Martingale limit theory and its application. Academic
press, 1980.
[22] Herman O. Hartley and Jon N.K. Rao. Maximum-likelihood estimation for the mixed analysis
of variance model. Biometrika, 54(1-2):93–108, 1967.
[23] Charles R. Henderson. Estimation of variance and covariance components. Biometrics,
9(2):226–252, 1953.
[24] Jiming Jiang. Consistent estimators in generalized linear mixed models. Journal of the American
Statistical Association, 93(442):720–729, 1998.
BIBLIOGRAPHY 125
[25] Fredrik Johansson. mpmath: a Python library for arbitrary-precision floating-point arithmetic
(version 0.14), 2015. http://mpmath.org.
[26] Peter Kevei. A note on asymptotics of linear combinations of iid random variables. Periodica
Mathematica Hungarica, 60(1):25–36, 2010.
[27] Davar Khoshnevisan. Quadratic forms, lecture notes for Math 6010-1. http://www.math.utah.
edu/~davar/math6010/2014/QuadraticForms.pdf, 2014. Accessed: 2017-05-30.
[28] Nan Laird, Nicholas Lange, and Daniel Stram. Maximum likelihood computations with repeated
measures: Application of the EM algorithm. Journal of the American Statistical Association,
82(397):97–105, 1987.
[29] Last.fm. Last.fm dataset – 360k users. http://ocelma.net/MusicRecommendationDataset/
lastfm-360K.html, 2010. http://www.last.fm/.
[30] Paul J. Lavrakas. Encyclopedia of Survey Research Methods: A-M., volume 1. Sage, 2008.
[31] Mary J. Lindstrom and Douglas M. Bates. Newton-Raphson and EM algorithms for linear
mixed-effects models for repeated-measures data. Journal of the American Statistical Associa-
tion, 83(404):1014–1022, 1988.
[32] Jun S. Liu. Monte Carlo Strategies in Scientific Computing. Springer New York, 2004.
[33] James D. Malley. Optimal unbiased estimation of variance components, volume 39 of Lecture
Notes in Statistics. Springer Verlag, Berlin, 1986.
[34] Albert W. Marshall and Ingram Olkin. Matrix versions of the Cauchy and Kantorovich inequal-
ities. Aequationes Mathematicae, 40(1):89–93, 1990.
[35] Peter Mccullagh. Resampling and exchangeable arrays. Bernoulli, 6(2):285–301, 2000.
[36] Art B. Owen. The pigeonhole bootstrap. The Annals of Applied Statistics, 1(2):386–411, 2007.
[37] Art B. Owen and Dean Eckles. Bootstrapping data arrays of arbitrary order. The Annals of
Applied Statistics, 6(3):895–927, 2012.
[38] Philippe Pebay. Formulas for robust, one-pass parallel computation of covariances and arbitrary-
order statistical moments. Technical Report SAND2008-6212, Sandia National Laboratories,
2008.
[39] Michael J.D. Powell. The BOBYQA algorithm for bound constrained optimization without
derivatives. Technical Report NA06, DAMTP, University of Cambridge, 2009.
BIBLIOGRAPHY 126
[40] Stephen W. Raudenbush. A crossed random effects model for unbalanced data with applications
in cross-sectional and longitudinal research. Journal of Educational and Behavioral Statistics,
18(4):321–349, 1993.
[41] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling for various Metropolis Hastings
algorithms. Statistical Science, 16(4):351–367, 2001.
[42] Gareth O. Roberts and Sujit K. Sahu. Updating schemes, correlation structure, blocking and
parameterization for the Gibbs sampler. Journal of the Royal Statistical Society. Series B
(Methodological), 59(2):291–317, 1997.
[43] Shayle R. Searle. Another look at Henderson’s methods of estimating variance components.
Biometrics, 24(4):749–787, 1968.
[44] Shayle R. Searle, George Casella, and Charles E. McCulloch. Variance components. John Wiley
& Sons, 2006.
[45] Robert J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley Series in Prob-
ability and Statistics. Wiley, 2009.
[46] Tom A.B. Snijders. Multilevel analysis. In Miodrag Lovric, editor, International Encyclopedia
of Statistical Science, pages 879–882. Springer Berlin Heidelberg, 2011.
[47] David A. Van Dyk and Xiao-Li Meng. The art of data augmentation. Journal of Computational
and Graphical Statistics, 10(1):1–50, 2001.
[48] Matthew Wall. Big Data: Are you ready for blast-off? http://www.bbc.com/news/
business-26383058, March 2014. Accessed: 2017-04-08.
[49] Yahoo!-Webscope. Dataset ydata-ymovies-user-movie-ratings-train-v1 0, 2015. http://
research.yahoo.com/Academic_Relations.
[50] Yahoo!-Webscope. Dataset ydata-ymusic-rating-study-v1 0-train, 2015. http://research.
yahoo.com/Academic_Relations.
[51] Yaming Yu and Xiao-Li Meng. To center or not to center: That is not the question – an
ancillarity–sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency. Journal of
Computational and Graphical Statistics, 20(3):531–570, 2011.