Using stochastic approximation techniques to efficiently construct

Using stochastic approximation techniques to efficiently construct

confidence intervals for heritability

November 1, 2016

Regev Schweiger1, Eyal Fisher2, Elior Rahmani1, Liat Shenhav2, Saharon Rosset2 and EranHalperin3,4

1 Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel2 School of Mathematical Sciences, Department of Statistics, Tel Aviv University, Tel Aviv, Israel3 Department of Computer Science, University of California, Los Angeles, CA, USA4 Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, CA, USA

Abstract

Estimation of heritability is an important task in genetics. The use of linear mixed models(LMMs) to determine narrow-sense SNP-heritability and related quantities has received muchrecent attention, due of its ability to account for variants with small effect sizes. Typically,heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach.The common way to report the uncertainty in REML estimation uses standard errors (SE),which rely on asymptotic properties. However, these assumptions are often violated becauseof the bounded parameter space, statistical dependencies, and limited sample size, leading tobiased estimates and inflated or deflated confidence intervals. In addition, for larger datasets(e.g., tens of thousands of individuals), the construction of SEs itself may require considerabletime, as it requires expensive matrix inversions and multiplications.

Here, we present FIESTA (Fast confidence IntErvals using STochastic Approximation), amethod for constructing accurate confidence intervals (CIs). FIESTA is based on parametricbootstrap sampling, and therefore avoids unjustified assumptions on the distribution of theheritability estimator. FIESTA uses stochastic approximation techniques, which accelerate theconstruction of CIs by several orders of magnitude, compared to previous approaches as wellas to the analytical approximation used by SEs. FIESTA builds accurate CIs rapidly, e.g.,requiring only several seconds for datasets of tend of thousands of individuals, making FIESTAa very fast solution to the problem of building accurate CIs for heritability for all dataset sizes.

1

Introduction

Heritability, or the proportion of phenotypic variation that is explained by genetic variation, is animportant population parameter in human genetics, in evolution, in plant and animal breeding, andmore. Estimating the heritability has been traditionally performed using related individuals suchas in twin studies or pedigree designs [1–3]. More recently, genetic variation has been estimatedusing genetic marker information, and in particular in genome-wide association studies (GWAS) [4,5], which have identified thousands of genetic variants that are associated with dozens of commondiseases. However, genome-wide significant associations were generally found to explain only asmall proportion of the heritability of complex diseases.

To cope with this challenge, linear mixed model (LMM) approaches [6–13] have been applied toestimate the heritability explained by common SNPs (the narrow-sense SNP-heritability, to whichwe refer as heritability, and denote by h2) from cohorts of unrelated individuals, such as thosefound in GWAS [14]. Estimation under the LMM is usually performed using restricted maximumlikelihood (REML) estimation, and is implemented in some widely used tools, like the GCTAsoftware package [15]. LMMs utilize all variants from a GWAS, and not just the variants that arestatistically significant, and therefore is able to account for variants with small effect sizes.

As in any statistical analysis, the process of estimating the heritability suffers from statisticaluncertainty. Typically, confidence intervals (CIs) are reported alongside with point estimates toquantify this uncertainty. Usually, such CIs are constructed from standard errors (SEs), whichmake the assumption that the estimators asymptotically follow a normal distribution. However, ithas been shown [13, 16–20] that such CIs can be highly inaccurate. This is because estimators donot necessarily obey the conditions required for them to asymptotically follow the normal distribu-tion. Additionally, these CIs may spread beyond the natural boundaries of their parameters, e.g.,including negative values for heritability. As a result, these CIs are often inaccurate, difficult tointerpret, or lead to erroneous conclusions.

To handle these issues, previous approaches have taken several directions. Non-standard asymp-totic theory for boundary and near-boundary maximum likelihood estimates has been developed(e.g., [21–23]), and it has been suggested to replace the asymptotic normality assumption with theasymptotics developed for the non-standard boundary case [24]. Visscher et al. [25] derived an an-alytical expression for the asymptotic variance of the heritability estimator in a range of pedigree-and marker-based experimental designs. Unfortunately, these conditions typically do not hold forgenomic datasets, mainly due to the limited sample size, making either of these approximationsineffective. Other approaches include hierarchical bootstrapping schemes, e.g., [26]; extending theREML estimation method with Bayesian priors, e.g., [27, 28]; using alternative statistics as asa basis for building CIs [17, 29, 30]; or using Bayesian posterior distribution of the heritabilityvalue [31].

An alternative approach is the parametric bootstrap test inversion technique, which constructsCIs via sampling phenotypes, performing heritability estimation on the sampled phenotypes, es-timating the distribution of the heritability estimator and using these estimates as a basis for CIconstruction [32]. The main advantage of using a parametric bootstrap approach is that it does notrequire any assumptions on the distribution of the heritability estimator or of Bayesian priors. As anaıve implementation of this approach would be computationally prohibitive, the ALBI method [20]utilizes a highly accurate approximation that allows an efficient construction of accurate CIs. How-ever, ALBI still requires a preprocessing step. Newer datasets (e.g. the UK Biobank [33]) maycontain tens or hundreds of thousands of individuals, for which this step may require hours of com-putation time. In addition, the need for a preprocessing step can be an obstacle in the adoption ofa better CI construction method.

2

In this paper, we introduce FIESTA (Fast confidence IntErvals using STochastic Approxima-tion), which dramatically reduces the running time of CI construction by several orders of magni-tude, e.g., to mere seconds for dataset with tens of thousands of individuals, compared to . Thekey ingredient of our approach is a CI construction algorithm from the field of stochastic approx-imation (for a review, see [34]). Originating in the work of Robbins and Monro [35], stochasticapproximation algorithms are recursive update rules that can be used, among other things, to solveoptimization problems or function inversion problems when the collected data is subject to noise.It has been shown [36] that stochastic approximation can be used to construct CIs for general fam-ilies of parametric distributions, given the ability to randomly sample from them, and this is theapproach we employ here. We validate FIESTA on two real datasets, the Northern Finland BirthCohort (NFBC) dataset [37] and the Wellcome Trust Case Control Consortium 2 (WTCCC2) [38]dataset.

In addition to the significant speedup in time, FIESTA requires no preprocessing step beyondcalculating the eigendecomposition of the kinship matrix, which is usually already performed as apart of heritability estimation. Finally, we show that FIESTA is even significantly faster than theanalytical SE formulation. In summary, FIESTA can effectively be used extremely easily to rapidlygenerate accurate CIs for REML heritability estimates. FIESTA will be available as part of theALBI toolkit at https://github.com/cozygene/albi at the time of publication.

Results

A faster method for calculating CIs for heritability

CIs constructed from standard errors, which are based on the assumption of a normal distributionfor the heritability estimators, were previously shown to be inaccurate [13, 16–20]. In this paper, weintroduce FIESTA, a method that generates accurate CIs for h2, the true heritability value, givenh2, the restricted maximum likelihood (REML) estimator for h2 (see Methods). FIESTA uses theprinciple of test inversion to construct accurate CIs, using a stochastic approximation method thatdirectly estimates the CI boundaries. We review FIESTA below; for a full description, see Methods.

The methodology of test inversion can be described as follows. The estimator h2 is a functionof the phenotype, which is a random variable whose distribution depends on h2, assuming a fixedkinship matrix. Therefore, h2 is distributed differently for every value of h2. For each true valueof h2, we select a subset of possible h2 values that has a sampling probability of 1 − α, where h2

is distributed under the assumption of a true heritability value h2. We define this subset to bethe acceptance region for that value of h2. The CI accompanying an estimate h2 is the intervalcontaining all values of h2 whose acceptance region includes h2, namely, for which h2 does notimply the rejection of the null hypothesis that the true heritability value is h2, with a significancelevel of α.

It remains to define suitable acceptance regions. In the Methods section, we review our schemefor defining acceptance regions. A basic ingredient of our construction of acceptance regions isinverting certain quantile functions of the distribution of h2, as a function of h2. For example,finding the inverse of a value H2 of the 95%-quantile function is finding a heritability value h2 forwhich Prh2(h2 ≤ H2) = 0.95, i.e., the probability to get an heritability estimate of H2 or below isprecisely 95%, when h2 is distributed with the heritability value h2.

Instead of carrying out this task by a full parametric bootstrap estimate of the distribution of theestimator, we employ a technique from the field of stochastic approximation to achieve the sameresults with a fraction of the computational cost. The modified Robbins-Monro procedure [39],described in the Methods section, is an iterative method that finds the inverse of the quantile

3

https://github.com/cozygene/albi

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Estimated value h2

Trueva

lueofh2

NFBC

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Estimated value h2

Trueva

lueofh2

WTCCC2

Figure 1 95% CIs for the NFBC and WTCCC2 datasets. Accurate 95% CIs constructed for theNFBC dataset [37] (left) and the WTCCC2 [38] dataset (right) by FIESTA. For each h2 on a fine grid of

1000 values (x axis), we constructed a CI, whose boundaries are shown (y axis). For example, for h2 = 0.5(denoted by a dashed line), the CI for NFBC is [0.282, 0.705] (denoted by a full line).

function of a one-parameter distribution. It operates by iteratively (1) drawing a sample with atrue heritability value equal to our current guess for the required inverse value, (2) comparing itsestimated heritability to H2; (3) updating our current guess accordingly, by moving in the rightdirection, with a step size that decreases with the number of iterations. An additional speedup isacquired by using a fast method to calculate the derivative of likelihood of the sample, and usingthe derivative to compare its estimated heritability to H2, instead of performing the full likelihoodmaximization.

We applied FIESTA to construct 95% CIs for the NFBC dataset [37] and the WTCCC2dataset [38], as seen in Figure 1. We then turned to verify the accuracy of these CIs, which canbe measured as follows. Draw multiple phenotype vectors from the distribution assumed by theLMM with parameters that correspond to a true heritability value h2. From each such phenotype,construct a CI for its estimated heritability with a confidence level of, e.g., 95%. If the constructedCIs are accurate, then they should cover the true underlying h2 95% of the time. Then, check thepercentage of times in which the CI covered h2, as a function of h2. We measured the accuracyof FIESTA, with CIs designed to have a coverage of 95%. The results are shown in Figure 2,demonstrating that FIESTA accurately achieves the desired confidence levels.

Benchmarks

We compared the speed of the stochastic approximation approach, implemented in FIESTA, withthat of using the parametric bootstrap for estimating the distribution of heritability estimator. Thelatter was tested either as implemented naively by using either GCTA [15] and pylmm [40], or byusing ALBI [20]. Both approaches require the calculation of the eigendecomposition of the kinshipmatrix. As this is already often a part of the heritability estimation algorithm, its calculation timeis excluded from the benchmarks. In the Discussion, we discuss how this step could be avoided

4

0 0.2 0.4 0.6 0.8 190%

95%

100%

95% CI

True value of h2

CoverageProbability

NFBC

0 0.2 0.4 0.6 0.8 190%

95%

100%

95% CI

True value of h2

WTCCC2

Figure 2 Accuracy of CIs for the NFBC and WTCCC2 datasets. The coverage probabilities ofthe FIESTA CIs. The coverage probabilities are shown for CIs designed to have coverage probabilities of95%. The CIs achieve accurate coverage.

altogether.One difference between the approaches is that the bootstrap approach performs a lengthy pre-

processing step that estimates many distributions. Once these distributions are estimated, con-structing a CI is very rapid. In contrast, the stochastic approximation approach does not performa preprocessing step, but performs a non-trivial calculation per CI.

The construction of a single CI with FIESTA consists of calculating six to eight values usingthe modified Robbins-Monro procedure (see Methods). The first four values depend only on thekinship matrix, but not on the heritability estimate for which we construct a CI, so they need tobe calculated only once per kinship matrix, and can then be shared between several CIs. Eachmodified Robbins-Monro run has the complexity of O(nT ), where n is the number of individualsin the sample and T is the number of iterations (in the order of 1,000; see Methods). Therefore, intotal, the time complexity to construct K CIs with FIESTA grows linearly with K,T and n.

We also compared FIESTA to the performance of the analytical SE approach. While ofteninaccurate, analytical SEs are often the go-to method by many practitioners: First, their calculationis conceptually easy to understand, since a closed-form formula exists for the SEs (see Appendix A1);second, using a closed-form expression is often perceived as faster than more involved algorithmicprocedures. However, this is not the case for heritability estimation, as SEs are calculated usingvariants of the Fisher information matrix (e.g., the AI matrix, as in GCTA [15]), whose calculationrequires matrix-by-vector multiplications, which are O(n2). In contrast, FIESTA is linear in n,giving it an advantage at larger datasets in particular.

We performed a benchmark to evaluate FIESTA, using the NFBC and WTCCC2 datasets. Weestimated the distributions of h2 for h2 = 0, 0.01, . . . , 1, with GCTA [15] and pylmm [40], both ofwhich perform full estimation, using 1,000 random bootstrap samples. For the same task, we alsoused ALBI [20], at a grid resolution of 0.001. As explained above, the time of construction of CIsgiven these distributions is negligible relative to the time required for their estimation. We alsoconstructed analytical SEs for both datasets using the AI method (Appendix A1). These times arereported in Table 1.

As a comparison, we used FIESTA to construct varying number of CIs, using 1,000 iterationsin the modified Robbins-Monro procedure (see Methods). In Table 1, it can be seen that FIESTAis significantly faster, particularly when few CIs are needed. We also note that FIESTA is currently

5

implemented in the Python language, using the numpy package; a significant additional speedupcan be obtained by migrating to a compiled language, e.g., C++.

We then continued to investigate the stability of CI construction and its dependency on thenumber of iterations. We ran FIESTA 100 times to construct CI for the NFBC and WTCCC2datasets using 200, 500, 1,000 or 2,000 iterations. We measured the variance in the constructed CIendpoints (Table 2). As expected, the variance decreases with the number of iterations. In addition,we measured the mean and variance of the coverage of CIs under a grid of true heritability values.Here, also, we observed that variance of coverage decreases with the number of iterations. Wenote that 500 iterations are sufficient for reasonably accurate CIs for these datasets, and that thecoverage of even 200 iterations is only slightly biased downwards.

Algorithm Time for NFBC Time for WTCCC2

GCTA > 30 days > 30 days

pylmm 3.8 hours > 8 days

ALBI 5.35 minutes 2.5 hours

Analytical SEs ∼3.1 sec × # of CIs, e.g.: ∼6.2 min × # of CIs, e.g.:1 CI, ∼3 seconds 1 CI, ∼6 minutes5 CIs, ∼15 seconds 5 CIs, ∼31 minutes10 CIs, ∼31 seconds 10 CIs, ∼1 hours50 CIs, ∼2.6 minutes 50 CIs, ∼5 hours

FIESTA ∼1.8 sec + 0.6 sec × # of CIs, e.g.: ∼6 sec + 2.8 sec × # of CIs, e.g.:1 CI, ∼3 seconds 1 CI, ∼9 seconds5 CIs, ∼6 seconds 5 CIs, ∼20 seconds10 CIs, ∼8 seconds 10 CIs, ∼34 seconds50 CIs, ∼33 seconds 50 CIs, ∼2.4 minutes

Table 1 Benchmarks. Running times of FIESTA, compared with previous methods (see Results for moredetails). Running times are reported for the NFBC (2,520 individuals) and WTCCC2 (13,950 individuals)datasets.

Dataset NFBC WTCCC2

No. of iterations 200 500 1,000 2,000 200 500 1,000 2,000

CI lower point SE 0.0201 0.0132 0.0094 0.0067 0.0050 0.0032 0.0023 0.0016

CI upper point SE 0.0206 0.0133 0.0096 0.0070 0.0050 0.0031 0.0023 0.0016

Mean coverage 94.20% 94.71% 94.87% 94.95% 94.720% 95.217% 95.323% 95.373%

SE of coverage 0.45% 0.34% 0.30% 0.28% 0.781% 0.575% 0.486% 0.442%

Table 2 Stability of CI construction. 95% CIs for the NFBC and WTCCC2 datasets were constructed100 times, with either 200, 500, 1,000 or 2,000 iterations. CIs were constructed for h2 = 0, 0.001, . . . , 1. Inorder to assess the variance of the construction process, the mean empirical standard error (SE) of the lowerand upper endpoints is reported, where the mean was calculated over all non-constant endpoints, across allh2 values. In addition, the CI coverage for h2 = 0, 0.01, . . . , 1 was calculated as in Figure 2. The averagemean and SE across all 100 runs, calculated across all h2, is reported.

Methods

For clarity of presentation, we begin by defining the heritability under the LMM, and brieflyreviewing stochastic approximation and its relevance to finding CIs. Finally, we introduce FIESTA,

6

our improved method for faster construction of CIs for heritability.

The linear mixed model and REML

We consider the following standard linear mixed model (see [41] for a detailed review). Let n bethe number of individuals and m is the number of SNPs. Let y be a n × 1 vector of phenotypemeasurements for each individual. Let X be a n× p matrix of p covariates (possibly including anintercept vector 1n as a first column, as well as other covariates such as sex, age, etc.). Let Z be then×m standardized genotype matrix, i.e., columns have zero mean and unit variance. Let β be ap×1 vector of fixed effects, s a m×1 vector of random effects, and e a n×1 vector of errors. Then,y = Xβ + Zs + e. We assume s and e are statistically independent and are distributed normallyas s ∼ N

(0m, 1

mσ2gIm

), e ∼ N

(0n,σ

2eIn

). The fixed effects β and the coefficients σ2

g and σ2e are

the parameters of the model.Define K = 1

mZZT. Typically, K is commonly called the kinship matrix, or the genetic rela-tionship matrix. Under these conditions, it follows [14] that:

y ∼ N(Xβ,σ2

gK+ σ2eIn

). (1)

The narrow-sense heritability due to genotyped common SNPs is defined as the proportion oftotal variance explained by genetic factors [42]:

h2 =σ2g

σ2g + σ2

e

.

Defining σ2p = σ2

g+σ2e , Equation (1) becomes: y ∼ N

(Xβ,σ2

pVh2

), whereVh2 = h2K+ (1− h2)In.

The most common way of estimating h2 is restricted maximum likelihood (REML) estimation.REML consists of maximizing the likelihood function associated with the projection of the phe-notype onto the subspace orthongonal to that of the fixed effects of the model [43]. In [20], it isshown that the distribution of h2 depends only on h2, and is invariant under changes to σ2

p and

β. We may therefore limit our study to the h2 estimator alone, in the special case of fixed σ2p = 1

and β = 0p, which substantially simplifies the problem; namely, we may focus on properties of thedistribution N (0n,Vh2) instead of the more general N

(Xβ,σ2

pVh2

).

Confidence intervals for h2

We wish to build confidence intervals with a coverage probability of 1 − α (e.g., 95%). The fullderivation is developed in [20], and is reviewed in Appendix A2; we cite the results here.

Let cβ(h2) be the β-th quantile function of h2, when the true heritability is h2; i.e. Prh2(h2 ≤ cβ(h

2)) = β.

Define s and t to be the values for which Prh2=s(h2 = 0) = α/2 and Prh2=t(h

2 = 1) = α/2. Inaddition, let s∗ = c1−α(0), t

∗ = cα(1). Then the lower and upper CI boundaries for an estimate H2

are given, respectively, by

lH2 =

0 if H2 ≤ s∗

c−11−α(H

2) if c−11−α(H

2) < s

s if s ∈ [c−11−α/2(H

2), c−11−α(H

2)]

c−11−α/2(H

2) if s < c−11−α/2(H

2)

(2)

7

and

uH2 =

c−11−α/2(H

2) if c−1α/2(H

2) < t

t if t ∈ [c−1α (H2), c−1

α/2(H2)]

c−1α (H2) if t < c−1

α (H2)

1 if t∗ ≤ H2 .

(3)

Using stochastic approximation to calculate CIs

Robbins-Monro. Stochastic approximation methods are a family of iterative stochastic opti-mization algorithms that attempt to find zeroes, inverses or extrema of functions which cannot becomputed directly, but only estimated via noisy observations. The classical Robbins-Monro algo-rithm presents a methodology for solving a function inversion problem, where the function is theexpected value of a parametrized family of distributions. Namely, a function g(θ) is given, for whichwe want to find an inverse, i.e., a value θ for which g(θ) = C, for some constant C. However, thefunction g is not directly available to us, but rather we are only able to obtain noisy observationsfrom it. The Robbins-Monro procedure is a modification of Newton’s method, where the step sizesare instead an appropriately decreasing sequence. Starting with an initial guess, θ0, at iterationn we obtain a noisy sample yn from a distribution whose mean is g(θn), and update our estimatewith

θn+1 = θn − γn · (yn − C),

where γn = 1/n. The Robbins-Monro procedure is shown to converge to the correct solution when:(i) the random variables defining our sampling process at each g(θ) are uniformly bounded; (ii)g(θ) is nondecreasing; and (iii) g′(θ) exists and is positive [35].

Using Robbins-Monro to calculate CIs. Garthwaite and Buckland [36] have used the Robbins-Monro process for finding the endpoints of CIs, as we will now describe. We discuss the case ofone-sided CIs, but the application to two-sided CIs is immediate.

Suppose that [0,uθ) is the one-sided 1 − α CI for θ, when data y has been observed, with an

estimate θ = θ(y). Then, the correct endpoint satisfies

Prθ=uθ

(θ ≤ θ(y)

)= α

If we define g(θ) = Prθ

(θ ≥ θ(y)

)(to make it nondecreasing), then finding uθ is equivalent to

finding the inverse of g at 1− α. However, under these settings, we do not have direct access to g.Rather, we sample a binary random variable Yθ, indicating that a sample yθ randomly drawn from

g(θ) has an estimate θ(yθ) larger than θ(y). By definition, Prθ(Yθ) = Prθ

(θ(yθ)) ≥ θ(y)

)= g(θ),

so the random sample Yθ has a mean of g(θ). Effectively, this formulation allows us to use theRobbins-Monro procedure to invert the quantile function as a function of θ. Full asymptoticefficiency can be achieved by multiplying the step size γn by some constant c.

In detail, denote by yn a random sample from the random variable Yθn . The update rule isθn+1 = θn − cγn · (yn − (1− α)), or explicity:

θn+1 =

{θn − cα

n if yn = 1

θn + c(1−α)n if yn = 0

8

The procedure is shown to be fully asymptotic efficient if c = 1/g′(uθ). However, as neither g noruθ are known in advance, c is estimated adaptively, using the current estimate θn in place of uθ,and assuming a parametric form for g [36].

The modified Robbins-Monro procedure. As mentioned above, if the optimal step size con-stant is known, this procedure is fully asymptotic efficient. However it was empirically shown towork poorly for extreme quantiles. Joseph [39] suggested a modification of this procedure, whichis tuned to obtain optimal convergence speed. It uses the following update form:

θn+1 = θn − an(yn − Cn).

Joseph allows the use of a different target value, Cn, in each iteration, instead of the requiredconstant, C. The step sizes an and target values Cn are derived explicitly in [39] to be optimalunder a Bayesian analysis framework. As in [36], the optimal step size also uses g′(uθ), which isunknown, and a suitable approximation scheme is used. The modified Robbins-Monro procedureachieves significantly faster convergence rates in the case of the estimation of extreme quantiles.

Using the modified Robbins-Monro procedure to obtain CIs for heritability

We now describe how to rapidly construct CIs for heritability. As described above, the first step isto find s, t, s∗ and t∗. To find s, we employ the modified Robbins-Monro procedure [39], where theparameter of interest is θ := h2, the function is g(θ) := Prh2=θ(h

2 = 0) and the inverse value we wishto find corresponds to C = α/2. We note that we chose g here to be nonincreasing for the sake ofclarity of presentation; to conform with the Robbins-Monro formulation, we would need to redefineg → 1− g and C → 1−C. At a single iteration of the modified Robbins-Monro procedure, we havean estimate h2n for s, and we need to sample from a distribution whose mean is Prh2

n(h2 = 0). To

achieve that, we draw a sample from the distribution corresponding to h2n, N(0n,Vh2

n

), and check

if the maximum likelihood estimate for it is 0 (or above). This procedure can be done quickly inO(n), as we now describe, circumventing the need to perform a full likelihood maximization for thesample.

As detailed above, we make repeated use of the following procedure: (1) Draw a randomsample y from the distribution correpsonding to a given heritability value h2, N (0n,Vh2); (2)Decide whether its heritability estimate, h2(y), is larger than a given value, H2. In [20], it is shownthat when X = 1n, these two steps may equivalently be performed by drawing a vector u of i.i.d,standard normal variables u ∼ N (0n, In), and checking if

n∑i=1

ξh2,H2

i u2i > 0 , (4)

where

ξh2,H2

i =h2(di − 1) + 1

H2(di − 1) + 1

di − 1

H2(di − 1) + 1− 1

n− 1

n−1∑j=1

dj − 1

H2(dj − 1) + 1

,

for i = 1, . . . ,n − 1, and ξh2,H2

n = 0, with di being the eigenvalues of K. The sign of theexpression in Equation (4) is equal to the sign of ∂`REML

∂h2 (H2), the derivative of `REML at the pointH2. Therefore, assuming the restricted likelihood function is well behaved, a positive derivativeindicates that the REML heritability estimate is larger than H2. Similar expressions are defined

9

for a general X in [20]. Once the eigendecomposition of K is obtained, this procedure may beperformed in a time complexity linear in n.

Similarly, for finding s∗, we define the function g(θ) := Prh2=0(h2 ≤ θ), for which we want to

find the inverse of C = 1− α. The procedures for finding t and t∗ are similar.The second step involves calculating the quantities c−1

α/2(H2), c−1

α (H2), c−11−α(H

2) and c−11−α/2(H

2)

as required. This can again be done by the modified Robbins-Monro procedure, by setting θ := h2,g(θ) := Prh2(h2 ≤ H2), and C = α/2,α, 1 − α/2 or 1 − α. To sample from a distribution whosemean is Prh2

n(h2 ≤ H2), we draw a sample from the distribution corresponding to h2n, and check if

the maximum likelihood estimate for it is above H2. Again, this procedure can be done quickly inO(n). Once these quantities have been calculated, the CI can be calculated as detailed in Equations(2) and (3).

In practice, we used the following choices in the modified Robbins-Monro procedure: (i) Weused T = 1000 iterations; (ii) we set the prior standard deviation to τ = 0.4, used to derivean and Cn via the Bayesian analysis (see [39]); (iii) we used the midpoint between the estimateand relevant boundary (0 or 1, depending on the quantile required) as a starting point; (iv) weadaptively changed the step size constant, following the suggestion of Garthwaite and Buckland,by approximating the derivative with an expression proportional to the distance from θ:

g′(uθ) ≈ k(h2n −H2), k =2

zβ · (2π)−1/2 · e−z2β/2

where z is the quantile function of the normal distribution, and β is the required quantile.

The NFBC dataset

We analyzed 5,236 individuals from the Northern Finland Birth Cohort (NFBC) dataset, whichconsists of genotypes at 331,476 genotyped SNPs and 10 phenotypes [37]. From each pair ofindividuals with relatedness of more than 0.025, one was reserved, resulting in 2,520 individuals.

The WTCCC2 dataset

We analyzed the Wellcome Trust Case Control Consortium 2 dataset [38]. In the multiple sclerosis(MS) and ulcerative colitis (UC) datasets, we used the same data processing described in [44] toensure consistency. Briefly, UK controls and cases from both UK and non-UK were used. SNPswere removed with > 0.5% missing data, p < 0.01 for allele frequency difference between twocontrol groups, p < 0.05 for deviation from Hardy-Weinberg equilibrium, p < 0.05 for differentialmissingness between cases and controls, or minor allele frequency < 1%. In all analyses, SNPswithin 5M base pairs of the human leukocyte antigen (HLA) region were excluded, because theyhave large effect sizes and highly unusual linkage disequilibrium patterns, which can bias or exag-gerate the results. Finally, from each pair of individuals with relatedness of more than 0.025, onewas reserved, resulting in 13,950 individuals.

Discussion

We have presented FIESTA, an efficient method for constructing accurate CIs using stochasticapproximation. We have shown that FIESTA is very fast, while achieving exact coverage due tothe fact that it does not rely on any assumptions of the distribution of the estimator. FIESTA is

10

also faster than the analytical approximation used by SEs. Due to its speed, FIESTA can be easilyused for datasets with tens or hundreds of thousands of individuals.

FIESTA requires the eigendecomposition of the kinship matrix, whose computational complex-ity is cubic in the number of individuals. While this is often a preliminary step in heritabilityestimation, it may be computationally prohibitive for larger datasets. Recent methods for heri-tability estimation (see [45]) utilize conjugate gradient methods to avoid cubic steps altogether.One direction of extension for FIESTA is devising a procedure to calculate the derivative of therestricted likelihood function using conjugate gradient methods, which are quadratic, but do notrequire the eigendecomposition.

We note that the confidence intervals constructed by FIESTA are estimated under a set ofassumptions, particularly that the data is generated from the linear mixed model as described inthe Methods. Deviations from these assumptions could result in inaccurate confidence intervals.Specifically, we observed that when the genotype matrix is of low rank (e.g., in the case whereduplicates are introduced), then the confidence intervals calculated by FIESTA may be inaccurate.We therefore recommend removing duplicates and closely related individuals from the data priorto the application of FIESTA.

A common extension of the LMM is that of multiple variance components, where the genome isdivided into distinct partitions (e.g., according to functional annotations, or by chromosomes), andthe relative genetic contribution of each partition is estimated instead. Another extension is thatof multiple traits, where several phenotypes are estimated concurrently, allowing dependencies be-tween them. In principle, the methodology behind FIESTA can be applied to the multiparametriccase as well. However, there are several computational and conceptual hurdles that make this appli-cation highly nontrivial. First, a major difficulty rises from the fact that it is no longer necessarilypossible to jointly diagonalize several kinship matrices. Thus, the computation of the derivativesof the logarithm of the restricted likelihood functions can no longer utilize the eigendecomposition.Second, the inversion of acceptance regions of multiple parameters results in confidence regions ofmore than one dimension. While these have the required coverage probability, their shape maybe difficult to report or to interpret easily (e.g. an ellipsoid). For example, hyper-rectangularconfidence regions are often desirable [46], as the marginal CI of each parameter has the same cov-erage probability as the confidence region. Therefore, multiparametric extensions remain a futuredirection of research.

Acknowledgements

The authors would like to thank David Steinberg. R.S. is supported by the Colton Family Foun-dation. This study was supported in part by a fellowship from the Edmond J. Safra Center forBioinformatics at Tel Aviv University to R.S. The Northern Finland Birth Cohort data were ob-tained from dbGaP: phs000276.v2.p1. This study makes use of data generated by the WellcomeTrust Case Control Consortium. A full list of the investigators who contributed to the generationof the data is available from www.wtccc.org.uk. Funding for the project was provided by theWellcome Trust under award 076113.

11

www.wtccc.org.uk

References

1. Fisher, R. A. The correlation between relatives on the supposition of mendelian inheritance.Transactions of the Royal Society of Edinburgh 52, 399–433 (1918).

2. Silventoinen, K. et al. Heritability of adult body height: a comparative study of twin cohortsin eight countries. Twin Research 6, 399–408 (2003).

3. Macgregor, S., Cornes, B. K., Martin, N. G. & Visscher, P. M. Bias, precision and heritabilityof self-reported and clinically measured height in Australian twins. Human Genetics 120,571–580 (2006).

4. Manolio, T. A., Brooks, L. D. & Collins, F. S. A HapMap harvest of insights into the geneticsof common disease. The Journal of Clinical Investigation 118, 1590 (2008).

5. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.Nucleic Acids Research 42, D1001–6 (2014).

6. Visscher, P. M., Hill, W. G. & Wray, N. R. Heritability in the genomics eraconcepts andmisconceptions. Nature Reviews Genetics 9, 255–266 (2008).

7. Kang, H. M. et al. Efficient control of population structure in model organism associationmapping. Genetics 178, 1709–23 (2008).

8. Kang, H. M. et al. Variance component model to account for sample structure in genome-wideassociation studies. Nature Genetics 42, 348–54 (2010).

9. Lippert, C. et al. FaST linear mixed models for genome-wide association studies. NatureMethods 8, 833–5 (2011).

10. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies.Nature Genetics 44, 821–4 (2012).

11. Vattikuti, S., Guo, J. & Chow, C. C. Heritability and genetic correlations explained bycommon SNPs for metabolic syndrome traits. PLoS Genetics 8, e1002637 (2012).

12. Wright, F. A. et al. Heritability and genomics of gene expression in peripheral blood. NatureGenetics 46, 430–437 (2014).

13. Kruijer, W. et al. Marker-based estimation of heritability in immortal populations. Genetics199, 379–398 (2015).

14. Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height.Nature Genetics 42, 565–9 (2010).

15. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: A tool for genome-widecomplex trait analysis. The American Journal of Human Genetics 88, 76–82 (2011).

16. Lohr, S. L. & Divan, M. Comparison of confidence intervals for variance components withunbalanced data. Journal of Statistical Computation and Simulation 58, 83–97 (1997).

17. Burch, B. D. Comparing pivotal and REML-based confidence intervals for heritability. Jour-nal of Agricultural, Biological, and Environmental Statistics 12, 470–484 (2007).

18. Burch, B. D. Assessing the performance of normal-based and REML-based confidence inter-vals for the intraclass correlation coefficient. Computational Statistics & Data Analysis 55,1018–1028 (2011).

19. Kraemer, K. Confidence intervals for variance components and functions of variance compo-nents in the random effects model under non-normality. PhD thesis (Iowa State University,2012).

12

20. Schweiger, R. et al. Fast and accurate construction of confidence intervals for heritability.The American Journal of Human Genetics 98, 1181–1192 (2016).

21. Chernoff, H. On the distribution of the likelihood ratio. The Annals of Mathematical Statis-tics, 573–578 (1954).

22. Moran, P. A. Maximum-likelihood estimation in non-standard conditions in MathematicalProceedings of the Cambridge Philosophical Society 70 (1971), 441–450.

23. Self, S. G. & Liang, K.-Y. Asymptotic properties of maximum likelihood estimators andlikelihood ratio tests under nonstandard conditions. Journal of the American Statistical As-sociation 82, 605–610 (1987).

24. Stern, S. & Welsh, A. Likelihood inference for small variance components. The CanadianJournal of Statistics 28, 517–532 (2000).

25. Visscher, P. M. & Goddard, M. E. A general unified framework to assess the sampling varianceof heritability estimates using pedigree or marker-based relationships. Genetics 199, 223–232(2015).

26. Thai, H. T., Mentr, F., Holford, N. H. G., Veyrat-Follet, C. & Comets, E. A comparison ofbootstrap approaches for estimating uncertainty of parameters in linear mixed-effects models.Pharmaceutical Statistics 12, 129–140 (2013).

27. Wolfinger, R. D. & Kass, R. E. Nonconjugate Bayesian analysis of variance component mod-els. Biometrics 56, 768–774 (2000).

28. Chung, Y., Rabe-hesketh, S., Gelman, A., Dorie, V. & Liu, J. Avoiding boundary estimatesin linear mixed models through weakly informative priors. Berkeley Preprints, 1–30 (2011).

29. Harville, D. A. & Fenech, A. P. Confidence intervals for a variance ratio, or for heritability,in an unbalanced mixed linear model. Biometrics, 137–152 (1985).

30. Burch, B. D. & Iyer, H. K. Exact confidence intervals for a variance ratio (or heritability) ina mixed linear model. Biometrics, 1318–1333 (1997).

31. Furlotte, N. A., Heckerman, D. & Lippert, C. Quantifying the uncertainty in heritability.Journal of Human Genetics 59, 269–275 (2014).

32. Carpenter, J. & Bithell, J. Bootstrap confidence intervals: when, which, what? A practicalguide for medical statisticians. Statistics in Medicine 19, 1141–1164 (2000).

33. Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a widerange of complex diseases of middle and old age. PLoS Med 12, e1001779 (2015).

34. Kushner, H. & Yin, G. G. Stochastic approximation and recursive algorithms and applications(Springer Science & Business Media, 2003).

35. Robbins, H. & Monro, S. A stochastic approximation method. The annals of mathematicalstatistics, 400–407 (1951).

36. Garthwaite, P. H. & Buckland, S. T. Generating Monte Carlo confidence intervals by theRobbins-Monro process. Applied Statistics, 159–171 (1992).

37. Sabatti, C. et al. Genome-wide association analysis of metabolic traits in a birth cohort froma founder population. Nature Genetics 41, 35–46 (2009).

38. Sawcer, S. et al. Genetic risk and a primary role for cell-mediated immune mechanisms inmultiple sclerosis. Nature 476, 214 (2011).

13

39. Joseph, V. R. Efficient Robbins–Monro procedure for binary data. Biometrika 91, 461–470(2004).

40. Furlotte, N. A. & Eskin, E. Efficient multiple trait association and estimation of geneticcorrelation using the matrix-variate linear mixed-model. Genetics 200, 59–68 (2015).

41. Searle, S. R., Casella, G. & McCulloch, C. E. Variance components (John Wiley & Sons,2009).

42. Visscher, P. M., Hill, W. G. & Wray, N. R. Heritability in the genomics eraconcepts andmisconceptions. Nature Reviews Genetics 9, 255–266 (2008).

43. Patterson, H. D. & Thompson, R. Recovery of inter-block information when block sizes areunequal. Biometrika 58, 545–554 (1971).

44. Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages andpitfalls in the application of mixed-model association methods. Nature genetics 46, 100–106(2014).

45. Loh, P.-R. et al. Contrasting genetic architectures of schizophrenia and other complex diseasesusing fast variance-components analysis. Nature Genetics 47, 1385–1392 (2015).

46. Sidak, Z. Rectangular confidence regions for the means of multivariate normal sistributions.Journal of the American Statistical Association 62, 626–633 (1967).

14

Appendix

A1 Variance of estimators

The main method of calculating the variance of the estimator, applied by all widely used LMMmethods, employs the Fisher information matrix, or a variant of which, possibly applying the deltamethod in addition [1]. The observed information matrix J (θ) of parameters θ is the negativeof the Hessian of the log-likelihood of the data y. Namely, J (θ)i,j = − ∂

∂θiθj`(θ;y). The Fisher

information matrix I(θ) is the expectation of the observed information matrix. Namely, I(θ)i,j =E[− ∂

∂θiθj`(θ;y)

]. Asymptotically, under certain regularity conditions,

√n(θ−θ)

d−→ N (0, I(θ)−1).

According to the delta method, the asymptotic distribution of a function f(θ) satisfies√n(f(θ)−

f(θ))d−→ N (0,∇f(θ)TI(θ)−1∇f(θ)).

GCTA uses the Average Information [2] (AI) matrix A to calculate the variance of σ2g and σ2

e ,

where A = 12(I + J ). For the REML method, this is the matrix:

A =1

2·(yTQKQKQy yTQKQQyyTQQKQy yTQQQy

),

where Q = Σ−1−Σ−1X(XTΣ−1XT

)−1XTΣ−1, with Σ = σ2

gK+σ2eI. Then, the delta method

is used to calculate the variance of h2:

Var(h2) = (σ2g + σ2

e)−4

(σ2e −σ2

g

)A−1|σ2

g=σ2g ,σ

2e=σ2

e

(σ2e

−σ2g

).

Given the eigendecomposition of K, Σ−1 (and thus Q) can be calculated in O(n) (where n isthe number of individuals), avoiding an expensive matrix inversion. Several other computationalimprovements may be carried out, depending on software implementation. However, we note thatO(n2) matrix-by-vector multiplications cannot be avoided.

A2 Confidence intervals for heritability

Our approach is based on the duality between hypothesis testing and confidence intervals. As thedistribution of h2 depends solely on h2, we may assume without loss of generality that σ2

p = 1 and

β = 0p. For a fixed value h2, an acceptance region Ah2 is defined as the subset of values h2 for whicha test does not reject the null hypothesis that the phenotype vector is drawn from N (0n,Vh2). Theprobability of the event Ah2 under N (0n,Vh2) should be ≥ 1 − α. This region may be indirectlyderived from an actual test (e.g., a generalized likelihood ratio test) or constructed explicitly. Thecorresponding confidence interval for an estimate h2 = H2, CH2 , comprises of the set of parametervalues for which h2 does not imply the rejection of the null hypothesis that the true heritabilityvalue is h2:

CH2 ={h2

∣∣H2 ∈ Ah2

}.

Since the distribution of h2 is bounded and generally asymmetric, the choice of Ah2 is notunique. It remains to determine Ah2 for every h2. We give here a general description of theconstruction; in [3], we give a full description of the method, along with proofs.

Let cβ(h2) be the β-th quantile function of h2, when the true heritability is h2; i.e. Prh2(h2 ≤ cβ(h

2)) = β.A natural choice for Ah2 would be taking the interval obtained by removing a α/2-tail from both

15

0.3 0.4 0.5 0.6 0.70.3

0.4

0.5

0.6

0.7

Estimated value of h2

Truevalueofh2

Figure S1 An illustration of acceptance regions and CIs. The diagonal lines are the α/2 and 1−α/2quantile functions, shown for values in the mid-range of heritability values. Several example acceptanceregions are denoted as horizontal lines, in parameter regions where simple two-sided acceptance regions canbe defined. The CI for h2 = 0.5 is shown as a vertical line.

sides of the distribution of h2 given h2, i.e., choosing the two-sided Ah2 = [cα/2(h2), c1−α/2(h

2)]. Ifthis were always possible, a succinct way of describing the 1 − α CI, CH2 = [lH2 ,hH2 ], would beusing the fact that its endpoints are exactly those following

c1−α/2(lH2) = H2 ⇒ lH2 = c−11−α/2(H

2)

cα/2(hH2) = H2 ⇒ uH2 = c−1α/2(H

2).

as described in Figure S1.However, since the distribution is of a mixed type with discontinuity points, it may be the case

that the probability of the interval [cα/2(h2), c1−α/2(h

2)] might be greater than (1− α/2)− α/2 =

1 − α. For example, if Prh2(h2 = 0) > α/2, then cα/2 = 0, and Prh2(h2 ∈ [0, cα/2)) > α/2. Inthis case, we then instead choose to take the one-sided interval Ah2 = [0, c1−α(h

2)]. Similarly, ifPrh2(h2 = 1) > α/2, then c1−α = 1, and Prh2(h2 ∈ (c1−α/2, 1]) > α/2. In this case, we similarlychoose the one-sided interval [cα(h

2), 1] instead. We are therefore interested in the maximal values for which Prs(h

2 = 0) ≥ α/2, and the minimal value t for which Prh2(h2 = 1) ≥ α/2, because inthe range of values h2 ∈ [s, t], it holds that Prh2(h2 ∈ [cα/2(h

2), c1−α/2(h2)]) = 1−α. Equivalently,

assuming Prh2(h2 = 0) (resp., Prh2(h2 = 1)) is decreasing (resp., increasing) in h2, we may simplydefine s and t to be the values for which Prh2=s(h

2 = 0) = α/2 and Prh2=t(h2 = 1) = α/2.

The following assumes s and t exist, and that s < t; for the general case, see [3]. We divide ourconstruction into distinct cases, by setting

Ah2 =

[0, c1−α(h

2)] if h2 ∈ [0, s)

[cα/2(h2), c1−α/2(h

2)] if h2 ∈ [s, t]

[cα(h2), 1] if h2 ∈ (t, 1].

The three region types are illustrated by Figure S2. Inverting the acceptance regions, we get

16

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

s

t

s∗

t∗

cα/2(h2) cα(h2)

c1−α(h2) c1−α/2(h2)

Estimated value h2

Truevalueofh2

Figure S2 An illustration of the three acceptance regions types. The diagonal lines, from left toright, indicate the quantile functions for α/2,α, 1− α and 1− α/2. The three region types are indicated ashorizontal lines. The points s and t, where region types used are changed, are indicated as horizontal dashedlines. See Methods for a full description.

17

the following definition for CH2 = [lH2 ,hH2 ]. For the lower endpoint, we have

lH2 =

0 if H2 ∈ [0, c1−α(0))

c−11−α(H

2) if H2 ∈ [c1−α(0), c1−α(s))

s if H2 ∈ [c1−α(s), c1−α/2(s))

c−11−α/2(H

2) if H2 ∈ [c1−α/2(s), 1]

For the higher endpoint, we have

uH2 =

c−1α/2(H

2) if H2 ∈ [0, cα/2(t))

t if H2 ∈ [cα/2(t), cα(t))

c−1α (H2) if H2 ∈ [cα(t), cα(1))

1 if H2 ∈ [cα(1), 1]

These conditions, phrased in terms of the quantile functions cβ, e.g., H2 ≤ cα(t), can be equivalently

written in terms of the value of inverse quantile functions of the estimate H2, e.g. c−1α (H2) ≤ t. In

addition, let s∗ = c1−α(0), t∗ = cα(1). Explicitly,

lH2 =

0 if H2 ≤ s∗

c−11−α(H

2) if c−11−α(H

2) < s

s if s ∈ [c−11−α/2(H

2), c−11−α(H

2)]

c−11−α/2(H

2) if s < c−11−α/2(H

2)

and

uH2 =

c−11−α/2(H

2) if c−1α/2(H

2) < t

t if t ∈ [c−1α (H2), c−1

α/2(H2)]

c−1α (H2) if t < c−1

α (H2)

1 if t∗ ≤ H2

It follows from the discussion above, that in order to construct a CI for an heritability estimateH2, we need to first find s, t as above, s∗ = c1−α(0) and t∗ = cα(1), and then we need only calculatec−1β (H2) for β = α/2,α, 1 − α and 1 − α/2. Therefore, the entire construction relies on invertingcertain quantile functions.

References

1. Wasserman, L. All of statistics: a concise course in statistical inference (Springer Science &Business Media, 2013).

2. Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficientalgorithm for variance parameter estimation in linear mixed models. Biometrics, 1440–1450(1995).

3. Schweiger, R. et al. Fast and accurate construction of confidence intervals for heritability.The American Journal of Human Genetics 98, 1181–1192 (2016).

18

Documents

Using stochastic approximation techniques to efficiently construct