What Causes the Test Error? Going Beyond Bias-Variance via

Journal of Machine Learning Research 22 (2021) 1-82 Submitted 10/20; Revised 2/21; Published 4/21

What Causes the Test Error?Going Beyond Bias-Variance via ANOVA

Licong Lin [email protected] of Mathematical SciencesPeking University5 Yiheyuan Road, Beijing, China

Edgar Dobriban [email protected]

Departments of Statistics & Computer and Information Science

University of Pennsylvania

Philadelphia, PA, 19104-6340, USA

Editor: Ambuj Tewari

Abstract

Modern machine learning methods are often overparametrized, allowing adaptation to thedata at a fine level. This can seem puzzling; in the worst case, such models do not need togeneralize. This puzzle inspired a great amount of work, arguing when overparametrizationreduces test error, in a phenomenon called “double descent”. Recent work aimed to under-stand in greater depth why overparametrization is helpful for generalization. This lead todiscovering the unimodality of variance as a function of the level of parametrization, and todecomposing the variance into that arising from label noise, initialization, and randomnessin the training data to understand the sources of the error.

In this work we develop a deeper understanding of this area. Specifically, we proposeusing the analysis of variance (ANOVA) to decompose the variance in the test error ina symmetric way, for studying the generalization performance of certain two-layer linearand non-linear networks. The advantage of the analysis of variance is that it reveals theeffects of initialization, label noise, and training data more clearly than prior approaches.Moreover, we also study the monotonicity and unimodality of the variance components.While prior work studied the unimodality of the overall variance, we study the propertiesof each term in the variance decomposition.

One of our key insights is that often, the interaction between training samples andinitialization can dominate the variance; surprisingly being larger than their marginal effect.Also, we characterize “phase transitions” where the variance changes from unimodal tomonotone. On a technical level, we leverage advanced deterministic equivalent techniquesfor Haar random matrices, that—to our knowledge—have not yet been used in the area.We verify our results in numerical simulations and on empirical data examples.

Keywords: Test Error, ANOVA, Double Descent, Ridge Regression, Random MatrixTheory

1. Introduction

Modern machine learning methods are often overparametrized, allowing adaptation to thedata at a fine level. For instance, competitive methods for image classification—such asWideResNet (Zagoruyko and Komodakis, 2016)—and for text processing—such as GPT-

©2021 Licong Lin and Edgar Dobriban.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are providedat http://jmlr.org/papers/v22/20-1211.html.

https://creativecommons.org/licenses/by/4.0/

http://jmlr.org/papers/v22/20-1211.html

Lin and Dobriban

Figure 1: ANOVA decomposition of the variance. The plots show the components of thevariance (as well as the bias) in certain two-layer linear networks studied in the paper, asa function of the data aspect ratio δ = lim d/n, where d is the dimension of features andn is the number of samples. The variance can be decomposed into its contributions fromrandomness in label noise (l), training data/samples (s), and initialization (i). Namely, thevariance is decomposed into the main effects Va and the interaction effects Vab, Vabc, wherea, b, c ∈ {l, s, i}. We omit Vl, Vli in the figures since they equal zero. The key observationis that the interaction effects (especially Vsi) dominate the variance at the interpolationlimit where lim p/n = 1 (this turns out to correspond to δ = 1.25 in the Figure) and pis the number of features in the hidden layer. Left: Cumulative figure of the bias andvariance components. Right: Variance components in numerical simulations. (?: theory,n?: numerical, averaged over 5 runs, for ? = Vs, etc). Parameters: signal strength α = 1,noise level σ = 0.3, regularization parameter λ = 0.01, parametrization level π = 0.8. SeeSections 2.2, 4.2 for details.

3 (Brown et al., 2020)—have from millions to billions of explicit optimizable parameters,comparable to the number of datapoints. From a theoretical point of view, this can seempuzzling and perhaps even paradoxical: in the worst case, models with lots of parametersdo not need to generalize (i.e., perform similarly on test data as on training data from thesame distribution).

This puzzle has inspired a great amount of work. Without being exhaustive, some of themain approaches argue the following. (1) Overparametrization beyond the “interpolationthreshold” (number of parameters required to fit the data) can eventually reduce test error(in a phenomenon called “double descent”). (2) The specific algorithms used in the train-ing process have beneficial “implicit regularization” effects which effectively reduce modelcomplexity and help with generalization. These two ideas are naturally connected, as theimplicit regularization helps achieve decreasing test error with overparametrization. Thisarea has registered a great deal of progress recently, but its roots can be traced back manyyears ago. We discuss some of these works in the related work section.

One particular line of work aims to understand in greater depth why overparametriza-tion is helpful for generalization. In this line of work, Yang et al. (2020) has studied thebias-variance decomposition of the mean squared error (and for other losses), and pro-

2

What Causes the Test Error?

posed that a key phenomenon is that the variance is unimodal as a function of the level ofparametrization. This was verified empirically for a wide range of models including modernneural networks, as well as theoretically for certain two-layer linear networks with only thesecond layer trained. Moreover, d’Ascoli et al. (2020) proposed to decompose the variancein a two-layer non-linear network with only second layer trained (i.e., a random featuresmodel) into that arising from label noise, initialization, and randomness in the features ofthe training data (in this specific order), arguing that—in their particular model—the labeland initialization noise dominates the variance.

In this work we develop a set of techniques aiming to improve our understanding ofthis area; and more broadly of generalization in machine learning. Specifically, we proposeto use the analysis of variance (ANOVA), a classical tool from statistics and uncertaintyquantification (e.g., Box et al., 2005; Owen, 2013), to decompose the variance in the gen-eralization mean squared error into its components stemming from the initialization, labelnoise, and training data (see Figure 1 for a brief example). The advantage of the analysisof variance is that it reveals the effects of the components in a more clearly interpretable,and perhaps ”unequivocal”, way than the approach in d’Ascoli et al. (2020). The priordecomposition depends on the specific order in which the conditional expectations are eval-uated, while ours does not. We carry out this program in detail in certain two-layer linearand non-linear network models (more specifically, random feature models), which have havebeen the subject of intense recent study, and are effectively at the frontier of our theoreticalunderstanding.

As is well known in the literature on ANOVA, the variance components form a hierarchywhose first level, the main effects, can be interpreted as the effects of varying each variable(here: random initialization, features, label noise) separately, while the higher levels canbe interpreted as the interaction effects between them. These are symmetric, which isboth elegant and interpretable, and thus provide advantages over the prior approaches. SeeFigure 2 for an example.

Moreover, we study the monotonicity and unimodality of MSE, bias, variance, and thevarious variance components in a specific variance decomposition. While Yang et al. (2020)studied the unimodality of the overall variance, we study the properties the componentsindividually. On a technical level, our work is quite involved, and leverages some advancedtechniques from random matrix theory, that—to our knowledge—have not yet been usedin the area. In particular, we discovered that we can leverage the deterministic equivalentresults for Haar random matrices from Couillet et al. (2012). These have been developedfor different purposes, for analyzing random beamforming in wireless communications.

After the initial posting of our work, we became aware of the highly related paper Adlamand Pennington (2020b). This was publicly posted on the arxiv.org preprint server laterthan our work, but had been submitted for publication earlier. The conclusions in the twoworks are similar, but the techniques and setting are different. We discuss this at the endof the next section.

3

arxiv.org

Lin and Dobriban

1.1 Related Works

There is an extraordinary amount of related work, as this topic is one of the most excitingand popular ones recently in the theory of machine learning. Due to space limitations, wecan only review the most closely related work.

The phenomenon of “double descent”, coined in Belkin et al. (2019), states that thelimiting test error first increases, then decreases as a function of the parametrization level,having a “double descent”, or “w”-shaped behavior. This phenomenon has been studied,in one form or another, in a great number of recent works, see e.g., Advani et al. (2020);Bartlett et al. (2020); Belkin et al. (2019, 2018, 2020b); Derezinski et al. (2019); Geigeret al. (2020); Ghorbani et al. (2021); Hastie et al. (2019); Liang and Rakhlin (2018); Liet al. (2020); Mei and Montanari (2019); Muthukumar et al. (2020); Xie et al. (2020), etc.

Various forms have also appeared in earlier works, see e.g., the discussion on “A briefprehistory of double descent” (Loog et al., 2020) and the reply in Belkin et al. (2020a).This points to the related works Opper (2001); Kramer (2009). The online machine learn-ing community has engaged in a detailed historical reference search, which unearthed therelated early works1 Hertz et al. (1989); Opper et al. (1990); Hansen (1993); Barber et al.(1995); Duin (1995); Opper (1995); Opper and Kinzel (1996); Raudys and Duin (1998). Theobservations on the “peaking phenomenon” are consistent with empirical results on trainingneural networks dating back to the 1990s. There it has been suggested that the difficultiescaptured by the peak in double descent stem from optimization, such as the ill-conditioningof the Hessian (LeCun et al., 1991; Le Cun et al., 1991).

Some works that are especially relevant to us are the following. Hastie et al. (2019)showed that the limiting MSE of ridgeless interpolation in linear regression as a function ofthe overparametrization ratio, for fixed SNR, has a double descent behavior. Nakkiran et al.(2021) rigorously proved that optimally regularized ridge regression can eliminate doubledescent in finite samples in a linear regression model. Nakkiran (2019) clearly explained that“more data can hurt”, because algorithms do not always adapt well to the additional data.In comparison, the special case of our results pertaining to linear nets allows for certainnon-Gaussian data, while only proved asymptotically. Nakkiran et al. (2020) empiricallyshowed a double descent shape for the test risk for various neural network architecturesa function of model complexity, number of samples (“sample-wise” double descent), andtraining epochs.

d’Ascoli et al. (2020) used the (not fully rigorous) replica method to obtain the bias-variance decomposition for two-layer neural networks in the lazy training regime. Theyfurther also decomposed the variance in a specific order into that stemming from labelnoise, initialization, and training features. Compared to this, our work is fully rigorous,and proposes to use the analysis of variance, from which we show that the sequential de-compositions like the ones proposed in d’Ascoli et al. (2020) can be recovered. Moreover,we are concerned with a slightly different model (with orthogonal initialization), and someof our results are only proved for linear orthogonal networks (e.g., the forms of the variancecomponents). However, going beyond d’Ascoli et al. (2020), we also obtain rigorous resultsfor the monotonicity and unimodality of the various elements of the variance decomposition.

1. The reader can see the Twitter thread by Dmitry Kobak: https://twitter.com/hippopedoid/status/1243229021921579010.

4

https://twitter.com/hippopedoid/status/1243229021921579010

https://twitter.com/hippopedoid/status/1243229021921579010


Ba et al. (2020) obtained the generalization error of two-layer neural networks when onlytraining the first or the second layer, and compared the effects of various algorithmic choicesinvolved. Compared with our work, d’Ascoli et al. (2020); Ba et al. (2020) studied moregeneral settings and provided results that involve more complex expressions; the advantageour our simpler expressions is that we can find the variance components and study propertiessuch as their monotonicity. Our results are simpler mainly because we consider orthogonalinitialization; and, in several results, consider a linear network. We believe that our resultsare complementary.

Wu and Xu (2020) calculated the prediction risk of optimally regularized ridge regressionunder a general covariance assumption of the data. Jacot et al. (2020) argued that randomfeature models can be close to kernel ridge regression with additional regularization. Thisis related to the “calculus of deterministic equivalents” for random matrices (Dobriban andSheng, 2018). Liang et al. (2020) argued that in certain kernel regression problems onemay obtain generalization curves with multiple descent points. Chen et al. (2020) studiedcertain models with provable multiple descent curves.

When the data has general covariance, Kobak et al. (2020) showed that the optimalridge parameter could be negative. Thus any positive ridge penalty would be sub-optimalif the true parameter vector lies on a direction with high predictor variance. Understandingthe implications of this work in our context is a subject of interesting future research.

More broadly viewed, a great deal of effort has been focused on connecting “classical”statistical theory (focusing on low-dimensional models) with “modern” machine learning(focusing on overparametrized models).2 From this perspective, there are strong analogieswith nonparametric statistics (Ibragimov and Has′ Minskii, 2013). Non-parametric esti-mators such as kernel smoothing have, in effect, infinitely many parameters, yet they canperform well in practice and have strong theoretical guarantees. Nonparameteric statisticsalready has the same components of the “overparametrize then regularize” principle as inmodern machine learning. The same principle also arises in high-dimensional statistics,such as with basis pursuit and Lasso (Chen and Donoho, 1994). Namely, one can get goodperformance if one considers a large set of potential predictors (overparametrize), and thenselects a small, highly-regularized subset.

Even more broadly, our work is connected to the emerging theme in modern statistics andmachine learning of studying high-dimensional asymptotic limits, where both the samplesize and the dimension of data tend to infinity. This is a powerful framework that allowsus to develop new methods, and to uncover phenomena not detectable using classical fixed-dimension asymptotics (see e.g., Couillet and Debbah, 2011; Paul and Aue, 2014; Yaoet al., 2015). It also dates back to the 1970s, see e.g., the literature review in Dobriban andWager (2018), which points to works by Raudys (1967); Deev (1970); Serdobolskii (1980),etc. Some other recent related works include Pennington and Worah (2017); Louart et al.(2018); Liao and Couillet (2018, 2019); Benigni and Peche (2019); Goldt et al. (2019); Fanand Wang (2020); Deng et al. (2019); Gerace et al. (2020); Liao et al. (2020); Adlam et al.(2019); Adlam and Pennington (2020a). See also Geman et al. (1992); Bos and Opper

2. The reader can see e.g., the talks titled “From classical statistics to modern machine learning” by M.Belkin at the Simons Institute (https://simons.berkeley.edu/talks/tbd-65) at the Institute of Ad-vanced Studies (https://video.ias.edu/theorydeeplearning/2019/1016-MikhailBelkin), and andother venues.

5

https://simons.berkeley.edu/talks/tbd-65

https://video.ias.edu/theorydeeplearning/2019/1016-MikhailBelkin

Lin and Dobriban

(1997); Neal et al. (2018) for various classical and modern discussions of bias-variancetradeoffs and dynamics of training.

The most closely related work to ours is Adlam and Pennington (2020b), publicly postedlater, but submitted for publication earlier. Both works study the generalization error viaANOVA decomposition, and show that the interaction effect can dominate the variance.The conclusions in the two works are similar. For instance, the Vsi term (interaction be-tween samples and initialization, defined later) dominates the total variance; Vsi and Vsli(interaction between samples, label noise and initialization) diverge as the ridge regulariza-tion parameter λ → 0. On the other hand, there are many differences between two works.(1). The mathematical settings are different. Their work studies a two-layer nonlinearnetwork with Gaussian initialization, while we study both linear and nonlinear networkswith orthogonal initialization. (2). The mathematical tools employed in the two papersare different. They use Gaussian equivalents and the linear pencil representation, while weexploit orthogonal deterministic equivalents. (3). The results are different. Beyond theANOVA decomposition, they also study the effect of ensemble learning. On our end, westudy optimally tuned ridge regression and prove properties of the bias, variance and MSE.

Another related paper, also publicly posted after our work is by Rocks and Mehta(2020). They study generalization error in linear regression and two-layer networks byderiving the formulas for bias and variance. The main techinque they used is the cavitymethod originating from statistical physics. Similarly, they also show that the generalizationerror diverges at the interpolation threhold due to the large variance. We provide a moredetailed comparison later, after stating our main results.

As already mentioned, our work is related to the one by Yang et al. (2020). The model weconsider is related to theirs, with several key differences. One is the orthogonal initialization,in contrast to their Gaussian initialization. Also, they assume that the ratio d/n→ 0 whilewe study the proportional regime where d/n → δ > 0 (which can be arbitrarily small,so our setting is in a sense effectively more general). As for the results, they prov theunimodality of variance and monotonicity of the bias under their setting. They also makesome conjectures on the variance unimodality that we prove (keeping in mind the differentsettings), see the results section for more details.

1.2 Our Contributions

Our contributions can be summarized as follows:

1. We study a two-layer linear network where the first layer is a fixed partial orthogo-nal embedding (which determines the latent features) and the second layer is trainedwith ridge regularization. While the expressive power of this model only captures cer-tain linear functions, training only the second layer already exhibits certain intriguingstatistical and generalization phenomena. We study the prediction error of this learn-ing method in a noisy linear model. We consider three sources of randomness thatcontribute to the error: the random initialization (a random partial orthogonal em-bedding), the label noise, and the randomness over the training data. We propose touse the analysis of variance (ANOVA), a classical tool from statistics and uncertaintyquantification (e.g., Box et al., 2005; Owen, 2013) to decompose and understand theircontribution.

6


We study an asymptotic regime where the data dimension, sample size, and numberof latent features tends to infinity together, proportionally to each other. In thismodel, we calculate the limits of the variance components (Theorem 2); in terms ofmoments of the Marchenko-Pastur distribution (Marchenko and Pastur, 1967). Wethen show how to recover various sequential variance decompositions, such as the onefrom d’Ascoli et al. (2020) (albeit only for linear rather than nonlinear networks).We also show that the order in the sequence of decompositions matters. Our workleverages deterministic equivalent results for Haar random matrices from Couillet et al.(2012) that, to our knowledge, have not yet been used in the area. We also leveragerecent technical developments such as the calculus of deterministic equivalents forrandom matrices of the sample covariance type (Dobriban and Sheng, 2018, 2020).Proofs are in Appendix B.

2. We then study the bias-variance decomposition in greater detail. As a corollary of theANOVA results, we study the decomposition of the variance in the order label-sample-initialization, which has some special properties (Theorem 3). When using an optimalridge regularization, we study the monotonicity and unimodality properties of thesecomponents (Theorem 5 and Table 1). With this, we shed further light on phenomenadiscovered by Yang et al. (2020), who wrote that “The main unexplained mystery isthe unimodality of the variance”. Specifically, we are able to show that the variance isindeed unimodal in a broad range of settings. This analysis goes beyond prior workse.g., Yang et al. (2020) (who studied setting with a number of inner neurons beingmuch larger than the number of datapoints), or “double descent mitigation” as inNakkiran et al. (2021), because it studies bias and variance separately.

We uncover several intriguing properties: for instance, for a fixed parametrizationlevel π, as a function of aspect ratio or “dimensions-per-sample”, the variance ismonotonically decreasing when π < 0.5, and unimodal when π ≥ 0.5. We discuss andoffer possible explanations.

We also discuss the special case of linear models, which has received a great dealof prior attention (Proposition 6). We view the results on standard linear modelsas valuable, as they are both simpler to state and to prove, and moreover they alsodirectly connect to some prior work.

3. We develop some further special properties of the bias, variance, and MSE. We reporta seemingly surprising simple relation between the MSE and bias at the optimum(Section 2.3.1). We study the properties of the bias and variance for a fixed (as opposedto optimally tuned) ridge regularization parameter (Theorem 7). In particular, weshow that the bias decreases as a function of the parametrization, and increases asa function of the data aspect ratio. In contrast to choosing λ optimally, we see thatdouble descent is not mitigated, and may occur in our setting when we use a smallregularization parameter λ that is fixed across problem sizes (going beyond the modelswhere this was known from prior work). This corroborates that the lack of properregularization plays a crucial role for the emergence of double descent.

7

Lin and Dobriban

We also give an added noise interpretation of the initial random initialization step(Section 2.3.3). Further, we provide some detailed analysis and intuition of thesephenomena, aided by numerical plots of the variance components (Section 2.3.4).

4. The above results are about ridge regression as a heuristic for regularized empiricalrisk minimization. In some settings, ridge regularization is known to have limita-tions (Derezinski and Warmuth, 2014), thus it is an importat question to understandits fundamental limitations here. In fact, we can show that ridge regression is anasymptotically optimal estimator, in the sense that it converges to the Bayes optimalestimator in our model (Theorem 8). This provides some justification for studyingridge regression in a two-layer network, which is not covered by standard results.

5. We extend some of our results to two-layer networks with a non-linear activationfunction with orthogonal initialization. In particular, we provide the limits of theMSE, bias, and variance in the same asymptotic regime (Theorem 9). Furthermore, weprovide the monotonicity and unimodality properties of these quantities as a functionof parametrization and aspect ratio (Table 2).

6. We provide numerical simulations to check the validity of our theoretical results(Section 4), including the MSE, the bias-variance decomposition, and the variancecomponents. We also show some experiments on empirical data, specifically onthe superconductivity data set (Hamidieh, 2018), where we test our predictions fortwo-layer orthogonal nets. Code associated with the paper is available at https:

//github.com/licong-lin/VarianceDecomposition.

1.3 Highlights and Implications

We discuss some of the highlights and implications of our results.

Beyond bias-variance. Much of the prior work in this area has focused on the funda-mental bias-variance decomposition. In this work, we demonstrate that it is possible to gosignificantly beyond this via the ANOVA decomposition. Specifically, using this method-ology, one can understand how the random training data, initialization, and label noisecontribute to the test error in more detailed and comprehensive ways than what was previ-ously possible. We carry out this in certain two-layer linear and non-linear networks withonly the second layer trained (i.e., random features models), but our approach may be morebroadly relevant.

Non-additive test error. A key finding of our work is that in the specific neural netmodels considered here, the random training data, initialization, and label noise contributehighly non-additively to the test error. Thus, when discussing “the effects of initialization”,some care ought to be taken; i.e., to clarify which interaction effects (e.g., with label noiseor training data) this includes. The interaction term between the initialization and thetraining data can be large in our setting.

Beyond double descent: Prevalence of unimodality. While initial work on asymp-totic generalization error of one and two-layer neural nets focused on the “double descent”or peaking shape of the test error, our work gives further evidence that the unimodal

8

https://github.com/licong-lin/VarianceDecomposition

https://github.com/licong-lin/VarianceDecomposition


shape of the variance is a prevalent phenomenon. Moreover, our work also suggests thethe unimodality holds not just for the overall variance, but also for specific and variancecomponents; which was not known in prior work. We show that unimodality with respect toboth overparametrization level and data aspect ratio holds in specific parameter settings forthe variance and certain other decompositions for the optimal setting of the regularizationparameter. In other parameter settings, we obtain monotonicity results for these compo-nents. This also underscores that regularization and the associated bias-variance tradeoffplays a key role in determining monotonicity and unimodality.

2. ANOVA for a Two-layer Linear Network

2.1 Setup

In this section, we study the bias-variance tradeoff and ANOVA decomposition for a two-layer linear network model. Suppose that we have a training data set T containing ndata points (xi, yi) ∈ Rd × R, with features xi and outcomes yi. We assume the data isdrawn independently from a distribution such that xi = (xi1, xi2, ..., xid), where xij are i.i.d.random variables satisfying

Exij = 0, Ex2ij = 1, Ex8+η

ij <∞,

where η > 0 is an arbitrary constant. Also, each (xi, yi) are drawn from the model

y = f∗(x) + ε = x>θ + ε, θ ∈ Rd,

where ε ∼ N (0, σ2) is the label noise independent of x and σ ≥ 0 is the noise standarddeviation. In matrix form, Y = Xθ + E , where X = (x1, x2, ..., xn)> ∈ Rn×d has inputvectors xi, i = 1, 2, ..., n as its rows and Y = (y1, y2, ..., yn)> ∈ Rn×1 with output values yi,i = 1, 2, ..., n as its entries. Our task is to learn the true regression function f∗(x) = x>θ byusing a two-layer linear neural network with weights W ∈ Rp×d, β ∈ Rp×1, which computesfor an input x ∈ Rd,

f(x) = (Wx)>β. (1)

Later in Section 3 we will also study two-layer nonlinear networks. For analytical tractabil-ity, we assume that the true parameters θ are random: θ ∼ N (0, α2Id/d). Here α2 canbe viewed as a signal strength parameter. This assumption corresponds to performing an“average-case” analysis of the difficulty of the problem over random problem instances givenby various θ.

We also consider a random orthogonal initialization W independent of T , so W is a p×dmatrix uniformly distributed over the set of matrices satisfying WW> = Ip, also known asthe Stiefel manifold. This requires that p 6 d, so the dimension of the inner representationof the neural net is not larger than the number of input features. To an extent, this can beseen as a random projection model, where a lower-dimensional representation of the high-dimensional input features is obtained by randomly projecting the input features into asubspace. Both training and prediction are based on the lower dimensional representation.In some works studying the orthogonal initialization of neural networks (e.g., Hu et al.,

9

Lin and Dobriban

2020), the first layer weights W1 satisfy W>1 W1 = I1, while the last layer weights WL satisfyWLW

>L = In, so the dimension of the hidden representation is larger than the dimension of

the input features. Similarly, in several recent works on wide neural networks, the numberof inner neurons is large. However, we think that in many applications, the number of“higher level features” should indeed not be larger than the number of input features. Forinstance, the number of features in facial image data such as eyes, hair, is expected to benot more than the number of pixels.

The model we consider here is related to the one from Yang et al. (2020), with several keydifferences. The orthogonal initialization is a key difference, as Yang et al. (2020) assumethat W is a random Gaussian matrix. The expressive power of the two models is the same,but orthogonal initialization has some benefits (see Appendix A for more information).

During training, we fix W and estimate β by performing ridge regression:

βλ,T ,W = arg minβ∈Rp

1

2n‖Y − (WX>)>β‖22 +

λ

2‖β‖22, (2)

where λ > 0 is the regularization parameter. This has a closed-form solution

βλ,T ,W =

(WX>XW>

n+ λIp

)−1WX>Y

n. (3)

We will often use the notation R = (WX>XW>/n + λIp)−1 for the so-called resolvent

matrix of WX>XW>. By plugging it into (1), we obtain our estimated prediction function,for a new datapoint x, projected first via W and thus accessed via Wx:

f(x) = (Wx)>βλ,T ,W = x>W>RWX>Y

n. (4)

Ridge regression is equivalent to `2 weight decay, a popular heuristic. We will later showthat ridge has some asymptotic optimality properties in our model, which thus justify itschoice. In contrast, if we follow the approach from Hu et al. (2020) and take p > d withW>W = Id, then it is readily verified that we would obtain

f(x) = x>(X>X/n+ λId

)−1X>Y/n.

This means that the prediction function reduces to standard ridge regression. Thus weassume instead that WW> = Ip and this makes our model resemble the “feature extraction”layers of a neural network.

We will consider the following asymptotic setting. Let {pd, d, nd}∞d=1 be a sequence suchthat pd ≤ d and pd, d, nd →∞ proportionally, i.e.,

limd→∞

pdd

= π, limd→∞

d

nd= δ,

where π ∈ (0, 1] and δ ∈ (0,∞). Here π ∈ (0, 1] denotes the parametrization factor, i.e., thenumber of parameters in β relative to the input dimension. Also, δ > 0 is the data aspectratio. We will also use γ = δπ = lim p/n, the ratio of learned parameters to number ofsamples. In Yang et al. (2020), the assumption limd→∞ d/nd = 0 implies that the number ofsamples is much larger than the number of parameters, which is limiting in high dimensionalproblems. Thus, we study a broader setting, in which the sample size is proportional to themodel size, with an arbitrary ratio.

10


2.2 Bias-Variance Decompositions and ANOVA

2.2.1 Introduction and the Main Result

Now we analyze the mean squared prediction error in our model,

MSE(λ) = Eθ,x,ε,X,E,W (fλ,T ,W (x)− y)2

= Eθ,x,X,E,W (fλ,T ,W (x)− x>θ)2 + σ2.

The expectation is over a random test datapoint (x, y) from the same distribution as thetraining data, i.e., x ∈ Rd has i.i.d. zero mean, unit variance entries with finite 8 + ηmoment, y = θ>x + ε, and ε ∼ N (0, σ2). It is also over the random training input X,the random training label noise E , the random initialization W , and the random trueparameter θ. Thus, this MSE corresponds to an average-case error over various trainingdata sets, initializations, and true parameters. In this work, we will always average over therandom test data point x, as is usual in classical statistical learning. We will also averageover the random parameter θ, which corresponds to a Bayesian average-case analysis overvarious generative models. This is partly for technical reasons; we can also show almostsure convergence over the random θ, but the analysis for general θ is beyond our scope.Thus, we write the MSE as Eθ,xEX,E,W (fλ,T ,W (x)− x>θ)2 + σ2, and the outer expectationEθ,x is always present in our formulas.

We can study the mean squared error via the standard bias-variance decompositioncorresponding to the average prediction function EX,E,W fλ,T ,W (x) over the random trainingset T (X, E) and initialization W : MSE(λ) = Bias2(λ) + Var(λ) + σ2, where

Bias2(λ) = Eθ,x(EX,E,W fλ,X,E,W (x)− x>θ)2,

Var(λ) = Var[f(x)] = Eθ,xEX,E,W (fλ,T ,W (x)− EX,E,W fλ,T ,W (x))2.

A key point is that the variance can be further decomposed into the components due to therandomness in the training data X, label noise E , and initialization W .

We use s, l, i to represent the samples X, label noise E and initialization W , respectively.Their impact on the variance can be decomposed in a symmetric way into their main andinteraction effects via the the analysis of variance (ANOVA) decomposition (e.g., Box et al.,2005; Owen, 2013) as follows:

Var[f(x)] = Vs + Vl + Vi + Vsl + Vsi + Vli + Vsli,

where

Va = Eθ,xVara[E−a(f(x)|a)], a ∈ {s, l, i}Vab = Eθ,xVarab[E−ab(f(x)|a, b)]− Va − Vb, a, b ∈ {s, l, i}, a 6= b.

Vabc = Eθ,xVarabc[E−abc(f(x)|a, b, c)]− Va − Vb − Vc − Vab − Vac − Vbc= Var[f(x)]− Vs − Vl − Vi − Vsl − Vsi − Vli, {a, b, c} = {s, l, i}.

Here, E−a means taking expectation with respect to all components except for a. ThenVa > 0 can be interpreted as the effect of varying a alone, also referred to as the main

11

Lin and Dobriban

effect of a. Also, Vab = Vba > 0 can be interpreted as the second-order interaction effectbetween a and b beyond their main effects, and Vabc > 0 can be seen as the interaction effectamong a, b, c, beyond their pairwise interactions. The ANOVA decomposition is symmetricin a, b, c. We will show how to recover some sequential variance decompositions from theANOVA decomposition later. We mention that this decomposition is sometimes referred toas functional ANOVA (Owen, 2013).

For intuition, consider the noiseless case when the label noise equals zero, so E = 0 andthe index l does not contribute to the variance. Then the main effect of the samples/trainingdata is Vs = Eθ,xVars[E−s(f(x)|s)] = Eθ,xVarX [EW (f(x)|X)]. This can be interpreted as

the variance of the ensemble estimator f(x) := EW f(x), the average of models parametrizedby various W -s on the same data set X. Therefore, Vs can be regarded as the expectedvariance with respect to training data of the ensemble estimator f ; where the expectationis over the test datapoint x and the true parameter θ. Furthermore, Vsi +Vi is the variancethat can be eliminated by ensembling. This is because Vsi + Vs + Vi is the total variance ofthe estimator f(x); and Vs is the variance of the ensemble, thus Vsi + Vi is the remainingvariance that can be eliminated.

Our bias-variance decomposition is slightly different from the standard one in statisticallearning theory (e.g., Hastie et al., 2009, p. 24), where the variability is introduced onlyby the random training set T . Here we also consider the variability due to the randominitialization W ; and decompose the variance due to T into that due to samples X andlabel noise E . The motivation is because randomization in the algorithms, such as randominitialization, as well as randomness in stochastic gradient descent, are very common inmodern machine learning. Our decomposition helps understand such scenarios.

Now, we define the following quantities which are frequently used throughout our paper.These are “resolvent moments” of the well-known Marchenko-Pastur (MP) distribution Fγ(Marchenko and Pastur, 1967). The MP distribution is the limit of the distribution ofeigenvalues of sample covariance matrices n−1Z>Z of n× p data matrices Z with iid zero-mean unit-variance entries when n, p → ∞ with p/n → γ > 0 (Bai and Silverstein, 2010;Couillet and Debbah, 2011; Anderson et al., 2010; Yao et al., 2015).

Definition 1 (Resolvent moments) In this paper, we use the first and second resolventmoments

θ1(γ, λ) :=

∫1

x+ λdFγ(x), θ2(γ, λ) :=

∫1

(x+ λ)2dFγ(x), (5)

where Fγ(x) is the Marchenko-Pastur distribution with parameter γ. Recall that for us,γ = δπ. Then θ1 := θ1(γ, λ) and θ2 := θ2(γ, λ) have explicit expressions (see e.g., Bai andSilverstein, 2010, for the first one; and our proofs also contain the derivations):

θ1 =(−λ+ γ − 1) +

√(−λ+ γ − 1)2 + 4λγ

2λγ, (6)

θ2 = − d

dλθ1 =

(γ − 1)

2γλ2+

(γ + 1) · λ+ (γ − 1)2

2γλ2√

(−λ+ γ − 1)2 + 4λγ. (7)

We further define

λ := λ+1− π

2π

[λ+ 1− γ +

√(λ+ γ − 1)2 + 4λ

],

12


and denote θ1 := θ1(δ, λ), θ2 := θ2(δ, λ).

Remark. From (6), it is readily verified that λγθ21+(λ−γ+1)θ1−1 = 0. Taking derivatives

and noting that θ2 = −dθ1/dλ, we get 1 + (γ− 1)θ1− 2λ2γθ1θ2−λ(λ− γ+ 1)θ2 = 0. Thesetwo equations will be useful when simplifying certain formulas. Then we have the followingfundamental result on the asymptotic behavior of the variance components. This is ourfirst main result.

Theorem 2 (Variance components) Under the previous assumptions, consider an n×dfeature matrix X with i.i.d. entries of zero mean, unit variance and finite 8 +η-th moment.Take a two-layer linear neural network f(x) = (Wx)>β, with p 6 d intermediate activa-tions, and p × d matrix W of first-layer weights chosen uniformly subject to the orthogo-nality constraint WW> = Ip. Then, the variance components have the following limits asn, p, d→∞, with p/d→ π ∈ (0, 1] (parametrization level), d/n→ δ > 0 (data aspect ratio).Here α2 is the signal strength, σ2 is the noise level, λ is the regularization parameter, θi arethe resolvent moments (and λ, θi are their adjusted versions).

limd→∞

Vs = α2[1− 2λθ1 + λ2θ2 − π2(1− λθ1)2]

limd→∞

Vl = 0

limd→∞

Vi = α2π(1− π)(1− λθ1)2

limd→∞

Vsl = σ2δ(θ1 − λθ2)

limd→∞

Vli = 0

limd→∞

Vsi = α2[π(1− 2λθ1 + λ2θ2 + (1− π)δ(θ1 − λθ2))

− π(1− π)(1− λθ1)2 − 1 + 2λθ1 − λ2θ2]

limd→∞

Vsli = σ2δ[π(θ1 − λθ2)− (θ1 − λθ2)].

Remark. Theorem 2 shows that the label noise does not contribute to the variancevia a main effect (because limd→∞ Vl = 0), but instead through its interaction effects withthe sample and initialization (the terms Vsl, Vsli). These can be arbitrarily large if we letσ →∞. In our simulations, we assume a reasonable amount of label noise, e.g., σ = 0.3α.

2.2.2 Ordered Variance Decompositions

Using Theorem 2, we can calculate all six variance decompositions corresponding to theordering of the sources of randomness. Namely, suppose that we decompose the variance inthe order (a, b, c), where {a, b, c} = {s, l, i}, i.e., we calculate the following three terms:

Σaabc := Eθ,xEa,b,c[f(x)− Eaf(x)]2

Σbabc := Eθ,xEb,c[Eaf(x)− Ea,bf(x)]2

Σcabc := Eθ,xEc[Ea,bf(x)− Ea,b,cf(x)]2.

13

Lin and Dobriban

We can interpret these as follows: (1) Σaabc is all the variance related to a. (2) Σb

abc is allthe variance related to b after subtracting all the variance related to a in the total variance.(3) Σc

abc is the part of the variance that depends only on c. Then, simple calculations show

Σaabc = Va + Vab + Vac + Vabc

Σbabc = Vbc + Vb

Σcabc = Vc.

In previous work, d’Ascoli et al. (2020) considered the decomposition in the order label- initialization - sample (l, i, s), which becomes in our case (canceling the terms that vanish)

Σllis = Vls + Vlsi + Vl + Vli = Vls + Vlsi

Σilis = Vi + Vsi

Σslis = Vs.

d’Ascoli et al. (2020) argued that the label and initialization noise dominate the variance.However, different decomposition orders can lead to qualitatively different results. We takethe decomposition order (l, s, i) as an example. Roughly speaking, in this decomposition,Σllsi and Σs

lsi can be interpreted as the variance introduced by the data set given a fixedinitialization (and model), and Σi

lsi is the variance of the initialization alone. Figure 2 (left,middle) shows the results under these two different decomposition orders.

Figure 2: Bias-variance decompositions in three orders. Left: Decomposition order: label,sample, initialization (l, s, i); dominated by Σs

lsi (“samples”). Middle: Decompositionorder: label, initialization, sample (l, i, s); dominated by Σi

lis (“initialization”). Right:Decomposition order: sample, label, initialization (s, l, i); dominated by Σs

sli (“samples”).Parameters: signal strength α = 1, noise level σ = 0.3, regularization parameter λ = 0.01,parametrization level π = 0.8.

Comparing the left (lsi) and middle (lis) panels of Figure 2, we can see that differentdecomposition orders indeed lead to qualitatively different results. When δ < 2, in lsi, thevariance with respect to samples dominates the total variance, while in lis the variancewith respect to initialization dominates. Therefore, to have a better understanding of thelimiting MSE of ridge models, it is preferable to decompose the variance in a symmetricand more systematic way using the variance components. In fact, the discrepancy betweenthese two decompositions is due to the term Vsi, which dominates the variance (as discussedlater) and is contained in both Σs

lsi and Σilis. By identifying this key term Vsi, which has

14


not appeared in prior work, we are able to pinpoint the specific reason why the variance islarge, namely the interaction between the variation in samples and initialization. Moreover,the variance with respect to samples is even larger in Figure 2 (right), for the (sli) order,since Σs

sli also contains the interaction effect between the randomness of samples and labels.

2.2.3 A Special Ordered Variance Decomposition

Next, we consider a special case of the variance decomposition, in the order of label-sample-initialization. This is advantageous because it leads to particularly simple formulas, whosemonotonicity properties are particularly tractable, as explained below. The variance de-composes as Var(λ) = Σlabel + Σsample + Σinit, where

Σlabel := Σllsi = Eθ,xEX,W,E(fλ,T ,W (x)− EEfλ,T ,W (x))2

Σsample := Σslsi = Eθ,xEX,W (EEfλ,T ,W (x)− EX,Efλ,T ,W (x))2

Σinit := Σilsi = Eθ,xEW (EX,Efλ,T ,W (x)− EW,X,Efλ,T ,W (x))2.

As above, intuitively Σlabel is all variance related to the label noise, Σsample is the variancerelated to the samples after subtracting the variance related to the label noise, and Σinit

is the variance due only to the initialization. The decomposition ”label-init-sample” wasstudied in (d’Ascoli et al., 2020). Going beyond what was previously known, we can getexplicit expressions not only for the variances (using the ANOVA decomposition), but alsofor the bias, and moreover prove some monotonicity and unimodality properties of thesequantities when the ridge parameter λ is optimal.

Yang et al. (2020) observed empirically that the variance when fitting certain neuralnetworks can often be unimodal, and proved this for a 2-layer net similar to our setting, withGaussian initialization W and assuming n/d→∞. However, they left open the question ofunderstanding this phenomenon more broadly, writing that “The main unexplained mysteryis the unimodality of the variance”. Our result sheds further light on this problem.

Corollary 3 (Bias & variance in Two-Layer Orthogonal Net—special ordering)Under the assumptions from Theorem 2, we have the limits

limd→∞

Bias2(λ) =α2(1− π + λπθ1)2, (8)

limd→∞

Var(λ) =α2π[1− π + (π − 1)(2λ− δ)θ1 − πλ2θ21 + λ(λ− δ + πδ)θ2]+

σ2πδ(θ1 − λθ2). (9)

More specifically,

limd→∞

Σlabel(λ) = σ2πδ(θ1 − λθ2), (10)

limd→∞

Σsample(λ) = α2π[−λ2θ2

1 + λ2θ2 + (1− π)δ(θ1 − λθ2)], (11)

limd→∞

Σinit(λ) = α2π(1− π)(1− λθ1)2, (12)

where θi := θi(πδ, λ), i = 1, 2. Therefore

limd→∞

MSE(λ) =α2{

1− π + πδ(1− π + σ2/α2

)θ1 +

[λ− δ

(1− π + σ2/α2

)]λπθ2

}+ σ2.

(13)

15

Lin and Dobriban

For any fixed δ, π, the asymptotic MSE has a unique minimum at λ∗ := δ(1− π + σ2/α2).(except when π = 1, σ = 0).

Remark. Except for the simple formula for the bias, theorem 3 is direct corollary oftheorem 2, since Σlabel,Σsample,Σinit and Var are all sums of several variance components.

Almost sure results over random true parameter. Above, we provide average-case results over the true parameters θ ∼ N (0, α2Id/d). With additional work, we showbelow a corresponding almost sure result. For the next result, we assume that X has iidGaussian entries.

Theorem 4 (Almost sure result over true parameter θ) For each triple (pd, d, nd),suppose that the true parameter θ is a sample drawn from N (0, α2Id/d). Suppose in additionthat each entry of X is iid standard normal. Then as d→∞, Theorems 2 and 3 still holdalmost surely over the selection of θ.

2.2.4 Optimal Regularization Parameter; Monotonicity and Unimodality

In this section, we present some theoretical results about the risks when using an optimalregularization parameter. Moreover, we study the monotonicity and unimodality of certainvariance components in that setting.

We can find explicit formulas for the asymptotic bias and variance at the optimal λ∗,by plugging in the expressions of λ∗, θ1 into equations (8), (9):

limd→∞

Bias2(λ∗) = α2(1− π + λ∗πθ1)2

= α2

(δ(1− σ2/α2)− 1 +

√(δ(1 + σ2/α2) + 1)2 − 4γ

2δ

)2

. (14)

limd→∞

Var(λ∗) = −σ2π + (α2 + σ2δ)

(δ(2π − 1− σ2/α2)− 1 +

√(δ(1 + σ2/α2) + 1)2 − 4γ

2δ2

).

limd→∞

MSE(λ∗) = α2 [1− π + λ∗πθ1] + σ2

= α2

(δ(1− σ2/α2)− 1 +

√(δ(1 + σ2/α2) + 1)2 − 4γ

2δ

)+ σ2. (15)

From Theorem 3, we know that the optimal ridge penalty is λ∗ = δ(1 − π + σ2/α2).Thus, by plugging the expression of λ∗ into (8)—(13), we are able to study the properties ofthe MSE, bias and variance components at the optimal λ∗ as functions of π, δ. Our resultsare summarized in Table 1. See Figure 3 for illustration.

As for the monotonicity and unimodality properties, we have the following statement,where the properties are summarized in Table 1 for clarity.

Theorem 5 (Monotonicity and unimodality) Under the assumptions above, the MSE,Bias, and components of the sequential variance decomposition in the l−i−s order have themonotonicity and unimodality properties summarized in Table 1. For instance, the MSE isnon-increasing as a function of the parameterization level π, while holding δ fixed.

16


Function

Variableparametrization π = lim p/d aspect ratio δ = lim d/n

MSE ↘ ↗Bias2 ↘ ↗

Varδ < 2α2/(α2 + 2σ2): ∧, maxat [2 + δ(1 + 2σ2/α2)]/4.δ ≥ 2α2/(α2 + 2σ2): ↗ .

π ≤ 0.5 : ↘ .π > 0.5: ∧, max at2(2π − 1)/[1 + 2σ2/α2].

Σlabel ↗ ∧: max at α2/(α2 + σ2)

Σinit ∧ ↘Σsample conjecture: ↗ or ∧ conjecture: ∧

Table 1: Monotonicity properties of various components of the risk at the optimal λ∗, asa function of π and δ, while holding all other parameters fixed. ↗: non-decreasing. ↘:non-increasing. ∧: unimodal. Thus, e.g., the MSE is non-increasing as a function of theparameterization level π, while holding δ fixed.

We provide some observations below.

Consistency with prior work. The MSE result is consistent with optimal regular-ization mitigating double descent, which was shown in finite samples in a certain two-layerGaussian model in Nakkiran et al. (2021). However, our result holds for more generaldistributions of data matrices with arbitrary iid entries, while only proven asymptotically.

Variance as a function of δ. For fixed parametrization level π = lim p/d, as a func-tion of the “dimensions-per-sample” parameter δ = lim d/n, the variance is monotonicallydecreasing when π < 0.5, and unimodal with a peak at 2(2π−1)/[1+2σ2/α2]] when π ≥ 0.5.This prompts the question why π = 1/2 is special? The special role of this value was alsonoted in Yang et al. (2020). Recall that d is the original dimension, while p is the numberof features in the intermediate layer, and π = lim p/d. While the role of π = 1/2 does notseem straightforward to understand, qualitatively for large π we keep a lot of features inthe inner layer. This is close to a “well-specified” model. Thus, when we increase the sizeof the data set (and thus decrease δ = lim d/n), it is reasonable that the variance decreases.In contrast, regardless of π, when we severely decrease the size of the data set (and thusincrease δ = lim d/n), the optimal ridge estimator will regularize more strongly, and thus itis possible that its variance may decrease (which is what we indeed observe).

Variance as a function of π. For small δ, the variance is unimodal with respect to π.A possible heuristic is as follows. Recall that π = lim p/d (d is data dimension, p is numberof features in inner layer) denotes the amount of “parametrization” we allow. When π ≈ 0,the number of features in the inner layer is very small, which effectively corresponds to a“low signal strength” problem (see also our added noise interpretation below). The optimalridge estimator thus employs strong regularization, and acts like a constant estimator, thusthe variance is almost zero. When π = 1, we are using the correct number of features toestimate θ, thus the variance is also small. The variance is zero when σ = 0, and we canplot it when σ > 0 (see Figure 3). The above reasoning also suggests the variance may be

17

Lin and Dobriban

Figure 3: Perspective plots of the performance characteristics. Top row: Bias2 (left), MSE(right). Bottom row: variance, from two perspectives. As functions of π, δ, at the optimalλ∗ = δ(1− π + σ2/α2), when α = 1 and σ = 0.5.

larger for intermediate values of π. Thus, the unimodality of the variance with respect toπ is perhaps reasonable.

Beyond our results on Σlabel and Σinit, we conjecture based on numerical experimentsthat Σsample is unimodal as a function of δ and can be either unimodal or monotone as afunction of π. However, this appears more challenging to establish.

Comparison with Rocks and Mehta (2020). In their paper, they suggest that thetraining process W should be separated from the sampling of the training data X, ε whenstudying the variance. Thus, they calculate the variance by fixing θ,W , computing condi-tional variances (due to X, ε), then taking expectation over θ,W . This can, in principle, stillbe recovered from our general framework, if we look at the variance components conditionedon θ,W . In constrast, we consider the randomness arising from all components together.Our approach allows us to study some problems that do not easily fall within the scope ofthe conditional approach. For instance, we can provide intuition for why ensembling works;namely that it can reduce the interaction effect Vsi.

Multiple descent. It has been argued that other possible shapes of the test error,such as multiple descents, can arise. Liang et al. (2020) study kernel regression under the

18


limiting regimes d ∼ nc, c ∈ (0, 1). They provide an upper bound for the MSE and show itsmultiple descent behavior as c increases. Adlam and Pennington (2020a) study the neuraltangent kernel under the limiting regimes p ∼ nc, c = 1, 2 and observe that the MSE hasa triple descent shape as a function of p. To conclude, the MSE may exhibit multipledescent when considering different asymptotic regimes. However, since we only considerthe proportional limit where p/d → π, d/n → δ, we have not found evidence of multipledescent in our setting.

Remark: Fully linear regression. The monotonicity of the MSE and bias, andthe unimodality of variance at the optimal λ∗ also appear in the simpler (one-layer) linearsetting. Namely, we consider the usual linear model Y = Xθ + E , where X ∈ Rn×d is thedata matrix and Y ∈ Rn is the response. We fit a linear regression of Y on X, which canbe seen as a special case of our two layer setting with W = Id (and d = p). We use thesame assumptions (except the assumptions on W ) and notations as in the two layer setting.In particular, we assume n, p → ∞ with p/n → γ > 0, where the aspect ratio γ is now ameasure of the parametrization level. We have the following result.

Proposition 6 (Properties of the limiting MSE, bias & variance in linear setting)Under the same assumptions as in the two-layer setting, the limiting characteristics of opti-mally tuned ridge regression (λ∗ = γ/α2) have the following properties as a function of thedegree of parameterization γ:

1. MSE(γ) is increasing as a function of γ.

2. Bias2(γ) is increasing as a function of γ.

3. Var(γ) is unimodal as a function of γ, with maximum at γ = α2/(α2 + 1).

4. At the maximum, the bias equals the variance: Bias2[α2/(α2+1)] = Var[α2/(α2+1)].

Figure 6 in Liu and Dobriban (2020) shows the MSE, variance and bias at optimal λ∗.However, that work only studied it visually, and not theoretically. The result on the MSEhas appeared before as Proposition 6 of Dobriban and Sheng (2020), in a different context.However, the results on the bias and variance have not been considered in that work.

2.3 Further Properties of the Bias, Variance and MSE

In this section, we report some further properties of the bias, variances and MSE, includingbut not limited to the optimal ridge parameter setting.

2.3.1 Relation Between MSE and Bias at Optimum

We present a somewhat surprising relation between MSE and bias at the optimal λ∗ =λ∗(δ, π, α2, σ2): Let Bias2 := limd→∞Bias2(λ∗), Bias = |

√Bias2|, and denote MSE:=

limd→∞MSE(λ∗). In general for all λ, we have that MSE(λ) = Bias2(λ) + Var(λ) + σ2.Also, in prominent problems such as in non-parametric statistics, optimal rates are achievedby balancing bias and variance (e.g., Ibragimov and Has′ Minskii, 2013, etc). Thus we areinterested to see if the bias and variance are also balanced at the optimal λ in our case.

19

Lin and Dobriban

However, based on the explicit expressions above, the bias and variance are balanced viathe signal strength α2 via

Var = Bias · (α−Bias).

This holds for any π, δ and α. Thus, the MSE and bias are linked in a nontrivial way atthe optimal λ. We see that the optimal squared bias and variance are in general not equalat the optimum. Instead, we have the above relation, which also balances the bias with thesignal strength α2. We think that this explicit relation is remarkable.

2.3.2 Fixed Regularization Parameter

Figure 4: Left: Asymptotic MSE of ridge models when λ = 0.01, σ = 0.3, α = 1. Right:Asymptotic MSE as function of 1/δ when λ = 0.01, σ = 0.3, α = 1. (Note: this figure isplotted as a function of 1/δ, instead of δ as before. Increasing 1/δ is equivalent to increasingthe number of samples n.)

From Figure 3 and Theorem 5 above, we can see that the MSE is monotone decreasingwith respect to the parametrization level π = lim p/d if we choose the optimal λ∗. This isconsistent with “double descent being mitigated”, as in the results of Nakkiran et al. (2021)for a different problem.

Here we provide additional results for a suboptimal choice of λ. Specifically, we considerthe simplest case when λ is fixed across problem sizes. In contrast, we find that doubledescent is not mitigated, and may occur when we use a small regularization parameter (alsoreferred to as the ridgeless limit). In Figure 4 (left), we fix λ = 0.01 and plot a heatmapof the asymptotic MSE as function of the two variables π = lim p/d and δ = lim d/n.Clearly, the MSE is in general not monotone with respect to π or δ. Note the peak inthe MSE around the curve γ = δπ = 1, or equivalently δ = 1/π. This corresponds to the“interpolation threshold” where lim p/n = 1, and the number of learned parameters p isclose to the number of samples n. Thus, we fit just enough parameters to interpolate thedata.

Besides, we see in Figure 4 (right) that double descent (which we interpret as a changeof monotonicity, or a peak in the risk curve) with respect to 1/δ = limn/d occurs when π is

20


suitably large, e.g., π = 0.9, while the MSE is unimodal when π is small, e.g., π = 0.5. Theintuition is, as in many previous works on double descent (e.g., LeCun et al., 1991; Hastieet al., 2019), that the suboptimal regularization can lead to a somewhat ill-conditionedproblem, which increases the error (see also section 2.3.4 for more explanation). Thus, wesee that here as in prior works, using a suboptimal penalty λ may lead to non-monotoneMSE.

Moreover, we can obtain some quantitative results about the bias and variance withfixed values of the regularization parameter λ.

Theorem 7 (Bias and variance of ridge models given a fixed λ) Under the assump-tions in our two layer setting, we have

1. For any fixed λ > 0, limd→∞Bias2(λ) is monotonically decreasing as a function of πand is monotonically increasing as a function of δ.

2. limλ→0 limd→∞Var(λ) = ∞ on the curve δ = 1/π (the interpolation threshold wherelim p/d = 1). More specifically, when λ → 0, Vsi, Vsli goes to infinity while othervariance components converge to some finite limits on the curve δ = 1/π.

The first part implies that more samples or a larger degree of parameterization can alwaysreduce the prediction bias, which is consistent with our intuition that larger models can, inprinciple, approximate any function better.

For the variance components, it is natural to expect that some interaction exists, be-cause even the expressions W,X in the prediction function f(X) = (WX)>β interact non-additively. But we do not fully understand why the interaction terms Vsi, Vsli are large.This can be viewed as a surprising discovery of our paper. One somewhat tautologicalperspective is that the interaction terms are the part of variance that are most affectedby “under-regularization”. For instance, the main effect Vi comes from the randomness ofinitializations. Thus, one has to average—or ensemble—over the choices of initializationW to reduce Vi. However, for Vsi and Vsli, both ensembling and optimally tuned ridgeregularization can reduce their values significantly. Thus, these components seem to bemore affected by the “under-regularization” due to using a sub-optimal ridge parameter.However, this is still a somewhat circular explanation, because the entire reason that theydiverge is that they are sensitive to under-regularization.

For any fixed λ > 0, we conjecture based on numerical results that limd→∞Var(λ) isunimodal as both a function of π and a function of δ. This appears to be more challengingto show. Here the unimodality would be mainly due to being close to the interpolationthreshold p/n ≈ 1 for δ ≈ 1/π, which leads to ill-conditioned feature matrices and a largerisk.

2.3.3 Added Noise Interpretation

The random projection step in the initialization can be interpreted as creating additionalnoise. Thus, we can find a ridge model without the projection step (i.e. without the firstlayer) with larger training set label noise σ′2 and the same test point label noise σ2 that hasthe same asymptotic bias, variance, and MSE as the model with random projection step(i.e. with the first layer). To obtain the “effective noise” level σ′2, in equation (14), (15),

21

Lin and Dobriban

let us equate the formula determining limd→∞MSE(λ∗) = α2[1− π + λ∗πθ1] + σ2 for twosets of parameters α2, σ2, π, δ and α2, σ′2, π = 1, δ. This leads to the equation

1− π + λ∗πθ1 = λ′∗θ′1.

After simple calculation, we obtain

σ′2 = σ2 + ∆ := σ2 + α2(1− π)δ(1 + σ2/α2) + 1 +

√(δ(1 + σ2/α2) + 1)2 − 4γ

2γ.

Note that Variance = MSE− Bias2 − σ2, hence the random projection model has the sameasymptotic bias, variance and MSE as the ordinary ridge model with additional trainingset label noise ∆. However, the variance components are specific to the two-layer case, anddo not carry over to the one-layer case.

2.3.4 Understanding the Effect of the Optimal Ridge Penalty

In this section, we provide some intuitions for why unimodality and the double descentshape appears in the MSE of ridge models when using a fixed small penalty λ (close to theridgeless limit), and how the optimal λ∗ helps eliminate the non-monotonicity of the MSE.We illustrate this with numerical results.

To qualitatively understand the effect of the optimal penalty λ∗, we plot the variancecomponents, variance, bias and the MSE under two different scenarios. In the first scenario,we use the optimal penaly λ∗ for all ridge models (see Figure 5). It is readily verified thatVs and Vi contribute to a large portion of the variance, while the contributions of Vsi andVsli are relatively small.

In the second scenario, we choose λ = 0.01 for all ridge models. From Figure 6, we seethat, perhaps surprisingly, it is the interaction term Vsi between sample and initializationthat dominates the variance. In particular, we think that it is surprising that this interactionterm can be larger than the main effects Vs and Vi of sample and initialization. Also, Vsi andVsli lead to the modes of the variance on the curve δ = 1/π (the interpolation threshold).

Comparing Figure 5 and 6, we can see that Vs, Vi and Vsl are almost on the same scalein the two scenarios. However, Vsi and Vsli are much larger when λ = 0.01 than whenλ = λ∗. These two terms are the main reason why the variance is significantly larger whenλ = 0.01. Moreover, Figure 5g and 6g show that the bias is even relatively smaller whenwe use λ = 0.01 instead of the optimal λ∗. Intuitively, the reason is that the optimalregularization parameter is large, to achieve a better bias-variance tradeoff, and thus makesthe bias slightly larger while decreasing the variance a great amount.

Therefore, we may conclude that, under a reasonable assumption on the label noise(e.g., σ = 0.3α here),

1. Using a fixed small penalty λ for all ridge models can lead to unimodality/doubledescent shape in the MSE. The modes of the MSE as a function of δ are close to theinterpolation limit curve δ = 1/π.

2. The unimodality/double descent shape of the MSE given a fixed small penalty λ isdue to the variance. The bias is typically smaller when using a fixed small penalty λ

22


(a) Vs (b) Vi (c) Vsl

(d) Vsi (e) Vsli (f) Var

(g) Bias2 (h) MSE

Figure 5: Heatmaps of the performance characteristics for the optimal regularization pa-rameter λ = λ∗. Variance components, variance, bias and the MSE as functions of π and δwhen α = 1, σ = 0.3. (Var = Vs + Vi + Vsl + Vsi + Vsli. MSE = Bias2 + Var + σ2.)

instead of the optimal penalty λ∗. As mentioned, this is because the bias and varianceare balanced out for the optimal λ∗, and thus we can increase the bias a bit, whilesignificantly decreasing the variance.

3. Compared with choosing the same small ridge penalty for all models, through usingthe optimal penalty λ∗, one can reduce the variance significantly, especially along theinterpolation threshold curve. The unimodality/double decent shape of the MSE willvanish as a result; but the variance itself may still be unimodal.

4. Using the optimal penalty for all ridge models reduces the variance mainly by reducingthe interaction component Vsi. This component is large in an absolute sense, and thusa reduction has a significant effect. The component Vsli is also reduced in a relative

23

Lin and Dobriban

(a) Vs (b) Vi (c) Vsl

(d) Vsi (e) Vsli (f) Var

(g) Bias2 (h) MSE

Figure 6: Heatmaps of the performance characteristics for a fixed parameter λ = 0.01.Variance components, variance, bias and the MSE as functions of π and δ when α = 1, σ =0.3. (Var = Vs + Vi + Vsl + Vsi + Vsli. MSE = Bias2 + Var + σ2.)

sense; however, because it is of a smaller magnitude, this reduction has a more limitedeffect.

5. There is a special region where the bias and variance (for the optimal λ∗) change inthe same direction, in the sense that increasing the parametrization π = lim p/d ordecreasing the aspect ratio δ = lim d/n decrease both the bias and the variance. SeeFigures 5f and 5g.

This special region is characterized by the “triangle” 0 < π 6 1, δ > 0, with δ 62(2π − 1)/[1 + 2σ2/α2]. In finite samples, this is approximated by the inequalityd/n 6 2(2p/d − 1)/[1 + 2σ2/α2] between the sample size n, data dimension d andthe number of parameters p. This can be interpreted as saying—for instance—that

24


the parameter dimension p should be large enough. Thus, in that setting, with moreparametrization we can get simultaneously better bias and better variance. In a sense,this can indeed be viewed as a blessing of overparametrization.

6. There is a “hotspot” around π = 1/2, where in Vi, and Vsi both take large values (seeFigures 5b and 5d). The variance due to initialization is large for intermediate valuesof the projection dimension p. Roughly speaking, one can consider an analogy withBernoulli random variables, which have large variance for intermediate values of thesuccess probability.

7. For fixed λ, when δ < 1 (d < n), numerical experiments show that the MSE isdecreasing as π increases, which means that more parametrization can always give ussmall MSE when we have enough samples. See Figure 6h. It appears that there maybe no double descent for fixed λ when δ is sufficiently small; however investigatingthis is beyond our current scope.

Recall that, in the noiseless case, Vs can be interpreted as the variance of an ensembleestimator, and Vsi + Vi is the variance that can be reduced through ensembling. Therefore,the unimodality/double descent shape in the MSE is not intrinsic, and can be removedthrough regularization techniques such as ensembling, (consistent with d’Ascoli et al. 2020)or optimal ridge penalization.

2.4 Ridge is Optimal

We have obtained precise asymptotic results for optimally tuned ridge regression. However,is ridge regression optimal, or are there other methods that outperform it? In fact, we canprove that the ridge estimator is asymptotically optimal.

Theorem 8 (Ridge is optimal) Suppose that the samples are drawn from the standardnormal distribution, i.e., x and X both have i.i.d. N (0, 1) entries. Given the projectionW , projected matrix XW> and response Y , we define the optimal regression parameterβopt as the one minimizing the MSE over the posterior distribution p(θ|XW>,W, Y ) of theparameter θ,

βopt : = argminβ Ep(θ|XW>,W,Y )Ex,ε[(Wx)>β − (x>θ + ε)]2, (16)

where x ∼ N (0, Id), ε ∼ N (0, σ2) and x, ε are independent. We will check that this can beexpressed in terms of the posterior of θ as

βopt = W · Ep(θ|XW>,W,Y )θ. (17)

The optimal ridge estimator β = (n−1WX>XW>+λ∗Ip)−1WX>Y/n (Theorem 3) satisfies

the almost sure convergence in the mean squared error

limd→∞

EXW>,W,Y ‖β − βopt‖22 = 0, (18)

and is thus asymptotically optimal. Here d → ∞ means p, d, n → ∞ proportionally as inTheorem 3.

25

Lin and Dobriban

Remark. In Theorem 8, the optimal parameter βopt minimizes the mean squared errorover the posterior of θ given the projection W , projected matrix XW> and response Y .From the proof of Lemmas 12, 13, we know that E‖β‖2 converges to some positive constantas d→∞. Thus, β has a constant scale as d→∞, and the result that ‖β − βopt‖2 → 0 ofTheorem 8 shows that βopt is indeed non-trivially well approximated. This result impliesthe asymptotic optimality of ridge regression.

In addition, if we are given the original data matrix X instead of XW>, then from theoptimality of ridge regression in ordinary linear regression with Gaussian prior and noise,we have βopt = W (n−1X>X + dσ2Ip/[nα

2])−1X>Y/n and ridge regression over projecteddata is not asymptotically optimal. However, in our two-layer model, we only exploit theinformation of X through XW>, thus it is reasonable to consider the situation above, inwhich we are only given XW>.

3. Nonlinear Activation

It is also possible to consider the bias-variance decomposition for a two-layer neural networkwith certain scalar nonlinear activation functions σ(x) after the first layer. Namely, supposethat the data are generated through the same process, but instead of using a two-layer linearnetwork, we use

f(x) = σ(Wx)>β (19)

as the predictor. Here σ : R → R is an activation function applied to Wx entrywise. Asbefore, we assume W ∈ Rp×d has orthonormal rows, so p 6 d, and we only train β. This canbe viewed as a random features model. We apply ridge regression to estimate β, thereforeour prediction function is

f(x) := σ(Wx)>β = σ(x>W>)

(σ(WX>)σ(XW>)

n+ λIp

)−1σ(WX>)Y

n. (20)

For simplicity, we further assume that Eσ(Z) = 0, where Z ∼ N (0, 1) is a standard normalrandom variable. The results for activation functions with arbitrary mean can be obtainedthrough similar techniques, but are much more cumbersome. This assumption does notcapture the ReLU activation function σ+(x) = max(x, 0), but it can handle the functionσ+(x) − Eσ+(Z), which only differs from the ReLU by a constant. In particular, themean of our prediction function f(x) with the current restriction is always zero, i.e., theprediction function does not have an intercept term. In our model, the true regressionfunction f∗(x) = θ>x does not have an intercept term either; thus we think that the zero-mean restriction may not be significant in the current setting.

Moreover, we suppose that there are constants c1, c2 > 0 such that σ, σ′ grow at mostexponentially, i.e., |σ(x)|, |σ′(x)| ≤ c1e

c2|x|. Define the moments

µ := EZσ(Z), v := Eσ2(Z), (21)

where Z ∼ N (0, 1). Also, we suppose that the samples are drawn from the standard normaldistribution N (0, Id), i.e., X and x both have i.i.d. N (0, 1) entries. As before, we can write

26


down the MSE, bias and variance:

MSE(λ) := Eθ,x,W,X,E(f(x)− x>θ)2 + σ2

Bias2(λ) := Eθ,x(EW,X,E f(x)− x>θ)2

Var(λ) := Eθ,x,W,X,E(f(x)− EX,W,E f(x))2.

Our main result in this section gives asymptotic formulas for their limits.

Theorem 9 (Bias-Variance Decomposition for two-layer nonlinear NN) Underthe previous assumptions (i.e., in the setting of Theorem 2), with the further assumptionthat the samples are drawn from N (0, Id), i.e., x and X have i.i.d. N (0, 1) entries, we havethe following limits for the bias, variance, and mean squared error. Recall that we havean n × d feature matrix X and a two-layer nonlinear neural network f(x) = σ(Wx)>β,with p intermediate activations, and p × d orthogonal matrix W of first-layer weights withWW> = Ip. Here n, p, d → ∞ and p/d → π ∈ (0, 1] (parametrization level), d/n → δ > 0(data aspect ratio), with α2 the signal strength, σ2 the noise level, λ the regularizationparameter, θi the resolvent moments, and µ, v the Gaussian moments of the activationfunction σ from (21). Then

limd→∞

MSE(λ) = α2π

[1

π− 1 + δ(1− π)θ1 +

λ

v

(λµ2

v2− δ(1− π)

)θ2

+(v − µ2)

(γ

vθ1 +

1

v− λγ

v2θ2

)]+ σ2γ

(θ1 −

λ

vθ2

)+ σ2, (22)

limd→∞

Bias2(λ) = α2

[πµ2

v

(1− λ

vθ1

)− 1

]2

, (23)

limd→∞

Var(λ) = α2π

[2µ2

v− 1 +

(−2λµ2

v2+ δ(1− π)

)θ1 +

λ

v

(λµ2

v2− δ(1− π)

)θ2

−πµ4

v2

(1− λ

vθ1

)2

+ (v − µ2)

(γ

vθ1 +

1

v− λγ

v2θ2

)]+ σ2γ

(θ1 −

λ

vθ2

),

(24)

where θ1 := θ1(γ, λ/v), θ2 := θ2(γ, λ/v), γ = πδ. Similar to the linear case, the limiting

MSE has a unique minimum at λ∗ := v2

µ2

[δ(1− π + σ2/α2) + (v−µ2)γ

v

].

Remarks. (1). When expanding the function σ(x) in the Hermite polynomial basis, µ isthe coefficient of the second basis function x, and

√v is σ(x)’s norm in the Hilbert space.

Thus v ≥ µ2 and the equality holds iff σ(x) = kx. (2). When σ(x) = kx (i.e. v = µ2), theresults in theorem 9 reduce to those in theorem 3.

Also, we have monotonicity properties similar to in the linear case (Table 2):

Theorem 10 (Monotonicity and unimodality for non-linear net at optimal λ∗) Underthe assumptions from Theorem 9, for the optimal λ = λ∗, the MSE, Bias, and variance havethe monotonicity and unimodality properties summarized in Table 2. The MSE and bias aredecreasing as a function of the parametrization level π, and increasing as a function of thedata aspect ratio δ. The variance is either monotone or unimodal depending on the setting.

27

Lin and Dobriban

Function

Variableparametrization π = lim p/d aspect ratio δ = lim d/n

MSE ↘ ↗Bias2 ↘ ↗

Var

δ < 2µ2

v

(2µ2

v− 1

)/(1 + 2σ2/α2

): ∧, max

atv

µ2

[2 +

δv

µ2

(1 +

2σ2

α2

)]/4.

δ ≥ 2µ2

v

(2µ2

v− 1

)/(1 + 2σ2/α2

): ↗ .

π ≤ v

2µ2: ↘ .

π >v

2µ2: ∧, max at

2µ2(2πµ2/v − 1)

v(1 + 2σ2/α2).

Table 2: Monotonicity properties of various components of the risk for a two-layer networkwith nonlinear activation at the optimal λ∗, as a function of π and δ, while holding all otherparameters fixed. ↗: non-decreasing. ↘: non-increasing. ∧: unimodal. Thus, e.g., theMSE is non-increasing as a function of the parameterization level π, while holding δ fixed.

Thus, comparing Tables 2 and 1, we see that with stronger Gaussian assumptions onthe data distribution, optimal ridge regularization has similar effects in the nonlinear andlinear cases. For instance, λ∗ can eliminate the “peaking” shape of the MSE.

4. Numerical Simulations

In this section, we perform several numerical experiments, to check the correctness of ourtheoretical results.

4.1 Verifying the Theoretical Results for the MSE

To check the correctness of the MSE formula, we estimate the MSE from its definitiondirectly. For simplicity, we subtract the test point label noise σ2 from the MSE formula.We randomly generate k = 400 i.i.d. tuples of random variables (xi, θi, εi, Xi,Wi), 1 ≤ i ≤ k,from their assumed distributions (we assume X and x have i.i.d. N (0, 1) entries in numericalsimulations), and estimate the MSE by calculating:

MSE =1

k

k∑i=1

(fi(xi)− x>i θi)2,

where k = 400 and fi(xi) = σ(x>i W>i )(n−1σ(WiX

>i )σ(XiW

>i ) + λIp

)−1n−1σ(WiX

>i )(Xiθi+

Ei), for σ(x) both linear and nonlinear. We repeat this process 20 times and plot the meanand standard error in Figure 7. The regularization parameter λ is set optimally or fixed.We also plot our theoretical MSE formula from (13). The parameters in the experimentare shown in the captions of the figure. We can see from Figure 7 that our theoreticalprediction of the MSE is quite accurate.

28


Figure 7: Numerical verification of the theoretical results for MSE. We display, as a functionof δ = lim d/n, the theoretical formula and the numerical mean and standard deviation over20 repetitions. Parameters: α = 1, σ = 0.3, π = 0.8, n = 150, d = bnδc, p = bdπc. Left:linear, σ(x) = x, λ = λ∗(optimal). Right: nonlinear, σ(x) = σ+(x) − Ex∼N (0,1)σ+(x),σ+(x) = max(x, 0), λ = 0.01.

Figure 8: Left: numerical simulation verifying the accuracy of the bias, variance and MSEformulas. Right: simulations with variance components. For each ANOVA component,symbolized by ?, we show two curves: ?: theory, n?: numerical (averaged over 5 runs).Parameters: α = 1, σ = 0.3, π = 0.8, n = 150, d = bnδc, p = bdπc.

4.2 Bias-variance Decomposition and the Variance Components

We next study the accuracy of the formulas for the bias, variance and the variance compo-nents in the linear case. Estimating them directly requires many samples. For example, toestimate the bias based on the defining formula from Section 2.2, we may need to generate,say, 100 pairs of (x, θ), and for each (x, θ) generate 500 triples of i.i.d. (X,W, E). Thus, wemay need to simulate 50, 000 samples in total to obtain accurate results. This is beyondour current scope. Therefore, for simplicity, we check instead the formulas that we havederived in Appendix B in equations (25)—(27), (32)—(38). We omit the results for Vl andVli since they converge to 0. In all experiments, we choose n = 150. For the bias, variance,

29

Lin and Dobriban

Functional Estimator

E tr(MM>)1

k

k∑i=1

tr(MiM>i )

‖EM‖2F

∥∥∥∥1

k

k∑i=1Mi

∥∥∥∥2

F

EW ‖EXM‖2F1

k

∑kj=1

∥∥∥∥1

k

k∑i=1Mij

∥∥∥∥2

F

EX‖EWM‖2F1

k

∑ki=1

∥∥∥∥∥1

k

k∑j=1Mij

∥∥∥∥∥2

F

Table 3: Empirical estimators of functionals of interest. Here M is a generic matrix thatcan be M or M . For the bias, variance and the MSE, we take k = 100 andMi denotes theappropriate matrix M obtained from the i-th pair of (X,W ). For variance components,k = 20, 50 and Mij denotes the appropriate matrix M obtained from the i-th X and thej-th W . Estimators of the quantities in (25)—(27), (32)—(38) are obtained by combiningthe above.

and MSE, we generate 100 i.i.d. copies of X,W of certain dimensions (we assume X hasi.i.d. N (0, 1) entries in numerical simulations) and estimate the expectations from the proofof Theorem 3 in Appendix B (25—27) using the Monte Carlo mean. As for the variancecomponents, we randomly generate k i.i.d. Xi-s and Wj-s, and use them to form k2 pairs of(Xi,Wj), where k = 20 in Figure 8 (right) and k = 50 in Figure 1 (left). Similarly, we alsoestimate the expectations from the proof of Theorem 2 in (32)—(38) via the Monte Carlomean (see the details in Table 3).

Figure 8 shows the results averaged over five runs. We can see that the numericalresults are quite close to the theoretical predictions. Moreover, the standard deviationsover 5 runs are uniformly less than 0.001 in all settings we considered, which implies thatthe variance due to the randomness of (X,W ) is negligible. The slight discrepancies betweenthe theoretical predictions and experiments are mostly owing to the bias in our estimators(e.g., the second, third and fourth estimators in Table 3 are biased, because they are of theform g(EM), which is estimated by g(k−1

∑ki=1Mi), and g is nonlinear), and they can be

reduced if we have more samples (X,W ).

One may wonder: Why is the standard deviation of different runs of the simulation sosmall (e.g., less than 0.001)? The reason is that the Marchenko-Pastur theorem, on whichour theoretical predictions depend, has fast convergence rates (e.g., Bai, 1993; Gotze et al.,2004). Since the terms we estimated are expectations of various functionals of the eigenvaluespectrum, we can expect the simulation results to be quite precise.

4.3 General Covariance

Although our theoretical results are proved under an assuming the data distribution isisotropic, we also study the model under general covariance assumptions numerically. Namely,we assume the samples are drawn i.i.d. from N (0,Σ(r)), where Σ(r)ij = r|i−j| is an AR-1covariance matrix. We numerically study the bias-variance decomposition and the variance

30


Figure 9: Bias-variance decomposition and the variance components under an AR-1 co-variance assumption. Left: numerical simulation of the bias, variance and MSE formu-las. Right: simulations with variance components. Upper: r = 0.5, λ = 0.01. Down:r = 0.9, λ = 0.001. For each term, symbolized by ?, we show n?: numerical (averaged over5 runs). Parameters: α = 1, σ = 0.3, π = 0.8, n = 150, d = bnδc, p = bdπc.

components the same way as in Section 4.2. The only distinction is that the formulas weestimate are slightly different due to the non-identity covariance. More specifically, one canshow that the formulas for the general covariance case are the same as their counterpartsin equations (25)—(27), (32)—(38) with ‖ · ‖2F replaced by tr(· ·> Σ(r)).

In the experiment, we choose fixed small penalties λ = 0.01, 0.001 to mimic the ridge-less limit. Figure 9 shows the numerical results when r = 0.5, 0.9. We see that manyobservations in the isotropic case (e.g. monotonic bias, non-monotonicity of the MSE andvariance, the interaction terms dominating the variance at the interpolation threshold) stillhold in the general covariance case. However, the terms contributing the most to the totalvariance are Vsl and Vsli when r = 0.9, while they are Vsi when r = 0.5 or in the isotropiccase. We conjecture that this is because the covariance matrix can implicitly change theratio between the noise σ and signal α. However, more work is needed in the future forunderstanding the generalization error under a general covariance assumption.

31

Lin and Dobriban

4.4 Experiments on Empirical Data

In this section, we present an empirical data example to study several phenomena observedin our theoretical analysis. We use the Superconductivity Data Set (Hamidieh, 2018) re-trieved from the UC Irvine Machine Learning Repository in our data analysis.

The original data set contains N = 21, 263 superconductors as samples and d = 81features for each sample. Our goal is to predict the critical temperatures of the supercon-ductors based on their features. Before doing regression, we preprocess the data set in thefollowing way. We first randomly shuffle the samples. We then separate the data set intoa training set containing the first 90% of the samples, and a test set containing the rest.Finally, we normalize the features and responses so that they all have zero mean and unitvariance. Since N is quite a bit larger than d, we can estimate the variances of the featuresquite well; and thus we can standardize new test datapoints from this distribution usingthe estimated variances. After these steps, we are ready to start our experiments.

Similar to our theoretical setting, for each experiment setting (p, n), we randomly selectn samples from the training set to form a data matrix X ∈ Rn×d, map it into a randomp-dimensional subspace multiplying it by a projection matrix W with orthogonal rows andthen perform ridge regression on the p-dimensional subspace. For each setting, we generate50 i.i.d. sample matrices Xi, 1 ≤ i ≤ 50, and 50 i.i.d. random projections Wj , 1 ≤ j ≤ 50,and combine them to form 2500 sample-initialization pairs (Xi,Wj), 1 ≤ i, j ≤ 50. Denoteby yi the response vector of Xi. Then for each (Xi,Wj), we have the ridge estimator

fij(x) = x>

(WjX

>i XiW

>j

n+ λIp

)−1WjX

>i yin

, 1 ≤ i, j ≤ 50.

We use these estimators to make predictions on the test set and estimate the MSE, bias,and some of the variance components as MSE := L−1

∑Lk=1 E(fij(xk)− yk)2, and:

Var :=1

L

L∑k=1

E(fij(xk)− Efij(xk))2, Bias2

:=1

L

L∑k=1

(Efij(xk)− yk)2,

Vs :=1

Lns

L∑k=1

ns∑i=1

(Ej fij(xk)− Efij(xk))2, Vi :=1

Lni

L∑k=1

ni∑j=1

(Eifij(xk)− Efij(xk))2,

where E denotes the Monte Carlo mean, ni = ns = 50, L is the test set size, and xk, yk aretest features and responses.

Since we have no information about the true noise of the responses, we only study thevariance introduced by the choice of the data matrix X and initialization W . Figure 10shows the empirically estimated bias, variance and MSE as functions of the number ofsamples n given a fixed amount of parameterization (fixed π = lim p/d). From this figure,we observe the following:

1. The MSE is unimodal as a function of number of samples n, which corroborates thatincreasing the number of training samples can sometimes lower the model’s perfor-mance when we do not have enough samples and do not regularize well (e.g. use asmall λ = 0.01). This unimodality is quite similar to the unimodality we observed in

32


Figure 10: Empirically estimated MSE, variance and bias as functions of number of sam-ples n. We display the mean and one standard deviation of the numerical results over 10repetitions. Left: π = 0.2, λ = 0.01. Right: π = 0.9, λ = 0.01. (Both panels are from thesame simulation.)

Figure 11: Numerically estimated variance components as a function of the sample size n.Left: three components of variance (Vs, Vi, and “Rest”: the variance due to interactionand response noise, Rest := Var − Vs − Vi). Middle and right: two orders of variancedecomposition. Middle: sample, initialization. Right: initialization, sample. Σa

ab :=Var − Vb,Σ

bab := Vb, {a, b} = {s, i}. Parameters: π = 0.2, λ = 0.01. We display the

mean and one standard deviation over 10 repetitions. (All three panels are from the samesimulation.)

Figure 4 in our theoretical setting. It is also consistent with the general phenomenonof sample-wise double descent (Nakkiran, 2019).

2. The bias is decreasing as a function of n, when n is small, and stays roughly constantwhen n is larger. The reason is that the data does not truly come from a linear model,and thus the linear model that we use has a nonzero approximation bias.

3. The bias is also decreasing as a function of 1/δ = n/d. This suggests that moresamples can reduce the bias of the ridge estimator. Furthermore, the variance isthe main contributor to the unimodality of the MSE. These two observations are alsoconsistent with our theoretical results from Theorem 7, which suggests that the bias is

33

Lin and Dobriban

Figure 12: Empirically estimated MSE, variance and bias as functions of degree of parame-terization πd = p/d. We show the the mean and one standard deviation over 10 repetitions.

Figure 13: Empirically estimated MSE, variance and bias as functions of number of samplesn using the optimal λ∗. We display the mean and one standard deviation of the numericalresults over 10 repetitions. Left: π = 0.2. Right: π = 0.9. (Both panels are from the samesimulation.)

increasing as a function of δ and the variance can be very large along the interpolationthreshold δπ = 1 when λ is small.

Figure 11 (left) shows estimates of three components of variance in our data example.In this low parameterization setting (π = 0.2), when n is small (< 100), the variance Vsdue purely to sampling is large, the variance Vi due purely to initialization is small, and thevariance Vis due to their interaction and also the response noise is large. Thus, combiningthese variances together, we can see from Figure 11 (middle and right) that different ordersof decomposition can indeed lead to different interpretations of the variances introduced bysampling and initialization.

Figure 12 exhibits empirical estimates of the bias, variance, and MSE as functions ofthe degree of parameterization π = lim p/d when given enough samples, here n = 1000, so

34


that δd = d/n is small. We see that all three terms decrease as π increases, which meansmore parameters can improve the estimator’s performance when we have enough samples(δ is small). This is also close to what we have observed in Figure 6h, i.e., that the MSE isdecreasing as π increases when δ < 1.

We also study the effect of optimal ridge penalty on the empirical data. For simplicity,we select the ridge parameter from the set {i × 10−j |i = 1, 2, 5; j = 0, 1, 2, 3} such that itminimizes the empirical MSE. Figure 13 shows the MSE, variance and bias obtained usingthe optimal ridge penalty. Compared with Figure 10, we see that the optimal ridge penaltycan mitigate the non-monotonicity of MSE and keep the bias decreasing as the numberof sample n increases. These observations are consistent with what we have shown in ourtheoretical setting.

To conclude, although our theoretical results are based on quite strong assumptions onthe data distribution, many conclusions and insights still carry over to certain problemsinvolving empirical data.

Acknowledgments

We thank the associate editor for handling our paper. We are very grateful for the reviewersfor detailed and thorough feedback, which has lead to numerous important improvements.We thank Yi Ma, Song Mei, Zitong Yang, Chong You, Yaodong Yu for helpful discussions.This work was partially supported by a Peking University Summer Research award, andby the NSF-Simons Collaboration on the Mathematical and Scientific Foundations of DeepLearning THEORINET (NSF 2031985). This work was performed when LL was a studentat Peking University.

35

Lin and Dobriban

Appendix A. Comparison of Orthogonal and Gaussian InitializationModels

Here we provide a comparison of the orthogonal and Gaussian initialization models forlinear networks f(x) = (Wx)>β. The expressive power of the two models is the same,as with probability one we can write a p × d matrix W with iid Gaussian entries, wherep 6 d, via its SVD as W = UDV , where U is p× p orthonormal, D is p× p diagonal withnonzero entries with probability one, and V is p × d partial orthonormal with V V > = Ip.Then (Wx)>β = x>W>β = x>V >DU>β = x>W>o βo, where Wo = UV is a random partialorthonormal matrix, and βo = UDU>β is a new regression coefficient. Thus, the two modelshave the same expressive power.

The orthogonal model we consider has some advantages over the Gaussian model. In-deed, considering the case when p = d, in the orthogonal model, we first rotate x orthog-onally, then take a linear combination of the coefficients. In contrast, in the Gaussianmodel we not only rotate x, but also scale it by the singular values of x, which due to theMarchenko-Pastur law (Marchenko and Pastur, 1967) spread out from zero to two. Thus,we induce a significant distortion of the input in the first layer. Then, we can expect thatlearning may be more challenging due to this additional scaling. Indeed, the regressioncoefficients corresponding to the directions with near-zero singular values must be scaledup asymptotically by unboundedly large values for accurate prediction. On the other hand,the Gaussian model more closely mimics practical initialization schemes, which can indeedinvolve iid weights. We also note that recently, some empirical work has argued aboutthe benefits of orthogonal initialization (Hu et al., 2020; Qi et al., 2020). For instance, Qiet al. (2020) argues that orthogonality (or isometry) alone enables training practical >100layer CNNs on ImageNet without shortcut and BatchNorm, and therefore provides somejustification for orthogonality in our theoretical analysis.

Appendix B. Proofs

B.1 Proof of Theorem 3

Different from the order of the theorems, here we first give the proof of theorem 3 and thenthe proof of theorem 2. This is because the proof of theorem 2 is more complicated anddepends on some lemmas in the proof of theorem 3.

In the proofs, we will often refer to the spectral distribution (or measure) a symmetricmatrix M , which is simply the discrete distribution placing uniform point masses on eachof the (real) eigenvalues of M . When the matrix size grows, we will consider settings wherethe spectral distribution converges in distribution to a fixed probability distribution.

Let us define

MX,W (λ) := W>(n−1WX>XW> + λIp)−1WX>/n

(a d× n matrix), and

MX,W (λ) := MX,W (λ)X

(a d×d matrix) and omit their dependence on λ,X,W for simplicity. As we will clearly seebelow, M can be viewed as a “regularized pseudo-inverse”. Also, EM − I directly controls

36


the bias, and M − EM controls part of the variance. The calculations of bias and variancein terms of M follow those of Yang et al. (2020), and we include them here for the reader’sconvenience. The calculations following them are more novel.

To start, we have fλ,T ,W (x) = x>MY , and similar to the proof of theorem 1 in Yanget al. (2020)

Bias2(λ) = Eθ,x[EX,W,E

(x>Mθ + x>ME

)− x>θ

]2

= Eθ,x[x>(EX,W,EM − I)θ + ET ,W (x>ME)

]2.

Since E has zero mean and is independent of x and M , we know EX,W,E(x>ME) = 0. Inwhat follows, sometimes we omit the subscript when we take expectation over all randomvariables. Thus, using that for any two vectors and a matrix of conformable sizes, (a>Nb)2 =a>Nbb>N>a = trNbb>N>aa>, the above equals

Eθ,x[x>(EM − I)θθ>(EM − I)>x

]= Eθ,x tr

[x>(EM − I)θθ>(EM − I)>x

]= tr

[(EM − I)E

(θθ>

)(EM − I)>E

(xx>

)]=α2

d‖EM − I‖2F . (25)

This shows that the average bias is determined by how well the random matrix M (whichdepends both on the random data X and the random initialization W ) approximates theidentity matrix.

Similarly, by grouping terms appropriately, and using again that EX,W,E(x>ME) = 0,

Var(λ) = Eθ,x,X,W,E[x>Mθ + x>ME − EX,W,E(x>Mθ + x>ME)

]2

= Eθ,x,X,W,E[x>(M − EM)θ + x>ME

]2

= Eθ,x,X,W,E[x>(M − EM)θ

]2+ (x>ME)2,

where the interaction term is zero because of the independence between E and other vari-ables. Then, using that trA>A = ‖A‖2F ,

Var(λ) = Eθ,x,X,W,E{[x>(M − EM)θθ>(M − EM)>x

]+ x>MEE>M>x

}= Eθ,x,X,W

{tr[(M − EM)θθ>(M − EM)>xx>

]+ σ2 tr[x>MM>x]

}= EM tr

[(M − EM)E(θθ>)(M − EM)>E(xx>)

]+ σ2E tr

[MM>E(xx>)

]=α2

dE‖M − EM‖2F + σ2E‖M‖2F . (26)

Thus, the variance is determined by how much M varies around its mean, and by howlarge M is. Combining results for variance and bias, and using that E‖M − I‖2F = E‖M −

37

Lin and Dobriban

EM‖2F + E‖EM − I‖2F , we obtain

MSE(λ) = Var(λ) + Bias2(λ) + σ2 =α2

dE‖M − I‖2F + σ2E‖M‖2F + σ2. (27)

Similarly, for Σlabel,Σsample,Σinit, we have

Σlabel = Eθ,xEW,X,E [fλ,T ,W (x)− EEfλ,T ,W (x)]2

= Eθ,x,W,X,E(x>ME)2 = σ2E‖M‖2F .

Thus, the variance due to label noise is determined by the magnitude of M . Also,

Σsample = Eθ,xEW,X [EEfλ,T ,W (x)− EX,Efλ,T ,W (x)]2

= Eθ,x,W,X [x>Mθ − EX(x>Mθ)]2

=α2

dEW,X ‖M − EXM‖2F .

Finally,

Σinit = Eθ,xEW (EX,Efλ,T ,W (x)− EW,X,Efλ,T ,W (x))2

= Eθ,x,W [EX(x>Mθ)− EW,X(x>Mθ)]2

=α2

dEW ‖EXM − EW,XM‖2F .

This shows that in the specific decomposition order: label, samples, initialization, thevariance due to the randomness in the sample X is determined by the Frobenius variabilityof M around its mean with respect to X. The respective statement is also true for thevariance due to initialization.

Therefore, to prove theorem 3, it suffices to study the limiting behaviours of EM , EXM ,M and M . Under the assumptions in Theorem 3, we have characterize their behavior inthe following lemmas.

Lemma 11 (Behavior of EM)

limd→∞

1

dE tr(M) = π(1− λθ1), ∀i ≥ 1. lim

d→∞

1

d‖EM‖2F = π2(1− λθ1)2. (28)

Lemma 12 (Behavior of the Frobenius norm of M)

limd→∞

1

dE‖M‖2F = π

[1− 2λθ1 + λ2θ2 + (1− π)δ(θ1 − λθ2)

]. (29)

Lemma 13 (Behavior of the Frobenius norm of M)

limd→∞

E‖M‖2F = πδ(θ1 − λθ2). (30)

Lemma 14 (Behavior of the Frobenius norm of EXM)

limd→∞

1

dEW ‖EXM‖2F = π(1− λθ1)2. (31)

38


We put the proof of these lemmas after the proof of theorem 3 and 2 for clarity (see AppendixB.3). By using Lemmas 11—14, we are able to complete our proof.Proof [Proof of Theorem 3] Plugging equation (28) into (25), we have

Bias2(λ) =α2

d‖EM − I‖2F → α2(π(1− λθ1)− 1)2 = α2(1− π + λπθ1)2.

Plugging equations (28), (29), (30) into (26), we have

Var(λ) =α2

dE‖M − EM‖2F + σ2E‖M‖2F

=α2

d

[E‖M‖2F − ‖EM‖2F

]+ σ2E‖M‖2F

→ α2π

[1− 2λθ1 + λ2θ2 + (1− π)δ(θ1 − λθ2)− π(1− λθ1)2

]+ σ2πδ(θ1 − λθ2)

= α2π

[1− π + (π − 1)(2λ− δ)θ1 − πλ2θ2

1 + λ(λ− δ + πδ)θ2

]+σ2πδ(θ1 − λθ2).

Similarly, by Lemmas 12, 13, 14

Σlabel = σ2E‖M‖2F → σ2πδ(θ1 − λθ2).

Σsample =α2

dEW,X ‖M − EXM‖2F =

α2

d[E‖M‖2F − EW ‖EXM‖2F ]

→ α2π[−λ2θ2

1 + λ2θ2 + (1− π)δ(θ1 − λθ2)].

Finally,

Σinit = Var(λ)− Σsample − Σlabel → α2π(1− π)(1− λθ1)2.

MSE(λ) = Var(λ) + Bias2(λ) + σ2

→ α2{

1− π + πδ(1− π + σ2/α2

)θ1 +

[λ− δ

(1− π + σ2/α2

)]λπθ2

}+ σ2.

As for the choice of optimal λ∗, denote δ(1− π + σ2/α2) by c and calculate

d

dλlimd→∞

MSE(λ) =d

dλα2 [1− π + πcθ1 + (λ− c)λπθ2] + σ2

= α2 d

dλ

(∫πc

x+ λdFγ(x) +

∫(λ− c)λπ(x+ λ)2

dFγ(x)

)= α2π

d

dλ

(∫λ2 + cx

(x+ λ)2dFγ(x)

)= 2α2π

∫(λ− c)x(x+ λ)3

dFγ(x).

If π = 1 and σ = 0, then c = 0 and the asymptotic MSE is monotonically increasing sinceFγ(x) is supported on [0,+∞). Therefore the optimal ridge λ∗ = 0, which is outside ofthe range (0,∞) that we considered here. Otherwise , the derivative is less than zero whenλ < c and larger than zero when λ > c. Therefore, the asymptotic MSE as a function of λhas a unique minimum at c = δ(1− π + σ2/α2).

39

Lin and Dobriban


In this proof, we will use the same notations as in the proof of theorem 3. Since the mainidea of this proof is quite similar to the proof of theorem 3, we will omit some details in thederivation for simplicity.Proof [Proof of Theorem 2] By definition, for Vs,

Vs = Eθ,xVarX(EE,W (f(x)|X)) = Eθ,x,X [x>(EWM − EM)θ]2

=α2

dEX‖EWM − EM‖2F . (32)

For Vl,

Vl = Eθ,xVarE(EX,W (f(x)|E)) = σ2‖EM‖2F . (33)

For Vi,

Vi = Eθ,xVarW (EE,X(f(x)|W )) = Eθ,x,W [x>(EXM − EM)θ]2

=α2

dEW ‖EXM − EM‖2F . (34)

Similarly,

Vsl = Eθ,xVarE,X(EW (f(x)|E , X))− Vs − Vl= Eθ,x,X,E [x>(EWM − EM)θ + x>EW ME ]2 − Vs − Vl= σ2EX‖EW M − EM‖2F . (35)

Vli = Eθ,xVarE,W (EX(f(x)|E ,W ))− Vi − Vl= Eθ,x,E,W [x>(EXM − EM)θ + x>EXME ]2 − Vi − Vl= σ2EW ‖EXM − EM‖2F (36)

Vsi = Eθ,xVarX,W (Eτ (f(x)|X,W ))− Vs − Vi= Eθ,x,X,W [x>(M − EM)θ]2 − Vs − Vi

=α2

d

(E‖M‖2F − EX‖EWM‖2F − EW ‖EXM‖2F + ‖EM‖2F

). (37)

And

Vsli = Var(f(x))− (Vs + Vl + Vi + Vsl + Vsi + Vli)

= σ2(E‖M‖2F − EW ‖EXM‖2F − EX‖EW M‖2F + ‖EM‖2). (38)

After obtaining the expressions of the variance components, Theorem 2 follows directly byplugging Lemmas 11—17 into equation (32)—(38).

40


Lemma 15 (Behaviour of the Frobenius norm of M)

limd→∞

‖EM‖2F = limd→∞

EW ‖EXM‖2F = 0.

When X is symmetric, by switching the sign of X, clearly EXM = 0. This lemma showsthat the same result still holds asymptotically when X is not symmetric, and simply haszero-mean entries with finite sixth moment.

Lemma 16 (Behavior of the Frobenius norm of EW M)

limd→∞

EX‖EW M‖2F = δ(θ1 − λθ2).

Lemma 17 (Behavior of the Frobenius norm of EWM)

limd→∞

1

dEX‖EWM‖2F = (1− 2λθ1 + λ2θ2).

See Appendix B.4 for the proof of the above lemmas.

B.3 Proof of Lemmas 11—14

We first prove these four lemmas under the assumption that X has i.i.d. standard Gaussianentries in Appendix B.3.1—B.3.4 to obtain some heuristics for the formulas. Then inAppendix B.3.5 we generalize the proof to the non-Gaussian case which only requires X tohave i.i.d. zero mean unit variance and finite 8 + η moment entries for any η > 0.

Next, we denote R =(WX>XW>/n+ λIp

)−1for simplicity. In the Gaussian case,

the proof proceeds by moving to the SVD decomposition, and carefully exploiting serveralproperties of the normal distribution and the Marchenko-Pastur law. In the general case,we use deterministic equivalents properties for covariance matrices to show that all termswe are concerned with converge to the same limits as in the Gaussian case.

B.3.1 Proof of Lemma 11 (Under Gaussian Assumption)

Proof By definition,

EM = EMX,W (λ) = EW>RWX>X

n.

Let V = [W>,W>⊥ ] be an orthonormal matrix containing an arbitrary orthogonal comple-ment of W>. It will be convenient to write this as W = DV >, where the p × d matrixD =

(Ip×p, 0p×(d−p)

)selects the appropriate rows of V >. Denote X := XV , a matrix of the

same size n× d as the original matrix X. Then,

EM = EV,XV D>(DV >X>XVD>

n+ λIp

)−1DV >X>XV V >

n

= EV,X

V D>

(DX>XD>

n+ λIp

)−1DX>XV >

n.

41

Lin and Dobriban

By assumption, W is uniformly sampled from the Stiefel manifold, i.e., the manifold ofpartial orthogonal p × d (p 6 d) matrices with orthonormal rows. Thus, we can assumethat V is also uniformly distributed over orthogonal matrices (i.e. the Haar measure).Furthermore, since V is orthogonal, we know that X has independent standard Gaussianentries and is independent of V because XV has the same distribution for any orthogonalmatrix V . Noting that EV V AV > = trA · Id/d when V ∈ Rd×d follows the Haar measure,we get

EM = EXEV V D>

(DX>XD>

n+ λIp

)−1DX>XV >

n

=1

d· E

Xtr

D>(DX>XD>n

+ λIp

)−1DX>X

n

· Id=

1

d· E

Xtr

(DX>XD>n

+ λIp

)−1DX>XD>

n

· Id.Further defining X := XD> = XW> which is now of size n× p (while the original size

was n × d), then X also has independent standard Gaussian entries. Letting X = UΓV >

be the SVD decomposition of X, we have

EM =1

d· EX tr

(X>Xn

+ λIp

)−1X>X

n

· Id =p

d

(1− λ

pEΓ tr

(Γ>Γ

n+ λIp

)−1)· Id.

This is determined by the spectral measure of n−1X>X. Since X has independentstandard normal entries and limd→∞ p/n = πδ, applying the Marchenko-Pastur theorem(Marchenko and Pastur, 1967; Silverstein, 1995; Bai and Silverstein, 2010), we get

1

dE tr(M)→ π

(1− λ

∫1

(x+ λ)dFπδ(x)

)= π(1− λθ1(πδ, λ)),

1

d‖EM‖2F → π2(1− λθ1)2.

This finishes the proof.


Proof By definition of M ,

E‖M‖2F = E∥∥∥∥W>RWX>X

n

∥∥∥∥2

F

= E tr

(W>R

WX>X

n

X>XW>

nR>W

)= E tr

(RWX>XX>XW>

n2R>).

42


Let W⊥ = f(W ) ∈ R(d−p)×d be an orthogonal complement of W and define X1 := XW>,X2 := XW>⊥ . Since X has Gaussian entries, X1 and X2 both have Gaussian entries. Since

EX>1 X2 = EWX>XW>⊥ = n · EWW>⊥ = 0,

it follows that X1 and X2 are independent. Noting that XX> = X1X>1 +X2X

>2 , we have

E‖M‖2F = E tr

(RWX>X1X

>1 XW

>

n2R>)

+ E tr

(RWX>X2X

>2 XW

>

n2R>).

=: C1 + C2.

For the first term, we have

C1 = E tr

(RWX>X1X

>1 XW

>

n2R>)

= E tr

(RX>1 X1X

>1 X1

n2R>).

Let X1 = UΓ1V> be the singular value decomposition of X1. By plugging in the definition

of R, we obtain

C1 = E tr

[(Γ>1 Γ1

n+ λIp

)−2(Γ>1 Γ1

n

)2].

Thus, according to the Marchenko-Pastur theorem,

α2

dC1 → α2π

∫x2

(x+ λ)2dFπδ(x) = α2π(1− 2λθ1 + λ2θ2).

For the second term, since X1 and X2 are indepedent and noting that

EX2X>2 X2 = EX,W (XW>⊥W⊥X

>) = EW (tr(Id −W>W ))Id = (d− p)Id,

we have

C2 = E tr

(RWX>X2X

>2 XW

>

n2R>)

= EX1EX2 tr

(RX>1 X2X

>2 X1

n2R>)

= EX1 tr

(RX>1 EX2(X2X

>2 )X1

n2R>)

=d− pn

EX1 tr

(RX>1 X1

nR>).

Since

EX1 tr

(RX>1 X1

nR>)

= E tr

[(Γ>1 Γ1

n+ λIp

)−2Γ>1 Γ1

n

],

by the Marchenko-Pastur theorem, α2

d C2 → α2(1−π)πδ∫

x(x+λ)2

dFπδ(x) = α2(1−π)πδ(θ1−λθ2). Finally, combining the results for C1 and C2 gives

α2

dE‖M‖2F =

α2

d(C1 + C2)→ α2π

[1− 2λθ1 + λ2θ2 + (1− π)δ(θ1 − λθ2)

],

and this finishes the proof.

43

Lin and Dobriban


Proof By definition, we have

E‖M‖2F = E tr

(W>R

WX>

n

)(W>R

WX>

n

)>= E tr

(W>R

WX>

n

XW>

nR>W

)= E tr

(RWX>XW>

n2R>).

Denote XW> by X1 and write X1 = UΓV > for the SVD of X1. Then,

E‖M‖2F =1

nE tr

(RX>1 X1

nR>)

=1

nE tr

(R− λR2

)=

1

nE tr

[(Γ>Γ

n+ λ)−1 − λ(

Γ>Γ

n+ λ)−2

]→ πδ(θ1 − λθ2),

where the last line follows from the Marchenko-Pastur theorem and that p/n→ πδ.


Proof Similarly, let W⊥ = f(W ) ∈ R(d−p)×d be an orthogonal complement of W . DenotingX1 = XW>, X2 = XW>⊥ and combining the fact that X1 and X2 are independent, withEX2 = 0, we have

EXM = EXW>RWX>X

n

= W>EX(X>1 X1

n+ λIp

)−1X>1 (X1W +X2W⊥)

n

= W>EX1

[(X>1 X1

n+ λIp

)−1X>1 X1

n

]W (39)

Write the SVD of X1 as X1 = UΓV > and note that V ∈ Rp×p is uniformly distributed overthe set of orthogonal matrices. Then the above equals

W>

[Ip − λEΓ,V V

(Γ>Γ

n+ λIp

)−1

V >

]W = W>W

[1− λ

pEΓ tr

(Γ>Γ

n+ λIp

)−1].

Thus,

limd→∞

α2

dEW ‖EXM‖2F = lim

d→∞

α2

d

[1− λ

pEΓ tr

(Γ>Γ

n+ λIp

)−1]2

EW tr(W>W )

= limd→∞

α2p

d

[1− λ

pEΓ tr

(Γ>Γ

n+ λIp

)−1]2

→ α2π(1− λθ1)2,

where the last line follows directly from the Marchenko-Pastur theorem.

44


B.3.5 Proof of Lemma 11—14 (General Case)

In Appendix B.3.1—B.3.4, we have proved Lemma 11—14 under the assumption that theentries of X are i.i.d. standard Gaussian. In this part, we will generalize previous proofs tothe non-Gaussian case, i.e., X has i.i.d. zero mean, unit variance entries with finite 8 + ηmoment, and hence complete the proof of Lemma 11—14.

For simplicity, we only present the proof of Lemmas 12, 13 in non-Gaussian case. Lem-mas 11, 14 can be proved using very similar arguments as Lemma 12.

We recall the calculus of deterministic equivalents from random matrix theory, whichwill be used in our proof (Dobriban and Sheng, 2018, 2020). One of the best ways tounderstand the Marchenko-Pastur law is that resolvents are asymptotically deterministic.Let Σ = n−1X>X, where X = ZΣ1/2 and Z is an n × p random matrix with iid entriesof zero mean and unit variance, and Σ1/2 is any sequence of p × p positive semi-definitematrices. We take n, p, q →∞ proportionally.

We say that the (deterministic or random) not necessarily symmetric matrix sequencesAn, Bn of growing dimensions are equivalent, and write

An � Bn

if

limn→∞

|tr [Cn(An −Bn)]| = 0 (40)

almost surely, for any sequence Cn of not necessarily symmetric matrices with boundedtrace norm, i.e., such that

lim sup ‖Cn‖tr <∞.

We call such a sequence Cn a standard sequence. Recall here that the trace norm (ornuclear norm) is defined by ‖M‖tr = tr((M>M)1/2) =

∑i σi, where σi are the singular

values of M .Moreover, if (40) only holds almost surely for any sequence Cn ∈ Rdn×dn of positive

semidefinite matrices with O(1/dn) spectral norm, An and Bn are said to be weak de-terministic equivalents and denoted by An

w� Bn. It is readily verified that deterministicequivalence implies weak deterministic equivalence.

By the general Marchenko-Pastur (MP) theorem of Rubio and Mestre (Rubio andMestre, 2011), we have that for any λ > 0

(Σ + λI)−1 � (qpΣ + λI)−1,

where qp is the unique positive solution of the fixed point equation

1− qp =qpn

tr[Σ(qpΣ + λI)−1

].

When n, p → ∞ and the sepctral distribution of Σ converges to H, qp → q and q satisfiesthe equation

1− q = γ

[1− λ

∫ ∞0

dH(t)

qt+ λ

].

We now proceed with the proof.

45

Lin and Dobriban

Proof [Proof of Lemma 12 (general case)]By definition, recalling that R = (WX>XW>

n +λIp)

−1,

EM = EW>RWX>X

n= EW>WX>R

X

n. (41)

Therefore, letting R = (XW>WX>

n + λIn)−1 be the resolvent obtained in the other order,

E tr(MM>) = E tr

[W>WX>R

XX>

n2RXW>W

].

Define the regularized resolvent Rτ =(X(W>W+τ)X>

n + λIn

)−1and

Mτ := W>WX>RτX

n.

Since we have already proved Lemma 12 under the Gaussian assumption, to generalize theresults into the non-Gaussian case, we only need to prove the following two steps:

(1). limd→∞

E tr(MM>)/d = limτ→0

limd→∞

E trMτM>τ /d (assuming the limits exist) (42)

(2). limτ→0

limd→∞

E trMτM>τ /d exists and is a constant independent of the distribution of X.

(43)

(1). Note that

∆ := limτ→0

limd→∞

1

d|E tr(MM>)− E trMτM

>τ |

≤ limτ→0

limd→∞

1

d|E tr(M(M> −M>τ ))|+ 1

d

∣∣∣E tr[(M −Mτ )M>τ ]∣∣∣

≤ limτ→0

limd→∞

1

d|E tr(M(M> −M>τ ))|+ 1

d

∣∣∣E tr[Mτ (M> −M>τ )]∣∣∣

=: ∆1 + ∆2.

Therefore it suffices to prove ∆1,2 → 0. For ∆1, we have the following argument. DenoteX>RX/n by A0, X>RτX/n by Aτ and W>W by P . Note that M = PA0, and

∆1 = limτ→0

limd→∞

Eτ

d|tr (MA0AτP )| (44)

≤ limτ→0

limd→∞

Eτ

2d[‖M‖2F + ‖A0AτP‖2F ]

≤ limτ→0

limd→∞

Eτ

2d[tr(A2

0) + tr(A0A2τA0)]

≤ limτ→0

limd→∞

τE1

2d

[1

λ2tr

(X>X

n

)2

+1

λ4tr

(X>X

n

)4]

≤ limτ→0

O(τ) = 0,

46


where the second line follows from the properties of the Frobenius norm. In the third andfourth line we use A0, Aτ � X>X/nλ and the fact that

tr(M1M2M1) � tr(M1M3M1) ∀M1,M2,M3 positive semi-definite and M2 �M3.

Finally, the last line is due to the finite 8 + η moment assumption and some direct caclu-lations. Using the same techinique, it is not hard to show that ∆2 also converges to zero,and hence we conclude the proof of (1).

(2). We first give an alternative expression for Mτ . Denote (W>W + τ) by S and letV = U> = X/

√n, C = In/λ and A = (W>W + τ)−1 in the Woodbury identity:

(A+ UCV )−1 = A−1 −A−1U(C−1 + V A−1U)−1V A−1

⇐⇒ U(C−1 + V A−1U)−1V = A−A1/2(I +A−1/2UCV A−1/2)−1A1/2. (45)

We have by left multiplying W>W in (45) that, with RS := (λId + S1/2X>XS1/2

n )−1

Mτ = W>WS−1/2 [Id − λRS ]S−1/2. (46)

Fix τ and suppose that the spectral distribution of R converges in distribution to Hτ asd→∞. Since W>W has p eigenvalues 1 and d− p eigenvalues 0 and S = W>W + τ , it isclear that Hτ = πδ1+τ + (1−π)δτ . Also, we have by theorem 1 in Rubio and Mestre (2011)and some simple calculations, that

RS � (xdS + λId)−1, (47)

where xd is the unique positive solution of the fixed point equation (where we omit xd’sdependence on τ for notational simplicity)

1− xd =xdn

tr[S(xdS + λId)−1].

When n, p, d → ∞ proportionally, xd → x and x satisfies the equation (again omitting x’sdependence on τ for notational simplicity)

1− x = δ

(1− λ

∫ ∞0

dHτ (t)

xt+ λ

).

Now, we start to calculate E tr(MτM>τ ). By definition,

E tr(MτM>τ ) = E tr

[W>WS−1 (Id − λRS)S−1 (Id − λRS)

]=: ∆1 + ∆2 + ∆3,

where

∆1 := E tr[W>WS−2

]∆2 := −2λE tr

[W>WS−2RS

]∆3 := λ2E tr

[W>WS−1RSS

−1RS

].

47

Lin and Dobriban

Since W>W has p eigenvalues equal to 1 and d− p eigenvalues equal to 0,

limτ→0

limd→∞

∆1/d = limτ→0

limd→∞

p

d(1 + τ)2= π. (48)

Since ‖W>WS−2‖2 ≤ 1, using the deterministic equivalent property (47) and the boundedconvergence theorem, we get

limτ→0

limd→∞

∆2/d = −2λ limτ→0

limd→∞

E tr[W>WS−2RS

]= −2λ lim

τ→0limd→∞

E tr[W>WS−2 (xdS + λId)

−1]

= −2λ limτ→0

limd→∞

p

d

1

[(1 + τ)2[xd(1 + τ) + λ]

= −2λ limτ→0

π

[(1 + τ)2[xτ (1 + τ) + λ]= −2λ lim

τ→0

π

xτ + λ, (49)

where the third line follows from the fact that W and S = (W>W + τ) are simultaneouslydiagonizable and W>W has p eigenvalues 1 and d− p eigenvalues 0. It can be verified thatthe limit in (49) exists and is a constant independent of the distribution of X. Now, itremains to prove that limτ→0 limd→∞∆3/d converges to a constant limit. Note that

limτ→0

limd→∞

∆3/d = limτ→0

limd→∞

λ2

dE tr

[W>WS−1RSS

−1RS

]= lim

τ→0limd→∞

λ2

dE tr

[W>WS−1/2RSS

−1RSS−1/2

]= lim

τ→0limd→∞

λ2

dE tr

[W>W

(λτ + λW>W +

SX>XS

n

)−1(λτ + λW>W +

SX>XS

n

)−1].

(50)

For any z ∈ E := C\R+, we have by Lemma 18 that(λW>W +

SX>XS

n− zId

)−2

� −(λW>W + xdS2 − zId)−2(x′d(z)S

2 − Id), (51)

where for any z ∈ E, xd(z) is the unique solution of certain fixed point equation independentof X, and x′d(z) := dxd(z)/dz. Furthermore, there exists x(z) such that xd(z) → x(z),x′d(z) → x′(z). Now, letting z = −λτ and replacing (λW>W + SX>XS/n − zId)−2 byits deterministic equivalent (51), we have from (50) and the bounded convergence theoremthat

limτ→0

limd→∞

∆3/d = limτ→0

limd→∞

λ2

dE tr

[W>W

(λτId + λW>W +

SX>XS

n

)−2]

= limτ→0

limd→∞

−λ2

dE tr

[W>W (λW>W + xdS

2 + λτId)−2(x′dS

2 − Id)]

= limτ→0

limd→∞

−πλ2 x′d(1 + τ)2 − 1

(λ+ xd(1 + τ)2 + λτ)2

= limτ→0

πλ2 1− x′(1 + τ)2

(λ+ x(1 + τ)2 + λτ)2= lim

τ→0πλ2 1− x′

(λ+ x)2, (52)

48


where x := x(−λτ), x′ := x′(−λτ). Since in the Gaussian case the limit of the L.H.S. of(42) exists, by the proof of (42), we know that the limit in (52) also exists, and does notdepend of X. This concludes the proof of (2).

Lemma 18 (Second order deterministic equivalent) Suppose that X ∈ Rn×d has i.i.d.zero mean, unit variance entries with finite 8 + η moment. Then for any z ∈ C\R+,(

λW>W +SX>XS

n− zId

)−2

� −(λW>W + xdS2 − zId)−2(x′d(z)S

2 − Id), (53)

where xd(z) is the unique solutions of a certain fixed point equation independent of X, andx′d(z) := dxd(z)/dz. Furthermore, there exists x(z) such that xd(z)→ x(z), x′d(z)→ x′(z).

Proof [Sketch of the proof] Since this lemma can be proved following the same steps asthe proof of theorem 3.1 (b) in Dobriban and Sheng (2020), here we only provide a sketchof the proof.Step 1. (First order deterministic equivalent) Denote

f(z,W ) := (λW>W + xdS2 − zId)−1

g(z,W,X) :=

(λW>W +

SX>XS

n− zId

)−1

,

where xd(z) is the unique solution of the fixed point equation

1− xd =xdn

tr[S2(λW>W + xdS2 − zId)−1].

Then we have from theorem 1 in Rubio and Mestre (2011) that (here we need the finite8 + η moment assumption)

f(z,W ) � g(z,W,X).

Furthermore, for a sequence of W and fixed τ, z, when n, p, d → ∞ proportionally, it canbe verified that xd(z)→ x(z) and x(z) satisfies the fixed point equation

1− x = δ

(1−

∫ ∞0

[λ(t− τ)− z]dHτ (t)

xt2 + λ(t− τ)− z

).

Step 2. (Second order deterministic equivalent) Using the same technique as in the proofof theorem 3.1 (b) in Dobriban and Sheng (2020), it can be proved that

f ′(z,W ) � g′(z,W,X).

for all z ∈ E.Step 3. (Explicit expressions for the derivatives) For invertible A(z), we have

dA

dz= −A−1dA

dzA−1.

49

Lin and Dobriban

Therefore

f ′(z,W ) = −(λW>W + xdS2 − zId)−1(x′d(z)S

2 − Id)(λW>W + xdS2 − zId)−1

= −(λW>W + xdS2 − zId)−2(x′d(z)S

2 − Id)

and

g′(z,W,X) =

(λW>W +

SX>XS

n− zId

)−2

.

Also, it can be shown that x′d(z) → x′(z) as in the proof of theorem 3.1 (b) in Dobribanand Sheng (2020).

Similarly, we can prove Lemma 11 and 14 in the general case via the two steps in (42), (43).Since the proofs are almost the same as Lemma 12 (and in fact even simpler), we omit themfor simplicity. Also, here we only mention one difference. When bounding ∆1 from (44),we need to first replace M by EM (or EXM) and move the expectation outside the traceoperator, e.g. in the proof of 11,

∆1 := limτ→0

limd→∞

1

d

∣∣∣tr[EM(EM> − EM>τ )]∣∣∣

≤ limτ→0

limd→∞

1

dE∣∣∣tr[EM(M> −M>τ )]

∣∣∣ .Then all results follow the same argument as in (1) of the proof of Lemma 12.Proof [Proof of Lemma 13 (general case)] By definition, we have

E‖M‖2F = E tr

∥∥∥∥W>RWX>

n

∥∥∥∥2

F

= E tr

[RWX>XW>

nR

]= E

1

ntr[R− λR2

]. (54)

Denote WX>XW>/n by Q1 and (W>W )1/2X>X(W>W )1/2/n by Q2. Since Q1 and Q2

have the same non-zero eigenvalues, their Limiting Spectral Distributions (LSD) (if one ofthem exists) only differ from a constant scale and a mass at 0, i.e.,

LSD(Q2) = πLSD(Q1) + (1− π)δ0.

Since the LSD of W>W converges to πδ1 + (1− π)δ0 almost surely, from Silverstein (1995)theorem 1.1, we know that the LSD of Q2 almost surely weakly converges to a nonrandomdistribution. Therefore, the LSD of Q1 also almost surely weakly converges to a nonrandomdistribution and 1

n tr[WX>XW>/n + λIp]−i for i = 1, 2 almost surely converge to some

nonrandom limits. Thus, by the bounded convergence theorem, we know (54) convergesto a nonrandom limit independent of the exact distribution of X (which only requires Xto have i.i.d. zero mean, unit variance entries). Since we have proved Lemma 13 in theGaussian case, the general case follows directly.

50


B.4 Proof of Lemmas 15—17

B.4.1 Proof of Lemma 15

Proof It is clear that ‖EM‖2F 6 EW ‖EXM‖2F . Thus we only need to prove the secondconvergence. By definition,

EXM = EXW>RWX>

n.

Since X has i.i.d. rows, by switching the i-th and j-th row of X, it is readily verifiedthat EXM = EXW>(n−1WX>XW>+λIp)

−1WX>/n has identically distributed columns.Note that EXM is a d× n matrix, it is thus enough to prove:

n‖EXM·1‖22 = n

∥∥∥∥EXW>RW (X1·)>

n

∥∥∥∥2

2

u−→ 0, (55)

whereu−→ denotes convergence uniformly in W . For notational simplicity, we denote the

column vector W (X1·)> by x (formed by taking the first row X1· of X), X−1·W

> by X(formed by taking the complement of the first row X1· of X) and (X>X/n + λIp) by C.Let also F = C−1xx>C−1/n. Then

(55) =1

n

∥∥∥∥∥∥EXW>(λIp +

X>X

n+xx>

n

)−1

x

∥∥∥∥∥∥2

2

=1

n

∥∥∥∥EXW>(C−1 − F

1 + x>C−1x/n

)x

∥∥∥∥2

2

=1

n

∥∥∥∥EXW>( F

1 + f

)x

∥∥∥∥2

2

, (56)

where in the second line we used the Sherman-Morrison formula, and the last line followsfrom the fact that C−1 and x are independent and Ex = 0. Let us denote f = x>C−1x/n.To prove that (56)

u−→ 0, we only need to prove the following:

(1).1

n

∥∥∥∥EXW>( F

Ex(1 + f)x

)∥∥∥∥2

2

u−→ 0 (57)

(2).1

n

∥∥∥∥EXW>( F

1 + fx

)− EXW>

(F

Ex(1 + f)x

)∥∥∥∥2

2

u−→ 0. (58)

(1). For any fixed W , since x and C−1 are independent,

Exx>C−1x

n= Ex

x>(W>C−1W )x

n=

tr(W>C−1W )

n=

tr(C−1)

n. (59)

51

Lin and Dobriban

Denote 1 + tr(C−1)/n by c0, and W>C−1W by C. Then

1

n

∣∣∣∣EX (W> F

Ex(1 + f)x

)i

∣∣∣∣2 =1

n3

(EX

1

c0eiCxx

>Cx

)2

=1

n3

EXd∑

j,k,l=1

1

c0eiCijxjxkCklxl

2

=1

n3

EX

d∑j=1

1

c0CijCjjEx3

j

2

≤ (EX311)2

n3

EX

d∑j=1

C2ij ·

d∑j=1

C2jj

,

where the last line follows from the Jensen inequality, Cauchy-Schwartz inequality and thefact that c0 ≥ 1. Summing up all coordinates and noting that 0 � C � Id/λ, we get

L.H.S. of (57) ≤ (EX311)2

n3EX

‖C‖2F · d∑j=1

C2j,j

≤ (EX311)2

n3· dλ2· dλ2≤ (EX3

11)2d2

λ4n3

u−→ 0.

(2). By definition, theL.H.S. of (58) equals

1

n

∥∥∥∥EXW> [( F

1 + f

)−(

F

Ex(1 + f)

)]x

∥∥∥∥2

2

=1

n3

∥∥∥∥EXW> [C−1xx>C−1(f − Exf)

(1 + f)(Ex1 + f)

]x

∥∥∥∥2

2

≤ 1

n3EX∥∥∥W>C−1xx>C−1x

∥∥∥2

2· EX

[(f − Exf)

(1 + f)(Ex1 + f)

]2

≤ 1

n3EX∥∥∥Cxx>Cx∥∥∥2

2· E

XVarx(x>Cx/n)

For the first term in the last line, note that 0 � C � Id/λ,

EX‖Cxx>Cx‖22 = EX(x>Cxx>C2xx>Cx) ≤ Ex(x>x)3/λ4 = O(n3),

where the last equality is due to the fact that there are O(n3) terms of the form x2ix

2jx

2k,

with 1 ≤ i, j, k ≤ d in (x>x)3. Now, it is enough to prove EX

Varx(x>Cx/n)u−→ 0.

Lemma 19 Suppose that x = (x1, ..., xd) has i.i.d. entries satisfying Exi = 0, Ex2i = 1.

Let A ∈ Rd×d. Then we have (see e.g. Bai and Silverstein, 2010; Couillet and Debbah,2011 and Mei and Montanari, 2019 Lemma B.6.)

Var(x>Ax) =

d∑i=1

A2ii(Ex4

1 − 3) + ‖A‖2F + tr(A2).

Using Lemma 19 above and recalling that 0 � C � Id/λ, we have

EX

Varx(x>Cx) = EX

d∑i=1

C2ii(Ex4

1 − 3) + ‖C‖2F + tr(C2)

= EX

d∑i=1

C2ii(Ex4

1 − 3) + 2‖C‖2F ≤d|Ex4

1 − 3|λ2

+2d

λ2= O(n).

52


Therefore EX

Varx(x>Cx/n) = O(1/n)u−→ 0 and we finished the proof.

B.4.2 Proof of Lemmas 16, 17

The results in these two lemmas follow from using matrix identities (e.g., the Woodburyidentity) for the target matrices and deterministic equivalent results for orthogonal projec-tion (Haar) matrices.

Proof By definition,

EW M = EWW>(WX>XW>

n+ λ

)−1WX>

n.

Let A = X>X/n + λ1, where 0 < λ1 < λ is an arbitrary value. The Woodbury matrixidentity states that (

A−1 + UCV)−1

= A−AU(C−1 + V AU

)−1V A.

Define λ2 := λ− λ1 and take V = U> = W , C = I/λ2, to get(A−1 +W>W/λ2

)−1= A−AW>

(λ2 +WAW>

)−1WA.

Therefore

C :=EWW>(WX>XW>

n+ λ

)−1

W = EWW>(λ2 +WAW>

)−1W

= A−1 −A−1EW(A−1 +W>W/λ2

)−1A−1

= A−1/2

Id − EW

(Id +

A1/2W>WA1/2

λ2

)−1A−1/2. (60)

Also,

EX‖EW M‖2F = EX tr

(CX>X

n2C>)

= EX1

ntrC(A− λ1)C>. (61)

1

dEX‖EWM‖2F =

1

dEX tr

(CX>XX>X

n2C>)

=1

dEX trC(A− λ1)2C>. (62)

Now, we define

C1 : = A−1 −A−1/2

(Ip +

edA

λ2

)−1

A−1/2

= A−1

[Id −

(Ip +

edA

λ2

)−1]

=

(A+

λ2

edId

)−1

,

53

Lin and Dobriban

where ed is defined in Lemma 20. Then from Lemmas 20, 21, we know that C can bereplaced by C1 when calculating the limits of (61) and (62). Hence

limd→∞

EX‖EW M‖2F = limd→∞

1

nEX tr

(C1(A− λ1)C>1

)= lim

d→∞

1

nEX tr

[(A− λ1)

(A+

λ2

edId

)−2]

→ δ[θ1(δ, λ)− λθ2(δ, λ)] = δ(θ1 − λθ2),

where λ := λ1 +λ2/e0 is a constant independent of the choice of λ1, λ2. The last line followsfrom Lemma 23, the definition that A = X>X/n+ λ1 and the Marchenko-Pastur theorem.Similarly,

limd→∞

1

dEX‖EWM‖2F = lim

d→∞EX

1

dtr(C1(A− λ1)2C>1

)= lim

d→∞EX

1

dtr

[(A− λ1)2

(A+

λ2

edId

)−2]

→ [1− 2λθ1(δ, λ) + λ2θ2(δ, λ)] = 1− 2λθ1 + λθ2.


Lemma 20 (Weak deterministic equivalent for Haar matrices) Under the above as-sumptions, we have (

Id +A1/2W>WA1/2

λ2

)−1w�(Id +

edA

λ2

)−1

, (63)

where (ed, ed) is the unique solution of the system of equations

ed =p

d(ed + 1− eded)−1

ed =1

dtrA (edA+ λ2Id)

−1.

Proof [Proof of Lemma 20] From properties of sample covariance matrices, we know thatthe largest eigenvalue of Ad := X>X/n + λ1 ∈ Rd×d converges to λ1 + (1 +

√δ)2 almost

surely as d → ∞. Therefore, the sequence of values ‖Ad‖2 is bounded almost surely. Fora fixed sequence of non-negative symmetric Ad with bounded 2-norm, (63) follows fromthe proof of theorem 7 (see the sketch of the proof) in Couillet et al. (2012). Since A isindependent of W and the sequence ‖Ad‖2 is bounded almost surely, (63) holds generally.

Lemma 21 (Replacing C by C1) Under the previous assumptions, we have

limd→∞

EX1

ntr(C(A− λ1)C>) = lim

d→∞EX

1

ntr(C1(A− λ1)C>1 ) (64)

limd→∞

EX1

dtr(C(A− λ1)2C>) = lim

d→∞EX

1

dtr(C1(A− λ1)2C>1 ). (65)

54


Proof [Proof of Lemma 21] Since the proof of two claims is almost the same, for simplicity,we only present the proof of (65).

For (65), we define (here ∆, ∆1, ∆2 are different from those in Appendix B.3.5)

∆ := limd→∞

1

dEX trC(A− λ1)2C> − 1

dEX trC1(A− λ1)2C>1

= limd→∞

1

dEX{

tr[(C − C1)(A− λ1)2C>

]+ tr

[C1(A− λ1)2(C − C1)>

]}=: ∆1 + ∆2.

Also, denote

G0 := (Id +A1/2W>WA1/2/λ2)−1

G1 := (Id + edA/λ2)−1

G := EWG0.

Therefore, from Lemma 20, we know that G0w� G1. Substituting the definition of C, C1

into ∆1, we have

∆1 = limd→∞

1

dEX tr

[A−1/2(G−G1)A−1/2(A− λ1)2A−1/2(Id −G)>A−1/2

]= lim

d→∞

1

dEX,W tr

[A−1/2(A− λ1)2A−1/2(Id −G)>A−1(G0 −G1)

].

By Lemma 22, we know A and G are simultaneously diagonalizable. Moreover, it is readilyverified that A−1/2(A− λ1)2A−1/2(Id −G)>A−1 is symmetric and non-negative. Since

‖A−1/2(A− λ1)2A−1/2(Id −G)>A−1‖2 = ‖(Id −G)>A−3/2(A− λ1)2A−1/2‖2≤ ‖A−2(A− λ)2‖2≤ λ2‖A−1‖22 + 2λ‖A−1‖2 + 1 ≤ 4,

we have by Lemma 20 and the bounded convergence theorem that ∆1 → 0. Similarly, wecan prove that ∆2 → 0 and these conclude the proof of (65).

Lemma 22 (Commutativity of A and G) Under the previous definitions, A and G aresimultaneously diagonalizable, and therefore there are commutative.

Proof [Proof of Lemma 22] Let A = UΓU> be the spectral decomposition. By definitionand equation (60),

G = EW (Id +A1/2W>WA1/2/λ2)−1

= Id −A1/2EWW>(λ2 +WAW>

)−1WA1/2

= Id − UΓ1/2EW (WU)>[λ2 +WUΓ(WU)>

]−1WUD1/2U>

= Id − UΓ1/2

[EWW>

(λ2 +WΓ>W>

)−1W

]Γ1/2U>,

55

Lin and Dobriban

where the last line is due toWd= WU . Now it suffices to show that EWW>

(λ2 +WΓ>W>

)−1W

is a diagonal matrix.Write W = (w1, w2, .., wd), where wi, 1 ≤ i ≤ d are the columns of W . Denote the i-th

(1 ≤ i ≤ d) diagonal entry of Γ by γi and define W−i := (w1, ..., wi−1, wi+1..., wd) to be thematrix obtained by removing the i-th column from W . Then, for any 1 ≤ i 6= j ≤ d

EW (W>(λ2 +WΓW>)−1W )ij = EWw>i

(λ2 +

d∑k=1

γkwkw>k

)−1

wj

= EW−jEwj |W−jw>i

(λ2 +

d∑k=1

γkwkw>k

)−1

wj = 0, (66)

where the last line follows from the symmetry of wj . Therefore, EW (W>(λ2+WΓW>)−1W )is a diagonal matrix and hence A and G are simultaneously diagonalizable.

Lemma 23 (Convergence of ed) Under the previous assumptions and notations, sup-pose that (ed(A), ed(A)) is the unique solution of

ed =p

d(ed + 1− eded)−1 (67)

ed =1

dtrA (edA+ λ2Id)

−1. (68)

Then (ed(A), ed(A))→ (e0, e0) almost surely, where (e0, e0) is the unique solution of

e0 = π (e0 + 1− e0e0)−1 (69)

e0 =1

e0

[1− λ2

e0θ1

(δ, λ1 +

λ2

e0

)]. (70)

Furthermore, for any decomposition λ = λ1 + λ2, we have

λ1 + λ2/e0 = λ+1− π

2π

[λ+ 1− γ +

√(λ+ γ − 1)2 + 4λ

]. (71)

We introduce λ1, λ2 only to ensure the invertibility of A and the uniform boundedness of‖A−1‖2. Equation (71) shows that different decompositions of λ do not affect the results.Proof Plugging (68) into (67), we obtain

1

dtr

(A

A+ λ2/ed

)=p/d− ed1− ed

. (72)

The uniqueness of the solution is guaranteed by theorem 7 in Couillet et al. (2012). Now,define for x ∈ R,

gd(x) :=1

dtr

(A

A+ λ2/x

)− p/d− x

1− x.

g(x) :=

[1− λ2

xθ1

(δ, λ1 +

λ2

x

)]− π − x

1− x.

56


First, we consider the case when 0 < π < 1. Noting that gd(0+) = −p/d, gd(1−) > 0 andgd(x) is increasing on (0, 1), it follows that gd(x) has a unique zero ed on (0, 1). Similarly,we can conclude that g(x) has a unique zero e0 on (0, 1).

Since A = X>X/n+ λ1 and p/d→ π, by applying the Marchenko-Pastur theorem, weknow that gd(x) → g(x) for any x ∈ (0, 1), almost surely. Since gd(x) are increasing func-tions on (0, 1), we further have, on any closed interval in (0, 1), gd(x) converges uniformlyto g(x), almost surely.

Due to the uniformly convergence of gd(x) on any closed interval and the fact that thezeros of gd(x), g(x) are in (0, 1), we conclude that the zeros of gd(x) converge to the zero ofg(x) almost surely, i.e., ed → e0 almost surely, and hence (ed, ed)→ (e0, e0) almost surely.

Plugging (69) into (70), we obtain[1− λ2

e0θ1

(δ, λ1 +

λ2

e0

)]=π − e0

1− e0,

Thus,

θ1

(δ, λ1 +

λ2

e0

)=

1− πλ2(1/e0 − 1)

. (73)

Plugging the explicit expression of θ1(δ, λ1 + λ2/e0) into (73) and reorganizing the result,we obtain

π

(λ1 +

λ2

e0

)2

− [(1 + π)λ+ (1− π)(1− γ)]

(λ1 +

λ2

e0

)+ λ[λ+ (1− δ)(1− π)] = 0.

Thus, defining

h(x) := πx2 − [(1 + π)λ+ (1− π)(1− γ)]x+ λ[λ+ (1− δ)(1− π)],

we get that λ1 + λ2/e0 is a solution of h(x) = 0. Since e0 ∈ (0, 1), λ1 + λ2/e0 > λ. Notethat h(λ) = −δ(1−π)2λ ≤ 0. From the properties of quadratic functions, we conclude thatλ1 + λ2/e0 is the larger solution of h(x) = 0. Therefore,

λ1 + λ2/e0 = λ+1− π

2π

[λ+ 1− γ +

√(λ+ γ − 1)2 + 4λ

]=: λ.

If π = 1, it can be readily verified that the unique solution of (69), (70) is

(e0, e0) = (1, 1− λ2θ1(δ, λ1 + λ2)).

Thus, equation (71) follows directly. As for the convergence of (ed,Ed), since gd(x) < 0for x < p/d, the zero of gd(x) (i.e. ed) is larger than p/d and hence converges to 1 asp/d → π = 1. Therefore, we have shown that ed → e0, and hence ed → e0 follows fromequation (68) and the Marchenko-Pastur theorem. This finishes the proof.

57

Lin and Dobriban


Proof In the following proof, we will calculate the derivatives of variances and bias sepa-rately and obtain the results in Table 1 based on them. Throughout the proof, we denotec := δ(1 + σ2/α2) + 1 for simplicity.

(1) MSE. Plugging the explicit expressions of θ1, θ2, λ∗ into equation (8) and taking

derivatives with respect to π and δ separately

d

dπlimd→∞

MSE(λ∗) =d

dπ[α2(1− π + λ∗πθ1) + σ2]

= α2 d

dπ

(2δ − c+

√c2 − 4γ

2δ

)= − α2√

c2 − 4γ≤ 0, (74)

Thus, limd→∞MSE(λ∗) is monotonically decreasing as a function of π. Similarly,

d

dδlimd→∞

MSE(λ∗) =d

dδ[α2(1− π + λ∗πθ1) + σ2] =

α2

2δ2

(1 +

2γ − c√c2 − 4γ

). (75)

Since c2 − 4γ ≥ c2 − 4γ(c − γ) = (c − 2γ)2, the derivative is larger than 0. Therefore,limd→∞MSE(λ∗) is increasing as a function of δ.

(2) Bias2. Since (limd→∞MSE(λ∗)− σ2)2 = α2 limd→∞Bias2(λ∗), the monotonicity ofthe MSE implies the monotonicity of Bias2.

(3) Var. Note that Var(λ∗) = MSE(λ∗)−Bias2(λ∗)− σ2

d

dπlimd→∞

Var(λ∗) =d

dπ

(limd→∞

MSE(λ∗)− limd→∞

Bias2(λ∗)− σ2

)=

d

dπ

[limd→∞

(MSE(λ∗)− σ2)

(1− 1

α2limd→∞

(MSE(λ∗)− σ2)

)]=

(d

dπlimd→∞

MSE(λ∗)

)(1− 2

α2limd→∞

(MSE(λ∗)− σ2)

)= − α2√

c2 − 4γ·

(δσ2/α2 + 1−

√c2 − 4γ

δ

).

Since√c2 − 4γ is decreasing as π increases, simple calculations reveal that, as a function

of π, limd→∞Var(λ∗) is monotonically increasing when δ ≥ 2α2/(α2 + 2σ2), while it isincreasing on (0, [2 + δ(1 + 2σ2/α2)]/4] and decreasing on ([2 + δ(1 + 2σ2/α2)]/4, 1] whenδ < 2α2/(α2 + 2σ2). Similarly,

d

dδlimd→∞

Var(λ∗) =

(d

dδlimd→∞

MSE(λ∗)

)(1− 2

α2limd→∞

(MSE(λ∗)− σ2)

)=

α2

2δ2

(1 +

2γ − c√c2 − 4γ

)·

(δσ2/α2 + 1−

√c2 − 4γ

δ

).

From (75), we know the first term in the last line is non-negative. Plugging in the expressionof c, it follows from some simple calculations that, as a function of δ, limd→∞Var(λ∗) is

58


monotonically decreasing when π < 0.5, while it is increasing on (0, 2(2π− 1)/[1 + 2σ2/α2]]and decreasing on (2(2π − 1)/[1 + 2σ2/α2],+∞) when π > 0.5.

(4) Σlabel. Plugging the definition of θ1, θ2 and λ∗ into Σlabel(λ∗), we obtain

limd→∞

Σlabel(λ∗) = σ2πδ(θ1 − λ∗θ2)

=σ2

2λ∗

(√c2 − 4γ − (γ + 1)λ∗ + (γ − 1)2√

c2 − 4γ

)=

σ2c√c2 − 4γ

.

Therefore, as a function of π, limd→∞Σlabel(λ∗) is monotonically increasing on (0, 1]. Since

d

dδlimd→∞

Σlabel(λ∗) =

2πσ2(1− δ(1 + σ2/α2))√c2 − 4γ

,

it follows that limd→∞Σlabel(λ∗) is increasing on (0, α2/(σ2 + α2)] and decreasing on

(α2/[σ2 + α2],+∞) as a function of δ.

(5) Σinit. We have

d

dπlimd→∞

Σinit(λ∗) =

d

dπα2π(1− π)(1− λ∗θ1)2

= α2(1− λ∗θ1)[(1− 2π)(1− λ∗θ1)− 2π(1− π)(λ∗θ1)′]

= 2α2(1− λ∗θ1)

(4δπ3 − 2c2π + c2

(c2 − 4γπ + c√c2 − 4γ)

√c2 − 4γ

).

Since f(t) := 4δt3 − 2c2t + c2 satisfies the following properties (a). f(0) = c2 > 0, (b).f(1) = 4δ − c2 ≤ −(δ(1 + σ2/α2) − 1)2 ≤ 0, (c). f(t) has a unique extremum on (0,+∞);we know that f is decreasing and has a unique zero on (0, 1]. Therefore, limd→∞Σinit(λ

∗)is unimodal as a function of λ1. Now, taking derivatives with respect to δ,

d

dδlimd→∞

Σinit(λ∗) =

d

dπα2π(1− π)(1− λ∗θ1)2

= −2α2π(1− π)(1− λ∗θ1)(λ∗θ1)′

= −α2

δ2(1− π)(1− λ∗θ1)

(2γ − c√c2 − 4γ

+ 1

)≤ 0,

where the last inequality follows from the fact that c2−4γ ≥ (2γ−c)2. Thus, limd→∞Σinit(λ∗)

is monotonically decreasing as a function of δ.

B.6 Proof of Proposition 6

Proof [Proof of Proposition 6] It is known that if the design matrix X has independententries of zero mean and unit variance, then as n, p→∞ proportionally, i.e., p/n→ γ > 0,

59

Lin and Dobriban

the MSE converges almost surely to the expression (Tulino and Verdu, 2004; Couillet andDebbah, 2011; Dobriban and Wager, 2018)

MSE(γ) = γmγ(−λ∗) = γθ1(γ, λ∗) + σ2, (76)

where λ∗ = γ/α2 is the limit of the optimal regularization parameters, and mγ is the

Stieltjes transform of the limiting eigenvalue distribution Fγ of Σ = n−1X>X, i.e., theStieltjes transform of the standard Marchenko-Pastur distribution with aspect ratio γ.

Furthermore, as shown in the proof of theorem 2.2 in Liu and Dobriban (2020), thespecific forms of the bias and variance are: with θi := θi(γ, λ), i = 1, 2

(a).Bias2 = α2

∫λ2

(x+ λ)2dFγ(x) = α2λ2θ2, (b).Var = γ

∫x

(x+ λ)2dFγ(x) = γ(θ1 − λθ2).

(77)

Therefore, we can obtain the explicit formulas of the bias, variance and MSE by pluggingλ∗ = λ/α2, equation (6), (7) into equation (76), (77). All results in Proposition 6 can bederived by calculating the derivatives as we have done in the proof of theorem 5. However,since the proofs are simpler in this special case, we present them here for the convenienceof readers. Throughout this proof, we denote 1/α2 by c.

MSE: Let τ := (1/α2 − 1)γ − 1, substitute (6) into (76) and take derivatives:

d

dγMSE(γ) =

d

dγ[γθ1(γ, λ∗) + σ2] =

d

dγ

τ +√τ2 + 4cγ2

2cγ

=τ + 1 + [τ(τ + 1) + 4cγ2]/

√τ2 + 4cγ2 − (τ +

√τ2 + 4cγ2)

2cγ2

=τ +

√τ2 + 4cγ2

2cγ2√τ2 + 4cγ2

≥ 0.

Thus, the MSE is strictly increasing as γ increases.

Bias: Plugging equation (7) into (77)(a) and denoting 1/γ by x , we have

Bias2(γ) = α2λ∗2θ2(γ, λ∗) =α2

2

(1− 1

γ+

cγ(γ + 1) + (γ − 1)2

γ√

(c+ 1)2γ2 + 2(c− 1)γ + 1

)

=α2

2

(1− x+

c(x+ 1) + (x− 1)2√(c+ 1)2 + 2(c− 1)x+ x2

)=:

α2

2(1 + f(x, c)).

Thus it is enough to show that f(x, c) is decreasing for x > 0. Taking derivatives withrespect to x, we have after some calculations that

d2

dx2f(x, c) =

6(c+ 1)3 + 12c2x

((c+ 1)2 + 2(c− 1)x+ x2)52

> 0, ∀x > 0.

60


Thus, for any fixed c > 0, f(x, c) is a strictly convex function with respect to x on (0,+∞).Furthermore, fixing c > 0 and letting x→∞, we get

limx→∞

f(x, c) = limx→∞

−x+c(x+ 1) + (x− 1)2√

(c+ 1)2 + 2(c− 1)x+ x2

= limx→∞

c(x+ 1) + (x− 1)2 − x√

(c+ 1)2 + 2(c− 1)x+ x2√(c+ 1)2 + 2(c− 1)x+ x2

= limx→∞

(x2 + (c− 2)x+ c+ 1)− x(x+ c− 1)√

1 + 4c(x+c−1)2

x

= limx→∞

(x2 + (c− 2)x+ c+ 1)− x(x+ c− 1)(1 +O( 1x2

))

x

= limx→∞

−x+O(1)

x= −1.

Combining the results above and using the fact that a strictly convex function with a finitelimit is strictly decreasing, it follows that f(x, c) is both strictly decreasing and convex.Therefore, we have proved that Bias2(γ) is strictly increasing on (0,+∞).Variance: Similarly, plugging (6), (7) into (77) (b), we get

Var(γ) = −1

2+

(c+ 1)γ + 1

2√

(γ(1− c)− 1)2 + 4γ2c

= −1

2+

(c+ 1)γ + 1

2√

(c+ 1)2γ2 + 2(c− 1)γ + 1.

Let x = (c+ 1)γ and t = c−1c+1 . Then it suffices to prove the unimodality of

gt(x) : =x+ 1√

x2 + 2tx+ 1, x ∈ (0,+∞), t ∈ (−1, 1).

Differentiating gt(x) with respect to x gives

d

dxgt(x) =

(1− t)(1− x)

(x2 + 2tx+ 1)32

.

Since g′t(x) > 0 when x < 1, and g′t(x) < 0 when x > 1, we see that gt(x) is strictly increasingon (0, 1) and strictly decreasing on (1,+∞). Correspondingly, by changing x back to γ,c back to 1/α2, we have shown that Var(γ) is strictly increasing on (0, α2/(α2 + 1)) andstrictly decreasing on [α2/(α2 + 1),+∞). Moreover, noting that

limγ→+∞

Var(γ) = limγ→+∞

−1

2+

(c+ 1)γ + 1

2√

(c+ 1)2γ2 + 2(c− 1)γ + 1

= limγ→+∞

−1

2+

(c+ 1)γ

2(c+ 1)γ= 0.

Therefore, we conclude that Var(γ) is a unimodal function converges to zero at infinitywith a unique maximum point at α2/(α2 + 1). Finally, proposition 6(4) can be obtained byevaluating Var(γ) and Bias2(γ) at α2/(α2 + 1).

61

Lin and Dobriban


Proof (1). By plugging the expression of θ1 into equation (8) and taking derivatives withrespect to δ in equation (8), we obtain

d

dπlimd→∞

Bias2(λ) =d

dπα2(1− π + λπθ1)2

= α2(1− π + λπθ1)

(λ+ γ − 1√

(−λ+ γ − 1)2 + 4λγ− 1

)< 0,

where the inequality follows from the facts that 1−π+λπθ1 =∫

[λ+ (1− π)x]/(x+ λ)dFγ(x) ≥0 and (λ+ γ − 1)2 − [(−λ+ γ − 1)2 + 4λγ] = −4λ < 0. Similarly,

d

dδlimd→∞

Bias2(λ) =d

dδα2(1− π + λπθ1)2

=α2

δ2(1− π + λπθ1)

(λ+ 1− λ2 + (γ + 2)λ+ (1− γ)√

(−λ+ γ − 1)2 + 4λγ

)> 0.

Therefore, the limiting Bias2 is monotonically increasing as a function of δ and mono-tonically decreasing as a function of π.

(2). Denote λ∗ = δ(1− π + σ2/α2) as before. From (9), we have

limd→∞

Var(λ) = α2π

{1− π +

λ

δ+

[(π − 1)(2λ− δ) +

λ(λ− γ + 1)

δ+δσ2

α2

]θ1+

λ[λ− δ

(1− π + σ2/α2

)]θ2

}= α2π

{1− π +

λ

δ+

[(2π − 2 +

λ− γ + 1

δ

)λ+ λ∗

]θ1(1, λ) + λ(λ− λ∗)θ2

}Now suppose δ = 1/π and let λ→ 0+,

limλ→0

limd→∞

Var(λ) = limλ→0

α2π

{1− π +

λ

δ+

[(2π − 2 +

λ− γ + 1

δ

)λ+ λ∗

]θ1 + λ(λ− λ∗)θ2

}= lim

λ→0α2π[1− π + λ∗(θ1(1, λ)− λθ2(1, λ))]

= limλ→0

α2π

[1− π + λ∗

(4λ

(√λ2 + 4λ+ λ)2

√λ2 + 4λ

)]= lim

λ→0O(

1

λ1/2) =∞.

Finally, letting λ→ 0 in Theorem 2, we obtain after some similar calculations that Vsi andVsli go to infinity while Vs, Vi and Vsl converge to some finite limits as d→∞, λ→ 0.


Proof [Proof of Theorem 8] For notational simplicity, we sometimes denote the 2-norm ofvectors and the Frobenius norm of matrices by ‖ · ‖ in this proof. From definition (16), we

62


have

βopt : = argminβ Ep(θ|XW>,W,Y )Ex,ε[(Wx)>β − (x>θ + ε)]2

= argminβ Ep(θ|XW>,W,Y )Ex[(Wx)>β − x>θ]2

= argminβ Ep(θ|XW>,W,Y )Ex[(W>β − θ)>xx>(W>β − θ)]2

= argminβ Ep(θ|XW>,W,Y )‖W>β − θ‖22= WEp(θ|XW>,W,Y )θ.

Thus, we have proved equation (17).Now, to prove (18), it suffices to do the following:

(1). Calculate the posterior p(θ|XW>,W, Y ).

(2). Bound the difference between βopt and β.

(1). Let W⊥ = f(W ) ∈ R(d−p)×d be a deterministic orthogonal complement of W , suchthat W>⊥W⊥ +W>W = Id. Then we have

p(θ|XW>,W, Y ) ∝ p(θ)p(XW>,W, Y |θ) = p(θ)p(XW>,W |θ)p(Y |XW>,W, θ)

∝ exp

(− ‖θ‖

22

2α2/d−‖XW>‖2F

2

)· p(Y |XW>,W, θ) (78)

∝ exp

(− ‖θ‖

22

2α2/d

)·∫p(Y |XW>⊥ , XW>,W, θ) · p(XW>⊥ |XW>,W, θ)d(XW>⊥ )

∝ exp

(− ‖θ‖

22

2α2/d

)·∫

exp

(−‖Y −Xθ‖

22

2σ2

)· p(XW>⊥ |XW>,W, θ)d(XW>⊥ ),

where in the second line we used the facts that XW> and W are independent conditionedon θ, XW> has i.i.d. N (0, 1) entries, and W is uniformly distributed over partial orthogonalmatrices. Denote XW> by A and XW>⊥ by A1. Then using the fact that A,A1 have i.i.d.N (0, 1) entries and A, A1 are independent conditioned on W and θ, we obtain

p(XW>⊥ |XW>,W, θ) ∝ exp

(−‖XW>⊥ ‖2F

2

)= exp

(−‖A1‖2F

2

).

Therefore,

p(Y |XW>,W, θ) =

∫p(Y |X, θ) · p(XW>⊥ |XW>,W, θ)d(XW>⊥ )

=

∫exp

(−‖(Y −AWθ)−A1W⊥θ‖22 + σ2‖A1‖2F

2σ2

)dA1

= exp

(−‖Y −AWθ‖22

2σ2

)∫exp

{−σ2‖A1‖2F + ‖A1W⊥θ‖2F − 2 tr[A1W⊥θ(Y −AWθ)>]

2σ2

}dA1

Further denote Wθ by θ , W⊥θ by θ1 and (Y − Aθ)θ>1 (θ1θ>1 + σ2)−1 by B. For a fixed Y ,

conditioned on A,W, θ, by separating A1 into n rows, applying Fubini’s theorem and using

63

Lin and Dobriban

properties of the p.d.f. of a normal distribution, we get

p(Y |XW>,W, θ) ∝ exp

(−‖Y −Aθ‖

22

2σ2

)·

∫exp

{−tr[(A1 −B)(θ1θ

>1 + σ2)(A1 −B)>]

2σ2+

tr[B(θ1θ>1 + σ2)B>]

2σ2

}dA1 (79)

∝ exp

(−‖Y −Aθ‖

2

2σ2+

tr[B(θ1θ>1 + σ2)B>]

2σ2

)· det(θ1θ

>1 + σ2)−

n2

= exp

(−‖Y −Aθ‖

2

2σ2+

tr[(Y −Aθ)θ>1 (θ1θ>1 + σ2)−1θ1(Y −Aθ)>]

2σ2

)· det(θ1θ

>1 + σ2)−

n2 .

Let θ1 = UDV > be the SVD of θ1. Then

det(θ1θ>1 + σ2) = det(DD> + σ2) = (‖θ1‖2 + σ2)

d−p∏i=2

σ2 ∝ ‖θ1‖2 + σ2.

θ>1 (θ1θ>1 + σ2)−1θ1 = D>(DD> + σ2)−1D = ‖θ1‖2(‖θ1‖2 + σ2)−1.

Thus,

p(Y |XW>,W, θ) ∝ exp

{[−1 + ‖θ1‖2(‖θ1‖2 + σ2)−1]

‖Y −Aθ‖2

2σ2

}· (‖θ1‖2 + σ2)−

n2 . (80)

Finally, by substituting (80) into (78), we obtain the posterior:

p(θ|XW>,W, Y ) ∝ exp

(− ‖θ‖

22

2α2/d− ‖Y −Aθ‖2

2(‖θ1‖2 + σ2)

)· (‖θ1‖2 + σ2)−

n2

= exp

(− ‖θ‖

2

2α2/d− ‖Y −Aθ‖2

2(‖θ1‖2 + σ2)

)· exp(− ‖θ1‖2

2α2/d) · (‖θ1‖2 + σ2)−

n2 . (81)

(2). Since βopt = WEp(θ|XW>,W,Y )θ = Ep(θ|XW>,W,Y )θ, it suffices to calculate the pos-

terior mean of θ. In equation (81), we can see that conditioned on θ1, θ follows a normaldistribution. Moreover, using the same technique as in equation (79), it is not hard to verifythat the expectation of θ conditioned on θ1 is(

d

α2+

A>A

‖θ1‖2 + σ2

)−1A>Y

‖θ1‖2 + σ2. (82)

Now changing A, θ back to X,W , we obtain

Ep(θ|θ1,XW>,W,Y )θ =

[WX>XW>

n+

(‖θ1‖2 + σ2)d

nα2

]−1WX>Y

n. (83)

64


Therefore, the conditional expectation of θ is the ridge estimator with λ = λ(θ1) := (‖θ1‖2+σ2)d/[nα2]. Thus βopt is in fact a weighted ridge estimator. Besides, note that W⊥θ hasi.i.d. N (0, α2/d) entries. Hence letting χ2(k) be a chi squared random variable with kdegrees of freedom,

‖θ1‖2 = ‖W⊥θ‖2 =α2

d

d−p∑i=1

(√dW⊥θ

α

)2

i

d=α2

dχ2(d− p) w−→ α2(1− π). (84)

Denote R∗ = (WX>XW>

n +λ∗)−1. Thus, we may guess that the posterior of ‖θ1‖ is close toα2(1− π) with high probability and

βopt ≈[WX>XW>

n+ δ(1− π + σ2/α2)

]−1WX>Y

n= R∗

WX>Y

n= β,

which is the optimal ridge estimator.We formalize this idea by bounding the difference between βopt and β. Denote the

conditional mean (83) of θ given ‖θ1‖2 = c by R(c). Let also R1 = (WX>XW>

n + λ(θ1))−1.

Then the optimal ridge estimator is β = R(α2(1− π)) and we have

EXW>,W,Y ‖β − βopt‖22 = EXW>,W,Y

∥∥∥∥∫ [R(‖θ1‖2)−R(α2(1− π))] · p(θ1|XW>,W, Y )dθ1

∥∥∥∥2

≤ EXW>,W,Y,θ1∥∥∥[R(‖θ1‖2)−R(α2(1− π))

∥∥∥2= EXW>,W,Y,θ1

∥∥∥∥[R∗ − R1

]WX>Y

n

∥∥∥∥2

,

where we used the Jensen inequality in the second line. Note that A−1 − B−1 = A−1(B −A)B−1 and omit the subscripts of the expectation. The above equals

E∥∥∥∥[R∗ (λ∗ − λ(θ1)

)R1

]WX>Y

n

∥∥∥∥2

. (85)

Furthermore, by Cauchy-Schwartz inequality and the fact that ‖AB‖F ≤ ‖A‖2‖B‖F , thesquare of this quantity is upper bounded by

E(λ∗ − λ(θ1)

)4· E∥∥∥∥[R∗ · R1

]WX>Y

n

∥∥∥∥4

.

≤ E(λ∗ − λ(θ1)

)4· E∥∥∥∥ 1

λ(θ1)λ∗WX>Y

n

∥∥∥∥4

≤ 1

λ∗4E(λ∗ − λ(θ1)

)4·(E

1

λ(θ1)8

)1/2

·

(E∥∥∥∥WX>Y

n

∥∥∥∥8)1/2

. (86)

Recall that Y = Xθ+E and note that E‖WX>X/n‖kF , E‖WX>/√n‖kF , E‖θ‖k2, E‖E/

√n‖k2

are all uniformly bounded (as d→∞) for any non-negative integer k because of the Gaussianasusmption and the boundeness of moments of Wishart matrices (Muirhead, 2009; Bai andSilverstein, 2010) etc. It follows directly by several applications of the Cauchy-Schwartz

65

Lin and Dobriban

inequality that E‖WX>Y/n‖8 is bounded as d → ∞, i.e., the third term in the R.H.S. of(86) is bounded.

If σ > 0, then λ(θ1) ≥ λ(0) = σ2d/[nα2] > 0, and hence the second term in (86) isbounded. If σ = 0 and π < 1, by the definition of λ(θ1) and integration in the polar system,it is readily verified that

E1

λ(θ1)8= E

n8

(x21 + · · ·+ x2

d−p)8

=n8∏8

k=1(d− p− 2k)

d→∞−→ C1 <∞.

Therefore, the second term in the R.H.S of (86) is also bounded. Also

(86) ≤ C · limd→∞

EXW>,W,Y,θ(λ∗ − λ(θ1)

)4

≤ C2 · limd→∞

([λ∗ − d

n(1− p

d+ σ2/α2)

]4

+ E[λ(θ1)− Eλ(θ1)]4

)= 0,

where C,C2 are some finite constants and the last equality follows from equation (84) andproperties of the chi-square distribution.

Therefore, we have proved that E‖β − βopt‖2 → 0 and the optimal ridge estimator β isasymptotically optimal.


Proof [Proof of Theorem 4] We only need to make small changes in the proof of theorems2 and 3 to prove theorem 4. We first take limd→∞Bias2(λ) as an example. Similar to (25)

Bias2(λ) = Ex[EX,W,E

(x>Mθ + x>ME

)− x>θ

]2

= Ex[x>(EX,WM − I)θ

]2= Ex tr((EM> − I)xx>(EM − I)θθ>)

= tr((EM> − I)(EM − I)θθ>). (87)

Note that we have shown EM is a multiple of identity in Lemma 11 (under Gaussianassumption), therefore

(87) =[(EM> − I)(EM − I)

]· tr(θθ>) =

1

dtr[(EM> − I)(EM − I)

]· tr(θθ>). (88)

From Lemma 11, we know the first term in the R.H.S. of (88) converges to π2(1−λθ1(πδ, λ))2.

Note that tr(θθ>)d= α2(

∑di=1 x

2i )/d, where xi ∼ N (0, 1). By the Borel-Cantelli lemma and

the concentration inequality for χ2-variables, we have tr(θθ>)a.s.→ α2.

Thus, (88) almost surely converges to α2π2(1 − θ1(πδ, λ))2 and the same asymptoticresult for bias as in theorem 3 holds almost surely over the randomness in θ.

From this example, we can see that the results in theorem 2 and 3 will automaticallyhold in the non-random setting if we can separate θ from other variables (e.g. EM) in (25)—(27), (32)—(38) by showing that the matrices which are multiplied by θθ> are in fact a

66


multiple of identity. For instance, in (25), since θθ> is multiplied by a multiple of identity(EM> − I)(EM − I), the same result follows.

To generalize other results in theorem 2 and 3 to the almost sure setting, from (25)—(27), (32)—(38), we can see that it is enough to show the following matrices are all multiplesof the identity:

(a). EM>M, (b). EW (EXM>EXM), (c). EX(EWM>EWM).

(a). EM>M . We will denote R =(WX>XW>/n+ λIp

)−1in what follows. By

the definition of M and the fact that XW> and X(I −W>W ) (denoted by X2) are twoindependent matrices with Gaussian entries for any fixed W with orthogonal rows

EM>M = EX>XW>

nR2WX>X

n

= EW>WX>XW>

nR2WX>XW>W

n(89)

+ EX>2 XW

>

nR2WX>X2

n. (90)

For (89), note that W and XW> are independent, so

(89) = EWW>[EXW>

WX>XW>

nR2WX>XW>

n

]W.

Since EWW>AW = tr(A) · Id/d for any constant matrix A, it follows directly that (89)is a multiple of identity. For (90), note that XW> and X(I − W>W ) are independentconditioned on W , thus

(90) = EWEX2|WX>2

[EXW>

XW>

nR2WX>

n

]X2. (91)

Let XW> = UDV > be the SVD. Since XW> has i.i.d. N(0, 1) entries, we can assume Ufollows the Haar measure and is independent of DV >. Therefore, with X2 = X(I−W>W )

(91) = EWEX2|WX>2

[EU,D,V U

DV >

n

(V D>DV >

n+ λIp

)−2V D>

nU>

]X2

= EWEX2|WX>2

{1

ntr

[ED,V

DV >

n

(V D>DV >

n+ λIp

)−2V D>

n

]}X2

= c0 · EWEX2|WX>2 X2 = c1 · EW (I −W>W ) = c2 · Id,

where c0, c1, c2 are some constants. Combining (89) and (90), we have proved EM>M is amultiple of identity.

(b). EW (EXM>EXM). Since W and XW> are independent, similarly

EXM = EXW>RWX>XW>

nW = W>

[EXW>|WR

WX>XW>

n

]W

= W>EV V ED(D>D

n+ λIp

)−1D>D

nV >W = c0W

>W,

67

Lin and Dobriban

where c0 is a constant (different from previous constants) and the last line follows from thefact that EV V AV > = tr(A)/p · Ip for any constant matrix A. Therefore

EW (EXM>EXM) = EW c20W>WW>W = c2

0 · EWW>W = c1 · Id,

where c1 is a constant.

(c). EX(EWM>EWM). Let X>X = UΓU> be the spectral decomposition of X>X. Bythe definition of M , we have

EWM = EW[W>RW

] X>Xn

= EWU

[(WU)>

((WU)Γ(WU)>

n+ λIp

)−1

WU

]ΓU>

n

= EW

[W>

(WΓW>

n+ λIp

)−1

W

]ΓU>

n= c0(Γ) · Id ·

UΓU>

n,

where c0(Γ) is a constant depending on Γ, the second line is due to Wd= WU and the last

line follows from the proof of Lemma 22 (see (66)). Finally, note that U follows the Haardistribution and is independent of Γ. Thus,

EX(EWM>EWM) = EΓ,Uc0(Γ)2 · UΓ2U>

n= EUU

(EΓc0(Γ)2 · Γ2

n

)U> = c1 · Id,

where c1 is a constant and the last equality follows again from the fact that EUUAU> =tr(A)/d · Id for any constant matrix A.


Proof By definition, we have

MSE(λ) := Eθ,x,W,X,E(f(x)− x>θ)2 + σ2

= Eθ,x,W,X,E f(x)2 − 2Eθ,x,W,X,E f(x) · x>θ + Eθ,x(x>θ)2 + σ2.

Bias2(λ) := Eθ,x|EW,X,E f(x)− x>θ|2

= Eθ,x(EW,X,E f(x))2 − 2Eθ,x,W,X,E f(x) · x>θ + Eθ,x(x>θ)2

Var(λ) := Eθ,x,W,X,E |f(x)− EX,W,E f(x)|2

= Eθ,x,W,X,E f(x)2 − Eθ,x(EW,X,E f(x))2.

To prove equation (22), (23) and (24), it is thus enough to calculate Eθ,x,W,X,E f(x)2,

Eθ,x,W,X,E f(x) · x>θ and Eθ,x(EW,X,E f(x))2. In Lemma 24, 25 and 26, we calculate thesethree terms separately. Equation (22), (23) and (24) follow directly from these three lemmasand the fact that Eθ,x(x>θ)2 = α2.

68


Using the same technique as in the proof of theorem 3, it can be shown that the limiting

MSE as a function of λ has a unique minimum at λ∗ := v2

µ2

[δ(1− π + σ2

α2 ) + (v−µ2)γv

].

Furthermore, results in Table 2 can be proved in the same way as results in Table 1. Forsimplicity, we omit the proof of optimal λ∗ and Table 2 here.

Lemma 24 (Asymptotic limit of Eθ,x(EW,X,E f(x))2) Under assumptions in theorem 9

limd→∞

Eθ,x(EW,X,E f(x))2 = α2π2µ4

v2

(1− λ

vθ1

)2

, (92)

where θ1 := θ1(γ, λ/v).

Lemma 25 (Asymptotic limit of Eθ,x,W,X,E f(x) · x>θ) Under assumptions in theorem9

limd→∞

Eθ,x,W,X,E f(x) · x>θ = α2πµ2

v

(1− λ

vθ1

), (93)


Lemma 26 (Asymptotic limit of Eθ,x,W,X,E f(x)2) Under assumptions in theorem 9 wehave

limd→∞

Eθ,x,W,X,E f(x)2 = α2π

[1− 2(v − µ2)

v+

(ρ(1− π)− 2λµ2

v2

)θ1 +

λ

v

(λµ2

v2− ρ(1− π)

)θ2

+v − µ2

v

(1 + γθ1 −

λγ

vθ2

)]+ σ2γ

(θ1 −

λ

vθ2

), (94)

where θ1 := θ1(γ, λ/v) and θ2 := θ2(γ, λ/v).

The proofs of these lemmas proceed by applying the leave-one-out technique and theMarchenko-Pastur law (refer to Lemmas 27—29), and by leveraging properties of the or-thogonal projection matrix and the normal distribution.

Proof [Proof of Lemma 24] Different from previous sections, we will denoteR =(σ(WX>)σ(XW>)

n + λ)−1

in the proof of Lemmas 24—26. By definition,

Eθ,x(EW,X,E f(x))2 = Eθ,x[EW,X,Eσ(x>W>)R

σ(WX>)Y

n

]2

= Eθ,x[EW,Xσ(x>W>)R

σ(WX>)Xθ

n

]2

=α2

dEx∥∥∥∥EW,Xσ(x>W>)R

σ(WX>)X

n

∥∥∥∥2

F

=:α2

dEx‖T1‖2F , (95)

69

Lin and Dobriban

For T1, we continue

T1 = EW,Xσ(x>W>)Rσ(WX>)

n[XW>W +X(I −W>W )]

= EW,Xσ(x>W>)Rσ(WX>)

nXW>W

= EWσ(x>W>)EX[Rσ(WX>)

nXW>

]W, (96)

where the second line follows from the fact that XW> and X(I −W>W ) are independentand EXX(I −W>W ). Denote

D1 := EX[Rσ(WX>)

nXW>

],

then D1 is a constant matrix independent of W since WX has i.i.d. N (0, 1) entries forany W with orthonormal rows. We can write the vector x as x = U(

√d1, 0, ..., 0)> U is a

random orthogonal matrix following the Haar distribution. Denoting the i-th column of Wby W·i and substituting x into (96), we get

T1 = EWσ(x>W>)D1W = EWσ((√d1, 0, ..., 0)U>W>)D1WUU>

= EWσ((√d1, 0, ..., 0)W>)D1WU>

where the last line follows from Wd= WU . We can further write this as

EWσ(√d1W

>·1 )D1(W·1, ...,W·d)U

>

=(EWσ(

√d1W

>·1 )D1W·1, 0, ..., 0

)U>

=(EW·1 tr[W·1σ(

√d1W

>·1 )D1], 0, ..., 0

)U>

=(EW11 [W11σ(

√d1W11)] tr(D1), 0, ..., 0

)U>, (97)

where the second line follows from the symmetry of W.i(i ≥ 2) conditional on W·1 and thelast line is due to the fact that EW·1W·1σ(

√d1W

>·1 ) is a multiple of identity since Wi,1 is

symmetric conditional on Wj,1, j 6= i. Therefore,

Eθ,x(EW,X,E f(x))2 =α2

dEx‖T1‖2F

=α2

dEd1

[EW11W11σ(

√d1W11)

]2tr(D1)2. (98)

Denote√d1W11 by w. Noting that w

w−→ N (0, 1) and d1/d = ‖x‖2/d a.s.−→ 1, we obtain

limd→∞

Eθ,x(EW,X,E f(x))2 = limd→∞

α2

d2tr(D1)2Ed1

[Ewwσ

(√d1

dw

)]2

= limd→∞

α2π2µ2

v2

(1− λ

vθ1

)2

Ed1/d

[Ewwσ

(√d1

dw

)]2

= α2π2µ2

v2

(1− λ

vθ1

)2

[Ea∼N (0,1)σ(a)a]2 = α2π2µ4

v2

(1− λ

vθ1

)2

,

70


where the third line can be rigorously justified using the concentration inequality for w andd1/d:

P (|w| > t) ≤ 2e−(d−2)t2/d (Levy’s lemma), P (|d1/d− 1| > t) ≤ 2e−dt2/8 (concentration for χ2).

the fact that |σ(x)|, |σ′(x)| ≤ c1ec2|x|, integration by parts and the bounded convergence

theorem. Here Levy’s lemma refers to usual concentration of the Haar measure (Boucheronet al., 2013).

Proof [Proof of Lemma 25] By definition,

Eθ,x(EX,W,E f(x)θ>x) = Eθ,x[EW,X,Eσ(x>W>)R

σ(WX>)Y

nθ>x

]= Eθ,x

[EW,Xσ(x>W>)R

σ(WX>)Xθ

nθ>x

]=α2

dEx[EW,Xσ(x>W>)R

σ(WX>)Xx

n

]=α2

dEx tr(T1x),

where T1 is defined in equation (95) in the proof of Lemma 24. Let us again write x =U(√d1, 0, ..., 0)> for an orthogonal U , and denote w =

√dW11. Then from equation (97)

and the fact that ww−→ N (0, 1), d1/d

a.s.−→ 1 we have

limd→∞

Eθ,x(EX,W,E f(x)θ>x) = limd→∞

α2

dEd1

(EW11 [W11σ(

√d1W11)] tr(D1), 0, ..., 0

)U>U(

√d1, 0, ..., 0)>

= limd→∞

α2

dtr(D1)Ed1EW11 [

√d1W11σ(

√d1W11)]

= limd→∞

α2πµ

v

(1− λ

vθ1

)Ed1/dEW11

[√d1

dwσ

(√d1

dw

)]

= α2πµ

v

(1− λ

vθ1

)Ea∼N (0,1)[aσ(a)] = α2π

µ2

v

(1− λ

vθ1

),

where the third line follows from similar arguments as in the proof of Lemma 24.


Eθ,x,W,X,E f(x)2 = Eθ,x,W,X,E∣∣∣∣σ(x>W>)R

σ(WX>)(Xθ + E)

n

∣∣∣∣2=α2

dE∥∥∥∥σ(x>W>)R

σ(WX>)X

n

∥∥∥∥2

2

+ σ2E∥∥∥∥σ(x>W>)R

σ(WX>)

n

∥∥∥∥2

2

=: T2 + T3.

71

Lin and Dobriban

For T2, we further have that it equals

α2

dEW,X,x

∥∥∥∥σ(x>W>)Rσ(WX>)XW>

n

∥∥∥∥2

2

+α2

dEW,X,x

∥∥∥∥σ(x>W>)Rσ(WX>)X(I −W>W )

n

∥∥∥∥2

2

=vα2

dEW,X

∥∥∥∥Rσ(WX>)XW>

n

∥∥∥∥2

F

+vα2

dEW,X

∥∥∥∥Rσ(WX>)(Id −W>W )

n

∥∥∥∥2

F

=vα2

dEW,X


n

∥∥∥∥2

F

+vα2(d− p)

dEW,X

∥∥∥∥Rσ(WX>)

n

∥∥∥∥2

F

=: T4 + T5,

where the first and third equations follow from the fact that XW> and X(I −W>W ) areindependent conditional on W and EXX(Id −W>W )(Id −W>W )X> = tr(Id −W>W ) =d − p. Also, we used that Wx ∼ N(0, Ip) given any orthgonal W , Ea∼N (0,1)σ(a) = 0,

definition (21) and the independence of Wx, XW> and X(I −W>W ) conditional on W .

Also, due to the independence of Wx, XW> conditional on W , we have

T3 = σ2vEW,X∥∥∥∥Rσ(WX>)

n

∥∥∥∥2

F

=σ2d

α2(d− p)T5.

Finally, substituting Lemma 28, 29 into T4, T5 , we get

Eθ,x,W,X,E f(x)2 = T3 + T4 + T5 = T4 +

(1 +

σ2d

α2(d− p)

)T5

=vα2

dEW,X


n

∥∥∥∥2

F

+

(vα2(d− p)

d+ vσ2

)EW,X

∥∥∥∥Rσ(WX>)

n

∥∥∥∥2

F

→ α2π

[1− 2(v − µ2)

v+

(ρ(1− π)− 2λµ2

v2

)θ1 +

λ

v

(λµ2

v2− ρ(1− π)

)θ2

+v − µ2

v

(1 + γθ1 −

λγ

vθ2

)]+ σ2γ

(θ1 −

λ

vθ2

)

Lemma 27 (Asymptotic behavior of D1) Under assumptions in theorem 9, we have

limd→∞

1

dEX tr

[Rσ(WX>)

nXW>

]= π

µ

v

(1− λ

vθ1

), (99)


Proof [Proof of Lemma 27] Since WW> = Ip, WX> has i.i.d. N (0, 1) entries and the

L.H.S. of (99) (if exists) is a constant independent of W . Denote X := XW> for notationalsimplicity. Also, let X·i be the i-th column of X, X·−i be the matrix obtained by deletingthe i-th column of X. By symmetry of X, it suffices to compute the first diagonal entry of

72


the matrix in the L.H.S. of (99). Namely, we have

1

dEX tr

[Rσ(WX>)

nXW>

]=p

dEX

[Rσ(X>)

nX

]11

=p

dEX

σ(X>)

n

(σ(X)σ(X>)

n+ λIn

)−1

X

11

=p

dEX

σ(X·1)>

n

(σ(X·1)σ(X>·1 )

n+σ(X·−1)σ(X>·−1)

n+ λIn

)−1

X·1

. (100)

Define C := [σ(X·−1)σ(X>·−1)/n+λIn], u := X·1 and u := σ(X·1). By the Sherman-Morrisonformula, the above equals

p

dEu>

n

(C−1 − C−1uu>C−1/n

1 + u>C−1u/n

)u =

p

d

[Eu>C−1u

n− E

(u>C−1uuC−1u/n

n+ u>C−1u

)]. (101)

Since u = σ(X·1) has i.i.d. zero mean v variance entries, by theorem 1 in Rubio andMestre (2011), the proof of theorem 2.1 in Liu and Dobriban (2020) , we know that C−i aredeterminstically equivalent to certain multiples of the identity matrix. Also, the multipleswill converge to certain limits, which can be determined by the Marchenko-Pastur law.Thus, we have after some calculations that

C−1 =

(σ(X·−1)σ(X>·−1)

n+ λIn

)−1

� 1

γvθ1

(1

γ,λ

γv

)· In =

γ

vθ1

(γ,λ

v

)+

1− γλ

. (102)

C−2 =

(σ(X·−1)σ(X>·−1)

n+ λIn

)−2

� 1

γ2v2θ2

(1

γ,λ

γv

)· In =

γ

v2θ2

(γ,λ

v

)+

1− γλ2

.

(103)

Therefore, it remains to calculate the limit of (101).

Since u ∼ N (0, In), we have by the strong law of large numbers that

lim sup

∥∥∥∥uu>n∥∥∥∥

tr

= lim supu>u/n <a.s. ∞.

Note that C is independent of u, thus we have by (102) that almost surely for a sequenceof uk, k = 1, 2, ...

limd→∞

u>C−1u

n= lim

d→∞tr

((uu>)

nC−1

)= lim

d→∞

[γ

vθ1

(γ,λ

v

)+

1− γλ

]tr

((uu>)

n

)=

[γ

vθ1

(γ,λ

v

)+

1− γλ

], a.s.

73

Lin and Dobriban

Similar results also hold for other terms: (i = 1, 2)

u>C−iu

n

a.s.−→[γ

viθi

(γ,λ

v

)+

1− γλi

]. (104)

u>C−iu

n

a.s.−→ µ

[γ

viθi

(γ,λ

v

)+

1− γλi

]. (105)

u>C−iu

n

a.s.−→ v

[γ

viθi

(γ,λ

v

)+

1− γλi

]. (106)

Therefore, simple calculations give[u>C−1u

n−(u>C−1uuC−1u/n2

1 + u>C−1u/n

)]→ µ

v

(1− λ

vθ1

), a.s. (107)

Now, we only need to show that the expectation of the L.H.S. of (107) also converges to itspointwise limit.To prove this, we start with bounding the mean squared error of (104).

E∣∣∣∣u>C−1u

n−[γ

vθ1 +

1− γλ

]∣∣∣∣2 ≤ E∣∣∣∣u>C−1u

n− Eu

u>C−1u

n

∣∣∣∣2 + E∣∣∣∣Euu>C−1u

n−[γ

vθ1 +

1− γλ

]∣∣∣∣2=

1

n2EC

∑i=j=k=l

+∑

i=j 6=k=l

+∑

i=l 6=j=k+

∑i=k 6=j=k

C−1ij C

−1kl Euiujukul − tr(C−1)2

+ E

∣∣∣∣tr(C−1)

n−[γ

vθ1 +

1− γλ

]∣∣∣∣2 .This can be further bounded by

1

n2K1‖C−1‖2F + E

∣∣∣∣tr(C−1)

n−[γ

vθ1 +

1− γλ

]∣∣∣∣2 → 0,

where K1 is a constant independent of n. This follows from some simple calculations, andthe convergence is due to C−1 � λ−1 and the bounded convergence theorem.

Similarly, we can prove the same results for the other five terms corresponding to equa-tion (104), (105) and (106). With the mean squared error converging to zero, we are nowable to show that the expectation of the L.H.S. of (107) also converges to the pointwiseconstant limit.

For notational simplicity, we further denote a := u>C−1u/n, b := u>C−1u/n, c :=u>C−1u/n and define constants A := limd→∞ a,B := limd→∞ b. Now, for the second termin the L.H.S. of (107), we have

limd→∞E∣∣∣∣ a2

1 + b− A2

1 +B

∣∣∣∣ ≤ limd→∞

E∣∣∣∣aA− a1 + b

∣∣∣∣+

∣∣∣∣AA− a1 + b

∣∣∣∣+

∣∣∣∣ A2(B − b)(1 +B)(1 + b)

∣∣∣∣≤

√E(A− a)2E

∣∣∣∣ a

1 + b

∣∣∣∣2 +

√E(A− a)2E

∣∣∣∣ A

1 + b

∣∣∣∣2 +

√E(B − b)2E

∣∣∣∣ A2

(1 + b)(1 +B)

∣∣∣∣2≤√K2E(A− a)2 +

√K3E(A− a)2 +

√K4E(B − b)2 → 0, (108)

74


where K2,K3,K4 are some constants independent of n. The existence of K2,K3,K4 is clear,and here we only take K2 as an example. Since

E∣∣∣∣ a

1 + b

∣∣∣∣2 ≤ Ea2 ≤ Ebc ≤ 1

λ2n2Eu>uu>u→ 2µ+ v

λ2,

we can choose K2 := supn Eu>uu>u/n2λ2 < ∞. (The second inequality can be proved bycomparing each term in the expression of a2 and bc and noting that C−1

ii C−1jj ≥ (C−1

ij )2

holds for any i, j.)Finally, noting limd→∞ Ea = limd→∞ Eµ tr(C−1)/n → A and using (108), we obtain

that the expectation of L.H.S. of (107) satisfies

limd→∞

p

dE(a− a2

1 + b

)= π

(A− A2

1 +B

)= π

µ

v

[1− λ

vθ1

(γ,λ

v

)].


Lemma 28 (Asymptotic behavior of T4) Under assumptions in theorem 9, we have

limd→∞

1

dEW,X


n

∥∥∥∥2

F

=π

v

[1− 2(v − µ2)

v− 2λµ2

v2θ1+

λ2µ2

v3θ2 +

v − µ2

v

(1 + γθ1 −

λγ

vθ2

)], (109)

where θ1 := θ1(γ, λ/v), θ2 := θ2(γ, λ/v).

Proof [Proof of Lemma 28] Denote R =(σ(XW>)σ(WX>)

n + λ)−1

. By definition,

1

dEW,X


n

∥∥∥∥2

F

=1

dEW,X

∥∥∥∥σ(WX>)√n

RXW>√

n

∥∥∥∥2

F

=1

ndEW,X tr

[WX>RXW>

]− λ

ndEW,X tr

[WX>R2XW>

]=: M1 − λM2.

Using the same notations and techniques as in the proof of Lemma 27, after some similarcalculations, we get

limd→∞

Mi = limd→∞

π

nEu>

(C−1 − C−1uu>C−1

n+ u>C−1u

)iu, i = 1, 2.

More specifically,

limd→∞

M1 = π limd→∞

E(u>C−1u

n− u>C−1uu>C−1u/n2

1 + u>C−1u/n

). (110)

limd→∞

M2 = π limd→∞

E(u>C−2u

n− 2

u>C−2uu>C−1u/n2

1 + u>C−1u/n+u>C−1uu>C−2uu>C−1u/n3

(1 + u>C−1u/n)2

).

(111)

75

Lin and Dobriban

Substituting equation (104), (105) and (106) into (110), (111), we can see that the randomvariables on the R.H.S. of (110), (111) almost surely converge to some constant. Using asimilar argument as in the proof of Lemma 27, it can be shown that the expectations onthe R.H.S. of equation (110), (111) both converge to their corresponding pointwise constantlimits.

Therefore, denoting the R.H.S. of (104) by ki(i = 1, 2) and replacing the R.H.S of (110),(111) by their pointwise constant limits, we obtain

limd→∞

M1 − λM2 = π

(k1 −

µ2k21

1 + vk1

)− λπ

[k2 −

2µ2k1k2

1 + vk1+

µ2vk21k2

(1 + vk1)2

]. (112)

From the remark after definition 1, it is readily verified that 1/(1 + vk1) = λθ1/v. Thus(112) is a polynomial function of θ1,2. Finally, canceling the high order (≥ 2) terms in (112)using the equations in the remark after definition 1, we have after some calculations that

limd→∞

M1 − λM2 =π

v

[1− 2(v − µ2)

v− 2λµ2

v2θ1+

λ2µ2

v3θ2 +

v − µ2

v

(1 + γθ1 −

λγ

vθ2

)].

Lemma 29 (Asymptotic behavior of T5) Under assumptions in theorem 9, we have

limd→∞

EW,X∥∥∥∥Rσ(WX>)

n

∥∥∥∥2

F

=γ

v

(θ1 −

λ

vθ2

), (113)

where θ1 := θ1(γ, λ/v), θ2 := θ2(γ, λ/v).


L.H.S. of (113) = EW,X tr

[R2σ(WX>)σ(XW>)

n2

]= EW,X tr [R]− λEW,X tr

[R2].

Note that σ(XW>) has i.i.d. zero mean v variance entries, by the Marchenko-Pasturtheorem and similar methods in the proof of theorem 3, we get

L.H.S. of (113) =γ

vθ1

(γ,λ

v

)− λγ

v2θ2

(γ,λ

v

).

References

Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Tripledescent and a multi-scale theory of generalization. In International Conference on Ma-chine Learning, pages 74–84, 2020a.

76


Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grainedbias-variance decomposition. In Advances in Neural Information Processing Systems,pages 11022–11032, 2020b.

Ben Adlam, Jake Levinson, and Jeffrey Pennington. A random matrix perspective onmixtures of nonlinearities for deep learning. arXiv preprint arXiv:1912.00827, 2019.

Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics ofgeneralization error in neural networks. Neural Networks, 132:428–446, 2020.

Greg W Anderson, Alice Guionnet, and Ofer Zeitouni. An Introduction to Random Matrices.Number 118. Cambridge University Press, 2010.

Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Denny Wu, and Tianzong Zhang. Gener-alization of two-layer neural networks: An asymptotic viewpoint. In 8th InternationalConference on Learning Representations, ICLR, 2020.

Zhidong Bai. Convergence rate of expected spectral distributions of large random matrices.part ii. sample covariance matrices. The Annals of Probability, 21(2):649–672, 1993.

Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices.Springer Series in Statistics. Springer, New York, 2nd edition, 2010.

David Barber, David Saad, and Peter Sollich. Finite-size effects and optimal test set size inlinear perceptrons. Journal of Physics A: Mathematical and General, 28(5):1325, 1995.

Peter L Bartlett, Philip M Long, Gabor Lugosi, and Alexander Tsigler. Benign overfitting inlinear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070,2020.

Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk boundsfor classification and regression rules that interpolate. In Advances in Neural InformationProcessing Systems, pages 2300–2311, 2018.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the NationalAcademy of Sciences, 116(32):15849–15854, 2019.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reply to loog et al.: Lookingbeyond the peaking phenomenon. Proceedings of the National Academy of Sciences, 117(20):10627–10627, 2020a.

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020b.

Lucas Benigni and Sandrine Peche. Eigenvalue distribution of nonlinear models of randommatrices. arXiv preprint arXiv:1904.03090, 2019.

Siegfried Bos and Manfred Opper. Dynamics of training. In Advances in Neural InformationProcessing Systems, pages 141–147, 1997.

77

Lin and Dobriban

Stephane Boucheron, Gabor Lugosi, and Pascal Massart. Concentration inequalities: Anonasymptotic theory of independence. Oxford University Press, 2013.

EP George Box, J Stuart Hunter, William Gordon Hunter, Roma Bins, Kay Kirlin IV, andDestiny Carroll. Statistics for experimenters: design, innovation, and discovery. WileyNew York, 2005.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.Language models are few-shot learners. In Advances in Neural Information ProcessingSystems, pages 1877–1901, 2020.

Lin Chen, Y. Min, M. Belkin, and Amin Karbasi. Multiple descent: Design your owngeneralization curve. arXiv preprint arXiv:2008.01036, 2020.

Shaobing Chen and David Donoho. Basis pursuit. In Proceedings of 1994 28th AsilomarConference on Signals, Systems and Computers, volume 1, pages 41–44. IEEE, 1994.

Romain Couillet and Merouane Debbah. Random Matrix Methods for Wireless Communi-cations. Cambridge University Press, 2011.

Romain Couillet, Jakob Hoydis, and Merouane Debbah. Random beamforming over quasi-static and fading channels: A deterministic equivalent approach. IEEE Transactions onInformation Theory, 58(10):6392–6425, 2012.

Stephane d’Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. Double trouble indouble descent: Bias and variance (s) in the lazy regime. In International Conference onMachine Learning, pages 2280–2290, 2020.

AD Deev. Representation of statistics of discriminant analysis and asymptotic expansionwhen space dimensions are comparable with sample size. In Sov. Math. Dokl., volume 11,pages 1547–1550, 1970.

Zeyu Deng, Abla Kammoun, and Christos Thrampoulidis. A model of double descent forhigh-dimensional binary linear classification. arXiv preprint arXiv:1911.05822, 2019.

Michal Derezinski and Manfred KK Warmuth. The limits of squared euclidean distanceregularization. In Advances in Neural Information Processing Systems, pages 2807–2815,2014.

Micha l Derezinski, Feynman Liang, and Michael W Mahoney. Exact expressions for dou-ble descent and implicit regularization via surrogate random design. arXiv preprintarXiv:1912.04533, 2019.

Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging. arXiv preprintarxiv:1810.00412, to appear in the Annals of Statistics, 2018.

Edgar Dobriban and Yue Sheng. Wonder: Weighted one-shot distributed ridge regressionin high dimensions. Journal of Machine Learning Research, 21(66):1–52, 2020.

78


Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridgeregression and classification. The Annals of Statistics, 46(1):247–279, 2018.

Robert PW Duin. Small sample size generalization. In Proceedings of the ScandinavianConference on Image Analysis, volume 2, pages 957–964, 1995.

Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernelfor linear-width neural networks. In Advances in Neural Information Processing Systems,pages 7710–7721, 2020.

Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stephaned’Ascoli, Giulio Biroli, Clement Hongler, and Matthieu Wyart. Scaling description ofgeneralization with number of parameters in deep learning. Journal of Statistical Me-chanics: Theory and Experiment, 2020(2):023401, 2020.

Stuart Geman, Elie Bienenstock, and Rene Doursat. Neural networks and the bias/variancedilemma. Neural Computation, 4(1):1–58, 1992.

Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mezard, and Lenka Zdeborova.Generalisation error in learning with random features and the hidden manifold model. InInternational Conference on Machine Learning, pages 3452–3462, 2020.

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearizedtwo-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054,2021.

Sebastian Goldt, Marc Mezard, Florent Krzakala, and Lenka Zdeborova. Modelling theinfluence of data structure on learning in neural networks: the hidden manifold model.arXiv preprint arXiv:1909.11500, 2019.

Friedrich Gotze, Alexander Tikhomirov, et al. Rate of convergence in probability to themarchenko-pastur law. Bernoulli, 10(3):503–548, 2004.

Kam Hamidieh. A data-driven statistical model for predicting the critical temperature ofa superconductor. Computational Materials Science, 154:346–354, 2018.

Lars Kai Hansen. Stochastic linear learning: Exact test and training error averages. NeuralNetworks, 6(3):393–396, 1993.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-ing. Springer series in statistics, 2009.

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises inhigh-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560,2019.

JA Hertz, A Krogh, and GI Thorbergsson. Phase transitions in simple learning. Journal ofPhysics A: Mathematical and General, 22(12):2133, 1989.

79

Lin and Dobriban

Wei Hu, Lechao Xiao, and Jeffrey Pennington. Provable benefit of orthogonal initializationin optimizing deep linear networks. In International Conference on Learning Represen-tations, 2020.

Il′dar Abdulovich Ibragimov and Rafail Zalmanovich Has′ Minskii. Statistical estimation:asymptotic theory, volume 16. Springer Science & Business Media, 2013.

Arthur Jacot, Berfin Simsek, Francesco Spadaro, Clement Hongler, and Franck Gabriel.Implicit regularization of random feature models. In International Conference on MachineLearning, pages 4631–4640, 2020.

Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. The optimal ridge penalty forreal-world high-dimensional data can be zero or negative due to the implicit ridge regu-larization. Journal of Machine Learning Research, 21(169):1–16, 2020.

Nicole Kramer. On the peaking phenomenon of the lasso in model selection. arXiv preprintarXiv:0904.4416, 2009.

Yann Le Cun, Ido Kanter, and Sara A Solla. Eigenvalues of covariance matrices: Applicationto neural-network learning. Physical Review Letters, 66(18):2396, 1991.

Yann LeCun, Ido Kanter, and Sara A Solla. Second order properties of error surfaces:Learning time and generalization. In Advances in Neural Information Processing Systems,pages 918–924, 1991.

Z. Li, Chuanlong Xie, and Qinwen Wang. Provable more data hurt in high dimensionalleast squares estimator. arXiv preprint arXiv:2008.06296, 2020.

Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel ridgeless regression cangeneralize. arXiv preprint arXiv:1808.00387, to appear in The Annals of Statistics, 2018.

Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Conference on LearningTheory, pages 2683–2711, 2020.

Zhenyu Liao and Romain Couillet. On the spectrum of random features maps of highdimensional data. In International Conference on Machine Learning, 2018.

Zhenyu Liao and Romain Couillet. A large dimensional analysis of least squares supportvector machines. IEEE Transactions on Signal Processing, 67(4):1065–1074, 2019.

Zhenyu Liao, Romain Couillet, and Michael W Mahoney. A random matrix analysis ofrandom fourier features: beyond the gaussian kernel, a precise phase transition, and thecorresponding double descent. In Advances in Neural Information Processing Systems,pages 13939–13950, 2020.

Sifan Liu and Edgar Dobriban. Ridge regression: Structure, cross-validation, and sketching.In International Conference on Learning Representations, 2020.

80


Marco Loog, Tom Viering, Alexander Mey, Jesse H Krijthe, and David MJ Tax. A briefprehistory of double descent. Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020.

Cosme Louart, Zhenyu Liao, Romain Couillet, et al. A random matrix approach to neuralnetworks. The Annals of Applied Probability, 28(2):1190–1248, 2018.

Vladimir A Marchenko and Leonid A Pastur. Distribution of eigenvalues for some sets ofrandom matrices. Mat. Sb., 114(4):507–536, 1967.

Song Mei and Andrea Montanari. The generalization error of random features regression:Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.

Robb J Muirhead. Aspects of Multivariate Statistical Theory, volume 197. John Wiley &Sons, 2009.

Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmlessinterpolation of noisy data in regression. IEEE Journal on Selected Areas in InformationTheory, 2020.

Preetum Nakkiran. More data can hurt for linear regression: Sample-wise double descent.arXiv preprint arXiv:1912.07242, 2019.

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and IlyaSutskever. Deep double descent: Where bigger models and more data hurt. In In-ternational Conference on Learning Representations, 2020.

Preetum Nakkiran, Prayaag Venkat, Sham M. Kakade, and Tengyu Ma. Optimal regular-ization can mitigate double descent. In International Conference on Learning Represen-tations, 2021.

Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, SimonLacoste-Julien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff inneural networks. arXiv preprint arXiv:1810.08591, 2018.

M Opper, W Kinzel, J Kleinz, and R Nehl. On the ability of the optimal perceptron togeneralise. Journal of Physics A: Mathematical and General, 23(11):L581, 1990.

Manfred Opper. Statistical mechanics of learning: Generalization. The Handbook of BrainTheory and Neural Networks,, pages 922–925, 1995.

Manfred Opper. Learning to generalize. Frontiers of Life, 3(part 2):763–775, 2001.

Manfred Opper and Wolfgang Kinzel. Statistical mechanics of generalization. In Models ofneural networks III, pages 151–209. Springer, 1996.

Art B. Owen. Monte Carlo theory, methods and examples. 2013.

Debashis Paul and Alexander Aue. Random matrix theory in statistics: A review. Journalof Statistical Planning and Inference, 150:1–29, 2014.

81

Lin and Dobriban

Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning.In Advances in Neural Information Processing Systems, pages 2637–2646, 2017.

Haozhi Qi, Chong You, Xiaolong Wang, Yi Ma, and Jitendra Malik. Deep isometric learningfor visual recognition. In International Conference on Machine Learning, 2020.

Sarunas Raudys. On determining training sample size of linear classifier. Comput. Systems(in Russian), 28:79–87, 1967.

Sarunas Raudys and Robert PW Duin. Expected classification error of the fisher linearclassifier with pseudo-inverse covariance matrix. Pattern recognition letters, 19(5-6):385–392, 1998.

Jason W Rocks and Pankaj Mehta. Memorizing without overfitting: Bias, variance, andinterpolation in over-parameterized models. arXiv preprint arXiv:2010.13933, 2020.

Francisco Rubio and Xavier Mestre. Spectral convergence for a general class of randommatrices. Statistics & Probability Letters, 81(5):592–602, 2011.

Vadim Ivanovich Serdobolskii. Discriminant analysis for a large number of variables. InDokl. Akad. Nauk SSSR, volume 22, pages 314–319, 1980.

Jack W Silverstein. Strong convergence of the empirical distribution of eigenvalues of largedimensional random matrices. J. Multivariate Anal., 55(2):331–339, 1995.

Antonio M Tulino and Sergio Verdu. Random matrix theory and wireless communications.Communications and Information theory, 1(1):1–182, 2004.

Denny Wu and Ji Xu. On the optimal weighted `2 regularization in overparameterized linearregression. In Advances in Neural Information Processing Systems, pages 10112–10123,2020.

Yuege Xie, Rachel Ward, Holger Rauhut, and Hung-Hsu Chou. Weighted optimization:better generalization by smoother interpolation. arXiv preprint arXiv:2006.08495, 2020.

Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-variance trade-off for generalization of neural networks. In International Conference onMachine Learning, 2020.

Jianfeng Yao, Zhidong Bai, and Shurong Zheng. Large Sample Covariance Matrices andHigh-Dimensional Data Analysis. Cambridge University Press, New York, 2015.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146, 2016.

82

Documents

What Causes the Test Error? Going Beyond Bias-Variance via