Fractal Structure and Generalization Properties of

Fractal Structure and Generalization Properties of StochasticOptimization Algorithms

Alexander Camuto [email protected] of Statistics, University of Oxford & Alan Turing Institute

George Deligiannidis [email protected] of Statistics, University of Oxford & Alan Turing Institute

Murat A. Erdogdu [email protected] of Computer Science and Department of Statistical Sciences at University of Toronto & Vector Institute

Mert Gürbüzbalaban∗ [email protected] of Management Science and Information Systems, Rutgers Business School

Umut Şimşekli∗ [email protected] - Département d’Informatique de l’École Normale Supérieure - PSL Research University

Lingjiong Zhu [email protected] of Mathematics, Florida State University

The authors are in alphabetical order∗ Corresponding authors

Abstract

Understanding generalization in deep learning has been one of the major challenges in statistical learning theoryover the last decade. While recent work has illustrated that the dataset and the training algorithm must be takeninto account in order to obtain meaningful generalization bounds, it is still theoretically not clear which propertiesof the data and the algorithm determine the generalization performance. In this study, we approach this problemfrom a dynamical systems theory perspective and represent stochastic optimization algorithms as random iteratedfunction systems (IFS). Well studied in the dynamical systems literature, under mild assumptions, such IFSscan be shown to be ergodic with an invariant measure that is often supported on sets with a fractal structure.As our main contribution, we prove that the generalization error of a stochastic optimization algorithm can bebounded based on the ‘complexity’ of the fractal structure that underlies its invariant measure. Leveragingresults from dynamical systems theory, we show that the generalization error can be explicitly linked to thechoice of the algorithm (e.g., stochastic gradient descent – SGD), algorithm hyperparameters (e.g., step-size,batch-size), and the geometry of the problem (e.g., Hessian of the loss). We further specialize our results tospecific problems (e.g., linear/logistic regression, one hidden-layered neural networks) and algorithms (e.g., SGDand preconditioned variants), and obtain analytical estimates for our bound. For modern neural networks, wedevelop an efficient algorithm to compute the developed bound and support our theory with various experimentson neural networks.

1

arX

iv:2

106.

0488

1v1

[st

at.M

L]

9 J

un 2

021

1. Introduction

In statistical learning, many problems can be naturally formulated as a risk minimization problem

minw∈Rd

R(w) := Ez∼π[`(w, z)]

, (1)

where z ∈ Z denotes a data sample coming from an unknown distribution π, and ` : Rd × Z → R+ is thecomposition of a loss and a function from the hypothesis class parameterized by w ∈ Rd. Since the distribution πis unknown, one needs to rely on empirical risk minimization as a surrogate to (1),

minw∈RdR(w,Sn) := (1/n)

∑n

i=1`(w, zi)

, (2)

where Sn := z1, . . . , zn denotes a training set of n points that are independently and identically distributed(i.i.d.) and sampled from π, and model training often amounts to using an optimization algorithm to solve theabove problem.

Statistical learning theory is mainly interested in understanding the behavior of the generalization error, i.e.,R(w,Sn)−R(w). While classical results suggest that models with large number of parameters should suffer frompoor generalization [SSBD14, AB09], modern neural networks challenge this classical wisdom: they can fit thetraining data perfectly, yet manage to generalize well [ZBH+17, NBMS17]. Considering that the generalizationerror is influenced by many factors involved in the training process, the conventional algorithm- and data-agnosticuniform bounds are typically overly pessimistic in a deep learning setting. In order to obtain meaningful,non-vacuous bounds, the underlying data distribution and the choice of the optimization algorithm need to beincorporated in the generalization bounds [ZBH+17, DR17].

Our goal in this study is to develop novel generalization bounds that explicitly incorporate the data and theoptimization dynamics, through the lens of dynamical systems theory. To motivate our approach, let us considerstochastic gradient descent (SGD), which has been one of the most popular optimization algorithms for trainingneural networks. It is defined by the following recursion:

wk = wk−1 − η∇Rk(wk−1), where ∇Rk(w) := ∇RΩk(w) := (1/b)∑i∈Ωk

∇`(w, zi).

Here, k represents the iteration counter, η > 0 is the step-size (also called the learning-rate), ∇Rk is the stochasticgradient, b is the batch-size, and Ωk ⊂ 1, 2, . . . , n is a random subset drawn with or without replacement withcardinality |Ωk| = b for all k.

Constant step-size SGD forms a Markov chain with a stationary distribution w∞ ∼ µ, which exists and is uniqueunder mild conditions [DDB20, YBVE20], and intuitively we can expect that the generalization performance ofthe trained model to be intimately related to the behavior of the risk R(w) over this limit distribution µ. Inparticular, the Markov chain defined by the SGD recursion can be written by using random functions hΩk at eachSGD iteration k, i.e.,

wk = hΩk(wk−1), with hΩk(w) = w − η∇Rk(w). (3)

Here, the randomness in hΩk is due to the selection of the subset Ωk. In fact, such a formulation is not specificto SGD; it can cover many other stochastic optimization algorithms if the random function hΩk is selectedaccordingly, including second-order algorithms such as preconditioned SGD [Li17]. Such random recursions (3)and characteristics of their stationary distribution have been studied extensively under the names of iteratedrandom functions [DF99] and iterated function system (IFS) [Fal04]. In this paper, from a high level, we relatethe ‘complexity of the stationary distribution’ of a particular IFS to the generalization performance of the trainedmodel.

We illustrate our context in two toy examples. In the first one, we consider a 1-dimensional quadratic problemwith n = 2 and `(w, z1) = w2/2 and `(w, z2) = w2/2−w. We run SGD with constant step-size η to minimize the

2

<latexit sha1_base64="w/xOSbHaKhyYVCYQLd88KBssK1w=">AAACrHicbVFdaxQxFM1O/ajrR1t99CW4CD6tM0WsD61UZcHHsXTbwmYoSfbObthMMiR3xGXYP9Af4Kv+Lf+NmQ+Kbb0QOJx7bs7lXFFq5TGO/wyirXv3HzzcfjR8/OTps53dvedn3lZOwlRabd2F4B60MjBFhRouSge8EBrOxepL0z//Ds4ra05xXUJW8IVRuZIcA8UYIKdHNHmbxPHl7igex23RuyDpwYj0lV7uDa7Y3MqqAINSc+9nSVxiVnOHSmrYDFnloeRyxRcwC9DwAnxWt0tv6OvAzGluXXgGacv+O1Hzwvt1IYKy4Lj0t3sN+b/erML8Q1YrU1YIRnZGeaUpWtokQOfKgUS9DoBLp8KuVC654xJDTsMhm0POJjVrPhainmzCD3TyowwzbWa0WcrqTpde69JWlzoruFBa4fqG7rBmGnJkmptFSCYoPzWAimC7AvSd6mPNnFoskblO1rHhnNpf+5y0PieBo6YqBLjeJ1wvuX2ru+Bsf5y8H+9/ezc6/tzfcZu8JK/IG5KQA3JMvpKUTIkkJflJfpHf0Tg6jWZR1kmjQT/zgtyoKP8LCVDSnA==</latexit>

= 1/100<latexit sha1_base64="N4EALTZugHrGMLqYEnFdQ3mYxLw=">AAACqnicbVFNbxMxEHWWrxK+WjhysYiQOKCw2yLgAKiAInFcItIW4qiyvbOJFa+9smcR0Sp/gDNX+F/8G7wfqmjLSJae3ryZN54RpVYe4/jPILpy9dr1Gzs3h7du37l7b3fv/pG3lZMwk1ZbdyK4B60MzFChhpPSAS+EhmOx/tDkj7+B88qaz7gpYVHwpVG5khwD9ZUBcvqGJs8OTndH8Thug14GSQ9GpI/0dG/wg2VWVgUYlJp7P0/iEhc1d6ikhu2QVR5KLtd8CfMADS/AL+p25C19HJiM5taFZ5C27L8VNS+83xQiKAuOK38x15D/y80rzF8tamXKCsHIziivNEVLm//TTDmQqDcBcOlUmJXKFXdcYtjScMgyyNmkZk1jIerJNnSgk+9lqGk3RpuhrO506ZkubXWps4ILpRVuzule10xDjkxzswybCcp3DaAi2K4Bfad6WzOnlitkrpN1bDim9mc+09ZnGjhqqkKA633C9ZKLt7oMjvbHyYvx/qfno8P3/R13yEPyiDwhCXlJDslHkpIZkcSQn+QX+R09jabRl2jeSaNBX/OAnIso+wvl9NIq</latexit>

= 1/3<latexit sha1_base64="PMFn50SisVrBi4BczIp2EEb/il0=">AAACqnicbVFNbxMxEHW2fJTw1ZYjF4sIiQMKuymiPQAqoEgcl4i0hTiqbO9sYsXrXdmzVaNV/gBnrvC/+Dd4P1TRlpEsPb15M288IwqtHIbhn16wdev2nbvb9/r3Hzx89Hhnd+/Y5aWVMJW5zu2p4A60MjBFhRpOCws8ExpOxOpTnT85B+tUbr7iuoB5xhdGpUpy9NR3BsjpOzp6tX+2MwiHYRP0Jog6MCBdxGe7vR8syWWZgUGpuXOzKCxwXnGLSmrY9FnpoOByxRcw89DwDNy8akbe0OeeSWiaW/8M0ob9t6LimXPrTHhlxnHprudq8n+5WYnp4bxSpigRjGyN0lJTzGn9f5ooCxL12gMurfKzUrnklkv0W+r3WQIpG1esbixENd74DnR8UfiaZmO0HirXrS6+1MWNLra54EJphesrurcV05Ai09ws/Ga88kMNqPC2K0DXqt5XzKrFEpltZS3rj6ndpc+k8Zl4jpoyE2A7H3+96PqtboLj0TB6Mxx9eT04+tjdcZs8Jc/ICxKRA3JEPpOYTIkkhvwkv8jv4GUwCb4Fs1Ya9LqaJ+RKBMlf6DPSKw==</latexit>

= 2/3

<latexit sha1_base64="2H8NKEK6KqP2tFLWe3vpqrMwzXs=">AAACx3icbVFLbxMxEHaWVwmvFI5cLKJKnKJNhQoHhMojAm5LS9pKcRTZzmxixWuv7Nkq0SoHrvBruMIv4d/gfaiiLSNZGn/zzXz2NyLXymMc/+lEN27eun1n52733v0HDx/1dh+feFs4CWNptXVngnvQysAYFWo4yx3wTGg4Fav3Vf30HJxX1nzFTQ7TjC+MSpXkGKBZb48hrLE8xvrO3YbOg6hToqju1Kb0+OOH7azXjwdxHfR6MmyTPmkjme12vrO5lUUGBqXm3k+GcY7TkjtUUsO2ywoPOZcrvoBJSA3PwE/L+j9buheQOU2tC8cgrdF/O0qeeb/JRGBmHJf+aq0C/1ebFJi+mpbK5AWCkY1QWmiKllbmhL87kKgrE7h0KryVyiV3XGKwsNtlc0jZqGTVYCHK0TZMoKN1Hnpq+2j1KKsbXnLBS2pe4qzgQmmFm0u81yXTkCLT3CyCM4H5tkqoCLIrQN+w3pTMqcUSmWtoDRo2rf2FzlGtcxQwaopMgGt1wvaGV3d1PTnZHwwPBvtfXvQP37V73CFPyTPynAzJS3JIPpGEjIkkP8hP8ov8jj5HNjqP1g016rQ9T8iliL79BfBD3z0=</latexit>

Stationary distribution of SGD

Figure 1: Middle-third Cantor set as the support of the stationary distribution of constant step-size SGD for `(w, z1) = w2/2and `(w, z2) = w2/2− w.

resulting empirical risk. We simply choose Ωk ⊂ 1, 2 uniformly random with batch-size b = 1, and we plot thehistograms of stationary distributions for different step-size choices η ∈ 1/100, 1/3, 2/3 in Figure 1. We observethat the support of the stationary distribution of SGD depends on the step-size: As the step-size increases thesupport becomes less dense and a fractal structure in the stationary distribution can be clearly observed. Thisbehavior is not surprising, at least for this toy example. It is well-known that the set of points that is invariantunder the resulting IFS (termed as the attractor of the IFS) for the specific choice of η = 2/3 is the famous‘middle-third Cantor set’ [FW09], which coincides with the support of the stationary distribution of the SGD.

As another example, we run SGD with constant step-size in order to train an ordinary linear regression model fora dataset of n = 5 samples and d = 2 dimensions, i.e., a>i w ≈ yi, where for i = 1, . . . , 5, yi and each coordinateof ai are drawn uniformly at random from the interval [−1, 1]. Figure 2 shows the heatmap of the resultingstationary distributions for different step-size choices η ranging from 0.1 to 0.9 (bright colors represent higherdensity). We observe that for small step-size choices, the stationary distribution is dense, whereas a fractalstructure can be clearly observed as the step-size gets larger.

Fractals are complex patterns and the level of this complexity is typically measured by the Hausdorff dimensionof the fractal, which is a notion of dimension that can take fractional values1, and can be much smaller thanthe ambient dimension d. Recently, assuming that SGD trajectories can be well-approximated by a certain typeof stochastic differential equations (SDE), it is shown that the generalization error can be controlled by theHausdorff dimension of the trajectories of the SDE, instead of their ambient dimension d [ŞSDE20]. That is, theambient dimension that appears in classical learning theory bounds is replaced with the Hausdorff dimension.The fractal geometric approach presented in [ŞSDE20] can capture the ‘low dimensional structure’ of fractalsets and provides an alternative perspective to the compression-based approaches that aim to understand whyoverparametrized networks do not overfit [AGNZ18, SAN20, SAM+20, HJTW21].

However, SDE approximations for SGD often serve as mere heuristics, and guaranteeing a good approximationtypically requires unrealistically small step-sizes [LTE19]. For more realistic step-sizes, theoretical concerns havebeen raised about the validity of conventional SDE approximations for SGD [LMA21, GŞZ20, Yai19]. Anotherdrawback of the SDE approximation is that the bounds in [ŞSDE20] are implicit, in the sense that they cannot berelated to algorithm hyperparameters, problem geometry, or data. We address these issues and present a direct,discrete-time analysis by exploiting the connections between IFSs and stochastic optimization algorithms.

Our contributions are summarized as follows:

• We extend [ŞSDE20] and show that the generalization error can be linked to the Hausdorff dimension of invariant

1. The Hausdorff dimension of the middle-third Cantor set in Figure 1 is log3(2) ≈ 0.63 whereas the ambient dimension is 1 [Fal04,Example 2.3].

3

(a) η = 0.3 (b) η = 0.5 (c) η = 0.7 (d) η = 0.9

Figure 2: The stationary distribution of constant step-size SGD for linear regression, where we have n = 5 datapoints and w ∈ R2.

measures (rather than the Hausdorff dimension of sets as in [ŞSDE20]). More precisely, under appropriateconditions, we establish a generalization bound for the stationary distribution of IFS w∞ ∼ µ. That is, withprobability at least 1− 2ζ,

|R(w∞,Sn)−R(w∞)| .

√dimHµ log2(n)

n+

log(1/ζ)

n, (4)

for n large enough, where dimHµ is the (upper) Hausdorff dimension of the measure µ.• By leveraging results from IFS theory, we further link dimHµ to (i) the form of the recursion (e.g., hΩk in (3)),(ii) algorithm hyperparameters (e.g., η, b), and (iii) problem geometry (e.g., Hessian of Rk), through a singleterm, which encapsulates all these components and their interaction.

• We establish bounds on dimHµ for SGD and preconditioned SGD algorithms, when used to minimize variousempirical risk minimization problems such as least squares, logistic regression, support vector machines. In allcases, we explicitly link the generalization performance of the model to the hyperparameters of the underlyingtraining algorithm.

• Finally, we numerically compute key quantities that appear in our generalization bounds, and show empiricallythat they have a statistically significant correlation with the generalization error.

Notation and preliminaries. Bd(x, r) ⊂ Rd denotes the closed ball centered around x ∈ Rd with radius r.

A function f : Rd1 → Rd2 is said to be (Fréchet) differentiable at x ∈ Rd1 if there exists a d1 × d2 matrixJf (x) : Rd1 → Rd2 such that lim‖h‖→0 ‖f(x + h) − f(x) − Jf (x)h‖/‖h‖ = 0. The matrix Jf (x) is called thedifferential of f , also known as the Jacobian matrix at x, and determinant of Jf (x) is called the Jacobiandeterminant [HS74]. For real-valued functions f, g, we define f(n) = ω(g(n)) if limn→∞ |f(n)|/g(n) =∞. For aset A, |A| denotes its cardinality.

2. Technical Background on Fractal Geometry

Fractal sets emerge virtually in all branches of science, and fractal-based techniques have been used in machinelearning [SHTY13, MSS19, DSD20, ŞSDE20, AGZ21]. The inherent ‘complexity’ of a fractal set often plays animportant role and it is typically measured by its fractal dimension, where several notions of dimension have beenproposed [Fal04]. In this section, we briefly mention two important notions of fractal dimension, which will beused in our theoretical development.

Minkowski dimension of a set. The Minkowski dimension (also known as the box-counting dimension[Fal04]) is defined as follows. Let F ⊂ Rd be a set and for δ > 0, let Nδ(F ) denote a collection of sets thatcontains the smallest number of closed balls of diameter at most δ which cover F . Then the upper-Minkowski

4

dimension of F is defined as follows:

dimMF := lim supδ→0

[log |Nδ(F )| / log(1/δ)

]. (5)

To visualize the upper-Minkowski dimension of a set F , consider the set F lying on an evenly spaced grid andcount how many boxes are required to cover the set. The upper-Minkowski dimension measures how this numberchanges as the grid is made finer using a box-counting algorithm.

Hausdorff dimension of a set. An alternative to the purely geometric Minkowski dimension, the Hausdorffdimension [Hau18] is a measure theoretical notion of fractal dimension. It is based on the Hausdorff measure,which generalizes the traditional notions of area and volume to non-integer dimensions [Rog98]. More precisely,for s ≥ 0, let F ⊂ Rd and δ > 0, and denote Hsδ(F ) := inf

∑∞i=1 diam(Ai)

s, where the infimum is taken overall the δ-coverings Aii of F , that is, F ⊂ ∪iAi with diam(Ai) < δ for every i. The s-dimensional Hausdorffmeasure of F is defined as the monotone limit Hs(F ) := limδ→0Hsδ(F ). When s ∈ N, Hs is the s-dimensionalLebesgue measure up to a constant factor; hence the generalization of ‘volume’ to fractional orders.

Based on the Hausdorff measure, the Hausdorff dimension of a set F ⊂ Rd is then defined as follows:

dimH F := sup s > 0 : Hs(F ) > 0 = inf s > 0 : Hs(F ) <∞ .

In other words, the Hausdorff dimension of F is the ‘moment’ s when Hs(F ) drops from∞ to 0, that is, Hr(F ) = 0for all r > dimH F and Hr(F ) =∞ for all r < dimH F .

We always have 0 ≤ dimH F ≤ d, and when F is bounded, we always have 0 ≤ dimH F ≤ dimMF ≤ d [Fal04].Furthermore, the Hausdorff dimension of Rd equals d, and the Hausdorff dimension of smooth Riemannianmanifolds correspond to their intrinsic dimension, e.g. dimH Sd−1 = d− 1, where Sd−1 is the unit sphere in Rd.

Hausdorff dimension of a probability measure. IFSs generate invariant measures as the number ofiterates goes to infinity, and random fractals arise from such invariant measures. There has been a growingliterature that studies the structure of such random fractals [Saz00, NSB02, MS02, Ram06, FST06, JR08], wherethe notion of fractal dimension can be extended to measures, and our theory will rely on the Hausdorff dimensionof invariant measures associated with stochastic optimization algorithms. In particular, we will mainly usethe upper Hausdorff dimension dimHµ of a Borel probability measure µ on Rd, which is defined as follows:dimHµ := inf dimHA : µ(A) = 1. In other words, dimHµ is the smallest Hausdorff dimension of all measurablesets with full measure.

3. Generalization Bounds for Stochastic Optimization Algorithms as IFSs

In this section, we will present our main theoretical results which relate the generalization error to the upper-Hausdorff dimension of the invariant measure associated with a stochastic optimization algorithm. We consider astandard supervised learning setting, where Z = X × Y, where X is the space of features and Y is the space oflabels, and π is the unknown data distribution on Z.

For mathematical convenience, in order to construct the training set with n elements, we first consider an infinitesequence of i.i.d. data samples from the data distribution π, then we take the first n elements from this infinitesequence. More precisely, we consider the (countable) product measure π∞ = (π ⊗ π ⊗ . . . ) defined on thecylindrical sigma-algebra. Then, we consider the infinite i.i.d. data sequence as S ∼ π∞, i.e., S = (zj)j≥1 withzj

i.i.d.∼ π for all j = 1, 2, . . . . Finally, we define the training set as Sn := (z1, . . . , zn), i.e., we take the first nelements of S. To avoid technical complications, throughout the paper we will assume that all the encounteredfunctions and sets are measurable. All the proofs are given in the supplement.

Given a dataset Sn, we represent the training algorithm as an IFS, which is based on the following recursion:wk = hΩk(wk−1;Sn), where the mini-batch Ωk is i.i.d. sampled according to some distribution (e.g., sampling

5

without-replacement uniformly among all possible mini-batches). This compact representation enables us to covera broad range of optimization algorithms with a unified notation, including SGD (see (3)), as well as preconditionedSGD, and stochastic Newton methods. For example, if we take hΩk(w;Sn) = w−ηHk(w)−1∇Rk(w) where Hk(w)is an estimate of the Hessian of Rk, we cover stochastic Newton methods [EM15]. Similar constructions can bemade for other popular algorithms, such as SGD-momentum [Qia99], RMSProp [HSS12], or Adam [KB15].

Notice that there are only finitely many values that Ωk can take. For example, in the case of without-replacementmini-batch sampling with batch-size b, there are in total mb =

(nb

)many subsets of 1, 2, . . . , n with cardinality

b. Alternatively, a more realistic setup would be to divide the dataset into mb = n/b batches with each batchhaving b elements, and at each iteration k, we can randomly choose one of the batches. In both examples we canenumerate as S1, S2, . . . , Smb . If the probability of sampling the mini-batch Ωk = Si is pi for every i, then, with aslight abuse of notation, we can rewrite the IFS recursion as:

wk = hUk(wk−1;Sn), (6)

where Uk is a random variable taking values in 1, 2, . . . ,mb and pi := P(Uk = i). If the mini-batch sampling isuniform (i.e., the default option in practice), we have pi = 1/mb; however, we are not restricted to this option, thesampling scheme is allowed to be more general. We finally call the triple (Rd, hi(·;Sn)mbi=1, pi

mbi=1) an iterated

function system (IFS).

Given a dataset Sn, we are interested in the limiting behavior of the training algorithm (6). We characterizethis behavior by considering the invariant measure µW |Sn of the IFS (also called stationary distribution), that isa Borel probability measure on Rd, such that w∞ ∼ µW |Sn . To be able to work in this context, we first needto ensure that the recursion (6) admits an invariant measure, i.e., µW |Sn exists. Accordingly, we require thefollowing mild conditions on the IFS (6). Let U be a random variable with the same distribution as Uk. If therecursion (6) is Lipschitz on average, i.e.,

E[LU | Sn] <∞, with LU := supx,y∈Rd

‖hU (x;Sn)− hU (y;Sn)‖‖x− y‖

, (7)

and is contractive on average, i.e., if

E [log(LU ) | Sn] =

mb∑i=1

pi log(Li) < 0, with pi > 0, for any i = 1, . . . ,mb, (8)

then it can be shown that the process is ergodic and admits a unique invariant measure where the limit

ρ := limk→∞

(1/k) log∥∥hUkhUk−1

· · ·hU1

∥∥exists almost surely and is a constant [Elt90], where ρ is called the Lyapunov exponent. Furthermore, underfurther technical assumptions, it can be shown that (6) is geometrically ergodic [DF99].

Our goal will be to relate the generalization error to dimHµW |Sn . To achieve this goal, at first sight, it mightseem tempting to extract a full-measure set A by using the definition of µW |Sn , such that µW |Sn(A) = 1 anddimHA ≈ dimHµW |Sn , and then directly invoke the results from [ŞSDE20], which would link the generalizationerror to dimHA, hence, also to dimHµW |Sn . However, since [ŞSDE20] does not consider an IFS framework, theconditions they require (e.g., boundedness of A, dimMA = dimHA) are not suited to IFSs, and hence prevent usfrom directly using their results.

As a remedy, we make a detour and show that we can find almost full-measure sets A, such that µW |Sn(A) ≈ 1

and dimMA ≈ dimHµW |Sn (notice that in this case we directly use the Minkowski dimension of A, as opposed toits Hausdorff dimension). To achieve this goal, we require the following geometric regularity condition on theinvariant measure.

H 1. For π∞-almost all S and all n ∈ N+, the recursion (6) satisfies (7) and (8) and the limit

limr→0

[logµW |Sn(Bd(w, r)) / log r

]6

exists for µW |Sn-almost every w.

This is a common condition [Pes08] and is satisfied for a large class of measures. For instance, ‘sufficiently regular’measures with the property that C1r

s ≤ µ(B(x, r)) ≤ C2rs for some constant s > 0 and positive constants C1,

C2 will satisfy this assumption. Such measures are called Ahlfors-regular (cf. [ŞSDE20, Assumption H4] for arelated condition), and it is known that IFSs that satisfy certain ‘open set conditions’ lead to Ahlfors regularinvariant measures (see [MT10, Section 8.3]). Yet, our assumption is more general and does not immediatelyrequire Ahlfors-regularity. Under H 1, we now formalize our key observation, which serves as the basis for ourbounds.

Proposition 1. Assume that H 1 holds. Then for every ε > 0, n ∈ N+, and π∞-almost every S, there existsδ := δ(ε,Sn) ∈ (0, 1] and a bounded measurable set ASn,δ ⊂ Rd, such that

µW |Sn(ASn,δ) ≥ 1− δ, and dimMASn,δ ≤ dimHµW |Sn + ε, (9)

and δ(ε,Sn)→ 0 as ε→ 0.

Thanks to this result, we can now leverage the proof technique presented in [ŞSDE20, Theorem 2], and link thegeneralization error to dimHµW |Sn through dimMASn,δ. We shall emphasize that, mainly due to the sets ASn,δ

not being of full-measure, our framework introduces additional non-trivial technical difficulties that we need totackle in our proof.

We now introduce our second assumption, which roughly corresponds to a ‘topological stability’ condition, and isadapted from [ŞSDE20, Assumption H5]. Formally, consider the (countably infinite) collection of closed balls ofradius β, whose centers are on the fixed grid Nβ :=

((2j1+1)β

2√d

, . . . , (2jd+1)β

2√d

): ji ∈ Z, i = 1, . . . , d

, and for a

set A ⊂ Rd, define Nβ(A) := x ∈ Nβ : Bd(x, β) ∩A 6= ∅, which is the collection of the centers of the balls thatintersect A.

H 2. Let Z∞ := (Z×Z×· · · ) denote the countable product endowed with the product topology and let B be the Borelσ-algebra generated by Z∞. For a Borel set A ⊂ Rd, let F,G be the sub-σ-algebras of B generated by the collectionsof random variables given by R(w,Sn) : w ∈ Rd, n ≥ 1 and

1 w ∈ Nβ(A) : β ∈ Q>0, w ∈ Nβ , n ≥ 1

respectively. There exists a constant M ≥ 1 such that for any F ∈ F, G ∈ G we have P (F ∩G) ≤MP (F )P(G)for all Borel sets A.

H 2 simply ensures that the dependence between the training error and the topological properties of the supportof µW |Sn can be controlled via M . Hence, it can be seen as a form of algorithmic stability [BE02], where Mmeasures the level of stability of the topology of µW |Sn : a small M indicates that the geometrical structure ofµW |Sn does not heavily depend on the particular value of Sn. The constant M is also related to the mutualinformation [XR17, AAV18], but may be better behaved than the mutual information as it relies on very specificfunctions of the random variables.

We require one final assumption, which states that the loss ` is sub-exponential.

H 3. ` is L-Lipschitz continuous in w, and when z ∼ π, for all w, `(w, z) is (ν, κ)-sub-exponential, that is, forall |λ| < 1/κ, we have logEz∼π [exp (λ (`(w, z)−R(w)))] ≤ ν2λ2/κ.

Armed with these assumptions, we can now present our main result.

Theorem 1. Assume that H 1 to 3 hold and dimHµW |Sn = ω(log log(n)/ log(n)), π∞-almost-surely. Then, thefollowing bound holds for sufficiently large n:

|R(W,Sn)−R(W )| ≤ 8ν

√dimHµW |Sn log2(nL2)

n+

log(13M/ζ)

n, (10)

with probability at least 1− 2ζ over the joint distribution of S ∼ π∞, W ∼ µW |Sn .

7

This theorem shows that the Hausdorff dimension of the invariant measure acts as a ‘capacity metric’ andthe generalization error is therefore directly linked to this metric, i.e., the complexity of the underlying fractalstructure has close links to the generalization performance. On the other hand, the condition dimHµW |Sn =ω(log log(n)/ log(n)) is very mild and makes sure that the dimension of the IFS does not decrease very rapidlywith increasing number of data points n. Theorem 1 enables us to access the rich theory of IFSs, where boundson the Hausdorff dimension are readily available, and connect them to statistical learning theory. The followingresult is a direct corollary to Theorem 1 and [Ram06, Theorem 2.1] (see Theorem 2 in the Appendix).

Corollary 1. Assume that the conditions of Theorem 1 hold. Furthermore, consider the recursion (6) andassume that hi are continuously differentiable with derivatives h′i that are α-Hölder continuous for some α > 0.Then, there exists a constant M > 1 such that for sufficiently large n:

|R(W,Sn)−R(W )| ≤ 8ν

√√√√√ Ent log2(nL2)[mb∑i=1

pi∫Rd log(‖Jhi(w)‖)dµW |Sn(w)

]n

+log(13M/ζ)

n, (11)

with probability 1− 2ζ over S ∼ π∞, W ∼ µW |Sn , where Ent :=∑mbi=1 pi log(pi) denotes the negative entropy of

the mini-batch sampling scheme, ‖ · ‖ denotes the operator norm, and Jhi is the Jacobian of hi defined in thenotation section.

By this result, we discover an interesting quantity,∑mbi=1 pi

∫Rd log(‖Jhi(w)‖)dµW |Sn(w), which simultaneously

captures the effects of the data and the algorithm. To see it more clearly, let us consider the SGD recursion (3),where ‖Jhi(w)‖ = ‖I − η∇2RSi(w)‖ and Simbi=1 denotes the enumeration of the mini-batches. Then, the overallquantity becomes

EU,W[log ‖I − η∇2RSU (W )‖

], (12)

where the expectation is taken over the mini-batch index U ∈ 1, . . . ,mb with P(U = i) = pi, and W ∼ µW |Sn .We clearly observe that this term depends on (i) the algorithm choice through the form of hi, (ii) step-size η,(iii) batch-size through mb, (iv) problem geometry through ∇2R, and (v) data distribution through µW |Sn . Webelieve that such a compact representation of all these constituents and their interaction is novel and will openup interesting future directions.

4. Analytical Estimates for the Hausdorff Dimension

The generalization bound presented in Theorem 1 applies to a number of stochastic optimization algorithms thatcan be represented with an IFS and to a large class of losses that can be non-convex or convex. It is controlledby the Hausdorff dimension of the invariant measure µW |Sn which needs to be estimated. In the numericalexperiments section, we will discuss how this quantity can be estimated from the dataset Sn and the iterates ofthe underlying algorithm.

Corollary 1 shows that for smooth losses, the Hausdorff dimension can be controlled with the expectation of thenorm of the logarithm of the Jacobian log(‖Jhi(w)‖) with respect to the invariant measure µW |Sn . In general,an explicit characterization of the invariant measure is not known. Nevertheless, under additional appropriateassumptions that can hold in practice, such as boundedness of the data of the loss, we next discuss that it ispossible to get uniform lower and upper bounds on the quantity ‖Jhi(w)‖ which leads to analytical upper boundson dimHµW |Sn .

As illustrative examples; in the following, we will consider the setting where we divide Sn into mb = n/b batcheswith each batch having b elements, and then we discuss how analytical estimates on the (upper) Hausdorffdimension dimHµW |Sn can be obtained for some particular problems including least squares, regularized logisticregression, and one hidden-layer networks. In the Appendix, we also discuss how similar bounds can be obtainedfor support vector machines and other algorithms such as preconditioned SGD and stochastic Newton methods.

8

Least squares. We consider the least squares problem, with data points zi = (ai, yi) and loss

`(w, zi) :=(aTi w − yi

)2/2 + λ‖w‖2/2, (13)

where λ > 0 is a regularization parameter.

Proposition 2 (Least squares). Consider the least squares problem (13). Assume the step-size η ∈ (0, 1R2+λ ),

where R := maxi ‖ai‖ is finite. Then, we have the following upper bound:

dimHµW |Sn ≤log (n/b)

log(1/(1− ηλ)). (14)

Note that here ` is only pseudo-Lipschitz |`(w) − `(w′)| ≤ L(1 + ‖w‖ + ‖w′‖)‖w − w′‖, rather than globallyLipschitz. However; the conditions in Proposition 2 ensure that w will stay in a bounded region, in which case `becomes Lipschitz. Also note that only the logarithm of the Lipschitz constant directly enters the bound.

We observe that, for fixed n,the upper bound for dimHµW |Sn is decreasing both in η and b. This behavior isnot surprising: large η results in chaotic behaviors (cf. Figures 1,2), and in the extreme case where b = n, thealgorithm becomes deterministic and hence converges to a single point, in which case the Hausdorff dimensionbecomes 0. However, the decrement due to b does not automatically grant good generalization performance: sincethe algorithm becomes deterministic, the stability constant M in H 2 can get arbitrarily large, hence the bound inTheorem 1 could become vacuous. This outcome reveals an interesting tradeoff between the Hausdorff dimensionand the constant M , through the batch-size b, and investigating this tradeoff is an interesting future direction.

We further notice that the numerator in (14) is increasing with n, which suggests that the batch-size should betaken in proportion with n (i.e., setting mb to a constant value), in order to have a control over dimHµW |Sn .Finally, regarding the remaining bounds in this section, even though their forms might differ from (14), similarremarks also apply. Hence, we will omit the discussion.

Regularized logistic regression. Given the data points zi = (ai, yi), consider regularized logistic regressionwith the loss:

`(w, zi) := log(1 + exp

(−yiaTi w

))+ λ‖w‖2/2, (15)

where λ > 0 is a regularization parameter. We have the following result.

Proposition 3 (Regularized logistic regression). Consider the regularized logistic regression (15). Assume thestep-size η < 1/λ and the input data is bounded, i.e. R := maxi ‖ai‖ < 2

√λ. We have:


log(1/(1− ηλ+ 14ηR

2)). (16)

Next, we consider a non-convex formulation for logistic regression, with zi = (ai, yi) and the loss

`(w, zi) := ρ (yi − 〈w, ai〉) + λr‖w‖2/2, (17)

where λr > 0 is a regularization parameter and ρ is a non-convex function, where a standard choice is Tukey’sbisquare loss defined as ρTukey(t) = 1− (1− (t/t0)2)3 for |t| ≤ t0, and ρTukey(t) = 1 for |t| ≥ t0, and exponentialsquared loss: ρexp(t) = 1− e−|t|2/t0 , where t0 > 0 is a tuning parameter.

Proposition 4 (Non-convex formulation for logistic regression). Consider the non-convex formulation for logisticregression (17). Assume the step-size η < 1

λr+R2(2/t0) , where R = maxi ‖ai‖ <√λrt0/2. Then, we have the

following upper bound for the Hausdorff dimension:


log(1/(1− ηλr + ηR2 2t0

)). (18)

9

Figure 3: Estimates of R plotted against the generalization error (|training loss− test loss|) for VGG11 and FCNstrained on CIFAR10, SVHN and BHP with varying η, b. The linear regression of best fit is plottedin red, where shading corresponds to the 95% confidence interval. For all plots the one-sided p-value,testing whether the null hypothesis that the slope of the line is in-fact negative and not positive, issignificantly less than 0.001, indicating that it is highly likely that R and the generalization error arepositively correlated.

One hidden-layer neural network. Given the data points zi = (ai, yi). Let ai ∈ Rd be the input andyi ∈ Rm be the corresponding output, and let wr ∈ Rd be the weights of the r-th hidden neuron of a onehidden-layer network, and br ∈ R is the output weight of hidden unit r. For simplicity of the presentation,following [ZMG19, DZPS19], we only optimize the weights of the hidden layer, i.e. w =

[wT1 w

T2 . . . w

Tm

]is the

decision variable with the regularized squared loss:

`(w, zi) := ‖yi − yi‖2 + λ‖w‖2/2, yi :=

m∑r=1

brσ(wTr ai

), (19)

where the non-linearity σ : R→ R is smooth and λ > 0 is a regularization parameter.

Proposition 5 (One hidden-layer network). Consider the one hidden-layer network (19). Assume the step-sizeη < 1

2λ . Then, we have the following upper bound for the Hausdorff dimension:


log(1/(1− η(λ− C))), (20)

where C := My‖b‖∞‖σ′′‖R2 + (maxj ‖vj‖∞)2< λ, where R := maxi ‖ai‖, My := maxi ‖yi − yi‖, and vi :=[

b1σ′(wT1 ai)ai b2σ

′(wT2 ai)ai · · · bmσ′(wTmai)ai]T .

5. Experiments

Our aim now is to empirically demonstrate that the bound in Corollary 1 is informative, in that it is predictive ofa neural network’s generalization error. As the second term of this bound cannot be evaluated, we focus ourefforts on the first term. Further, because the denominator of the first term is the only term that depends on theinvariant measure, we want to establish that the inverse of this denominator is predictive of a neural network’sgeneralization error. Note however that for complex models, such as modern neural networks, analyticallybounding ‖Jhi(x)‖ becomes highly non-trivial.

In our experiments, we fix the algorithm to SGD and we develop a numerical method for computing the term(12). Noting that Jhi(w) = I − η∇2Ri(w), for simplicity we denote the inverse of (12) as the ‘complexity’:

R = 1/

[mb∑i=1

pi

∫Rd

log (‖Jhi(w)‖) dµW |Sn(w)

].

10

Figure 4: Estimated R plotted against η for 2 layer FCN trained on BHP.

To approximate the expectations, we propose the following simple Monte Carlo strategy:

R−1 ≈[1/(NWNU )

]NW∑i=1

NU∑j=1

log(‖JhUj (Wi)‖

), (21)

where Uj denotes i.i.d. random mini-batch indices that are drawn without-replacement from 1, . . . , n andWi

i.i.d.∼ µW |S . Assuming (8) is ergodic [DF99], we treat the iterates wk as i.i.d. samples from µW |Sn for largek, hence, log(‖JhUj (Wi)‖) can be computed on these iterates, and (21) can be computed accordingly. Ourimplementation for computing ‖JhUj (Wi)‖ for neural networks with millions of parameters is detailed in theAppendix. Though the size of Jhi is very large in our experiments (≈ 20M× 20M on average), our algorithm canefficiently compute the norms without constructing JhUj (Wi), by extending the approach presented in [YGKM20].

In Figure 3 we plot the estimates of R for a variety of convolutional (CONV) and fully connected network (FCN)architectures trained on CIFAR10, SVHN and Boston House Prices (BHP). For the full details of the models,the hardware used, their run-time, and the convergence criterion used, see Section F in the supplement. Theplot demonstrates that R and generalization error are positively correlated and that this correlation is significant(p-value 0.001) for all model architectures. This provides evidence that the bound on the generalization errorin Corollary 1 is informative.

To support our findings in Section 4 that the bound for the Hausdorff dimension dimHµW |Sn is monotonicallydecreasing in the step-size η, we plot R against η in Figure 4 for the networks trained on BHP in Figure 3.R decreases with increasing η, clearly backing our findings. We note that these results were inconclusive forclassification models trained with a cross-entropy loss, in that we could not clearly observe a negative or positivecorrelation. Future work will further study this lack of correlation, particular to classification models.

6. Conclusion

In this work, we investigated stochastic optimization algorithms through the lens of IFSs and studied theirgeneralization properties. Under some assumptions, we showed that the generalization error can be controlledbased on the Hausdorff dimension of the invariant measure determined by the iterations, which can lead to tighterbounds than the standard bounds based on the ambient dimension. We proposed an efficient methodology toestimate the Hausdorff dimension in deep learning settings and supported our theory with several experimentson neural networks. We also illustrated our bounds on specific problems and algorithms such as SGD and itspreconditioned variants, which unveil new links between generalization, algorithm parameters and the Hessian ofthe loss.

11

Our study does not have a direct societal impact as it is largely theoretical. The limitation of our study is itsasymptotic nature due to operating on invariant measures. Future work will address obtaining nonasymptoticbounds in terms of the number of iterations k.

Acknowledgments

M.G.’s research is supported in part by the grants Office of Naval Research Award Number N00014-21-1-2244,National Science Foundation (NSF) CCF-1814888, NSF DMS-2053485, NSF DMS-1723085. U.Ş.’s research issupported by the French government under management of Agence Nationale de la Recherche as part of the“Investissements d’avenir program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). L.Z. is grateful to thesupport from a Simons Foundation Collaboration Grant and the grant NSF DMS-2053454 from the NationalScience Foundation. A. C. was supported by an EPSRC Studentship.

References

[AAV18] Amir Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tighteninggeneralization bounds. In Advances in Neural Information Processing Systems (NeurIPS), pages7234–7243, 2018. (Cited on page 7.)

[AB09] Martin Anthony and Peter L Bartlett. Neural Network Learning: Theoretical Foundations. CambridgeUniversity Press, 2009. (Cited on page 2.)

[AGNZ18] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deepnets via a compression approach. In Proceedings of the 35th International Conference on MachineLearning, volume 80, pages 254–263. PMLR, 10–15 Jul 2018. (Cited on page 3.)

[AGZ21] Naman Agarwal, Surbhi Goel, and Cyril Zhang. Acceleration via fractal learning rate schedules.arXiv preprint arXiv:2103.01338, 2021. (Cited on page 4.)

[Anc16] Andreas Anckar. Dimension bounds for invariant measures of bi-Lipschitz iterated function systems.Journal of Mathematical Analysis and Applications, 440(2):853–864, 2016. (Cited on pages 28 and 31.)

[BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine LearningResearch, 2(Mar):499–526, 2002. (Cited on page 7.)

[Bog07] Vladimir I Bogachev. Measure Theory, volume 1. Springer, 2007. (Cited on page 17.)

[DDB20] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step sizestochastic gradient descent and Markov chains. Annals of Statistics, 48(3):1348–1382, 2020. (Cited onpage 2.)

[DF99] Persi Diaconis and David Freedman. Iterated random functions. SIAM Review, 41(1):45–76, 1999.(Cited on pages 2, 6, 11, and 20.)

[DR17] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds fordeep (stochastic) neural networks with many more parameters than training data. In Proceedings ofthe Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI), 2017. (Cited on page 2.)

[DSD20] Nadav Dym, Barak Sober, and Ingrid Daubechies. Expression of fractals through neural networkfunctions. IEEE Journal on Selected Areas in Information Theory, 1(1):57–66, 2020. (Cited on page 4.)

[DZPS19] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizesover-parameterized neural networks. In International Conference on Learning Representations, 2019.(Cited on page 10.)

12

[Elt90] John H Elton. A multiplicative ergodic theorem for Lipschitz maps. Stochastic Processes and theirApplications, 34(1):39–47, 1990. (Cited on page 6.)

[EM15] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled Newton methods.Advances in Neural Information Processing Systems, 28:3052–3060, 2015. (Cited on page 6.)

[Fal97] Kenneth J. Falconer. Techniques in Fractal Geometry. Wiley, 1997. (Cited on page 17.)

[Fal04] Kenneth Falconer. Fractal Geometry: Mathematical Foundations and Applications. Wiley, 2004.(Cited on pages 2, 3, 4, and 5.)

[FST06] Ai Hua Fan, Károly Simon, and Hajnal R. Tóth. Contracting on average random IFS with repellingfixed point. Journal of Statistical Physics, 122:169–193, 2006. (Cited on page 5.)

[FW09] De-Jun Feng and Yang Wang. On the structures of generating iterated function systems of Cantorsets. Advances in Mathematics, 222(6):1964–1981, 2009. (Cited on page 3.)

[GŞZ20] Mert Gürbüzbalaban, Umut Şimşekli, and Lingjiong Zhu. The heavy-tail phenomenon in SGD. arXive-prints, page arXiv:2006.04740, June 2020. (Cited on page 3.)

[Hau18] Felix Hausdorff. Dimension und äusseres Mass. Mathematische Annalen, 79(1-2):157–179, 1918.(Cited on page 5.)

[HJTW21] Daniel Hsu, Ziwei Ji, Matus Telgarsky, and Lan Wang. Generalization bounds via distillation. InInternational Conference on Learning Representations, 2021. (Cited on page 3.)

[HS74] Morris W. Hirsch and Stephen Smale. Differential Equations, Dynamical Systems, and Linear Algebra.Academic Press, 1974. (Cited on page 4.)

[HSS12] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Overview of mini-batch gradient descent.Neural Networks for Machine Learning, 575, 2012. (Cited on page 6.)

[JR08] Joanna Jaroszewska and Michał Rams. On the Hausdorff dimension of invariant measures of weaklycontracting on average measurable IFS. Journal of Statistical Physics, 132:907, 2008. (Cited on page 5.)

[KB15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In InternationalConference on Learning Representations, 2015. (Cited on page 6.)

[Li17] Xi-Lin Li. Preconditioned stochastic gradient descent. IEEE Transactions on Neural Networks andLearning Systems, 29(5):1454–1466, 2017. (Cited on page 2.)

[LMA21] Zhiyuan Li, Sadhika Malladi, and Sanjeev Arora. On the validity of modeling SGD with stochasticdifferential equations (SDEs). arXiv preprint arXiv:2102.12470, 2021. (Cited on page 3.)

[LTE19] Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and dynamics of stochasticgradient algorithms i: Mathematical foundations. Journal of Machine Learning Research, 20(40):1–47,2019. (Cited on page 3.)

[MS02] Józef Myjak and Tomasz Szarek. On Hausdorff dimension of invariant measures arising from non-contractive iterated function systems. Annali di Matematica Pura ed Applicata, 181:223–237, 2002.(Cited on page 5.)

[MSS19] Eran Malach and Shai Shalev-Shwartz. Is deeper better only when shallow is good? In Advances inNeural Information Processing Systems (NeurIPS), volume 32, 2019. (Cited on page 4.)

[MT10] John M Mackay and Jeremy T Tyson. Conformal Dimension: Theory and Application, volume 54.American Mathematical Society, 2010. (Cited on page 7.)

[NBMS17] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring gener-alization in deep learning. In Advances in Neural Information Processing Systems (NIPS), pages5947–5956, 2017. (Cited on page 2.)

13

[NSB02] Matthew Nicol, Nikita Sidorov, and David Broomhead. On the fine structure of stationary measuresin systems which contract-on-average. Journal of Theoretical Probability, 15:715–730, 2002. (Cited onpage 5.)

[Pes08] Yakov B Pesin. Dimension Theory in Dynamical Systems: Contemporary Views and Applications.University of Chicago Press, 2008. (Cited on pages 7, 16, and 17.)

[Qia99] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks,12(1):145–151, 1999. (Cited on page 6.)

[Ram06] Michał Rams. Dimension estimates for invariant measures of contracting-on-average iterated functionsystems. arXiv preprint math/0606420, 2006. (Cited on pages 5, 8, and 17.)

[RKM16] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods II: Local conver-gence rates. arXiv preprint arXiv:1601.04738, 2016. (Cited on page 19.)

[Rog98] Claude Ambrose Rogers. Hausdorff Measures. Cambridge University Press, 1998. (Cited on page 5.)

[SAM+20] Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai,Masatoshi Yukishima, and Tomoaki Nishimura. Spectral pruning: Compressing deep neural networksvia spectral analysis and its generalization error. In International Joint Conference on ArtificialIntelligence, pages 2839–2846, 2020. (Cited on page 3.)

[SAN20] Taiji Suzuki, Hiroshi Abe, and Tomoaki Nishimura. Compression based bound for non-compressed net-work: unified generalization error analysis of large compressible deep neural network. In InternationalConference on Learning Representations, 2020. (Cited on page 3.)

[Saz00] Tomasz Sazarek. The dimension of self-similar measures. Bulletin of the Polish Academy of Sciences.Mathematics, 48:293–202, 2000. (Cited on page 5.)

[SHTY13] Mahito Sugiyama, Eiju Hirowatari, Hideki Tsuiki, and Akihiro Yamamoto. Learning figures withthe Hausdorff metric by fractals—towards computable binary classification. Machine Learning,90(1):91–126, 2013. (Cited on page 4.)

[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory toAlgorithms. Cambridge University Press, 2014. (Cited on page 2.)

[ŞSDE20] Umut Şimşekli, Ozan Sener, George Deligiannidis, and Murat A Erdogdu. Hausdorff dimension,heavy tails, and generalization in neural networks. In Advances in Neural Information ProcessingSystems, volume 33, 2020. (Cited on pages 3, 4, 6, 7, and 23.)

[Wai19] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint, volume 48.Cambridge University Press, 2019. (Cited on page 26.)

[XR17] Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learningalgorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 2524–2533,2017. (Cited on page 7.)

[Yai19] Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. In InternationalConference on Learning Representations, 2019. (Cited on page 3.)

[YBVE20] Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. An analysis ofconstant step size SGD in the non-convex regime: Asymptotic normality and bias. arXiv preprintarXiv:2006.07904, 2020. (Cited on page 2.)

[YGKM20] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W. Mahoney. PyHessian: Neural networksthrough the lens of the Hessian. In 2020 IEEE International Conference on Big Data (Big Data),pages 581–590, 2020. (Cited on pages 11 and 20.)

14

[ZBH+17] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deeplearning requires rethinking generalization. In International Conference on Learning Representations,2017. (Cited on page 2.)

[ZMG19] Guodong Zhang, James Martens, and Roger Grosse. Fast convergence of natural gradient descentfor overparameterized neural networks. In Advances in Neural Information Processing Systems,volume 32, 2019. (Cited on pages 10 and 18.)

15

Appendix

The appendix is organized as follows:

• Technical background for the proofs.

– In Section A, we provide additional background on dimension theory. In particular we define theMinkowski dimension and the local dimension for a measure. Then, we provide three existing theoreticalresults that will be used in our proofs.

• Additional theoretical results.

– In Section B, we provide an upper-bound on the Hausdorff dimension of the invariant measure ofSGD, when applied on support vector machines. This result is a continuation of the results given inSection 4.

– In Section C, we provide upper-bounds on the Hausdorff dimension of the invariant measure ofpreconditioned SGD on different problems.

– In Section D, we provide an upper-bound on the Hausdorff dimension of the invariant measure of thestochastic Newton algorithm applied on linear regression.

• Details of the experimental results.

– In Section E, we provide the details of the algorithm that we developed for computing the operatornorm ‖I − η∇2Rk(w)‖ for neural networks.

– In Section F, we provide the details of the SGD hyperparameters, network architectures, and informationregarding hardware/run-time.

• Proofs.

– In Section G, we provide the proofs all the theoretical results presented in the main document and theAppendix.

Appendix A. Further Background on Dimension Theory

A.1 Minkowski dimension of a measure

Based on the definition of the Minkowski dimension of sets as given in Section 2, we can define the Minkowskidimension of a finite Borel measure µ, as follows [Pes08]:

dimMµ := limδ→0

inf

dimMZ : µ(Z) ≥ 1− δ. (22)

Note that in general, we havedimMµ ≤ inf

dimMZ : µ(Z) = 1

,

where the inequality can be strict, see [Pes08, Chapter 7].

A.2 Local dimensions of a measure

It is sometimes more convenient to consider a dimension notion that is defined in a pointwise manner. Let µ bea finite Borel regular measure on Rd. The lower and upper local (or pointwise) dimensions of µ at x ∈ Rd are

16

respectively defined as follows:

dimlocµ(x) := lim infr→0

logµ(B(x, r))

log r, (23)

dimlocµ(x) := lim supr→0

logµ(B(x, r))

log r, (24)

where B(x, r) denotes the ball with radius r about x. When the values of these dimensions agree, the commonvalue is called the local (or pointwise) dimension of µ at x, and is denoted by dimloc µ(x). The local dimensionsdescribe the power-law behavior of µ(B(x, r)) for small r [Fal97]. These dimensions are closely linked to theHausdorff dimension.

A.3 Existing Results

The following result upper-bounds the Hausdorff dimension of the invariant measure of an IFS to the constituentsof the IFS. We translate the result to our notation.

Theorem 2. [Ram06, Theorem 2.1] Consider the IFS (6) and assume that conditions (7) and (8) are satisfiedand hi are continuously differentiable with derivatives h′i that are α-Hölder continuous for some α > 0. Theinvariant measure µW |Sn of the IFS satisfies

dimHµW |Sn ≤ s where s :=Ent∑mb

i=1 pi∫x∈Rd log(‖Jhi(w)‖)dµW |Sn(w)

,

where Ent :=∑mbi=1 pi log(pi) is the (negative) entropy. Furthermore, if hi are conformal and either SOSC or

ROSC is satisfied, then we havedimH(µW |Sn) = dimH(µW |Sn) = s.

The next two results link the Hausdorff and Minkowski dimensions of a measure to its local dimension.

Proposition 6. [Fal97, Propositions 10.3] For a finite Borel measure µ, the following identity holds:

dimHµ = inf s : dimlocµ(x) ≤ s for µ-almost all x . (25)

Theorem 3. [Pes08, Theorem 7.1] Let µ be a finite Borel measure on Rd. If dimlocµ(x) ≤ α for µ-almost everyx, then dimMµ ≤ α.

The next theorem, called Egoroff’s theorem, will be used in our proofs repeatedly. It provides a condition formeasurable functions to be uniformly continuous in an almost full-measure set.

Theorem 4 (Egoroff’s Theorem). [Bog07, Theorem 2.2.1] Let (X,A, µ) be a space with a finite nonnegativemeasure µ and let µ-measurable functions fn be such that µ-almost everywhere there is a finite limit f(x) :=limn→∞ fn(x). Then, for every ε > 0, there exists a set Xε ∈ A such that µ (X\Xε) < ε and the functions fnconverge to f uniformly on Xε.

Appendix B. Additional Analytical Estimates for SGD

Support vector machines. Given the data points zi = (ai, yi) with the input data ai and the outputyi ∈ −1, 1, consider support vector machines with smooth hinge loss:

`(w, zi) := `σ(yiaTi w) + λ‖w‖2/2, (26)

17

where σ > 0 is a smoothing parameter, λ > 0 is the regularization parameter and `σ(z) := 1−z+σ log(1+e−1−zσ ).

This loss function is a smooth version of the hinge loss that can be easier to optimize in some settings. In fact, itcan be shown that as σ → 0, this loss converges to the (non-smooth) hinge loss pointwise.

Proposition 7 (Support vector machines). Consider the support vector machines (26). Assume the step-sizeη < 1

λ+‖R‖2/(4ρ) , where R := maxi ‖ai‖ is finite. Then, we have:


log(1/(1− ηλ)). (27)

Appendix C. Analytical Estimates for Preconditioned SGD

We consider the pre-conditioned SGD methods

wk = wk−1 − ηH−1∇Rk(wk−1), (28)

for a fixed square matrix H. Some choices of H includes a diagonal matrix, a block diagonal matrix or theFisher-information matrix (see e.g. [ZMG19]). We assume that H is a positive definite matrix, and by Choleskydecomposition, we can write H = SST , where S is a real lower triangular matrix with positive diagonal entries.If we have H = JJT , where J is the Jacobian, then the corresponding least square problems is called theGauss-Newton methods for least squares. Assume that H there exist some m,M > 0 such that:

0 ≺ mI H MI. (29)

As illustrative examples; in the following, we will consider the setting where we divide Sn into mb = n/b batcheswith each batch having b elements, and then we discuss how analytical estimates on the (upper) Hausdorffdimension dimHµW |Sn can be obtained for some particular problems including least squares, regularized logisticregression, support vector machines, and one hidden-layer network.

Least squares. We consider the least square problem with data points zi = (ai, yi) and the loss

`(w, zi) :=1

2

(aTi w − yi

)2+λ

2‖w‖2, (30)

where λ > 0 is a regularization parameter. If we apply preconditioned SGD this results in the recursion (6) with

hi(w) = Miw + qi with Mi := I − ηλH−1 − ηH−1Hi, (31)

Hi :=1

b

∑j∈Si

ajaTj , qi :=

η

bH−1

∑j∈Si

ajyj ,

where aj ∈ Rd are the input vectors, and yj are the output variables, and Simbi=1 is a partition of 1, 2, . . . , nwith |Si| = b, where i = 1, 2, . . . ,mb with mb = n/b. We have the following result.

Proposition 8 (Least squares). Consider the pre-conditioned SGD method (28) for the least square problem(30). Assume that the step-size η < m

R2+λ , where R := maxi ‖ai‖ is finite. Then, we have the following upperbound for the Hausdorff dimension:


log(1/(1− ηM−1λ)). (32)

Regularized logistic regression. We consider the regularized logistic regression problem with the data pointszi = (ai, yi) and the loss:

`(w, zi) := log(1 + exp

(−yiaTi w

))+λ

2‖w‖2, (33)

where λ > 0 is the regularization parameter.

18

Proposition 9 (Regularized logistic regression). Consider the pre-conditioned SGD method (28) for regularizedlogistic regression (33). Assume that the step-size η < m/λ and R := maxi ‖ai‖ < 2

√mλ/M . Then, we have the

following upper bound for the Hausdorff dimension:


log(1/(1− ηM−1λ+ 14ηm

−1R2)). (34)

Next, we consider a non-convex formulation for logistic regression. Consider the data points zi = (ai, yi) and theloss:

`(w, zi) := ρ (yi − 〈w, ai〉) +λr2‖w‖2, (35)

where λr > 0 is a regularization parameter and ρ is a non-convex function. We have the following result.

Proposition 10 (Non-convex formulation for logistic regression). Consider the pre-conditioned SGD method (28)in the non-convex logistic regression setting (35). Assume that the step-size η < m

λr+R2(2/t0) and R := maxi ‖ai‖ <√λrt0m/(2M). Then, we have the following upper bound for the Hausdorff dimension:


log(1/(1− ηM−1λr + ηm−1R2 2t0

)). (36)

Support vector machines. We have the following result for pre-conditioned SGD when applied to the supportvector machines problem (26).

Proposition 11 (Support vector machines). Consider pre-conditioned SGD (28) for support vector machines(26). Assume that the step-size η < m

λ+‖R‖2/(4ρ) where R := maxi ‖ai‖ is finite. Then, we have the followingupper bound for the Hausdorff dimension:


log(1/(1− ηM−1λ)). (37)

One hidden-layered neural network. Consider the one-hidden-layer neural network setting as in Proposi-tion 5, where the objective is to minimize the regularized squared loss with the loss function:

`(w, zi) := ‖yi − yi‖2 +λ

2‖w‖2, yi :=

m∑r=1

brσ(wTr ai

), (38)

where the non-linearity σ : R→ R is smooth and λ > 0 is the regularization parameter.

Proposition 12 (One hidden-layer network). Consider the one-hidden-layer network (38). Assume that η < mC+λ

and λ > MmC, where C is defined in Corollary 5. Then, we have the following upper bound for the Hausdorff

dimension:dimHµW |Sn ≤

log (n/b)

log(1/(1− η(M−1λ−m−1C))). (39)

Appendix D. Analytical Estimates for Stochastic Newton

We consider the stochastic Newton method

wk = wk−1 − η[Hk(wk−1)]−1∇Rk(wk−1), where Hk(w) := (1/b)∑i∈Ωk

∇2`(w, zi),

see e.g. [RKM16], where Ωk = Si with probability pi with i = 1, 2, . . . ,mb, where mb = n/b.

19

For simplicity, we focus on the least square problem, with the data points zi = (ai, yi) and the loss:

`(w, zi) :=1

2

(aTi w − yi

)2+λ

2‖w‖2, (40)

where λ > 0 is a regularization parameter. If we apply stochastic Newton this results in the recursion (6) with

hi(w) = Miw + qi with Mi := (1− η)I, (41)

Hi :=1

b

∑j∈Si

ajaTj + λI, qi :=

η

bH−1i

∑j∈Si

ajyj ,

where aj ∈ Rd are the input vector, and yj are the output variable, and Simbi=1 is a partition of 1, 2, . . . , nwith |Si| = b, where i = 1, 2, . . . ,mb with mb = n/b. Therefore, Jhi(w) = (1 − η)I. By following the similarargument as in the proof of Proposition 8, we conclude that for any η ∈ (0, 1),


log(1/(1− η)), (42)

where the upper bound is decreasing in step-size η and batch-size b.

Appendix E. Estimating the Complexity R for SGD

Estimating R, as detailed in Equation (21), requires drawing NW samples from the invariant measure and NUbatches of from the training data. As mentioned in the main text, to approximate the summation over NWsamples from the invariant measure, assuming (8) is ergodic [DF99], we treat the iterates wk as i.i.d. samplesfrom µW |S for large k, hence, the norm of the Jacobian log(‖JhIj (Wi)‖) can be efficiently computed on theseiterates. Thus, we first we train a neural-network to convergence, whereby convergence is defined as the modelreaching some accuracy level (if the dataset is a classification task) and achieving a loss below some threshold ontraining data. We assume that after convergence the SGD iterates will be drawn from the invariant measure.As such we run the training algorithm for another 200 iterates, saving a snapshot of the model parameters ateach step, such that NW = 200 in Equation (21). For each of these snapshots we estimate the spectral norm‖JhI (W )‖ using a simple modification of the power iteration algorithm of [YGKM20], detailed in Section E.1below. This modified algorithm is scalable to neural networks with millions of parameters and we apply it to 50of the batches used during training, such that NU = 50 in (21).

E.1 Power Iteration Algorithm for ‖Jhi(w)‖

We re-purpose the power iteration algorithm of [YGKM20] adding a small modification that allows for theestimation of the spectral norm ‖Jhi(w)‖. We first note that

Jhi(w) = I − η∇2Ri(w) (43)

where ∇2Ri(w) is the Hessian for the ith batch. As such our power iteration algorithm needs to estimate theoperator norm of the matrix I − η∇2Ri(w) and not just that of the Hessian of the network. To do this we justneed to change the ‘vector-product’ step of the power-iteration algorithm of [YGKM20]. Our modified methodhas the same convergence guarantees, namely that the method will converge to the ‘true’ top eigenvalue if thiseigenvalue is ‘dominant’, in that it dominates all other eigenvalues in absolute value, i.e if λ1 is the top eigenvaluethen we must have that:

|λ1| > |λ2| ≥ . . . |λn|

to guarantee convergence.

20

Algorithm 1: Power Iteration for Top Eigenvalue Computation of Jhi(w)

Input: Network Parameters: w, Loss function: f , Learning rate: η1 Compute the gradient of θ by backpropagation, i.e., compute gw = df

dw .2 Draw a random vector v from N(0, 1) (same dimension as w).3 Normalize v, v = v

‖v‖24 for i = 1, 2, . . . do // Power Iteration5 Compute gv = gTθ v // Inner product6 Compute Hv by backpropagation, Hv = d(gv)

dw // Get Hessian vector product7 Compute Jhiv, Jhiv = (I − ηH)v = v − ηHv // Get Jhi vector product

8 Normalize and reset v, v =Jhiv

‖Jhiv‖29 end

Appendix F. Experiment Hyperparameters

Training Parameters: All models in Figures 3 and 4 were trained using SGD with batch sizes of 50 or 100and were considered to have converged for CIFAR10 and SVHN if they reached 100% accuracy and less than0.0005 loss on the training set. For BHP convergence was considered to have been achieved after 100000 trainingsteps. For all models except VGG16 in Figures 3 and 4 we use learning rates in

0.0075, 0.02, 0.025, 0.03, 0.04, 0.06, 0.07, 0.075, 0.08, 0.09, 0.1, 0.11, 0.12,

0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.194, 0.2, 0.22, 0.24, 0.25, 0.26.

VGG16 models were trained with learning rates in 0.0075, 0.02, 0.03, 0.06, 0.07, 0.08.

Network Architectures: BHP FCN had 2 hidden layers and were 10 neurons wide. Similarly CIFAR10 FCNwere 5 and 7 layers deep with 2048 neurons per layer. 9-layer CONV networks were VGG11 networks with thefinal 2 layers removed. 16-layer CONV networks were simply the standard implementation of VGG16 networks.

Run-time: The full battery of fully connected models split over two GeForce GTX 1080 GPUs took two daysto train to convergence and the subsequent power iterations took less than a day. Similarly the full gamut ofVGG11 models took a day to train to convergence over four GeForce GTX 1080 GPUs and the subsequent poweriterations took less than a day to converge. The VGG16 models took a day to train over four GeForce GTX 1080GPUs but the power iterations for each model took roughly 24 hours on a single GeForce GTX 1080 GPU.

Appendix G. Postponed Proofs

G.1 Proof of Proposition 1

Proof. Denote α := dimHµW |Sn . By Assumption H 1, we have

dimlocµW |Sn(w) = dimlocµW |Sn(w),

for µW |Sn -a.e. w, and by Proposition 6 we have

dimlocµW |Sn(w) ≤ α+ ε, (44)

for all ε > 0 and for µW |Sn -a.e. w. By invoking Theorem 3, we obtain

dimMµW |Sn ≤ α+ ε. (45)

21

Since this holds for any ε, dimMµW |Sn ≤ α.

By definition, we have for almost all Sn:

dimMµW |Sn = limδ→0

inf

dimMA : µW |Sn(A) ≥ 1− δ. (46)

Hence, given a sequence (δk)k≥1 such that δk ↓ 0, and Sn, and any ε > 0, there is a k0 = k0(ε) such that k ≥ k0

implies

inf

dimMA : µW |Sn(A) ≥ 1− δk≤dimMµW |Sn + ε (47)≤α+ ε. (48)

Hence, for any ε1 > 0 and k ≥ k0, we can find a bounded Borel set ASn,k, such that µW |Sn(ASn,k) ≥ 1− δk, and

dimMASn,k ≤ α+ ε+ ε1. (49)

Note that the boundedness of ASn,k follows from the fact that its upper-Minkowski dimension is finite. Bychoosing ε = ε1 = ε

2 , it yields the desired result. This completes the proof.

G.2 Proof of Theorem 1

Proof. We begin similarly to the proof of Proposition 1.

Denoteα(S, n) := dimHµW |Sn .

By Assumption H 1, we have dimlocµW |Sn(w) = dimlocµW |Sn(w) for µW |Sn -almost every w, and by Proposition 6we have

dimlocµW |Sn(w) ≤ α(S, n) + ε, (50)

for all ε > 0 and for µW |Sn -a.e. w. By invoking Theorem 3, we obtain

dimMµW |Sn ≤ α(S, n) + ε. (51)

Since this holds for any ε > 0, dimMµW |Sn ≤ α(S, n).

By definition, we have for all S and n:

dimMµW |Sn = limδ→0

inf

dimMA : µW |Sn(A) ≥ 1− δ. (52)

Hence, for each n, there exists a set Ωn of full measure such that

fnδ (S) := inf

dimMA : µW |Sn(A) ≥ 1− δ→ dimMµW |Sn , (53)

for all S ∈ Ωn. Let Ω∗ := ∩nΩn. Then for S ∈ Ω∗ we have that for all n

fnδ (S)→ dimMµW |Sn , (54)

and therefore, on this set we also have

supn

1

ξnmin

1,∣∣fnδ (S)− dimMµW |Sn

∣∣→ 0,

where ξn is a monotone increasing sequence such that ξn ≥ 1 and ξn →∞.

22

By applying Theorem 4 to the collection of random variables:

Fδ(S) := supn

1

ξnmin


∣∣ , (55)

for any ζ > 0, we can find a subset Z ⊂ Z∞, with probability at least 1 − ζ under π∞, such that on Z theconvergence is uniform, that is

supS∈Z

supn

1

ξnmin


∣∣ ≤ c(δ), (56)

where for any ζ, c(δ) := c(δ; ζ)→ 0 as δ → 0.

Hence, for any δ, S ∈ Z, and n, we have

fnδ (S) ≤dimMµW |Sn + ξnc(δ) (57)≤α(S, n) + ξnc(δ). (58)

Consider a sequence (δk)k≥1 such that δk ↓ 0. Then, for any S ∈ Z and ε > 0, we can find a bounded Borel setASn,k, such that µW |Sn(ASn,k) ≥ 1− δk, and

dimMASn,k ≤ α(S, n) + ξnc(δk) + ε. (59)

Define the set

Wn,δk :=⋃

S∈Z∞ASn,k. (60)

By using G(w) := |R(w)− R(w,Sn)|, under the joint distribution of (W,Sn), such that S ∼ π∞ and W ∼ µW |Sn ,we have:

P (G(W ) > ε) ≤ζ + P (G(W ) > ε ∩ S ∈ Z) (61)≤ζ + δk + P (G(W ) > ε ∩ W ∈ ASn,k ∩ S ∈ Z) (62)

≤ζ + δk + P

(sup

w∈ASn,k

G(w) > ε

∩ S ∈ Z

). (63)

Now, let us focus on the last term of the above equation.

First we observe that as ` is L-Lipschitz, so are R and R. Hence, by considering the particular forms of theβ-covers in H 2, for any w′ ∈ Rd we have:

G(w) ≤ G(w′) + 2L ‖w − w′‖ , (64)

which implies

supw∈ASn,k

G(w) ≤ maxw∈Nβn (ASn,k)

G(w) + 2Lβn. (65)

Now, notice that the β-covers of H 2 still yield the same Minkowski dimension in (5) [ŞSDE20]. Then bydefinition, we have for all S and n:

lim supβ→0

log |Nβ(ASn,k)|log(1/β)

= limβ→0

supr<β

log |Nr(ASn,k)|log(1/r)

= dimMASn,k := dM(S, n, k). (66)

23

Hence for each n

gn,kβ (S) := supQ3r<δ


→ dM(S, n, k), (67)

almost surely. By using the same reasoning in (53), we have, for each n, there exists a set Ω′n of full measure suchthat

gn,kβ (S) = supQ3r<β


→ dM(S, n, k), (68)

for all S ∈ Ω′n. Define Ω∗∗ := (∩nΩ′n) ∩ Ω∗. Hence, on Ω∗∗ we have:

Gkβ(S) := supn

1

ξnmin

1,∣∣∣gn,kβ (S)− dM(S, n, k)

∣∣∣→ 0, (69)

By applying Theorem 4 to the collection Gkβ(S)β , for any ζ1 > 0 we can find a subset Z1 ⊂ Z∞, with probabilityat least 1− ζ1 under π∞, such that on Z1 the convergence is uniform, that is

supS∈Z1

supn

1

ξnmin1, |gn,kβ (S)− dM(S, n, k)| ≤ c′(β), (70)

where for any ζ1, c′(β) := c′(β; ζ1, δk)→ 0 as β → 0.

Hence, denoting Z∗ := Z ∩ Z1 by using (59) we have:

S ∈ Z∗ ⊆⋂n

|Nβ (ASn,k)| ≤

(1

β

)α(S,n)+ξnc(δk)+ξnc′(β)+ε

.

Let (βn)n≥0 be a decreasing sequence such that βn ∈ Q for all n and βn → 0. We then have

P

(S ∈ Z∩

max

w∈Nβn (ASn,k)Gn(w) ≥ ε

)

≤ P

(S ∈ Z∗ ∩

max

w∈Nβn (ASn,k)Gn(w) ≥ ε

)+ ζ2.

For ρ > 0 and m ∈ N+ let us define the interval Jm(ρ) := (mρ, (m+ 1)ρ]. Furthermore, for any t > 0 define

ε(t) :=

√2ν2

n

[log(1/βn) (t+ ξnc(δk) + ξnc′(βn) + ε) + log(M/ζ2)

]. (71)

For notational simplicity, denote Nβn,k := Nβn(Wn,δk) and

α(S, n, k, ε) := α(S, n) + ξnc(δk) + ξnc′(βn) + ε. (72)

Let d∗ be the smallest real number such that α(S, n) ≤ d∗ almost surely2, we therefore have:

2. Notice that we trivially have d∗ ≤ d; yet, d∗ can be much smaller than d.

24

P

(S ∈ Z ∩

max

w∈Nβn (ASn,k)Gn(w) ≥ ε(α(S, n))

)

≤ ζ2 + P

(|Nβn(ASn,k)| ≤

(1

βn

)α(S,n,k,ε)

∩

maxw∈Nβn (ASn ,k)

|Rn(w)−R(w)| ≥ ε(α(S, n)))

= ζ2 +

d d∗ρ e∑m=0

P

(|Nβn(ASn,k)| ≤

(1

βn

)α(S,n,k,ε)

∩

maxw∈Nβn (ASn,k)

Gn(w) ≥ ε(α(S, n))∩ α(S, n) ∈ Jm(ρ)

)

= ζ2 +

d d∗ρ e∑m=0

P

(|Nβn(ASn,k)| ≤

(1

βn

)α(S,n,k,ε)∩ α(S, n) ∈ Jm(ρ)

∩⋃

w∈N(βn)

(w ∈ Nβn(ASn,k) ∩ Gn(w) ≥ ε(α(S, n))

))

≤ ζ2 +

d d∗ρ e∑m=0

∑w∈Nβn,k

P

(Gn(w) ≥ ε(mρ)

∩w ∈ Nδn(ASn,k)

∩

|Nβn(ASn,k)| ≤

(1

βn

)α(S,n,k,ε)∩ α(S, n) ∈ Jm(ρ)

),

where we used the fact that on the event α(S, n) ∈ Jm(ρ), ε(α(S, n)) ≥ ε(mρ).

Notice that the eventsw ∈ Nβn(ASn,k)

,|Nβn(ASn,k)| ≤ (1/βn)α(S,n,k,ε)

, α(S, n) ∈ Jm(ρ)

are in G.

On the other hand, the event Gn(w) ≥ ε(mρ) is clearly in F (see H 2 for definitions).

Therefore, we have

P

(S ∈ Z ∩

max


)

≤ ζ2 +M

d d∗ρ e∑m=0

∑w∈Nβn,k

P(Gn(w) ≥ ε(mρ)

)

× P

(w ∈ Nβn(ASn,k)

∩

|Nβn(ASn,k)| ≤

(1

βn

)α(S,n,k,ε)∩ α(S, n) ∈ Jm(ρ)

),

Recall that Gn(w) = 1n

∑ni=1[`(w, zi) − Ez∼π`(w, z)]. Since the (zi)i are i.i.d. by Assumption 3 it follows that

25

Gn(w) is (ν/√n, κ/n)-sub-exponential and from [Wai19, Proposition 2.9] we have that

P(Gn(w) ≥ ε(mρ)

)≤ 2 exp

(−nε(mρ)2

2ν2

),

as long as ε(mρ) ≤ ν2/κ.

For n large enough we may assume that ε(d∗) ≤ ν2/κ, and thus

P

(S ∈ Z ∩

max


)

≤ 2M

d dρ e∑m=0

e−2nε2(mρ)

B2

∑w∈Nβn,k

P

(w ∈ Nβn(ASn,k)

∩

|Nβn(ASn,k)| ≤

(1

βn

)α(S,n,k,ε)

∩ α(S, n) ∈ Jm(ρ)

)+ ζ2

≤ 2M

d dρ e∑m=0

e−nε2(mρ)

2ν2

∑w∈Nβn,k

E

[1w ∈ Nβn(ASn,k)

× 1

|Nβn(ASn,k)| ≤

(1

βn

)α(S,n,k,ε)× 1 α(S, n) ∈ Jm(ρ)

]+ ζ2

≤ 2M

d dρ e∑m=0

e−nε2(mρ)

2ν2 E

[ ∑w∈Nβn,k

1w ∈ Nβn(ASn,k)

× 1

|Nβn(ASn,k)| ≤

(1

βn

)α(S,n,k,ε)× 1 α(S, n) ∈ Jm(ρ)

]+ ζ2 (73)

≤ 2M

d dρ e∑m=0

e−nε2(mρ)

2ν2 E

[|Nβn(ASn,k)| × 1

|Nβn(ASn,k)| ≤

(1

βn

)α(S,n,k,ε)

× 1 α(S, n) ∈ Jm(ρ)

]+ ζ2

= ζ2 + 2M

d dρ e∑m=0

e−nε2(mρ)

2ν2 E

[[1

βn

]α(S,n,k,ε)

× 1 α(S, n) ∈ Jm(ρ)

],

where (73) follows from Fubini’s theorem.

Now, notice that the mapping t 7→ ε2(t) is linear with derivative bounded by

2ν2

nlog(1/βn).

Therefore, on the event α(S, n) ∈ Jm(ρ) we have

ε2(α(S, n))− ε2(mρ) ≤(α(S, n)−mρ)2ν2

nlog(1/βn) (74)

≤ρ2ν2

nlog(1/βn). (75)

26

By choosing ρ = ρn = 1/ log(1/βn), we have

ε2(mρn) ≥ ε2 (α(S, n))− 2ν2

n.

Therefore, we have

P

(S ∈ Z ∩

max


)

≤ ζ2 + 2ME

d dρn e∑m=0

e−nε2(mρn)

2ν2

[1

βn

]α(S,n,k,ε)

× 1 α(S, n) ∈ Jm(ρn)

≤ ζ2 + 2ME

d dρn e∑m=0

e−nε2(α(S,n))

2ν2+1

[1

βn

]α(S,n,k,ε)

× 1 α(S, n) ∈ Jm(ρn)

= ζ2 + 2ME

[e−

nε2(α(S,n))

2ν2+1

[1

βn

]α(S,n,k,ε)].

By the definition of ε(t), for any S and n we have that:

2Me−nε2(α(S,n))

2ν2+1

[1

βn

]α(S,n,k,ε)

= 2eζ2.

Therefore,

P

(S ∈ Z ∩

max


)≤ (1 + 2e)ζ2.

Therefore, by using the definition of ε(t), (63), and (65), with probability at least 1− ζ − δk − (1 + 2e)ζ2, we have

|R(W,Sn)−R(W )| ≤ 2

√2ν2

n

[log

(1

βn

)(α(S, n) + ξnc(δk) + ξnc′(βn) + ε

)+ log

(M

ζ2

)]+2Lβn.

Choose k such that δk ≤ ζ/2, ζ2 = ζ/(2 + 4e), ξn = log log(n), ε = α(S, n), and βn =√

2ν2/L2n. Then, withprobability at least 1− 2ζ, we have

|R(W,Sn)−R(W )| (76)

≤ 4

√4ν2

n

[1

2log (nL2)

(2α(S, n) + c(δk) log log(n) + o(log log(n))

)+ log

(13M

ζ

)]. (77)

Finally, as we have α(S, n) log(n) = ω(log log(n)), for n large enough, we obtain

|R(W,Sn)−R(W )| ≤ 8ν

√α(S, n) log2 (nL2)

n+

log (13M/ζ)

n. (78)

This completes the proof.

27


Proof. If we apply SGD this results in the recursion (6) with

hi(w) = Miw + qi with Mi := (1− ηλ)I − ηHi, (79)

Hi :=1

b

∑j∈Si

ajaTj , qi := (η/b)

∑j∈Si

ajyj ,

where aj ∈ Rd are the input vector, and yj are the output variable, and Simbi=1 is a partition of 1, 2, . . . , nwith |Si| = b with i = 1, 2, . . . ,mb and mb = n/b. Let Li be the Lipschitz constant of ai. It can be seen that∇`(w, zi) is Lipschitz with constant Li = R2

i + λ, where Ri = maxj∈Si ‖aj‖. We assume η < 2/L = 2/(R2 + λ),where R = maxiRi, otherwise the expectation of the iterates can diverge from some initializations and for somechoices of the batch-size. We have

hi(u)− hi(v) = Mi(u− v),

where0

(1− ηλ− ηR2

i

)I Mi (1− ηλ)I.

Hence, hi is bi-Lipschitz in the sense of [Anc16] where

γi(u− v) ≤ ‖hi(u)− hi(v)‖ ≤ Γi(u− v),

with

γi = min(∣∣1− ηλ− ηR2

i

∣∣ , |1− ηλ|) , (80)

Γi = max(∣∣1− ηλ− ηR2

i

∣∣ , |1− ηλ|) < 1, (81)

as long as γi > 0. For simplicity of the presentation, we assume η < 1R2+λ in which case the expressions for γi

and Γi simplify to:γi = 1− ηλ− ηR2

i , Γi = 1− ηλ.

In this case, it is easy to see that0 < γi ≤ ‖Jhi(w)‖ ≤ Γi < 1,

and it follows from Theorem 2 that

dimHµW |Sn ≤Ent∑mb

i=1 pi log(Γi)=

−Ent∑mbi=1 pi log(1/Γi)

. (82)

By Jensen’s inequality, we have

−Ent =

mb∑i=1

pi log

(1

pi

)≤ log

(mb∑i=1

pi ·1

pi

)= log(mb), (83)

where mb =(nb

).

When η < 1R2+λ , we recall that γi = 1− ηλ− ηR2

i and Γi = 1− ηλ. Therefore,

dimHµW |Sn ≤−Ent∑mb

i=1 pi log(1/Γi)≤ log (mb)

log(1/(1− ηλ))=

log (n/b)

log(1/(1− ηλ)). (84)

The proof is complete.

28


Proof. When the batch-size is equal to b, we can compute that the Jacobian is given by

Jhi(w) =1

b

∑j∈Si

(1− ηλ+ ηy2

j

[e−yja

Tj w

(1 + e−yjaTj w)2

]aja

Tj

), (85)

where Simbi=1 is a partition of 1, 2, . . . , n with |Si| = b, where i = 1, 2, . . . ,mb and mb = n/b. Note that theinput data is bounded, i.e. Ri := maxj∈Si ‖aj‖ < ∞, and R := maxiRi < 2

√λ. Recall that the step-size is

sufficiently small, i.e. η < 1/λ. One can provide the upper bound on Jhi(w):

‖Jhi(w)‖ ≤ Γi := 1− ηλ+1

4ηR2

i ≤ 1− ηλ+1

4ηR2, (86)

so that


i=1 pi log(Γi)=

−Ent∑mbi=1 pi log(1/(1− ηλ+ 1

4ηR2i ))

≤ logmb

log(1/(1− ηλ+ 14ηR

2))(87)

=log (n/b)

log(1/(1− ηλ+ 14ηR

2)), (88)

where we used (86) and (83) in (87).

The proof is complete.


Proof. We can compute that

∇`(w, zi) = −aiρ′ (yi − 〈w, ai〉) + λrw, (89)

hi(w) =1

b

∑j∈Si

(1− ηλr)w + ηajρ′ (yj − 〈w, aj〉) , (90)

Jhi(w) =1

b

∑j∈Si

(1− ηλr)I − ηajaTj ρ′′ (yj − 〈w, aj〉) , (91)

where Simbi=1 is a partition of 1, 2, . . . , n with |Si| = b, where i = 1, 2, . . . ,mb with mb = n/b. Furthermore,‖ρ′′exp‖∞ = ρ′′exp(0) = 2

t0. Therefore, for η ∈ (0, 1

λr+R2(2/t0) ),

0 < (1− ηλr)− ηR2 2

t0≤ ‖Jhi(w)‖ ≤ (1− ηλr) + ηR2 2

t0, (92)

where R = maxi ‖ai‖ <√λrt0/2. We have

dimHµW |Sn ≤logmb

log(1/(1− ηλr + ηR2 2t0

))=

log (n/b)

log(1/(1− ηλr + ηR2 2t0

)), (93)

where we used (92) and (83). The proof is complete.

29


Proof. We can compute that

∇`(w, zi) = yi`′σ

(yia

Ti w)ai + λw,

∇2`(w, zi) = y2i `′′σ

(yia

Ti w)aia

Ti + λ,

hi(w) = w − η

b

∑j∈Si

∇`(w, zj),

where Simbi=1 is a partition of 1, 2, . . . , n with |Si| = b, where i = 1, 2, . . . ,mb with mb = n/b, so that

Jhi(w) = I − η

b

∑j∈Si

∇2`(w, zj) = (1− ηλ)I − η

b

∑j∈Si

y2j `′′σ

(yja

Tj w)aja

Tj ,

with

`′′σ(z) =1

σ

e−(1−z)/σ

(1 + e−(1−z)/σ)2≥ 0, ‖`′′σ‖∞ = `′′σ(1) =

1

4ρ.

Therefore, if η ∈ (0, 1λ+‖R‖2/(4ρ) ) and R := maxi ‖ai‖, then

1− ηλ− η 1

4ρR2 ≤ ‖Jhi(w)‖ ≤ 1− ηλ.

This implies that


log(1/(1− ηλ))=

log (n/b)

log(1/(1− ηλ)), (94)

where we used (83). The proof is complete.


Proof. We recall that the loss is given by:

`(w, zi) := ‖yi − yi‖2 + λ‖w‖2/2, yi :=

m∑r=1

brσ(wTr ai

), (95)

where the non-linearity σ : R→ R is smooth and λ > 0 is a regularization parameter. Note that we can re-write(95) as `(w, zi) =

∥∥yi − bTσ (wTr ai)∥∥2+ λ‖w‖2/2. We can compute that

∂`(w, zi)

∂wr= −(yi − yi)

∂yi∂wr

+ λwr = −(yi − yi)brσ′(wTr ai)ai + λwr. (96)

Therefore,

∇`(w, zi) = −(yi − yi)vi + λw, where vi :=

b1σ′(wT1 ai)ai

b2σ′(wT2 ai)ai· · ·

bmσ′(wTmai)ai

,with

hi(w) = w − η

b

∑j∈Si

∇`(w, zi),

30

and

Jhi(w) = (1− ηλ)I − η

b

∑j∈Si

vj ⊗ vTj

+η

b

∑j∈Si

(yj − yj)

b1σ′′(wT1 aj)aja

Tj 0d . . . 0d

0d b2σ′′(wT2 aj)aja

Tj . . . 0d

......

. . ....

0d 0d . . . bmσ′′(wTmaj)aja

Tj

= (1− ηλ)I − η

b

∑j∈Si

diag(B(j)r mr=1), (97)

where Simbi=1 is a partition of 1, 2, . . . , n with |Si| = b, where i = 1, 2, . . . ,mb with mb = n/b, and

B(i)r := br

[−(yi − yi)σ′′(wTr ai) + (σ′(wTr ai))

2]aia

Ti , (98)

and 0d is a d× d zero matrix and diag(B(i)r mr=1) denotes a block diagonal matrix with the matrices B(i)

r on thediagonal. We assume the output yi and the activation function σ and its second derivative σ′′ is bounded. Thiswould for instance clearly hold for classification problems where yi can take integer values on a compact set witha sigmoid or hyperbolic tangent activation function. Then, under this assumption, there exists a constant My > 0

such that maxi ‖yi − yi‖ ≤My. Then for η ∈ (0, 12λ ) and λ > C where C := My‖b‖∞‖σ′′‖R2 + (maxj ‖vj‖∞)

2,we get

1− η(C + λ) ≤ ‖Jhi(w)‖ ≤ 1− η(λ− C).

This implies that


log(1/(1− η(λ− C)))=

log (n/b)

log(1/(1− η(λ− C))), (99)



Proof. Recall that H is positive-definite and there exist some m,M > 0:

0 ≺ mI H MI. (100)

We havehi(u)− hi(v) = Mi(u− v),

where0

(1− ηλm−1 − ηm−1R2

i

)I Mi

(1− ηλM−1

)I, (101)

where Ri := maxj∈Si ‖aj‖, and we recall the assumption that η < mR2+λ , with R := maxiRi. Hence, hi is

bi-Lipschitz in the sense of [Anc16] where

γi(u− v) ≤ ‖hi(u)− hi(v)‖ ≤ Γi(u− v),

with

γi = min(∣∣1− ηλm−1 − ηm−1R2

i

∣∣ , ∣∣1− ηM−1λ∣∣) , (102)

Γi = max(∣∣1− ηλm−1 − ηm−1R2

i

∣∣ , ∣∣1− ηM−1λ∣∣) < 1, (103)

as long as γi > 0. We recall the assumption η < mR2+λ , where R := maxiRi, in which case the expressions for γi

and Γi simplify to:γi = 1− ηm−1λ− ηm−1R2

i , Γi = 1− ηM−1λ.

31

In this case, it is easy to see that0 < γi ≤ ‖Jhi(w)‖ ≤ Γi < 1,

and it follows from Theorem 2 that


i=1 pi log(Γi)≤ logmb

log(1/(1− ηM−1λ))=

log (n/b)

log(1/(1− ηM−1λ)), (104)



Proof. Similar as in the proof of Proposition 3, we can compute that the Jacobian is given by

Jhi(w) =1

b

∑j∈Si

(1− ηH−1λ+ ηH−1y2

j

[e−yja

Tj w

(1 + e−yjaTj w)2

]aja

Tj

), (105)

where Simbi=1 is a partition of 1, 2, . . . , n with |Si| = b, where i = 1, 2, . . . ,mb with mb = n/b, and H is apositive-definite matrix with 0 ≺ mI H MI. recall that the input data is bounded, i.e. maxj∈Si ‖aj‖ ≤ Rifor some Ri, and R := maxiRi satisfying R < 2

√mλ/M . Also recall the step-size is sufficiently small, i.e.

η < m/λ. One can provide upper bounds and lower bounds on Jhi(w):

‖Jhi(w)‖ ≤ Γi := 1− ηM−1λ+1

4ηm−1R2

i , (106)

‖Jhi(w)‖ ≥ γi := 1− ηm−1λ, (107)

so that

dimHµW |Sn ≤−Ent∑mb

i=1 pi log(1/(1− ηM−1λ+ 14ηm

−1R2i ))

≤ logmb

log(1/(1− ηM−1λ+ 14ηm

−1R2))(108)

=b log (n/b)

log(1/(1− ηM−1λ+ 14ηm

−1R2)), (109)

where we used (83) in (108). The proof is complete.


Proof. Similar as in the proof of Proposition 4, we can compute that

Jhi(w) =1

b

∑j∈Si

(I − ηH−1λr

)− ηH−1aja

Tj ρ′′ (yj − 〈w, aj〉) , (110)

where Simbi=1 is a partition of 1, 2, . . . , n with |Si| = b, where i = 1, 2, . . . ,mb with mb = n/b, and H is apositive-definite matrix with 0 ≺ mI H MI. For the function ρ, a standard choice is exponential squared

32

loss: ρexp(t) = 1− e−|t|2/t0 , where t0 > 0 is a tuning parameter. We can compute that ‖ρ′′exp‖∞ = ρ′′exp(0) = 2t0.

Therefore, for η ∈ (0, mλr+R2(2/t0) ),

0 < 1− ηm−1λr − ηm−1R2 2

t0≤ ‖Jhi(w)‖ ≤ 1− ηM−1λr + ηm−1R2 2

t0, (111)

where R = maxi ‖ai‖ and we recall that R <√λrt0m/(2M). We have


log(1/(1− ηM−1λr + ηm−1R2 2t0

))=

log (n/b)

log(1/(1− ηM−1λr + ηm−1R2 2t0

)), (112)



Proof. Similar as in the proof of Proposition 7, we can compute that

Jhi(w) = I − η

bH−1

∑j∈Si

∇2`(w, zj) = (1− ηλH−1)I − η

bH−1

∑j∈Si

y2j `′′σ

(yja

Tj w)aja

Tj ,

where Simbi=1 is a partition of 1, 2, . . . , n with |Si| = b, where i = 1, 2, . . . ,mb with mb = n/b, and H is apositive-definite matrix with 0 ≺ mI H MI, and

`′′σ(z) =1

σ

e−(1−z)/σ

(1 + e−(1−z)/σ)2≥ 0, ‖`′′σ‖∞ = `′′σ(1) =

1

4ρ.

Therefore, if η ∈ (0, mλ+‖R‖2/(4ρ) ) where R := maxi ‖ai‖, then

1− ηm−1λ− ηm−1 1

4ρR2 ≤ ‖Jhi(w)‖ ≤ 1− ηM−1λ.

This implies that


log(1/(1− ηM−1λ))=

log (n/b)

log(1/(1− ηM−1λ)), (113)

where we use (83). The proof is complete.


Proof. By following the similar derivations as in Proposition 5, we obtain

Jhi(w) =(1− ηλH−1

)I − η

bH−1

∑j∈Si

diag(B(j)

r mr=1

), (114)

where Simbi=1 is a partition of 1, 2, . . . , n with |Si| = b, where i = 1, 2, . . . ,mb, with mb = n/b, and H

is a positive-definite matrix and 0 ≺ mI H MI for some m,M > 0, and diag(B(i)r mr=1) denotes a

block diagonal matrix with the matrices B(i)r on the diagonal defined in Proposition 5. As in Proposition 5,

there exists a constant My > 0 such that maxi ‖yi − yi‖ ≤ My. Then for η ∈ (0, mC+λ ) and λ > M

mC whereC := My‖b‖∞‖σ′′‖R2 + (maxj ‖vj‖∞)

2, we get

1− ηm−1(C + λ) ≤ ‖Jhi(w)‖ ≤ 1− η(M−1λ−m−1C

).

33

This implies that


log(1/(1− η(M−1λ−m−1C)))=

log (n/b)

log(1/(1− η(M−1λ−m−1C))), (115)


34

Documents

Fractal Structure and Generalization Properties of