AlexanderRakhlin KarthikSridharan arXiv:1510.03925v1 [math ... · An equivalent way to deﬁne martingale type p is to ask that there exist a constant C such that E sup YyY∗≤1

arX

iv:1

510.

0392

5v1

[m

ath.

PR]

13

Oct

201

5

On Equivalence of Martingale Tail Bounds and

Deterministic Regret Inequalities

Alexander Rakhlin

University of Pennsylvania

Karthik Sridharan

Cornell University

October 15, 2015

Abstract

We study an equivalence of (i) deterministic pathwise statements appearing in the onlinelearning literature (termed regret bounds), (ii) high-probability tail bounds for the supremumof a collection of martingales (of a specific form arising from uniform laws of large numbers formartingales), and (iii) in-expectation bounds for the supremum. By virtue of the equivalence,we prove exponential tail bounds for norms of Banach space valued martingales via deterministicregret bounds for the online mirror descent algorithm with an adaptive step size. We extendthese results beyond the linear structure of the Banach space: we define a notion of martingaletype for general classes of real-valued functions and show its equivalence (up to a logarithmicfactor) to various sequential complexities of the class (in particular, the sequential Rademachercomplexity and its offset version). For classes with the general martingale type 2, we exhibit afiner notion of variation that allows partial adaptation to the function indexing the martingale.Our proof technique rests on sequential symmetrization and on certifying the existence of regretminimization strategies for certain online prediction problems.

1 Introduction

Let Z1, . . . ,Zn be a martingale difference sequence taking values in a separable (2,D)-smoothBanach space (B, ∥ ⋅ ∥). A result due to Pinelis [17] asserts that for any u > 0

P (supn≥1∥ n∑t=1

Zt∥ ≥ σu) ≤ 2exp− u2

2D2 , (1)

where σ is a constant satisfying ∑∞t=1 ∥Zt∥2∞ ≤ σ2. Writing the norm ∥x∥ = sup∥y∥∗≤1 ⟨y,x⟩ as thesupremum over the dual ball, we may re-interpret (1) as a one-sided tail control for the supremumof a stochastic process y ↦ ∑n

t=1 ⟨y,Zt⟩ ∶ ∥y∥∗ ≤ 1. In this paper, we consider several extensions of(1), motivated by the following questions:

(a) Can (1) be strengthened by replacing σ with a “path-dependent” version of variation?

(b) Does a version of (1) hold when we move away from the linear structure of the Banach space?

Positive answers to these questions constitute the first contribution of our paper. The second con-tribution involves the actual technique. The cornerstone of our analysis is a certain equivalence ofmartingale inequalities and deterministic pathwise statements. The latter inequalities are studied in

1

http://arxiv.org/abs/1510.03925v1

the field of online learning (or, sequential prediction), and are referred to as regret bounds. We showthat the existence (which can be certified via the minimax theorem) of prediction strategies thatminimize regret yields predictable processes that help in answering (a) and (b). The equivalenceis exploited in both directions, whereby stronger regret bounds are derived from the correspondingprobabilistic bounds, and vice versa. To obtain one of the main results in the paper, we sharpen thebound by passing several times between the deterministic statements and probabilistic tail bounds.The equivalence asserts a strong connection between probabilistic inequalities for martingales andonline learning algorithms.

In the remainder of this section, we present a simple example of the equivalence based onthe gradient descent method, arguably the most popular convex optimization procedure. Theexample captures, loosely speaking, a correspondence between deterministic optimization methodsand probabilistic bounds. Consider the unit Euclidean ball B in R

d. Let z1, . . . , zn ∈ B and define,recursively, the Euclidean projections

yt+1 = yt+1(z1, . . . , zt) = ProjB (yt − n−1/2zt) (2)

for each t = 1, . . . , n, with the initial value y1 = 0. Elementary algebra1 shows that for any f ∈ B,the regret inequality ∑n

t=1 ⟨yt − f, zt⟩ ≤ √n holds deterministically for any sequence z1, . . . , zn ∈ B.We re-write this statement as

∥ n∑t=1

zt∥ −√n ≤ n∑t=1⟨yt,−zt⟩ . (3)

Applying the deterministic inequality to a B-valued martingale difference sequence −Z1, . . . ,−Zn,

P (∥ n∑t=1

Zt∥ −√n > u) ≤ P ( n∑t=1⟨yt,Zt⟩ > u) ≤ exp−u2

2n . (4)

The latter upper bound is an application of the Azuma-Hoeffding’s inequality. Indeed, the process(yt) is predictable with respect to σ(Z1, . . . ,Zt), and thus (⟨yt,Zt⟩) is a [−1,1]-valued martingaledifference sequence. It is worth emphasizing the conclusion: one-sided deviation tail bounds for anorm of a vector-valued martingale can be deduced from tail bounds for real-valued martingales withthe help of a deterministic inequality. Next, integrating the tail bound in (4) yields a seeminglyweaker in-expectation statement

E∥ n∑t=1

Zt∥ ≤ c√n (5)

for an appropriate constant c. The twist in this uncomplicated story comes next: with the help ofthe minimax theorem, [23] established existence of strategies (yt) such that

∀z1, . . . , zn, f ∈ B, n∑t=1⟨yt − f, zt⟩ ≤ supE∥ n∑

t=1Zt∥ , (6)

with the supremum taken over all martingale difference sequences with respect to a dyadic filtration.In view of (5), this bound is c

√n.

What have we achieved? Let us summarize. The deterministic inequality (3), which holds forall sequences, implies a tail bound (4). The latter, in turn, implies an in-expectation bound (5),which implies (3) (with a worse constant) through a minimax argument, thus closing the loop. Theequivalence—studied in depth in this paper—is informally stated below:

1See the two-line proof in the Appendix, Lemma 19.

2

Informal: The following bounds imply each other: (a) an inequality that holds for all sequences;(b) a deviation tail probability for the size of a martingale; (c) an in-expectation bound on the sizeof a martingale.

The equivalence, in particular, allows us to amplify the in-expectation bounds to appropriatehigh-probability tail bounds.

As already mentioned, the pathwise inequalities, such as (3), are extensively studied in thefield of online learning. In this paper, we employ some of the recently developed data-dependent(adaptive) regret inequalities to prove tail bounds for martingales. In turn, in view of the aboveequivalence, martingale inequalities shall give rise to novel deterministic regret bounds.

While writing the paper, we learned of the trajectorial approach, extensively studied in recentyears. In particular, it has been shown that Doob’s maximal inequalities and Burkholder-Davis-Gundy inequalities have deterministic counterparts [2, 3, 13, 4]. The online learning literaturecontains a trove of pathwise inequalities, and further synthesis with the trajectorial approach (andthe applications in mathematical finance) appears to be a promising direction.

This paper is organized as follows. In the next section, we extend the Euclidean result tomartingales with values in Banach spaces, and also improve it by replacing

√n with square root of

variation. We define a notion of martingale type for general classes of functions in Section 3, andexhibit a tight connection to the growth of sequential Rademacher complexity. Section 4 presentssequential symmetrization; here we prove that statements for the dyadic filtration automaticallyyield corresponding tail bounds for general discrete-time stochastic processes. In Section 5, weintroduce the machinery for obtaining regret inequalities, and show how these inequalities allowone to amplify certain in-expectation bounds into high-probability statements (Section 6). The lasttwo sections contain some of the main results: In Section 7 we prove a high probability bound forthe notion of martingale type, and present a finer analysis of adaptivity of the variation term inSection 8.

2 Results in Banach spaces

For the case of the Euclidean (or Hilbertian) norm, it is easy to see that the√n bound of (5) can be

improved to a distribution-dependent quantity (∑nt=1E ∥Zt∥2)1/2. Given the equivalence sketched

earlier, one may wonder whether this implies existence of a gradient-descent-like method with asequence-dependent variation governing the rate of convergence of this optimization procedure.Below, we indeed present such a method for 2-smooth Banach spaces.

Let (B, ∥ ⋅ ∥) be a separable Banach space, and let (B∗, ∥ ⋅ ∥∗) denote its dual. (B, ∥ ⋅ ∥) is ofmartingale type p (for p ∈ [1,2])) if there exists a constant C such that

E∥ n∑t=1

Zt∥p ≤ Cpn∑t=1

E ∥Zt∥p (7)

for any B-valued martingale difference sequence. The best possible constant C in this inequality(as well as its finiteness) is known to depend on the geometry of the Banach space. For instance,for a Hilbert space (7) holds for p = 2 with constant C = 1. On the other hand, triangle inequalityimplies that any space has the trivial type p = 1.

3

An equivalent way to define martingale type p is to ask that there exist a constant C such that

E sup∥y∥∗≤1

n∑t=1⟨y,Zt⟩ = E∥ n∑

t=1Zt∥ ≤ C ( n∑

t=1E ∥Zt∥p)1/p . (8)

We now show that the strengthening to a sequence-dependent variation holds for any 2-smoothBanach space, as we show next. Based on the equivalence mentioned earlier, we immediately obtaintail bounds.

Assume ∥ ⋅ ∥ is 2-smooth. Let DR ∶B∗ ×B∗ → R be the Bregman divergence with respect to aconvex function R, which is assumed to be 1-strongly convex on the unit ball B∗ of B∗. DenoteR2

max ≜ supf,g∈B∗DR(f, g). We extend and improve (4) as follows.

Theorem 1. Let Z1, . . . ,Zn be a B-valued martingale difference sequence, and let Et stand for theconditional expectation given Z1, . . . ,Zt. For any u > 0, it holds that

P⎛⎜⎝∥∑n

t=1Zt∥ − 2.5Rmax (√Vn + 1)√Vn +Wn + (E√Vn +Wn)2 > u

⎞⎟⎠ ≤√2 exp −u2/16 , (9)

where

Vn =n∑t=1∥Zt∥2 and Wn =

n∑t=1

Et−1 ∥Zt∥2 . (10)

Furthermore, the bound holds with Wn ≡ 0 if the martingale differences are conditionally symmetric.

In addition to extending the Euclidean result of the previous section to Banach spaces, (9)offers several advantages. First, it is n-independent. Second, deviations are self-normalized (thatis, scaled by root-variation terms). We refer to Lemma 11 for other forms of probabilistic bounds.

To prove the theorem, we start with a deterministic inequality from [21, Corollary 2]. Forcompleteness, the proof is provided in the Appendix.

Lemma 2. Let F ⊂B∗ be a convex set. Define, recursively,

yt+1 = yt+1(z1, . . . , zt) = argminf∈F

ηt ⟨f, zt⟩ +DR(f, yt) (11)

with y0 = 0, ηt ≜ Rmaxmin1,(√∑ts=1 ∥zs∥2 +√∑t−1

s=1 ∥zs∥2)−1, and with R2max ≜ supf,g∈F DR(f, g).

Then for any f ∈ F and any z1, . . . , zn ∈B,

∑nt=1 ⟨yt − f, zt⟩ ≤ 2.5Rmax (√∑n

t=1 ∥zt∥2 + 1) .Proof of Theorem 1. We take F to be the unit ball in B, ensuring ∥yt∥∗ ≤ 1. For any martingaledifference sequence (Zt) with values in B∗, the above lemma implies, by definition of the norm,

∥∑nt=1Zt∥ − 2.5Rmax (√Vn + 1) ≤ ∑n

t=1 ⟨yt,Zt⟩ (12)

4

for all sample paths. Dividing both sides by√

Vn +Wn + (E√Vn +Wn)2, we conclude that theleft-hand side in (9) is upper bounded by

P⎛⎜⎝

∑nt=1 ⟨yt,Zt⟩√

Vn +Wn + (E√Vn +Wn)2 > u⎞⎟⎠ . (13)

To control this probability, we recall the following result [8, Theorem 2.7]:

Theorem 3 ([8]). For a pair of random variables A,B, with B > 0, such that

E expλA − λ2B2/2 ≤ 1 ∀λ ∈ R, (14)

it holds that

P⎛⎝ ∣A∣√

B2 + (EB)2 > u⎞⎠ ≤√2 exp−u2/4 .

To apply this theorem, we verify assumption (14):

Lemma 4. The random variables A = ∑nt=1 ⟨yt,Zt⟩ and B2 = 4∑n

t=1(∥Zt∥2+Et−1 ∥Zt∥2) satisfy (14).

The proof of the Lemma, as well as most of the proofs in this paper, is postponed to theAppendix. This concludes the proof of Theorem 1.

Let us make several remarks. First, [21, Corollary 2] proves a more general deterministicinequality: for any collection of functions Mt = Mt(z1, . . . , zt−1), there exists a strategy (yt) suchthat

∀z1, . . . , zn ∈B, ∑nt=1 ⟨yt − f, zt⟩ ≤ 4.5Rmax (√∑n

t=1 ∥zt −Mt∥2 + 1) .Second, the reader will notice that the pathwise inequality (12) does not depend on n and the

construction of yt is also oblivious to this value. A simple argument (Lemma 20 in the Appendix)then allows us to lift the real-valued Burkholder-Davis-Gundy inequality (with the constant from[6]) to the Banach space valued martingales:

E maxs=1,...,n

∥ s∑t=1

Zt∥ ≤ (2.5Rmax +√3)E√Vn + 2.5Rmax .

Notably, the constant in the resulting BDG inequality is proportional to Rmax.We also remark that Theorem 1 can be naturally extended to p-smooth Banach spaces B. This

is accomplished in a straightforward manner by extending Lemma 2.In conclusion, we were able to replace the distribution-independent

√n bound with a sequence-

dependent quantity Vn. One may ask whether this phenomenon is general; that is, whether asequence-dependent variation bound necessarily holds whenever the corresponding distribution-independent bound does. We prove in Theorem 5 below that this is indeed the case (up to alogarithmic factor), a result that holds for general classes of functions.

5

3 Martingale Type for a General Class of Functions

We now define the analogue of a martingale type for a class G of real-valued measurable functionson some abstract measurable space Z. To this end, we assume that (Z1, . . . ,Zn) is a discrete timeprocess on a probability space (Ω,A, P ). Let E denote the expectation on this probability space,and let Et−1 denote the conditional (given Z1, . . . ,Zt−1) expectation with respect to Zt. For anyg ∶ Z → R,

n∑t=1(g(Zt) − Et−1 [g(Zt)]) (15)

is a sum of martingale differences g(Zt) − Et−1 [g(Zt)]. We let Z ′1, . . . ,Z′n be a tangent sequence;

that is, Z ′t and Zt are independent and identically distributed conditionally on Z1, . . . ,Zt−1. LetE′t−1 denote the conditional (given Z1, . . . ,Zt−1) expectation with respect to Z ′t.

Definition 1. A class G ⊂ RZ has martingale type p if there exists a constant C such that

E[supg∈G

n∑t=1(g(Zt) − Et−1 [g(Zt)])] ≤ C E( n∑

t=1E′t−1 sup

g∈G∣g(Zt) − g(Z ′t)∣p)1/p. (16)

Remark 3.1. We conjecture that the statements below also hold for the definition of martingaletype where E

′t−1 supg∈G ∣g(Zt) − g(Z ′t)∣p on the right-hand side of (16) is replaced with a smaller and

more natural quantity supg∈G ∣g(Zt) − E′t−1g(Z ′t)∣p.In proving (16), we shall work with a dyadic filtration. Let (At = σ(ǫ1, . . . , ǫt))nt=1 generated by

independent Rademacher (symmetric ±1-valued) random variables ǫ1, . . . , ǫn. Let x = (x1, . . . ,xn)be a predictable process with respect to this filtration (that is, xt is At−1-measurable) with valuesin some set X . Sequential Rademacher complexity2 of an abstract class F ⊆ RX on x is defined as

Rn(F ;x) = E ∣supf∈F

n∑t=1

ǫtf(xt)∣ . (17)

Definition 2. Let r ∈ (1,2]. We say that sequential Rademacher complexity of F exhibits an n1/r

growth with constant C if

∀n ≥ 1, ∀x, Rn(F ;x) ≤ Cn1/r ⋅ supf∈F ,ǫ∈±1n,t≤n

∣f(xt(ǫ))∣ . (18)

We will work with a particular class of functions F = fg(z, z′) = g(z) − g(z′) ∶ g ∈ G definedon X ≜ Z ×Z. It is immediate that F exhibits n1/r whenever G does, and vice versa, with at mostdoubling of the constant C.

Using a sequential symmetrization technique, it holds (see [25]) that

E[supg∈G

n∑t=1(g(Zt) − Et−1 [g(Zt)])] ≤ 2 sup

z

Rn(G;z) . (19)

Therefore, the statement “G has martingale type r whenever G exhibits an n1/r growth” correspondsto the phenomenon that, loosely speaking, “one may replace the distribution-independent n1/r

bound with a sequence-dependent variation.”The next theorem shows a tight connection between the complexity growth n1/r and martingale

type.

2This complexity is defined in [25] without the absolute values; this difference is minor (and disappears if 0 ∈ F).

6

Theorem 5. For any function class G ⊆ RZ , the following statements hold:

1. If for some r ∈ (1,2] sequential Rademacher complexity exhibits n1/r growth, then G hasmartingale type p for every p < r.

2. If G has martingale type p, then sequential complexity exhibits an n1/p growth.

The proof relies on the development in the next few sections, and especially on Lemma 15. Thetechnique is partly inspired by the work of Burkholder [7] and Pisier [18]. In particular, a key toolis the reverse Holder principle [19, Prop. 8.53].

In addition to Theorem 5, let us state informal versions of Theorems 17 and 18 which appear,respectively, in Sections 7 and 8. Define the random variables

Varp = E′n∑t=1

supg∈G∣g(Zt) − g(Z ′t)∣p , Varp(g) = E′ n∑

t=1∣g(Zt) − g(Z ′t)∣p

where E′ is expectation with respect to the tangent sequence, conditionally on Z1∶n. Then Theo-

rem 17 states that with high probability controlled by u > 0,

supg∈G

n∑t=1(g(Zt) − Et−1 [g(Zt)]) ≲ log(n)Var1/rr + uVar

1/22

whenever G exhibits n1/r growth of sequential Rademacher complexity. Theorem 8 addresses thecase of martingale type 2 and states that with high probability controlled by u > 0,

supg∈G

n∑t=1(g(Zt) − Et−1 [g(Zt)]) − n q

4 (Var1/22 (g)) 2−q4 − uVar

1/22 (g) ≲ 0

whenever sequential entropy (defined below) at scale α behaves as α−q.

3.1 Other complexity measures

We see that the martingale type of G is described by the behavior of sequential Rademachercomplexity. The latter behavior can, in turn, be quantified in terms of geometric quantities, such assequential covering numbers and the sequential scale-sensitive dimension. We present the followingtwo definitions from [25], both stated in terms of a predictable process x = (x1, . . . ,xn) with respectto the dyadic filtration. It may be beneficial (at least it was for the authors of [25]) to think of xas a complete binary tree of depth n, decorated by elements of X , and ǫ ∈ ±1n specifying a pathin this tree.

Definition 3 (Sequential covering number). Let x = (x1, . . . ,xn) be an X -valued predictable processwith respect to the dyadic filtration, and let F ⊆ RX . A collection V of R-valued predictable processesis called an α-cover (with respect to ℓp) of F on x if

∀f ∈ F , ∀ǫ ∈ ±1n, ∃v ∈ V, s.t. ( 1n

n∑t=1∣f(xt(ǫ)) − vt(ǫ)∣p)1/p ≤ α . (20)

The cardinality of the smallest α-cover is denoted by Np(F , α,x) and Np(F , α,n) = supxNp(F , α,x),and both are referred to as sequential covering numbers. Sequential entropy is defined as logNp.

7

Definition 4 (Sequential fat-shattering dimension). We say that F ⊆ RX shatters the predictableprocess x = (x1, . . . ,xn) at scale α > 0 if there exists a real-valued predictable process s such that

∀ǫ ∈ ±1n, ∃f ∈ F , s.t. ∀t ≤ n, ǫt(f(xt(ǫ)) − st(ǫ)) ≥ α/2.The largest length n of a shattered predictable process x is called the sequential fat-shattering di-mension at scale α and denoted fatα(F).

The sequential covering numbers and the fat-shattering dimension are natural extensions of theclassical notions, as shown in [25]. In particular, a Dudley-type entropy integral upper bound interms of sequential covering numbers holds for sequential Rademacher complexity. The sequentialcovering numbers, in turn, are upper bounded in terms of the fat-shattering dimensions, in aparallel to the way classical empirical covering numbers are controlled by the scale-sensitive versionof the Vapnik-Chervonenkis dimension. We summarize the implications of these relationships inthe following corollary:

Corollary 6. For any function class F ⊆ RX ,1. If for some q > 0 either ∀α, logN2(F , α,n) ≤ Cα−q or ∀α, fatα(F) ≤ Cα−q, then F has

martingale type p for any p < maxq,2maxq,2−1 .

2. If F has martingale type r ∈ (1,2] then, for every p < r, there exists C such that logN2(F , α,n) ≤Cα− p

p−1 and fatα(F) ≤ Cα− p

p−1 , for all α.

We have established a relation between the martingale type of a function class F and severalsequential complexities of the class. However, unlike our starting point (1) and Theorem 1, ourresults so far do not quantify the tail behavior for the difference between the supremum of themartingale process and the corresponding variation. A natural idea is to mimic the “equivalence”argument used in Section 2 to conclude the exponential tail bounds. Unfortunately, the deviationinequalities of the previous section rest on pathwise regret bounds that, in turn, rely on the linearstructure of the associated Banach space, as well as on properties such as smoothness and uniformconvexity. Without the linear structure, it is not clear whether the analogous pathwise statementshold. The goal of the rest of the paper is to bring forth some of the tools recently developedwithin the online learning literature, and to apply these pathwise regret bounds to conclude highprobability tail bounds associated to martingale type. In addition to this goal, we will seek a versionof Theorem 5(i) for bounded functions, where the n1/r growth of sequential Rademacher complexityimplies martingale type r (rather than any p < r), but with an additional log(n) factor. Our thirdgoal will be to establish per-function variation bounds (similar to the notion of a weak variance[5]). We show that this latter bound is a finer version of the variation term, possible for classesthat are “not too large”.

Our plan is as follows. First, we reduce the problem to one based on the dyadic filtration. Afterthat, we shall introduce certain deterministic inequalities from the online learning literature thatare already stated for the dyadic filtration.

8

4 Symmetrization: dyadic filtration is enough

The purpose of this section is to prove that statements for the dyadic filtration can be lifted togeneral processes via sequential symmetrization. Consider the martingale

Mg =n∑t=1

g(Zt) −E[g(Zt)∣Z1, . . . ,Zt−1]indexed by g ∈ G. If (Zt) is adapted to a dyadic filtration At = σ(ǫ1, . . . , ǫt), each incrementg(Zt) − E[g(Zt)∣Z1, . . . ,Zt−1] takes on the value

fg(xt(ǫ1∶t−1)) ≜ (g(Zt(ǫ1∶t−1,+1)) − g(Zt(ǫ1∶t−1,−1))) /2or its negation, where xt is a predictable process with values in Z × Z and fg ∈ F defined by(z, z′) ↦ g(z) − g(z′). In the rest of the paper, we work directly with martingales of the formMf = ∑n

t=1 ǫtf(xt(ǫ)), indexed by an abstract class F ⊆ RX and an abstract X -valued predictableprocess x.

We extend the symmetrization approach of Panchenko [15] to sequential symmetrization forthe case of martingales. In contrast to the more frequently-used Gine-Zinn symmetrization proof(via Chebyshev’s inequality) [12, 26] that allows a direct tail comparison of the symmetrized andthe original processes, Panchenko’s approach allows for an “indirect” comparison. The followingimmediate extension of [15, Lemma 1] will imply that any exp−µ(u) type tail behavior of thesymmetrized process yields the same behavior for the original process.

Lemma 7. Suppose ξ and ν are random variables and for some Γ ≥ 1 and for all u ≥ 0P (ν ≥ u) ≤ Γexp−µ(u).

Let µ ∶ R+ → R+ be an increasing differentiable function with µ(0) = 0 and µ(∞) =∞. Suppose forall a ∈ R and φ(x) ≜ µ([x − a]+) it holds that Eφ(ξ) ≤ Eφ(ν). Then for any u ≥ 0,

P (ξ ≥ u) ≤ Γexp−µ(u − µ−1(1)).In particular, if µ(b) = cb, we have P (ξ ≥ u) ≤ Γexp1 − cu; if µ(b) = cb2, then P (ξ ≥ u) ≤Γexp1 − cu2/4.

As in [15], the lemma will be used with ξ and ν as functions of a single sample and the doublesample, respectively. The expression for the double sample will be symmetrized in order to pass tothe dyadic filtration. However, unlike [15], we are dealing with a dependent sequence Z1, . . . ,Zn,and the meaning ascribed to the “second sample” Z ′1, . . . ,Z

′n is that of a tangent sequence. That

is, Zt,Z′t are independent and have the same distribution conditionally on Z1, . . . ,Zt−1. Let Et−1

stand for the conditional expectation given Z1, . . . ,Zt−1.

Corollary 8. Let B ∶ G ×Z2n → R be a function that is symmetric with respect to the swap of thei-th pair zi, z

′i, for any i ∈ [n]:

B(g; z1, z′1, . . . , zi, z′i, . . . , zn, z′n) = B(g; z1, z′1, . . . , z′i, zi, . . . , zn, z′n) (21)

9

for all g ∈ G. Then, under the assumptions of Lemma 7 on µ, a tail behavior

∀(z,z′), P (supg∈G

n∑t=1

ǫt(g(zt) − g(z′t)) − B(g; (z1,z′1), . . . , (zn,z′n)) > u) ≤ Γexp−µ(u)for all u > 0 implies the tail bound

P (supg∈G

n∑t=1(g(Zt) −Et−1g(Zt)) − EZ′

1∶nB(g;Z1,Z

′1, . . . ,Zn,Z

′n) > u) ≤ Γexp−µ(u − µ−1(1))

for any sequence of random variables Z1, . . . ,Zn and the corresponding tangent sequence Z ′1, . . . ,Z′n.

The supremum is taken over a pair of predictable processes z,z′ with respect to the dyadic filtration.A direct comparison of the expected suprema also holds:

E supg∈G

n∑t=1(g(Zt) − Et−1g(Zt)) − EZ′1∶n

B(g;Z1,Z′1, . . . ,Zn,Z

′n) (22)

≤ supz,z′

E supg∈G

n∑t=1

ǫt(g(zt) − g(z′t)) − B(g; (z1,z′1), . . . , (zn,z′n)).We conclude that it is enough to prove tail bounds for a supremum

supf∈F ∑nt=1 ǫtf(xt) −B(f ;x1, . . . ,xn)

of a martingale with respect to the dyadic filtration, offset by a function B(f ;x1, . . . ,xn). Thiswill be achieved with the help of deterministic regret inequalities.

5 Deterministic regret inequalities

5.1 Sequential prediction

We let y1, . . . , yn ∈ ±1 and x1, . . . , xn ∈ X for some abstract measurable set X . Let F be a classof [−1,1]-valued functions on X . Fix a cost function ℓ ∶ R ×R → R, convex in the first argument.For a given function B ∶ F ×X n → R, we aim to construct yt = yt(x1, . . . , xt, y1, . . . , yt−1) ∈ [−1,1]such that

∀ (xt, yt)nt=1, n∑t=1

ℓ(yt, yt) ≤ inff∈F n∑t=1

ℓ(f(xt), yt) +B(f ;x1, . . . , xn) . (23)

We may view yt as a prediction of the next value yt having observed xt and all the data thus far. Inthis paper, we focus on the linear loss ℓ(a, b) = −ab/2 (equivalently, absolute loss ∣a− b∣ = (1 − ab)/2when b ∈ ±1) and ℓ(a, b) = (a − b)2. We equivalently write (23) for the linear cost function as

supf∈F n∑t=1

ytf(xt) − 2B(f ;x1, . . . , xn) ≤ n∑t=1

ytyt (24)

while for the square loss it becomes

supf∈F n∑t=1

2ytf(xt) − f(xt)2 −B(f ;x1, . . . , xn) ≤ n∑t=1

2ytyt − y2t . (25)

10

Given a function B and a class F , there are two goals we may consider: (a) certify the existenceof (yt) ≜ (y1, . . . , yn) satisfying the pathwise inequality (23) for all sequences (xt, yt)nt=1; or (b)give an explicit construction of (yt). Both questions have been studied in the online learningliterature, but the non-constructive approach will play an especially important role. Indeed, explicitconstructions—such as the simple gradient descent update (2) — might not be available in morecomplex situations, yet it is the existence of (yt) that yields the sought-after tail bounds.

5.2 Existence of strategies

To certify the existence of a strategy (yt), consider the following object:

A(F ,B) = ⟪supxt

infyt

maxyt

⟫n

t=1 n∑t=1

ℓ(yt, yt) − inff∈F n∑t=1

ℓ(f(xt), yt) +B(f ;x1, . . . , xn) (26)

where the notation ⟪⋯⟫nt=1 stands for the repeated application of the operators (the outer operatorscorresponding to t = 1). The variable xt ranges over X , yt is in the set ±1, and yt ranges in [−1,1].It follows that

A(F ,B) ≤ 0 is a necessary and sufficient condition for the existence of (yt) such that(23) holds.

Indeed, the optimal choice for y1 is made given x1; the optimal choice for y2 is made given x1, y1, x2,and so on. This choice defines the optimal strategy (yt).3 The other direction is immediate.

Suppose we can find an upper bound on A(F ,B) and then prove that this upper bound isnon-positive. This would serve as a sufficient condition for the existence of (yt). Next, we presentsuch an upper bound for the case when the cost function is linear. More general results for convexLipschitz cost functions can be found in [9].

As before, let ǫ = (ǫ1, . . . , ǫn) be a sequence of independent Rademacher random variables. Letx = (x1, . . . ,xn) and y = (y1, . . . ,yn) be predictable processes with respect to the dyadic filtrationσ(ǫ1, . . . , ǫt), with values in X and ±1, respectively. In other words, xt = xt(ǫ1, . . . , ǫt−1) ∈ X andyt = yt(ǫ1, . . . , ǫt−1) ∈ ±1 for each t = 1, . . . , n.Lemma 9. For the case of the linear cost function,

A(F ,B) ≤ supx

E [supf∈F

n∑t=1

1

2ǫtf(xt) −B(f ;x1, . . . ,xn)] . (27)

Therefore, whenever it holds that for any predictable process x = (x1, . . . ,xn)E [sup

f∈F

n∑t=1

ǫtf(xt) − 2B(f ;x1, . . . ,xn)] ≤ 0 , (28)

there exists a strategy (yt) with values

∣yt∣ ≤ supf∈F ∣f(xt)∣ (29)

such that the pathwise inequality (24) holds.

3If the infima are not achieved, a limiting argument can be employed.

11

Condition (28) in the previous lemma implies the existence of a strategy for (24). However,there might be situations when (28) can be verified for a function B(f ;x) of the predictable processthat does not have a corresponding representation in the sense of (24). The next lemma providesa variant of Lemma 9.

Lemma 10. Let x be an X -valued predictable process with respect to the dyadic filtration. Let thefunction B map the predictable process x and a function f ∈ F to a real value, with the property

supy

B(f ;x y) ≤ B(f ;x) (30)

where y = (y1, . . . ,yn) is a ±1-valued predictable process, and (x y)t = xt(y2∶t(ǫ)). If

E [supf∈F

n∑t=1

ǫtf(xt) − 2B(f ;x)] ≤ 0, (31)

then there is a strategy (yt) with yt = yt(y1, . . . , yt−1) and ∣yt∣ ≤ supf∈F ∣f(xt)∣ such that

∀y1, . . . , yn ∈ ±1, supf∈F n∑t=1

ytf(xt(y1, . . . , yt−1)) − 2B(f ;x) ≤ n∑t=1

ytyt. (32)

6 Amplification and equivalence

We now describe an interesting amplification phenomenon, already presented in the Introductionfor the simple Euclidean case. Whenever (28) holds, the deterministic inequality (24) holds, and,therefore, we may apply it to a particular martingale difference sequence to obtain high-probabilitybounds. Below, we detail this amplification for both linear and square loss functions.

6.1 Linear loss

Take any X -valued predictable process x = (x1, . . . ,xn) with respect to the dyadic filtration. Thedeterministic inequality (24) applied to xt = xt(ǫ1, . . . , ǫt−1) and yt = ǫt becomes

supf∈F n∑t=1

ǫtf(xt) − 2B(f ;x1, . . . ,xn) ≤ n∑t=1

ǫtyt (33)

for any ǫ, and thus we have the comparison of tails

P (supf∈F n∑t=1

ǫtf(xt) − 2B(f ;x1, . . . ,xn) > u) ≤ P ( n∑t=1

ǫtyt > u) . (34)

Given the boundedness of the increments ǫtyt, the tail bounds follow immediately from the Azuma-Hoeffding’s inequality or from Freedman’s inequality [10]. More precisely, we use the fact that themartingale differences are bounded by ∣yt∣ ≤ supf∈F ∣f(xt)∣, and conclude

Lemma 11. If there exists a prediction strategy (yt) that satisfies (24) and (29), then for anypredictable process x Azuma-Hoeffding inequality implies that

P (supf∈F n∑t=1

ǫtf(xt) − 2B(f ;x1, . . . ,xn) > u) ≤ exp(− u2

4maxǫ∑nt=1 supf∈F f(xt(ǫ))2) , (35)

12

Freedman’s inequality implies

P (supf∈F n∑t=1

ǫtf(xt) − 2B(f ;x1, . . . ,xn) > u, n∑t=1

supf∈F

f(xt)2 ≤ σ2) ≤ exp(− u2

2σ2 + 2uM/3) , (36)

where M = n ⋅ supf∈F ,ǫ∈±1n,t≤n ∣f(xt)∣, and we also have that for any α > 0,

P (supf∈F n∑t=1

ǫtf(xt) − 2B(f ;x1, . . . ,xn) −α n∑t=1

supf∈F

f(xt)2 > u) ≤ exp (−2αu) . (37)

In view of Lemma 9, a sufficient condition for these inequalities is that (28) holds for all x. Thesame inequalities hold with B(f ;x) if conditions of Lemma 10 are verified for the given x.

Let us emphasize the conclusion of the above lemma: the non-positivity of the expected supre-mum of a collection of martingales, offset by a function 2B, implies existence of a regret-minimizationstrategy, which implies a high-probability tail bound. To close the loop, we integrate out the tails,obtaining an in-expectation bound of the form (28), but possibly with a larger B function. This isa more general form of the equivalence promised in the introduction.

The next goal is to find nontrivial functions B such that (28) holds. The most basic B is aconstant that depends on the complexity of F , but not on f or the data. Define the worst-casesequential Rademacher averages as

Rn(F) ≜ supx

E supf∈F

n∑t=1

ǫtf(xt). (38)

Clearly, B =Rn(F)/2 satisfies (28). The following is immediate.

Corollary 12. For any F ⊆ RX and an X -valued predictable process x with respect to the dyadicfiltration,

P (supf∈F

n∑t=1

ǫtf(xt) >Rn(F) + u) ≤ exp(− u2

4maxǫ∑nt=1 supf∈F f(xt(ǫ))2 ) . (39)

Superficially, (39) looks like a one-sided version of the concentration bound for classical (i.i.d.)Rademacher averages [5]. However, sequential Rademacher averages are not Lipschitz with respectto a flip of a sign, as the whole remaining path may change after a flip.

6.2 Square loss

As for the case of the linear loss function, take any X -valued predictable process x = (x1, . . . ,xn)with respect to the dyadic filtration. Fix α > 0. The deterministic inequality (25) for xt =xt(ǫ1, . . . , ǫt−1) and yt = 1

αǫt becomes

supf∈F n∑t=1( 2αǫtf(xt) − f2(xt)) −B(f ;x1, . . . ,xn) ≤ n∑

t=1

2

αǫtyt − y2

t . (40)

As in the proof of (37), we obtain a tail comparison

P (supf∈F n∑t=1( 2αǫtf(xt) − f2(xt)) −B(f ;x1, . . . ,xn) > u

α) ≤ P ( n∑

t=1( 2αǫtyt − y2

t) > u

α) ≤ exp−αu

2 .

Once again, the most basic choice for B is the constant that depends on the complexity of theclass. We recall the following result from [22].

13

Lemma 13 ([22]). Let κ > 0. For any class F ⊆ RX , there exists a prediction strategy (yt) withvalues in [−κ,κ] such that

∀(x1, y1), . . . , (xn, yn) ∈ X × [−κ,κ], n∑t=1(yt − yt)2 − inf

f∈F

n∑t=1(f(xt) − yt)2 ≤Roff

n (F , κ,1),where, analogously to (38), we define offset Rademacher complexity

Roffn (F , c1, c2) ≜ sup

x,µE sup

f∈F n∑t=1

4c1ǫt(f(xt) −µt) − c2(f(xt) −µt)2 . (41)

Here, the supremum is taken over X -valued predictable processes x = (x1, . . . ,xn) and [−κ,κ]-valuedpredictable processes µ, both with respect to the dyadic filtration.

We conclude that (40) is satisfied with the data-independent constant B = Roffn (F ,1/α,1).

Hence, the following analogue of Corollary 12 holds:

Corollary 14. Let F ⊆ [−1,1]X . For any X -valued predictable process x with respect to the dyadicfiltration and for any α > 0, it holds that

P⎛⎝ supf∈F

n∑t=1(2ǫtf(xt) −αf2(xt)) −Roff

n (F ,1, α) > u⎞⎠ ≤ exp−αu2 .

To summarize, in Section 5 we presented the machinery of regret inequalities, as well as sufficientconditions for existence of strategies. In the present section we used the pathwise statements, alongwith real-valued deviation inequalities, to conclude tail bounds, which, in turn, certify existence ofregret-minimization strategies. In the next two sections we put these techniques to use.

7 Uniform variation and tail bounds for general martingale type

We now make an extensive use of the amplification technique to prove in-probability versions ofthe “martingale type” definition. We start by working with dyadic martingales of the form f ↦∑n

t=1 ǫtf(xt) where x = (x1, . . . ,xn) is a predictable process (with respect to the dyadic filtration)with values in X . Once the results for these objects are established, we conclude the correspondingstatements for general processes of the form (15) via the sequential symmetrization techniquesummarized in Corollary 8.

As in Section 3, we assume a growth condition n1/r on sequential Rademacher complexity.

Lemma 15. Let F ⊆ RX and r ∈ (1,2]. Under the growth assumption (18), for any p < r there

exists Kr,p <∞ such that

E ∣supf∈F

n∑t=1

ǫtf(xt)∣ ≤Kr,p E

⎡⎢⎢⎢⎢⎣(n∑t=1

supf∈F∣f(xt)∣p)1/p⎤⎥⎥⎥⎥⎦ . (42)

Further, if F ⊆ [−1,1]X and (18) holds with constant D/2, thenE ∣sup

f∈F

n∑t=1

ǫtf(xt)∣ ≤ 32D logn E

⎡⎢⎢⎢⎢⎣(n∑t=1

supf∈F∣f(xt)∣r)1/r⎤⎥⎥⎥⎥⎦ + φn (43)

where φn ≜ 64D√n logn

nD2 lognis a negligible term.

14

The second part of the proof of Lemma 15 uses the amplification idea of the previous section.Using Lemma 9, we can now conclude existence of prediction strategies whose regret is controlled

by sequence-dependent variance. This greatly extends the scope of available variance-type boundsin the online learning literature where results in this direction have been obtained for either finiteor linear classes.

Corollary 16. Let F ⊆ [−1,1]X and r ∈ (1,2]. If (18) holds with constant D/2, then there existsa prediction strategy (yt) such that

n∑t=1(−yt ⋅ yt) − inf

f∈F

n∑t=1(−f(xt) ⋅ yt) ≤ 32 ⋅D log2(n)( n∑

t=1supf∈F∣f(xt)∣r)1/r + φn

for any sequence of (xt, yt)nt=1 (equivalently, (24) holds).

In addition to being a novel result in the online learning domain, the above corollary servesas an amplification step to boost the in-expectation of bound of Lemma 15 to a high probabilitystatement. We then invoke Corollary 8 and Lemma 21 to prove the following theorem.

Theorem 17. Let Z1, . . . ,Zn be a stochastic process with values in Z and let Z ′1, . . . ,Z′n be a tangent

sequence. Let G ⊆ [−1,1]Z , r ∈ (1,2], and define the r-variation as

Varr = EZ′1∶n

n∑t=1

supg∈G(g(Zt) − g(Z ′t))r . (44)

If (18) holds for G with constant D/2, then with probability at least 1 − e log(n) exp(−2u2)supg∈G

n∑t=1(g(Zt) − Et−1g(Zt)) ≤ 256D log2(n)Var1/rr + u ⋅ 8√Var2 + 1 + 8φn.

We remark that the tail bound can be viewed as a ratio inequality (see [16, 11]) of the form (9),where the deviations are scaled by the square root of the variance.

8 Finer control via per-function variation

From the point of view of the previous section (and Theorem 5), all classes with sequentialRademacher complexity growth n1/2 are treated equally. However, classes with such a growthcan be as simple as a set consisting of two functions, or as complex as a set of linear functionsindexed by a ball in the infinite-dimensional Hilbert space. In this section, a different complex-ity measure will be used for the regime when the n1/2 growth hides the difference in complexity.This measure will be given by sequential covering numbers (and, as a consequence, by the offsetRademacher complexity). In the regime α−q, q ∈ [0,2], for the growth of sequential entropy, weexhibit a finer analysis of the variation term that allows part of the variance to be adapted to thefunction.

Let q ∈ (0,2]. We say that a class F ⊆ [−1,1]X has the γ−q growth (as γ decreases) of sequentialentropy if there is a constant C such that for all γ ∈ (0,1],

logN2(F , γ, n) ≤ Cγ−q.

15

As for sequential Rademacher complexity, it is easy to check that the class G and the derived classof functions (z, z′)↦ f(z, z′) = g(z) − g(z′) have the same growth of sequential entropy. Moreover,this growth controls the rate of growth of the offset Rademacher complexity, as shown in [22]. Inparticular, for the finite function class,

Roffn (F ,1, α) ≤ 8 log ∣F∣

α,

while for a parametric class of “dimension” d (such that N(F , γ, n) ≤ (C ′/γ)d for some C ′ > 0),

Roffn (F ,1, α) ≤ Cd log(n)

α,

and for a class with sequential entropy growth q ∈ (0,2),Roff

n (F ,1, α) ≤ Cα− 2−q

2+q nq

2+q

for some absolute constant C (the bound gains an extra logarithmic factor at q = 2). In this lastnonparametric regime, Corollary 14 implies that for any u > 0,

P⎛⎝ supf∈F

n∑t=1

ǫtf(xt) − α

2f(xt)2 −Cα−

2−r2+rn

r2+r > u⎞⎠ ≤ exp −αu ,

and the analogous statements hold for the finite and parametric cases. As the next Theorem shows,the offset Rademacher complexity Roff

n brings out (for smaller classes) the finer complexity controlobscured by the sequential Rademacher complexity which only provides Ω(n1/2) bounds.Theorem 18. Let Z1, . . . ,Zn be a discrete-time process with values in Z and let Z ′1, . . . ,Z

′n be a

tangent sequence. Let G ⊆ [−1,1]Z and define function-dependent variance as

Var2(g) = EZ′1∶n

n∑t=1(g(Zt) − g(Z ′t))2. (45)

If G exhibits an γ−q growth of sequential entropy, then there exists a constant C such that for anyu > 0, with probability at least 1 − e log(n) exp−u2,

supg∈G

n∑t=1(g(Zt) − Et−1g(Zt)) ≤ Cn

q

4 (Var2(g) + 2) 2−q4 + u ⋅ 2√2√Var2(g) + 2. (46)

If G is finite, with the same probability it holds that

supg∈G

n∑t=1(g(Zt) − Et−1g(Zt)) ≤ C√log ∣G∣√Var2(g) + 2 + u ⋅ 2√2√Var2(g) + 2, (47)

while for the parametric case,

supg∈G

n∑t=1(g(Zt) − Et−1g(Zt)) ≤ C√d log n

√Var2(g) + 2 + u ⋅ 2√2√Var2(g) + 2. (48)

16

The finite and parametric cases can be thought of as a “q = 0” regime. Here, we have a bound

that depends on n at most logarithmically. On the other hand, for q ≥ 2 the term nq

4 (Var2 + 1) 2−q4is replaced with n1−1/q, without any per-function adaptivity (as studied in the previous section).Between these two regimes, we obtain an interpolation, whereby the 1/2 power is split into a non-

adaptive part nq

4 and the adaptive part (Var2 + 1) 2−q4 . This constitutes a finer analysis of classeswith martingale type 2.

We may compare the bound of Theorem 18 in the finite case to the in-expectation bound of[14] in terms of “weak variance” for i.i.d. zero mean random variables Z1, . . . ,Zn ∈ Rd:

E [maxj≤d∣ n∑t=1

ǫtZt,j ∣] ≤¿ÁÁÀ2 ln(2d)Emax

j≤d

n∑t=1

Z2t,j .

In contrast to this bound, Theorem 18 matches the coordinate j on the left-hand side to the varianceof the jth coordinate on the right-hand side. Further, our bound holds for martingale differencesequences rather than i.i.d. random vectors. Finally, Theorem 18 holds well beyond the finite case.

9 Some Open Questions

The following are a few open-ended questions raised by this work:

1. In the definition of martingale type, can we replace E(∑nt=1E

′t−1 supg∈G ∣g(Zt) − g(Z ′t)∣p)1/p

with E(∑nt=1 supg∈G ∣g(Zt) − Et−1 [g]∣p)1/p and reach the same conclusions? The latter version

of variation is closer to the generalization of the martingale type for Banach spaces.

2. If for some r ∈ (1,2], sequential Rademacher complexity exhibits n1/r growth rate, then doesG have martingale type r? Currently, we only prove martingale type p for any p < r. For thecase of Banach spaces (linear g), the above question is answered in the positive in the work ofPisier [18]. However, the result of [18] relies on the notions of uniform convexity or uniformsmoothness which are specific to linear functionals and Banach spaces.

3. Is it possible to get a mix of uniform and per-funtion variance for general function classeswith martingale type 2? In Section 8, for martingale type 2 we prove a finer control throughper function variance. A natural question is whether one can replace the n-dependent part byuniform variance terms thus giving a mix of per-function and uniform variance in the samebound.

A Proofs

Lemma 19. The update in (2) satisfies

∀z1, . . . , zn ∈ B, n∑t=1⟨yt − f, zt⟩ ≤√n.

Proof of Lemma 19. The following two-line proof is standard. By the property of a projection,

∥yt+1 − f∥2 = ∥ProjB(yt − n−1/2zt) − f∥2 ≤ ∥(yt − n−1/2zt) − f∥2 = ∥yt − f∥2+ 1n∥zt∥2−2n−1/2 ⟨yt − f, zt⟩ .

17

Rearranging,

2n−1/2 ⟨yt − f, zt⟩ ≤ ∥yt − f∥2 − ∥yt+1 − f∥2 + 1

n∥zt∥2 .

Summing over t = 1, . . . , n yields the desired statement.

Lemma 20. With the notation of Lemma 1,

E maxs=1,...,n

∥ s∑t=1

Zt∥ ≤ (2.5Rmax +√3)E√Vn + 2.5Rmax .

Proof of Lemma 20. Because of the “anytime” property of the regret bound and the strategydefinition, we can write (12) as

maxs=1,...,n

∥ s∑t=1

Zt∥ − s∑t=1⟨yt,Zt⟩ ≤ 2.5Rmax (√Vn + 1) (49)

simply because the right-hand side is largest for s = n. Sub-additivity of max implies

maxs=1,...,n

∥ s∑t=1

Zt∥ − 2.5Rmax (√Vn + 1) ≤ maxs=1,...,n

s∑t=1⟨yt,Zt⟩ . (50)

By the Burkholder-Davis-Gundy inequality (with the constant from [6]),

E maxs=1,...,n

s∑t=1⟨yt,Zt⟩ ≤√3E( n∑

t=1⟨yt,Zt⟩2)1/2 ≤√3E√Vn . (51)

In view of (49), we conclude the statement.

Proof of Lemma 2. Because of the update form,

∀f ∈ F , ⟨yt+1 − f, zt⟩ ≤ 1

ηt(DR(f, yt) −DR(f, yt+1) −DR(yt+1, yt)) .

Summing over t = 1, . . . , n,n∑t=1⟨yt+1 − f, zt⟩ ≤ η−11 DR(f, y1) + n∑

t=2(η−1t − η−1t−1)DR(f, yt) − n∑

t=1η−1t DR(yt+1, yt)

≤ η−11 R2max +

n∑t=2(η−1t − η−1t−1)R2

max −n∑t=1

η−1t2∥yt+1 − yt∥2∗

≤ R2max(η−11 + η−1n ) − n∑

t=1

η−1t2∥yt+1 − yt∥2∗ ,

where we used strong convexity of R and the fact that ηt is nonincreasing. Next, we write

n∑t=1⟨yt − f, zt⟩ = n∑

t=1⟨yt+1 − f, zt⟩ + n∑

t=1⟨yt − yt+1, zt⟩

and upper bound the second term by noting that

⟨yt − yt+1, zt⟩ ≤ ∥yt − yt+1∥∗ ⋅ ∥zt∥ ≤ η−1t2∥yt − yt+1∥2∗ + ηt

2∥zt∥2 .

18

Combining the bounds,n∑t=1⟨yt − f, zt⟩ ≤ R2

max(η−11 + η−1n ) + n∑t=1

ηt

2∥zt∥2 . (52)

Now observe that

ηt = Rmaxmin1, √∑ts=1∥zs∥2−

√∑

t−1s=1∥zs∥2

∥zt∥2 (53)

and thus the second term in (52) is upper bounded as

n∑t=1

ηt

2∥zt∥2 ≤ Rmax

2

¿ÁÁÀ n∑s=1∥zs∥2.

For the first term, we use η−11 = R−1max and

η−1n ≤ R−1maxmax

⎧⎪⎪⎪⎨⎪⎪⎪⎩1,2¿ÁÁÀ t∑

s=1∥zs∥2

⎫⎪⎪⎪⎬⎪⎪⎪⎭Concluding,

n∑t=1⟨yt − f, zt⟩ ≤ Rmax

⎛⎜⎝2 + 2.5¿ÁÁÀ t∑

s=1∥zs∥2⎞⎟⎠ . (54)

Proof of Lemma 4. We have

Et−1 expλA − λ2B2/2= Et−1 expλ n∑

t=1⟨yt,Zt −Et−1Z

′t⟩ − 2λ2

n∑t=1(∥Zt∥2 + Et−1 ∥Z ′t∥2)

≤ Et−1 expλ n∑t=1⟨yt,Zt −Z ′t⟩ − 2λ2

n∑t=1(∥Zt∥2 + ∥Z ′t∥2)

≤ Et−1Eǫ expλ n∑t=1

ǫt ⟨yt,Zt −Z ′t⟩ − 2λ2n∑t=1(∥Zt∥2 + ∥Z ′t∥2) .

Since exp is a convex function,

Et−1Eǫ exp12(2λ n∑

t=1ǫt ⟨yt,Zt⟩ − 4λ2

n∑t=1∥Zt∥2) + 1

2(2λ n∑

t=1⟨yt,−Z ′t⟩ − 4λ2

n∑t=1∥Z ′t∥2)

≤ 1

2Et−1Eǫ exp2λ n∑

t=1ǫt ⟨yt,Zt⟩ − 4λ2

n∑t=1∥Zt∥2 + 1

2Et−1Eǫ exp2λ n∑

t=1ǫt ⟨yt,−Z ′t⟩ − 4λ2

n∑t=1∥Z ′t∥2

= Et−1Eǫ exp2λ n∑t=1

ǫt ⟨yt,Zt⟩ − 4λ2n∑t=1∥Zt∥2

≤ Et−1 exp4λ2n∑t=1∣ ⟨yt,Zt⟩ ∣2 − 4λ2

n∑t=1∥Zt∥2

≤ 1since ∥yt∥∗ ≤ 1.

19

Proof of Theorem 5. Let X1, . . . ,Xn be a discrete time process. We have

E∣supf∈F n∑t=1(f(Xt) − Et−1 [f(Xt)])∣ −C ( n∑

t=1E′t−1 sup

f∈F∣f(Xt) − f(X ′t)∣p)1/p

≤ ⟪suppt

E

Xt∼pt

⟫n

t=1

⎡⎢⎢⎢⎢⎣supf∈F

n∑t=1

(f(Xt) −EX′t∼pt[f(X ′t)]) −C ( n∑

t=1

EX′t∼ptsupf∈F

∣f(Xt) − f(X ′t)∣p)1/p⎤⎥⎥⎥⎥⎦

where ⟪suppt EXt∼pt⟫nt=1

stands for repeated application of the operators: supp1 EX1. . . suppn EXn

.By Jensen’s inequality, we upper bound the above expression by

⟪suppt

EXt,X′t∼pt⟫n

t=1

⎡⎢⎢⎢⎢⎣∣supf∈F

n∑t=1

(f(Xt) − f(X ′t))∣ −C (n∑t=1

supf∈F

∣f(Xt) − f(X ′t)∣p)1/p⎤⎥⎥⎥⎥⎦

.

Introducing independent Rademacher random variables ǫ1, . . . , ǫn, the preceding expression is equalto

⟪suppt

EXt,X′t∼pt

Eǫt⟫n

t=1

⎡⎢⎢⎢⎢⎣∣supf∈F

n∑t=1

ǫt(f(Xt) − f(X ′t))∣ −C (n∑t=1

supf∈F

∣f(Xt) − f(X ′t)∣p)1/p ⎤⎥⎥⎥⎥⎦

≤ ⟪supxt,x

′t

Eǫt⟫n

t=1

⎡⎢⎢⎢⎢⎣∣supf∈F

n∑t=1

ǫt(f(xt) − f(x′t))∣ −C (n∑t=1

supf∈F

∣f(xt) − f(x′t)∣p)1/p ⎤⎥⎥⎥⎥⎦

.

The latter expression may be written as

supx,x′

E

⎡⎢⎢⎢⎢⎣∣supf∈F

n∑t=1

ǫt(f(xt) − f(x′t))∣ −C (n∑t=1

supf∈F

∣f(xt) − f(x′t)∣p)1/p ⎤⎥⎥⎥⎥⎦

(55)

with a supremum ranging over predictable processes x = (x1, . . . ,xn) and x′ = (x′1, . . . ,x′n), eachxt,x

′t ∶ ±1t−1 → X . Now define the function class G ⊂ RX×X as follows:

G = (x,x′) ↦ f(x) − f(x′) ∶ f ∈ F .Trivially, (55) can be written with this notation as

supx,x′

E

⎡⎢⎢⎢⎢⎣∣supg∈G

n∑t=1

ǫtg(xt,x′t)∣ −C (

n∑t=1

supg∈G

∣g(xt,x′t)∣p)

1/p ⎤⎥⎥⎥⎥⎦.

However the complexity of G is not much larger than that of F :Rn(G; (x,x′)) = E ∣sup

g∈G

n∑t=1

ǫtg(xt,x′t)∣ = E ∣sup

f∈F

n∑t=1

ǫt(f(xt) − f(x′t))∣ ≤Rn(F ;x) +Rn(F ;x′).The first part of the theorem is concluded by applying Lemma 15 to the class G.

To prove the second part, we modify the lower bound construction in [25, Theorem 2]. Assumethat we are given a predictable process x of length n and that x0 is any one of the 2n − 1 values in

20

the image of x. Since Eǫ [∣∑nt=1 ǫt∣] ≤√n, we have that

Rn(F ;x) = Eǫ [∣supf∈F

n∑t=1

ǫtf(xt)∣]≤ Eǫ [∣sup

f∈F

n∑t=1

ǫtf(xt)∣] − supf∈F

∣f(x0)∣ Eǫ [∣ n∑t=1

ǫt∣] + supf∈F

∣f(x0)∣ √n≤ Eǫ [∣sup

f∈F

n∑t=1

ǫtf(xt)∣] −Eǫ [∣supf∈F

n∑t=1

ǫtf(x0)∣] + supf∈F


f∈F

n∑t=1

ǫtf(xt) − supf∈F

n∑t=1

ǫtf(x0)∣] + supf∈F


f∈F

n∑t=1

ǫt (f(xt) − f(x0))∣] + supf∈F

∣f(x0)∣ √n≤ 2Eǫ [∣sup

f∈F

n∑t=1

f(xt) + f(x0)2

− 1 − ǫt2

f(xt) − 1 + ǫt2

f(x0)∣] + supf∈F

∣f(x0)∣ √nNow consider the joint distribution over X1, . . . ,Xn such that, for every t ∈ [n], P (Xt = x0∣ǫt−1, ǫt =1) = 1 and P (Xt = xt(ǫ1∶t−1)∣ǫt−1, ǫt = 0) = 1. Under this distribution, we can rewrite the aboveinequality as

Rn(F ;x) ≤ 2E [∣supf∈F

n∑t=1

(Et−1 [f(Xt)] − f(Xt))∣] + supf∈F

∣f(x0)∣ √n.Since F is of type r, taking ǫ′1, . . . , ǫ

′n to be an independent Rademacher sequence, we further bound

the above term as

2C E

⎡⎢⎢⎢⎢⎣(n∑t=1

E′t−1 sup

f∈F

∣f(Xt) − f(X ′t)∣r)1/r⎤⎥⎥⎥⎥⎦ + supf∈F

∣f(x0)∣ √n

≤ 2C E

⎡⎢⎢⎢⎢⎣( n∑t=1

supf∈F

∣1 − ǫt2

f(xt) + 1 + ǫt2

f(x0) − 1 − ǫ′t2

f(xt) − 1 + ǫ′t2

f(x0)∣r)

1/r⎤⎥⎥⎥⎥⎦+ sup

f∈F

∣f(x0)∣ √n

≤ 4C ⎛⎝n⎛⎝ supf∈F ,t≤n,ǫ∈±1n

∣f(xt)∣r + supf∈F∣f(x0)∣r⎞⎠

⎞⎠1/r

+ supf∈F∣f(x0)∣ √n.

Since x0 is one of the elements of the tree x, we further upper bound the expression by

8C n1/r ⎛⎝ supf∈F ,t≤n,ǫ∈±1n

∣f(xt)∣⎞⎠ + supf∈F∣f(x0)∣ √n ≤ 16C n1/r ⎛⎝ sup

f∈F ,t≤n,ǫ∈±1n∣f(xt)∣⎞⎠ .

In the last step we used the fact that r ≤ 2 and so√n ≤ n1/r.

Proof of Lemma 7. We have

P (ξ ≥ u) ≤ Eφ(ξ)φ(u) ≤

Eφ(ν)φ(u) ≤

1

φ(u) (φ(0) + ∫∞

0φ′(x)P (ν ≥ x)dx) .

21

Choose a = u − µ−1(1), where µ−1 is the inverse function. If a < 0, the conclusion of the lemma istrue since Γ ≥ 1. In the case of a ≥ 0, we have φ(0) = 0. The above upper bound becomes

P (ξ ≥ u) ≤ Γ

φ(u) ∫∞

0φ′(x) exp−µ(x)dx = Γ

φ(u) ∫∞

aµ′(x) exp−µ(x)dx

= Γ

µ(u − a) [− exp−µ(x)]∞a = Γexp−µ(a) = Γexp−µ(u − µ−1(1)).If µ(b) = cb, we have

P (ξ ≥ u) ≤ Γexp−c(u − 1/c) = Γexp1 − cu.If µ(b) = cb2, we have

P (ξ ≥ u) ≤ Γexp−c(u − 1/√c)2 ≤ Γexp−cu2/4whenever u ≥ 2/√c. If u ≤ 2/√c, the conclusion is valid since Γ ≥ 1.Proof of Corollary 8. Let

ξ(Z1, . . . ,Zn,Z′1, . . . ,Z

′n) = sup

g

n

∑t=1(g(Zt) − g(Z ′t)) − B(g;Z1,Z

′1, . . . ,Zn,Z

′n)

and

ν(Z1, . . . ,Zn) = supg

n

∑t=1(g(Zt) −Et−1g(Z ′t)) −EZ′1∶n

B(g;Z1,Z′1, . . . ,Zn,Z

′n).

Then for any convex φ ∶ R→ R,Eφ(ν) ≤ Eφ(ξ)

using convexity of the supremum. The problem is now reduced to obtaining tail bounds for

P (supf

n

∑t=1(g(Zt) − g(Z ′t)) − B(g;Z1,Z

′1, . . . ,Zn,Z

′n) > u) .

Write the probability asEIξ(Z1, . . . ,Zn,Z

′1, . . . ,Z

′n) > u .

We now proceed to replace the random variables from n backwards with a dyadic filtration. Letus start with the last index. Renaming Zn and Z ′n we see that

EIsupg

n∑t=1

(g(Zt) − g(Z ′t)) − B(g;Z1, Z′

1, . . . , Zn, Z

′

n) > u= EIsup

g

n−1∑t=1

(g(Zt) − g(Z ′t)) + (g(Z ′n) − g(Zn)) − B(g;Z1, Z′

1, . . . , Zn, Z

′

n) > u= EEǫnIsup

g

n−1∑t=1

(g(Zt) − g(Z ′t)) + ǫn(g(Zn) − g(Z ′n)) − B(g;Z1, Z′

1, . . . , Zn, Z

′

n) > u≤ E sup

zn,z′

n

EǫnIsupg

n−1∑t=1

(g(Zt) − g(Z ′t)) + ǫn(g(zn) − g(z′n)) − B(g;Z1, Z′

1, . . . , Zn−1, Z

′

n−1, zn, z′

n) > u .

22

Proceeding in this manner for step n − 1 and back to t = 1, we obtain an upper bound of

supz1,z

′1

Eǫ1 . . . supzn,z′n

EǫnIsupg

n

∑t=1

ǫt(g(zt) − g(z′t)) − B(g; z1, z′1, . . . , zn, z′n) > u= sup

x

EIsupg

n

∑t=1

ǫtfg(xt) −B(g;x1, . . . ,xn) > u .

Proof of Lemma 10. To check the desired statement (32) for the given predictable process x,we verify that

⟪infyt

maxyt∈±1

⟫n

t=1[ n

∑t=1∣yt − yt∣ − inf

f∈F n

∑t=1∣f(xt(y1∶t−1)) − yt∣ +B(f ;x)] ≤ 0 (56)

where each infimum is taken over the set yt ∶ ∣yt∣ ≤ supf∈F ∣f(xt(y1∶t−1))∣. To this end,

2 ×⟪infyt

maxyt∈±1

⟫n

t=1[ n

∑t=1∣yt − yt∣ − inf

f∈F n

∑t=1∣f(xt(y1∶t−1)) − yt∣ +B(f ;x)]

= ⟪infyt

supyt∈±1

⟫n

t=1[ n

∑t=1(−yt ⋅ yt) − inf

f∈F n

∑t=1(−f(xt(y1∶t−1)) ⋅ yt) + 2B(f ;x)]

= ⟪suppt

Eyt∼pt⟫n

t=1

[ n

∑t=1

infyt

−ytE [yt] − inff∈F

n

∑t=1

−f(xt(y1∶t−1)) ⋅ yt + 2B(f ;x)]where pt ranges over distributions on ±1. In the last step, we have used the minimax theorem,and then the technique that can be found, for instance, in [1, 24]. Next, we replace the infima by(sub)optimal choices corresponding to the value of f . This yields

⟪suppt

Eyt∼pt⟫n

t=1

[supf∈F

n

∑t=1

f(xt(y1∶t−1)) ⋅ yt − supyt

ytE [yt] − 2B(f ;x)]≤ ⟪sup

pt

Eyt∼pt⟫n

t=1

[supf∈F

n

∑t=1

f(xt(y1∶t−1)) ⋅ (yt −E [yt]) − 2B(f ;x)]≤ ⟪sup

pt

Eyt,y

′t∼pt⟫n

t=1

[supf∈F

n

∑t=1

f(xt(y1∶t−1)) ⋅ (yt − y′t) − 2B(f ;x)]where the last step is by Jensen’s inequality. We further upper bound the above expression by

⟪suppt

Eyt,y

′t∼pt

maxy′′t

⟫n

t=1

[supf∈F

n

∑t=1

f(xt(y′′1∶t−1)) ⋅ (yt − y′t) − 2B(f ;x)]where y′′t ranges over ±1. Since yt, y

′t can be renamed, we introduce the random signs

⟪suppt

Eyt,y′t∼pt

Eǫtmaxy′′t

⟫n

t=1

[supf∈F

n

∑t=1

ǫt(yt − y′t)f(xt(y′′1∶t−1)) − 2B(f ;x)]≤ ⟪max

yt,y′t

Eǫtmaxy′′t

⟫n

t=1

[supf∈F

n

∑t=1

ǫt(yt − y′t)f(xt(y′′1∶t−1)) − 2B(f ;x)]≤ ⟪ max

bt∈±1Eǫtmaxyt

⟫n

t=1

[supf∈F

n

∑t=1

2ǫtbtf(xt(y1∶t−1)) − 2B(f ;x)] .

23

Since btǫt has the same distribution as ǫt for any bt ∈ ±1, we write the above expression as

⟪Eǫtmaxyt

⟫n

t=1

[supf∈F

n

∑t=1

2ǫtf(xt(y1∶t−1)) − 2B(f ;x)] .

To be consistent with the notation of predictable processes, we shift the numbering on y by one:

⟪Eǫtmaxyt+1

⟫n

t=1

[supf∈F

n

∑t=1

2ǫtf(xt(y2∶t)) − 2B(f ;x)] = supy

Eǫ [supf∈F

n

∑t=1

2ǫtf(xt(y2∶t(ǫ))) − 2B(f ;x)]≤ sup

y

Eǫ [supf∈F

n

∑t=1

2ǫtf(xt(y2∶t(ǫ))) − 2B(f ;x y)]by (30). The last quantity is nonpositive by (31).

Proof of Lemma 11. The first two statements are immediate from the discussion preceding theLemma. For the third statement, we have that

P (supf∈F

n

∑t=1

ǫtf(xt) − 2B(f ;x1, . . . ,xn) − α n

∑t=1

supf∈F

f(xt)2 > u) ≤ P ( n

∑t=1

ytǫt − αn

∑t=1

supf∈F

f(xt)2 > u)and the latter probability is further upper bounded by

P ( n

∑t=1

ǫtyt − αy2t > u) ≤ inf

λ>0

Eǫ [expλ n

∑t=1

ǫtyt −n

∑t=1

αλy2t − λu]

≤ infλ>0

maxy1∶n∈[−1,1]n

exp n

∑t=1

(λ2y2t /2 −αλy2t ) − λu ≤ exp−2αu .

Lemma 21. Suppose we have a collection of random variables (X(g), Y (g))g∈G , with 0 ≤ Y (g) ≤ balmost surely for any g ∈ G. Suppose for all α > 0, c > 0, and some 0 ≤ a ≤ 1, and K ≥ 0 it holdsthat

P (supg∈G

X(g) −α−aK − αY (g) > u) ≤ Γexp−cαu .

Then

P (supg∈G

X(g) − 4K 11+a (Y (g) + 1) a

a+1 − 4u√Y (g) + 1 > 0) ≤ log(b)Γexp −cu2 .Proof of Lemma 21. Fix u > 0 and consider a discretization over two regions [d′ℓ, d′u], [d′′ℓ , d′′u],given by αi = d′ℓ2i−1, i ∈ 1, . . . ,N ′ = ⌈log(d′u/d′ℓ)⌉ and αj = d′′ℓ 2j−1, j ∈ 1, . . . ,N ′′ = ⌈log(d′′u/d′′ℓ )⌉. LetN = N ′ + N ′′ be the total cardinality of the discretization, and let I denote the discretized set.From our premise, we have that for every index i and t > 0,

P (supg∈G

X(g) − α−ai K −αiY (g) > t) ≤ exp−cαit .

24

Substituting t = uα−1i + αi,

P (supg∈G

X(g) −α−ai K − αiY (g) −αi > uα−1i ) ≤ exp −cu − cα2i .

By union bound we conclude that,

P (maxαi∈I

supg∈G

X(g) − α−ai K −αiY (g) − αi − uα−1i > 0) ≤ ∑αi∈I

exp−cu − cα2i ≤ N × exp−cu .

Therefore,

P ( supg∈G,α

X(g) − 2α(Y (g) + 1) −α−aK − uα−1 > 0) ≤ N × exp−cuwith α taking values in [d′ℓ, d′u] ∪ [d′′ℓ , d′′u]. However,

infα2α(Y (g) + 1) +α−aK + uα−1 ≤ inf

α2α(Y (g) + 1) + 2maxα−aK,uα−1 .

Passing to two balancing choices

α′ =√

u

Y (g) + 1 , α′′ = ( K

Y (g) + 1)1/(a+1)

we obtain

P (supg∈G

X(g) − 4max √(Y (g) + 1)u,K1/(a+1)(Y (g) + 1)a/(a+1) > 0) ≤ N exp −cu .

It remains to quantify N such that, for any g ∈ G, the choices of α′, α′′ are captured by the twocorresponding regions of discretization. It is immediate that N does not depend on u or K, anddepends logarithmically on b. This concludes the proof.

In the proofs, it is useful to work with an equivalent to (18) growth assumption (57), definedbelow.

Lemma 22. Suppose sequential Rademacher complexity exhibits an n1/r growth with constant D/2 >0, in the sense of (18). Then the following holds for any 0,1-valued predictable process b andany X -valued predictable process x (both with respect to the dyadic filtration):

Eǫ [supf∈F

n

∑t=1

ǫtbtf(xt)] ≤D (maxǫ∑t

bt)1/r ⎛⎝ supf∈F ,t∈[n],ǫ∈±1n

∣btf(xt)∣⎞⎠ . (57)

Proof of Lemma 22. For any f ∈ F ,n

∑t=1

ǫtbtf(xt) = N

∑i=1

ǫτif(xτi) (58)

where N = maxǫ∑bt and τi = mins ∶ ∑sk=1 bk ≥ i. For simplicity, assume ∑bt = N for all ǫ

uniformly (the argument can be modified appropriately if not). Since bk is Ak−1-measurable, the

25

event τi ≤ t is At−1-measurable. Define N random variables Xi = xτi and ǫi = ǫτi , as well as thefiltration Ai = Aτi . We have that for any f and t

E [ǫif(Xi) ∣ Ai−1] = 0and therefore ∑N

i=1 ǫif(Xi) is a sum of martingale differences, indexed by f . By the result of [25],for any process X1, . . . , XN with values in img(x),

E supf∈F

N

∑i=1

ǫif(Xi) ≤ 2 supy,x′

Eγ supf∈F

N

∑i=1

γiyif(x′i) = 2 supx′

Eγ supf∈F

N

∑i=1

γif(x′i)where y,x′ range, respectively, over ±1-valued and img(x)-valued trees. The last equality followsfrom the rotation lemma (see [20]).

B Proof of Lemma 15

Lemma 23. Let F ⊆ RX and r ∈ (1,2]. If (18) holds with constant D/2, thenEǫ [sup

f∈F

n

∑t=1

ǫtf(xt)] ≤ Cr,pmaxǫ( n

∑t=1

supf∈F

∣f(xt)∣p)1/p

for any 1 ≤ p < r and Cr,p ≜D (1 − 2−(r−p)/rp)−1.Proof of Lemma 23. Given a predictable process x, define for each k = 0,1, . . . , a predictableprocess b(k) by

b(k)t = 1 if 2−(k+1)/pA < supf∈F ∣f(xt)∣ ≤ 2−k/pA

0 otherwise,

where

A =maxǫ( n

∑t=1

supf∈F

∣f(xt)∣p)1/p .Since xt is At−1-measurable, so is b

(k)t . From the definition, ∑k≥0 b

(k)t ≡ 1. Hence

supf∈F

n

∑t=1

ǫtf(xt) ≤ ∑k≥0

supf∈F

n

∑t=1

ǫtb(k)t f(xt).

Denoting Nk(ǫ) = t ∶ b(k)t = 1,Eǫ [sup

f∈F

n

∑t=1

ǫtf(xt)] ≤ ∑k≥0

Eǫ [supf∈F

n

∑t=1

ǫtb(k)t f(xt)]

≤D∑k≥0

(maxǫ∣Nk(ǫ)∣)1/r sup

f,ǫ,t

∣b(k)t f(xt)∣≤DA∑

k≥0

(maxǫ∣Nk(ǫ)∣)1/r 2−k/p. (59)

26

On the other hand note that for any k = 0,1, . . ., it holds that

A =maxǫ( n

∑t=1

supf∈F

∣f(xt)∣p)1/p ≥maxǫ( n

∑t=1

supf∈F

∣b(k)t f(xt)∣p)1/p

≥maxǫ

⎛⎝∣Nk(ǫ)∣ mint∈Nk(ǫ)

supf∈F

∣f(xt)∣p⎞⎠1/p

≥ (maxǫ∣Nk(ǫ)∣)1/p 2−(k+1)/pA .

Hence, maxǫ ∣Nk(ǫ)∣ ≤ 2k+1. Using this in Eq. (59) we conclude that:

Eǫ [supf∈F

n

∑t=1

ǫtf(xt)] ≤DA∑k≥0

2(k+1)/r−k/p ≤ 2DA∑k≥0

2k/r−k/p ≤ 2DA∑k≥0

2−k(r−p)/rp

≤ 2DA

1 − 2−(r−p)/rp =2D

1 − 2−(r−p)/rp maxǫ( n

∑t=1

supf∈F

∣f(xt)∣p)1/p .

Corollary 24. Let F ⊆ RX and r ∈ (1,2]. If (18) holds with constant D/4, thenP⎛⎝∣supf∈F

n

∑t=1

ǫtf(xt)∣ > 2Cr,pmaxǫ( n

∑t=1

supf∈F

∣f(xt)∣p)1/p + u⎞⎠ ≤ exp− u2

4maxǫ∑nt=1 supf∈F f(xt(ǫ))2 .

(60)

for any 1 ≤ p < r and Cr,p ≜D (1 − 2−(r−p)/rp)−1.Proof of Corollary 24. First, if (18) holds for F with a constant D/4, then it holds for the classF ∪ −F with a constant D/2. It is immediate that

B(x) = Cr,pmaxǫ( n

∑t=1

supf∈F

∣f(xt)∣p)1/p

satisfies (30). We now apply Lemma 23 to the class F ∪ −F , which yields that (31) of Lemma 10is satisfied for F ∪ −F :

E [ supf∈F∪−F

n

∑t=1

ǫtf(xt) − 2B(f ;x)] ≤ 0. (61)

The amplification argument of Lemma 11 yields the tail bound for the supremum over F ∪−F , andthe statement is concluded by noting that, pointwise,

∣supf∈F

n

∑t=1

ǫtf(xt)∣ ≤ supf∈F

∣ n∑t=1

ǫtf(xt)∣ = supf∈F∪−F

n

∑t=1

ǫtf(xt).

27

Proof of Lemma 15, Part 2. Given a predictable process x (with respect to the dyadic filtra-tion), define a predictable process b by

bt = 1 (∑ts=1 supf∈F ∣f(xs)∣p)1/p ≤ a

0 otherwise,

for some a > 0, to be specified later. By definition,

maxǫ∈±1n

( n

∑t=1

supf∈F

∣btf(xt)∣p)1/p ≤ a. (62)

Define the event E = ǫ ∶ (∑nt=1 supf∈F ∣f(xt(ǫ))∣p)1/p > a and note that for any u > 0,

P (∣supf∈F

n

∑t=1

ǫtf(xt)∣ > u) ≤ P (E) +P (Ec, ∣supf∈F

n

∑t=1

ǫtf(xt)∣ > u) . (63)

Consider the second term above:

P (Ec, ∣supf∈F

n

∑t=1

ǫtf(xt)∣ > u) = P (Ec, ∣supf∈F

n

∑t=1

ǫtbtf(xt)∣ > u) ≤ P (∣supf∈F

n

∑t=1

ǫtbtf(xt)∣ > u) . (64)

Let Cr,p = D

1−2−(r−p)/rp and chose a = u4Cr,p

. Given the choice of a, the last quantity can be written as

P (∣supf∈F

n

∑t=1

ǫtbtf(xt)∣ − 2D

1 − 2−(r−p)/rp a > u/2) . (65)

We would like to the tail bound of (60). Observe that functions in F only appear in (60) throughtheir values on the predictable process x. Hence, we may apply (60) to the collection (btf(xt))nt=1 ∶f ∈ F. In view of (62), the tail bound on (65) is

exp− u2

16maxǫ∑nt=1 supf∈F btf2(xt(ǫ)) . (66)

Now note that for any p ≤ 2 and F ⊆ [−1,1]X ,max

ǫ( n

∑t=1

supf∈F

btf2(xt))1/2 ≤min

⎧⎪⎪⎨⎪⎪⎩maxǫ( n

∑t=1

supf∈F

∣btf(xt)∣p)1/p ,maxǫ( n

∑t=1

supf∈F

f2(xt))1/2⎫⎪⎪⎬⎪⎪⎭≤min

⎧⎪⎪⎨⎪⎪⎩a,maxǫ( n

∑t=1

supf∈F

f2(xt))1/2⎫⎪⎪⎬⎪⎪⎭ .Then the tail bound in (66) is

exp

⎧⎪⎪⎨⎪⎪⎩−u2

16mina2,maxǫ (∑nt=1 supf∈F f2(xt))

⎫⎪⎪⎬⎪⎪⎭ =min

⎧⎪⎪⎨⎪⎪⎩exp −C2r,p , exp⎧⎪⎪⎨⎪⎪⎩−

u2

16maxǫ (∑nt=1 supf∈F f2(xt))

⎫⎪⎪⎬⎪⎪⎭⎫⎪⎪⎬⎪⎪⎭ .

(67)

28

Combining with (63),

P (∣supf∈F

n

∑t=1

ǫtf(xt)∣ > u) ≤ P ⎛⎝(n

∑t=1

supf∈F

∣f(xt(ǫ))∣p)1/p > u

4Cr,p

⎞⎠+min

⎧⎪⎪⎨⎪⎪⎩exp−C2r,p , exp⎧⎪⎪⎨⎪⎪⎩−

u2


⎫⎪⎪⎬⎪⎪⎭⎫⎪⎪⎬⎪⎪⎭ .

Now we have,

E ∣supf∈F

n

∑t=1

ǫtf(xt)∣ ≤ ∫ ∞

0P (sup

f∈F

∣ n∑t=1

ǫtf(xt)∣ > u)du. (68)

The integral is then controlled above by

4Cr,p∫∞

0P⎛⎝(

n

∑t=1

supf∈F

∣f(xt(ǫ))∣p)1/p > x⎞⎠dx +∫∞

0minexp−C2

r,p , exp − u2

16maxǫ(∑nt=1 supf∈F f2(xt))du

(69)

= 4Cr,pEǫ

⎡⎢⎢⎢⎢⎣(n

∑t=1

supf∈F

∣f(xt(ǫ))∣p)1/p⎤⎥⎥⎥⎥⎦ + ∫4Cr,p maxǫ(∑n

t=1 supf∈F f2(xt))1/2

0e−C

2r,pdu (70)

+∫ ∞

4Cr,p maxǫ(∑nt=1 supf∈F f2(xt))1/2

exp

⎧⎪⎪⎨⎪⎪⎩−u2


⎫⎪⎪⎬⎪⎪⎭du

≤ 4Cr,pEǫ

⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣p)1/p⎤⎥⎥⎥⎥⎦+ 8Cr,pmax

ǫ( n

∑t=1

supf∈F

f2(xt))1/2

exp−C2r,p . (71)

Trivially, x/2 ≤ 2x − 1 ≤ x for x ∈ [0,1]. Hence, we can upper and lower bound Cr,p as

Drp2r−prp

r − p ≤ Cr,p ≤ 2Drp2r−prp

r − p . (72)

Since 1 < p < r ≤ 2, we have

D

r − p ≤ Cr,p ≤ 8D

r − p . (73)

Hence, we conclude that

E ∣supf∈F

n

∑t=1

ǫtf(xt)∣ ≤ 32D

r − pEǫ

⎡⎢⎢⎢⎢⎣(n

∑t=1

supf∈F

∣f(xt)∣p)1/p⎤⎥⎥⎥⎥⎦ +64D

r − p maxǫ( n

∑t=1

supf∈F

f2(xt))1/2

exp− D2

(r − p)2

≤ 32D

r − pEǫ

⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣p)1/p⎤⎥⎥⎥⎥⎦+ 64D

r − p√n exp− D2

(r − p)2Now say we set 1

p= 1

r+ 1

logn, then in this case, 4

logn≥ r − p > 1

lognand,

Eǫ

⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣p)1/p⎤⎥⎥⎥⎥⎦≤ n 1

p− 1

rEǫ

⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣r)1/r⎤⎥⎥⎥⎥⎦≤ 2Eǫ

⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣r)1/r⎤⎥⎥⎥⎥⎦

. (74)

29

We conclude that

E ∣supf∈F

n

∑t=1

ǫtf(xt)∣ ≤ 32D logn Eǫ

⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣r)1/r⎤⎥⎥⎥⎥⎦+ 64D

√n logn

nD2 logn.

Lemma 25. Let F ⊆ RX and r ∈ (1,2]. Under the growth assumption (18) with constant D/2, forany p < r, u > 0, and Cr,p ≜D (1 − 2−(r−p)/rp)−1,

P (maxs≤n∣supf∈F

s

∑t=1

ǫtf(xt)∣ > u) ≤ 2⎛⎝C

pr,pE [∑n

t=1 supf∈F ∣f(xt)∣p]up

⎞⎠

1p+1

. (75)

Proof of Lemma 25. As in the proof of Lemma 15, given a predictable process x, define a pre-

dictable process b by bt = I(∑ts=1 supf∈F ∣f(xs)∣p)1/p ≤ a for some a > 0, to be specified later.

Define E = ǫ ∶ (∑nt=1 supf∈F ∣f(xt(ǫ))∣p)1/p > a . For any u > 0,


s

∑t=1

ǫtf(xt)∣ > u) ≤ P (E) +P (Ec, maxs≤n∣supf∈F

s

∑t=1

ǫtf(xt)∣ > u) . (76)

The second term can be written as

P (Ec, maxs≤n∣supf∈F

s

∑t=1

ǫtf(xt)∣ > u) = P (Ec, maxs≤n∣supf∈F

s

∑t=1

ǫtbtf(xt)∣ > u) ≤ P (maxs≤n∣supf∈F

n

∑t=1

ǫtbtf(xt)∣ > u)and further upper bounded, using Lemma 23, by


s

∑t=1

ǫtbtf(xt)∣ > u) ≤ E [∣supf∈F ∑nt=1 ǫtbtf(xt)∣]u

≤ Cr,pa

u. (77)

The first inequality is implied by Doob’s maximal inequality since s ↦ supf∈F ∑st=1 ǫtbtf(xt(ǫ)) is

a discrete-time submartingale. Next,

P (E) = P ( n

∑t=1

supf∈F

∣f(xt)∣p > ap) ≤ E [∑nt=1 supf∈F ∣f(xt)∣p]

ap.

Combining with (76), we have


s

∑t=1

ǫtf(xt)∣ > u) ≤ E [∑nt=1 supf∈F ∣f(xt)∣p]

ap+ Cr,pa

u. (78)

Setting a = (u(E[∑nt=1 supf∈F ∣f(xt)∣p])

Cr,p)1/(p+1), yields the statement of the lemma.

Lemma 26. Let F ⊆ RX and r ∈ (1,2]. Under the growth assumption (18), for any p < r there

exists a constant Br,p <∞ such that for any u > 0,

P (∣supf∈F

n

∑t=1

ǫtf(xt)∣ > u) ≤ Br,pE [∑nt=1 supf∈F ∣f(xt)∣p]

up. (79)

30

Proof of Lemma 26. Given a predictable X -valued process x = (x1, . . . ,xn) of length n, considera predictable process x of length nN constructed by concatenating N copies of x. Let ǫ1, . . . , ǫN ∈±1n be N independent vectors with i.i.d. Rademacher coordinates, and let ǫ = (ǫ1, . . . , ǫN).Denote Z(ǫj) ≜ ∣supf∈F ∑n

t=1 ǫjtf(xt(ǫj))∣. Now note that

maxj≤N

Z(ǫj) =maxj≤N

RRRRRRRRRRRRsupf∈F

⎛⎝nj

∑t=1

ǫtf(xt(ǫ)) −n(j−1)∑t=1

ǫtf(xt(ǫ))⎞⎠RRRRRRRRRRRR

≤maxj≤N

⎛⎝RRRRRRRRRRRsupf∈F

nj

∑t=1

ǫtf(xt)RRRRRRRRRRR +RRRRRRRRRRRRsupf∈F

n(j−1)∑t=1

−ǫtf(xt)RRRRRRRRRRRR⎞⎠

≤ maxs≤nN

∣supf∈F

s

∑t=1

ǫtf(xt)∣ +maxs≤nN

∣supf∈F

s

∑t=1

−ǫtf(xt)∣ .The two terms in the last bound have the same distribution, and so

P⎛⎝N−1/p max

j∈[N]Z(ǫj) > u⎞⎠ ≤ 2P

⎛⎝ maxk∈[nN]

∣supf∈F

k

∑t=1

ǫtf(xt)∣ > uN1/p/2⎞⎠≤ 4⎛⎝

Cpr,pEǫ [∑nN

t=1 supf∈F ∣f(xt(ǫ))∣p]Nup

⎞⎠

1p+1

.

Since x is a concatenation of N copies of x, convexity of sup implies

Eǫ [nN∑t=1

supf∈F

∣f(xt)∣p] ≤N Eǫ [ n

∑t=1

supf∈F

∣f(xt)∣p] .By homogeneity of the lemma statement, we may assume Eǫ [∑n

t=1 supf∈F ∣f(xt)∣p] = 1. With thisscaling, we have proved

supN≥1

P (N−1/pmaxj≤N

Z(ǫj) > u) ≤ 4(Cpr,p

up)

1p+1

.

Define ur,p to be the value of u that makes the right-hand side equal to 1/2. The reverse Holderprinciple (see Proposition 8.53 in [19], originating in the work of D.L. Burkholder [7]) implies thatthere exists constant Br,p <∞ such that

P (∣supf∈F

n

∑t=1

ǫtf(xt)∣ > u) ≤ Br,pu−p.

The statement follows by homogeneity.

Proof of Lemma 15, Part 1. Define the predictable process b as bt = I∑ts=1 supf∈F ∣f(xs)∣p ≤ up

and the event E = ǫ ∶ (∑nt=1 supf∈F ∣f(xt(ǫ))∣p)1/p > u. As in the proof of the other part of the

lemma,

P (∣supf∈F

n

∑t=1

ǫtf(xt)∣ > u) ≤ P (E) +P (∣supf∈F

n

∑t=1

ǫtbtf(xt)∣ > u) .

31

Under the growth assumption (18), applying Lemma 26, there exists a finite constant Br,p suchthat

P (∣supf∈F

n

∑t=1

ǫtbtf(xt)∣ > u) ≤ Br,pE [∑nt=1 supf∈F bt∣f(xt)∣p]

up

≤ Br,pE [min∑nt=1 supf∈F ∣f(xt)∣p, up]

up.

The last inequality follows from the definition of bt. We then have

E [∣supf∈F

n

∑t=1

ǫtf(xt)∣] = ∫ ∞

0P (sup

f∈F

n

∑t=1

ǫtf(xt) > u)du≤ ∫

∞

0P (E)du +Br,p∫

∞

0u−pEǫ [min( n

∑t=1

supf∈F

∣f(xt)∣p, up)]du.Let us focus on the second integral. Exchanging the order of integration and splitting the integralinto two parts,

∫∞

0u−pEǫ [min( n

∑t=1

supf∈F

∣f(xt)∣p, up)]du

= Eǫ

⎡⎢⎢⎢⎢⎣∫(∑n

t=1 supf∈F ∣f(xt)∣p)1/p

01du

⎤⎥⎥⎥⎥⎦+Eǫ [∫ ∞

(∑nt=1 supf∈F ∣f(xt)∣p)1/p

( n

∑t=1

supf∈F

∣f(xt)∣p)u−pdu]

= Eǫ

⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣p)1p⎤⎥⎥⎥⎥⎦+ 1

p − 1Eǫ

⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣p)( n

∑t=1

supf∈F

∣f(xt)∣p)(1−p)/p⎤⎥⎥⎥⎥⎦

= (1 + 1

p − 1)Eǫ

⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣p)1/p⎤⎥⎥⎥⎥⎦

.

By the definition of E we also have

∫∞

0P (E)du = E ⎡⎢⎢⎢⎢⎣(

n

∑t=1

supf∈F

∣f(xt)∣p)1/p⎤⎥⎥⎥⎥⎦

.

Combining,

E [∣supf∈F

n

∑t=1

ǫtf(xt)∣] ≤ (1 +Br,p (1 + 1

p − 1))E⎡⎢⎢⎢⎢⎣( n

∑t=1

supf∈F

∣f(xt)∣p)1/p⎤⎥⎥⎥⎥⎦

.

C Proofs of Theorem 17 and Theorem 18

Proof of Theorem 17. We first prove the lemma for a class G ⊆ [−1/2,1/2]Z and then rescale thefinal result, using the homogeneity of the bound. For the derived class F = fg(z, z′) = g(z)−g(z′) ∶

32

g ∈ G ⊆ [−1,1]X , where X = Z ×Z, (18) holds with a constant D/2. Invoking (37) of Lemma 11,for any X -valued predictable process x,

P (supf∈F

n

∑t=1

ǫtf(xt) −B(f ;x1, . . . ,xn) > u) ≤ exp (−2αu)with

B(f ;x1, . . . ,xn) = 32D log2(n)( n

∑t=1

supf∈F

∣f(xt)∣r)1/r − α n

∑t=1

supf∈F

f(xt)2 − φn

and φn = 64D√n logn

nD2 logn. Define

B(g; z1, z′1, . . . , zn, z′n) = 32D log2(n)( n

∑t=1

supg∈G

∣g(zt) − g(z′t)∣r)1/r −α n

∑t=1

supg∈G

(g(zt) − g(z′t))2 − φn.

Condition (21) of Corollary 8 is clearly satisfied for this function. Hence,

P (supg∈G

n

∑t=1

(g(Zt) − Et−1g(Zt)) − EZ′1∶nB(g;Z1,Z

′1, . . . ,Zn,Z

′n) > u) ≤ exp1 − 2αu

We now apply Lemma 21 with K = 0, a = 0,

X(g) = n

∑t=1

(g(Zt) − Et−1g(Zt)) − 32D log2(n)EZ′1∶n( n

∑t=1

supg∈G

∣g(Zt) − g(Z ′t)∣r)1/r − φn

Y (g) = EZ′1∶n

n

∑t=1

supg∈G

(g(Zt) − g(Z ′t))2yielding

P⎛⎝supg∈G

n

∑t=1

(g(Zt) −Et−1g(Zt)) − 128D log2(n)EZ′1∶n( n

∑t=1

supg∈G

∣g(Zt) − g(Z ′t)∣r)1/r − 4φn

−4u¿ÁÁÀEZ′1∶n

n

∑t=1

supg∈G

(g(Zt) − g(Z ′t))2 + 1 > 0⎞⎠ ≤ log(n) exp(1 − 2u2).Using Jensen’s inequality, we push the expectation inside the 1/r power, completing the proof.

Proof of Theorem 18. Define

B(g; z1, z′1, . . . , zn, z′n) = α

2

n

∑t=1

(g(zt) − g(z′t))2 +Cα− 2−q

2+q nq

2+q .

Condition (21) of Corollary 8 is satisfied and, therefore,

P (supg∈G

n

∑t=1

(g(Zt) − Et−1g(Zt)) − EZ′1∶nB(g;Z1,Z

′1, . . . ,Zn,Z

′n) > u) ≤ exp1 − αu.

33

To invoke Lemma 21, we choose a = 2−q2+q , K = Cn

q

2+q ,

X(g) = n

∑t=1

(g(Zt) −Et−1g(Zt)), Y (g) = 1

2EZ′

1∶n

n

∑t=1

(g(Zt) − g(Z ′t))2.Writing 1

1+a = 2+q4, we conclude that with probability at least 1 − e log(n) exp−u2

supg∈G

n

∑t=1

(g(Zt) − Et−1g(Zt)) ≤ C ′n q

4 (12EZ′

1∶n

n

∑t=1

(g(Zt) − g(Z ′t))2 + 1)2−q4

(80)

+ 4u¿ÁÁÀ1

2EZ′

1∶n

n

∑t=1

(g(Zt) − g(Z ′t))2 + 1 . (81)

References

[1] J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regretthrough minimax duality. In COLT ’09, 2009.

[2] B. Acciaio, M. Beiglbck, F. Penkner, W. Schachermayer, and J. Temme. A trajectorial inter-pretation of doobs martingale inequalities. Ann. Appl. Probab., 23(4):1494–1505, 08 2013.

[3] M. Beiglbock and M. Nutz. Martingale inequalities and deterministic counterparts. Electron.J. Probab, 19(95):1–15, 2014.

[4] M. Beiglbock and P. Siorpaes. Pathwise versions of the burkholder–davis–gundy inequality.Bernoulli, 21(1):360–373, 2015.

[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theoryof independence. Oxford University Press, 2013.

[6] D. Burkholder. The best constant in the davis inequality for the expectation of the martingalesquare function. Transactions of the American Mathematical Society, 354(1):91–105, 2002.

[7] D. L Burkholder. Maximal inequalities as necessary conditions for almost everywhere conver-gence. Probability Theory and Related Fields, 3(1):75–88, 1964.

[8] V. H. de la Pena, M. J. Klass, and T. L. Lai. Pseudo-maximization and self-normalizedprocesses. Probability Surveys, 4:172–192, 2007.

[9] D. Foster, A. Rakhlin, and K. Sridharan. Adaptive online learning, 2015. In Submission.

[10] D. A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118,1975.

[11] E. Gine and V. Koltchinskii. Concentration inequalities and asymptotic results for ratio typeempirical processes. The Annals of Probability, pages 1143–1216, 2006.

34

[12] E. Gine and J. Zinn. Some limit theorems for empirical processes. Annals of Probability,12(4):929–989, 1984.

[13] A. Gushchin. On pathwise counterparts of doobs maximal inequalities. Proceedings of theSteklov Institute of Mathematics, 1(287):118–121, 2014.

[14] P. Massart and R. Rossignol. Around Nemirovski’s inequality. In From Probability to Statisticsand Back: High-Dimensional Models and Processes–A Festschrift in Honor of Jon A. Wellner,pages 254–265. Institute of Mathematical Statistics, 2013.

[15] D. Panchenko. Symmetrization approach to concentration inequalities for empirical processes.Annals of Probability, 31(4):2068–2081, 2003.

[16] V. H Pena, T. L. Lai, and Q.-M. Shao. Self-normalized processes: Limit theory and StatisticalApplications. Springer, 2008.

[17] I. Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annalsof Probability, 22(4):1679–1706, 1994.

[18] G. Pisier. Martingales with values in uniformly convex spaces. Israel Journal of Mathematics,20:326–350, 1975.

[19] G. Pisier. Martingales in banach spaces (in connection with type and cotype). Course Notes(IHP), February 2011.

[20] A. Rakhlin and K. Sridharan. Statistical learning theory and sequential prediction, 2012.Available at http://stat.wharton.upenn.edu/~rakhlin/courses/stat928/stat928_notes.pdf.

[21] A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences.In Advances in Neural Information Processing Systems, 2013.

[22] A. Rakhlin and K. Sridharan. Online nonparametric regression. In The 27th Annual Conferenceon Learning Theory (COLT), 2014.

[23] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorialparameters, and learnability. Advances in Neural Information Processing Systems 23, pages1984–1992, 2010.

[24] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning via sequential complexities. Journalof Machine Learning Research, 2014.

[25] A. Rakhlin, K. Sridharan, and A. Tewari. Sequential complexities and uniform martingalelaws of large numbers. Probability Theory and Related Fields, February 2014.

[26] A. W. Van Der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: WithApplications to Statistics. Springer Series, March 1996.

35

http://stat.wharton.upenn.edu/~rakhlin/courses/stat928/stat928_notes.pdf

Documents

AlexanderRakhlin KarthikSridharan arXiv:1510.03925v1 [math ... · An equivalent way to deﬁne martingale type p is to ask that there exist a constant C such that E sup YyY∗≤1