Fundamental Limits of Online and Distributed Algorithms for … · 2019-04-02 · Memory constraints: The standard implementation of many common learning tasks require memory which

Fundamental Limits of Online and DistributedAlgorithms for Statistical Learning and Estimation

Ohad ShamirWeizmann Institute of [email protected]

Abstract

Many machine learning approaches are characterized by information constraints onhow they interact with the training data. These include memory and sequential accessconstraints (e.g. fast first-order methods to solve stochastic optimization problems);communication constraints (e.g. distributed learning); partial access to the underlyingdata (e.g. missing features and multi-armed bandits) and more. However, currentlywe have little understanding how such information constraints fundamentally affectour performance, independent of the learning problem semantics. For example, arethere learning problems where any algorithm which has small memory footprint (orcan use any bounded number of bits from each example, or has certain communicationconstraints) will perform worse than what is possible without such constraints? In thispaper, we describe how a single set of results implies positive answers to the above, fora variety of settings.

1 Introduction

Information constraints play a key role in machine learning. Of course, the main constraintis the availability of only a finite data set, from which the learner is expected to generalize.However, many problems currently researched in machine learning can be characterizedas learning with additional information constraints, arising from the manner in which thelearner may interact with the data. Some examples include:

• Communication constraints in distributed learning: There has been much work inrecent years on learning when the training data is distributed among several machines(with [13, 2, 27, 42, 27, 30, 8, 16, 34] being just a few examples). Since the machinesmay work in parallel, this potentially allows significant computational speed-ups andthe ability to cope with large datasets. On the flip side, communication rates betweenmachines is typically much slower than their processing speeds, and a major challengeis to perform these learning tasks with minimal communication.

1

arX

iv:1

311.

3494

v3 [

cs.L

G]

6 F

eb 2

014

• Memory constraints: The standard implementation of many common learning tasksrequire memory which is super-linear in the data dimension. For example, principalcomponent analysis (PCA) requires us to estimate eigenvectors of the data covariancematrix, whose size is quadratic in the data dimension and can be prohibitive for high-dimensional data. Another example is kernel learning, which requires manipulationof the Gram matrix, whose size is quadratic in the number of data points. There hasbeen considerable effort in developing and analyzing algorithms for such problems withreduced memory footprint (e.g. [38, 4, 9, 48, 44]).

• Online learning constraints: The need for fast and scalable learning algorithms haspopularised the use of online algorithms, which work by sequentially going over thetraining data, and incrementally updating a (usually small) state vector. Well-knownspecial cases include gradient descent and mirror descent algorithms (see e.g. [45,46]). The requirement of sequentially passing over the data can be seen as a type ofinformation constraint, whereas the small state these algorithms often maintain canbe seen as another type of memory constraint.

• Partial-information constraints: A common situation in machine learning is when theavailable data is corrupted, sanitized (e.g. due to privacy constraints), has missingfeatures, or is otherwise partially accessible. There has also been considerable interestin online learning with partial information, where the learner only gets partial feedbackon his performance. This has been used to model various problems in web advertising,routing and multiclass learning. Perhaps the most well-known case is the multi-armedbandits problem [17, 7, 6], with many other variants being developed, such as con-textual bandits [35, 36], combinatorial bandits [20], and more general models such aspartial monitoring [17, 12].

Although these examples come from very different domains, they all share the commonfeature of information constraints on how the learning algorithm can interact with the train-ing data. In some specific cases (most notably, multi-armed bandits, and also in the contextof certain distributed protocols [8, 50]) we can even formalize the price we pay for theseconstraints, in terms of degraded sample complexity or regret guarantees. However, we cur-rently lack a general information-theoretic framework, which directly quantifies how suchconstraints can impact performance. For example, are there cases where any online algo-rithm, which goes over the data one-by-one, must have a worse sample complexity than (say)empirical risk minimization? Are there situations where a small memory footprint provablydegrades the learning performance? Can one quantify how a constraint of getting only a fewbits from each example affects our ability to learn? To the best of our knowledge, there arecurrently no generic tools which allow us to answer such questions.

In this paper, we make a first step in developing such a framework. We consider a generalclass of learning processes, characterized only by information-theoretic constraints on howthey may interact with the data (and independent of any specific learning problem seman-tics). As special cases, these include online algorithms with memory constraints, certaintypes of distributed algorithms, as well as learning with partial information. We identify

2

cases where any such algorithm must be worse than what can be attained without suchinformation constraints.

Armed with these generic tools, we establish several new results for specific learningproblems:

• We prove that for some learning and estimation problems - in particular, sparse PCAand sparse covariance estimation in Rd - no online algorithm can attain statisticallyoptimal performance (in terms of sample complexity) with less than Ω(d2) memory.To the best of our knowledge, this is the first formal example of a memory/samplecomplexity trade-off.

• We show that for similar types of problems, there are cases where no distributedalgorithm (which is based on a non-interactive or serial protocol on i.i.d. data) canattain optimal performance with less than Ω(d2) communication per machine. To thebest of our knowledge, this is the first formal example of a communication/samplecomplexity trade-off, in the regime where the communication budget is larger than thedata dimension, and the examples at each machine come from the same underlyingdistribution.

• We prove a ‘Heisenberg-like’ uncertainty principle, which generalizes existing lowerbounds for online learning with partial information. In the context of learning withexpert advice, it implies that the number of rounds T , times the number of bits bextracted from each loss vector, must be larger than the dimension d in order to getnon-trivial regret. This holds no matter what these b bits are – a single coordinate (asin multi-armed bandits), some information on several coordinates (as in semi-banditfeedback), a linear projection (as in bandit optimization), some feedback signal froma restricted set (as in partial monitoring) etc. Remarkably, it holds even if the onlinelearner is allowed to adaptively choose which bits of the loss vector it can retain ateach round.

• We demonstrate the existence of simple (toy) stochastic optimization problems whereany algorithm which uses memory linear in the dimension (e.g. stochastic gradientdescent or mirror descent) cannot be statistically optimal.

Related Work

In stochastic optimization, there has been much work on lower bounds for sequential al-gorithms, starting from the seminal work of [40], and including more recent works such as[1]. [43] also consider such lower bounds from a more general information-theoretic perspec-tive. However, these results all hold in an oracle model, where data is assumed to be madeavailable in a specific form (such as a stochastic gradient estimate). As already pointed outin [40], this does not directly translate to the more common setting, where we are given adataset and wish to run a simple sequential optimization procedure. Indeed, recent worksexploited this gap to get improved algorithms using more sophisticated oracles, such as the

3

availability of prox-mappings [41]. Moreover, we are not aware of cases where these lowerbounds indicate a gap between the attainable performance of any sequential algorithm andbatch learning methods (such as empirical risk minimization).

In the context of distributed learning and statistical estimation, information-theoreticlower bounds have been recently shown in the pioneering work [50]. Assuming commu-nication budget constraints on different machines, the paper identifies cases where theseconstraints affect statistical performance. Our results (in the context of distributed learn-ing) are very similar in spirit, but there are two important differences. First, their lowerbounds pertain to parametric estimation in Rd, and are non-trivial when the budget size permachine is much smaller than d. However, for most natural applications, sending O(d) bits israther cheap, since one needs O(d) bits anyway just to communicate the algorithm’s output.In contrast, our basic results pertain to simpler detection problems, and lead to non-triviallower bounds in the natural regime where the budget size is the same or even much largerthan the data dimension (e.g. as much as quadratic). The second difference is that theirwork focuses mostly on non-interactive distributed protocols, while we address a more gen-eral class, which allows some interaction and also includes information-constrained settingsbeyond distributed learning. Strong lower bounds in the context of distributed learning havealso been shown in [8], but they do not apply to a regime where examples across machinescome from the same distribution, and where the communication budget is much larger thanwhat is needed to specify the output.

As mentioned earlier, there are well-known lower bounds for multi-armed bandit problemsand other online learning with partial-information settings. However, they are specific tothe partial information feedback considered. For example, the standard multi-armed banditlower bound [7] pertain to a setting where we can view a single coordinate of the loss vector,but doesn’t apply as-is when we can view more than one coordinate (as in semi-banditfeedback [33, 5], or bandits with side-information [37]), receive a linear projection (as inbandit linear optimization), or receive a different type of partial feedback (such as in partialmonitoring [19]). In contrast, our results are generic and can directly apply to any suchsetting.

The inherent limitations of streaming and distributed algorithms, including memory andcommunication constraints, have been extensively studied within theoretical computer sci-ence (e.g. [3, 10, 22, 39, 11]). This research provides quite powerful and general results, whichalso apply to algorithms beyond those considered here (such as fully interactive distributedalgorithms). Unfortunately, almost all these results consider tasks unrelated to learning,and/or adversarially generated data, and thus do not apply to statistical learning tasks,where the data is assumed to be drawn i.i.d. from some underlying distribution. [49, 26]do consider i.i.d. data, but they focus on problems such as detecting graph connectivityand counting distinct elements, and not learning problems such as those considered here.On the flip side, there are works on memory-efficient algorithms with formal guarantees forstatistical problems (e.g. [38, 9, 32, 23]), but these do not consider lower bounds or provabletrade-offs.

4

2 Information-Constrained Protocols

We begin with a few words about notation. We use bold-face letters (e.g. x) to denotevectors, with xj denoting the j-th coordinate. In particular, we let ej ∈ Rd denote j-thstandard basis vector. We use the standard asymptotic notation O(·),Ω(·) to hide constants,and O(·), Ω(·) to hide logarithmic factors. log(·) refers to the natural logarithm, and log2(·)to the base-2 logarithm.

We begin by defining the following class of algorithms, corresponding to online learningalgorithms with bounded memory:

Definition 1 (b-Memory Online Protocols). A given algorithm is a b-memory online protocolon m i.i.d. instances, if it has the following form, for arbitrary functions ftmt=1, f returningan output of at most b bits:

• For t = 1, . . . ,m

– Let X t be the t-th instance

– Compute message W t = ft(Xt,W t−1)

• Return W = f(W t)

Note that the functions ftmt=1, f are completely arbitrary, may depend on m and canalso be randomized. The crucial assumption is that their outputs W t,W are constrainedto be only b bits. Intuitively, b-memory online protocols maintain a state vector of size b,which is incrementally updated after each instance is received. We note that a majority ofonline learning and stochastic optimization algorithms have bounded memory. For example,for linear predictors, most single-pass gradient-based algorithms maintain a state whose sizeis proportional to the size of the parameter vector that is being optimized.

We will also define and prove results for the following much more general class ofinformation-constrained algorithms:

Definition 2 ((b, n,m) Protocol). Given access to a sequence of mn i.i.d. instances, an al-gorithm is a (b, n,m) protocol if it has the following form, for arbitrary functions ft returningan output of at most b bits, and an arbitrary function f :

• For t = 1, . . . ,m

– Let X t be a fresh batch of n i.i.d. instances

– Compute message W t = ft(Xt,W 1,W 2, . . .W t−1)

• Return W = f(W 1, . . . ,Wm)

As in b-memory online protocols, we constrain each W t to consist of at most b bits.However, unlike b-memory online protocols, here each W t may depend not just on W t−1,but on the entire sequence of previous messages W 1 . . .W t−1. b-memory online protocols area special when n = 1.

Examples of (b, n,m) protocols, beyond b-memory online protocols, include:

5

• Non-interactive and serial distributed algorithms: There are m machines and eachmachine receives an independent sample X t of size n. It then sends a message W t =ft(X

t) (which here depends only on X t). A centralized server then combines themessages to compute an output f(W 1 . . .Wm). This includes for instance divide-and-conquer style algorithms proposed for distributed stochastic optimization (e.g.[51, 52]). A serial variant of the above is when there are m machines, and one-by-one,each machine t broadcasts some information W t to the other machines, which dependson X t as well as previous messages sent by machines 1, 2, . . . , (t− 1).

• Online learning with partial information: We sequentially receive d-dimensional ex-amples/loss vectors, and from each of these we can extract and use only b bits ofinformation, where b d. For example, this includes most types of bandit problems.

• Mini-batch Online learning algorithms: The data is streamed one-by-one or in mini-batches of size n, with mn instances overall. An algorithm sequentially updates itsstate based on a b-dimensional vector extracted from each example/batch (such as agradient or gradient average), and returns a final result after all data is processed. Thisincludes most gradient-based algorithms we are aware of, but also distributed versionsof these algorithms (such as parallelizing a mini-batch processing step as in [28, 24]).

We note that our results can be generalized to allow the size of the messages W t to varyacross t, and even to be chosen in a data-dependent manner.

In our work, we contrast the performance attainable by any algorithm corresponding tosuch protocols, to constraint-free protocols which are allowed to interact with the sampledinstances in any manner.

3 Basic Results

All our results are based on simple ‘hide-and-seek’ statistical estimation problems, for whichwe show a strong gap between the attainable performance of information-constrained proto-cols and constraint-free protocols.

We will consider two variants of this problem, with different applications. Our firstproblem, parameterized by a dimension d, bias ρ, and sample size mn, is defined as follows:

Definition 3 (Hide-and-seek Problem 1). Consider the set of distributions Prj(·)dj=1 over

−1, 1d defined as Prj(x) = 2−d+1(12

+ ρxj). Given an i.i.d. sample of mn instances

generated from Prj(·), where j is unknown, detect j.

In words, Prj(·) corresponds to picking all coordinates other than j to be ±1 uniformlyat random, and independently picking coordinate j to be +1 with probability

(12

+ ρ), and

−1 with probability(12− ρ). It is easily verified that this creates instances with zero-mean

coordinates, except coordinate j whose expectation is 2ρ. The goal is to detect j based onan empirical sample.

6

We now present our first main result, which shows that for this hide-and-seek prob-lem, there is a large regime where detecting j is information-theoretically possible, but any(b, n,m) protocol will fail to do so with high probability.

Theorem 1. Consider hide-and-seek problem 1 on d > 1 coordinates, with some bias ρ ≤1/4n. Then for any estimate J of the biased coordinate returned by any (b, n,m) protocol,there exists some coordinate j such that

Prj(J = j) ≤ 3

d+ 8

√mb

d.

However, given mn samples, if J is the coordinate with the highest empirical average, then

Prj(J = j) ≥ 1− 2d exp

(−1

2mnρ2

). (1)

To give a concrete example, let us see how this theorem applies in the special case of b-memory online protocols, in which case we can take n = 1. In particular, suppose we choosethe bias ρ of coordinate j to be 1/4. This makes the hide-and-seek problem extremely easywithout information constraints: One coordinate j will be +1 with probability 3/4, whileall the other coordinates will be +1 with probability only 1/2. Thus, given only O(log(d))i.i.d. instances, we can easily determine j with arbitrarily high probability, just by findingthe coordinate with highest empirical average. However, the theorem implies that any b-memory online protocol, where b d will fail with high probability to detect j – at leastas long as the same size m is much smaller than d/b. When m ≥ d/b, our lower boundbecomes vacuous. However, this is not just an artifact of the analysis: If (for instance)m Ω(d log(d)), it is possible to detect j with bounded memory1.

Thus, the result implies a trade-off between sample complexity and memory complexity,which is exponential in d: Without memory constraints, one can detect j after O(log(d))i.i.d. instances, whereas with memory constraints, the number of instances needed scaleslinearly with d.

Since Thm. 1 also applies to (b, n,m) protocols in general, a similar reasoning impliestrade-offs between information and sample complexity in other situations, such as distributedlearning.

The proof of the theorem appears in Appendix A. However, the technical details mayobfuscate the proof’s simple intuition, which we now turn to explain.

First, the bound in Eq. (1) is just a straightforward application of Hoeffding’s inequal-ity and a union bound, ensuring that we can estimate the biased coordinate when mn log(d)/ρ2. However, in a setting where the instances are given to us one-by-one or in batches,this requires us to maintain an estimate for all d coordinates simultaneously. Unfortunately,when we can only output a small number b of bits based on each instance/batch, we cannotaccurately update the estimate for all d coordinates. For example, we can provide some

1The idea is to split the instance sequence into d segments of length O(log(d)) each, and use each segmenti to determine if coordinate i is the biased coordinate with high probability.

7

𝑋1𝑘

𝑋2𝑘

𝑋𝑗𝑘

𝑋𝑑𝑘

𝑊𝑘𝑗 ⋮

⋮

Figure 1: Illustration of the relationship between j, the coordinates 1, 2, . . . , j, . . . , d of thesample X t, and the message W t. The coordinates are independent of each other, and mostof them just output ±1 uniformly at random. Only coordinate j has a different distributionand hence contains some information on j.

information on all d coordinates, but then the information we will provide per coordinate(and in particular, the biased coordinate j) will be very small. Alternatively, we can pro-vide accurate information on O(b) coordinates, but we don’t know which coordinate is theimportant coordinate j, hence we are likely to ‘miss’ it. In fact, the proof relies on showingthat no matter what, a (b, n,m) protocol cannot provide more than b/d bits of information(in expectation) on coordinate j, unless it already ‘knows’ j. As there are only m roundsoverall, the amount of information conveyed on coordinate j is at most mb/d. If this is muchsmaller than 1, there is insufficient information to accurately detect j.

From a more information-theoretical viewpoint, one can view this as a result on the infor-mation transmission rate of a channel as illustrated in figure 1. In this channel, the messagej is sent at random through one of d independent binary symmetric channels (correspondingto the j-th coordinate in the data instances). Even though W constitutes b bits, the infor-mation on j it can convey is no larger than b/d in expectation, since it doesn’t ‘know’ whichof the channels to decode.

For some of the applications we consider, hide-and-seek problem 1 isn’t appropriate, andwe will need the following alternative problem, which is again parameterized by a dimensiond, bias ρ, and sample size mn:

Definition 4 (Hide-and-seek Problem 2). Consider the set of distributions Prj(·)dj=1 over−ei,+eidi=1, defined as

Prj(ei) =

12d

i 6= j12d

+ ρd

i = jPrj(−ei) =

12d

i 6= j12d− ρ

di = j

.

Given an i.i.d. sample of mn instances generated from Prj(·), where j is unknown, detect j.

In words, Prj(·) corresponds to picking ±ei where i is chosen uniformly at random, andthe sign is chosen uniformly if i 6= j, and to be positive (resp. negative) with probability

8

12

+ ρ (resp. 12− ρ) if i = j. It is easily verified that this creates sparse instances with

zero-mean coordinates, except coordinate j whose expectation is 2ρ/d.We now present a result analogous to Thm. 1 for this new hide-and-seek problem.

Theorem 2. Consider hide-and-seek problem 2 on d > 1 coordinates, with some biasρ ≤ min 1

27, 19 log(d)

, d14n. Then for any estimate J of the biased coordinate returned by any

(b, n,m) protocol, there exists some coordinate j such that

PrJ(J = j) ≤ 3

d+ 11

√mb

d.

However, given mn samples, if J is the coordinate with the highest empirical average, thenPrJ(J = j) ≥ 1− d exp (−(mn/d)ρ2).

The proof appears in Appendix A, and uses an intuition similar to the proof of Thm. 1.As in the previous theorem, it implies a regime with performance gaps between (b, n,m)protocols and unconstrained algorithms.

The theorems above hold for any (b, n,m) protocol, and in particular for b-memory onlineprotocols (since they are a special case of (b, 1,m) protocols). However, for b-online protocols,the following simple observation will allow us to further strengthen our results:

Theorem 3. Any b-memory online protocol over m instances is also a(b, κ,

⌈mκ

⌉)protocol

for any positive integer κ ≤ m.

The proof is immediate: Given a a batch of κ instances, we can always feed the instancesone by one to our b-memory online protocol, and output the final message after bm/κc suchbatches are processed, ignoring any remaining instances. This makes the algorithm a typeof(b, κ,

⌈mκ

⌉)protocol.

As a result, when discussing b-memory online protocols for some particular value of m,we can actually apply Thm. 1 and Thm. 2 replacing n,m by κ, dm/κe, where κ is a freeparameter we can tune to attain the most convenient bound.

4 Applications

4.1 Online Learning with Partial Information

Consider the standard setting of learning with expert advice, defined as a game over Trounds, where each round t a loss vector `t ∈ [0, 1]d is chosen, and the learner (withoutknowing `t) needs to pick an action it from a fixed set 1, . . . , d, after which the learnersuffers loss `t,it . The goal of the learner is to minimize the regret in hindsight to the best

fixed action,∑T

t=1 `t,it−mini `t,i. We are interested in partial information variants, where thelearner doesn’t get to see and use `t, but only some partial information on it. For example,in standard multi-armed bandits, the learner can only view `t(it). The following theorem isa simple corollary of Thm. 1, and we provide a proof in Appendix B.1.

9

Theorem 4. Suppose d ≥ 30. For any algorithm which can view only b bits of each lossvector, there is a distribution over loss vectors `t ∈ [0, 1]d with the following property: If theloss vectors are sampled i.i.d. from this distribution, then minj E[

∑Tt=1 `t(jt)−

∑Tt=1 `t(j)] ≥

0.01T as long as Tb ≤ 0.01d

As a result, we get that for any algorithm with any partial information feedback model(where b bits are extracted from each d-dimensional loss vector), it is impossible to getsub-linear regret guarantees in less than Ω(d/b) rounds. Remarkably, this holds even if thealgorithm is allowed to examine each loss vector `t and choose which b bits of information itwishes to retain. In contrast, full-information algorithms (e.g. Hedge [31]) can get sublinearregret in O(log(d)) rounds. Another interpretation of this result is as a ‘Heisenberg-like’uncertainty principle: Given T rounds and b bits extracted per round, we must have Tb ≥Ω(d) in order for the regret to be sublinear.

When b = O(1), the bound in the theorem matches a standard lower bound for multi-armed bandits, in terms of the relation2 of d and T . However, since the b bits extracted arearbitrary, it also implies lower bounds for other partial information settings. For example, weimmediately get an Ω(d/k) lower bound when we are allowed to view k coordinates insteadof 1, corresponding to (say) the semi-bandit feedback model ([20]), or the side-observationmodel of [37] with a fixed upper bound k on the number of side-observations. In partialmonitoring ([19]), we get a Ω(d/k) lower bound where k is the logarithm of the feedbackmatrix width. In attribute efficient learning (e.g. [21]), a simple reduction implies an Ω(d/k)sample complexity lower bound when we are constrained to view at most k features of eachexample. In general, for any current or future feedback model which corresponds to gettinga limited number of bits from the loss vector, our results imply a non-trivial lower bound.

4.2 Stochastic Optimization

We now turn to consider an example from stochastic optimization, where our goal is toapproximately minimize F (h) = EZ [f(h;Z)] given access to m i.i.d. instantiations of Z,whose distribution is unknown. This setting has received much attention in recent years,and can be used to model many statistical learning problems. In this section, we showa stochastic optimization problem where information-constrained protocols provably pay aperformance price compared to non-constrained algorithms. We emphasize that it is going tobe a very simple toy problem, and is not meant to represent anything realistic. We present itfor two reasons: First, it illustrates another type of situation where information-constrainedprotocols may fail (in particular, problems involving matrices). Second, the intuition ofthe construction is also used in the more realistic problem of sparse PCA and covarianceestimation, considered in the next section.

The construction is as follows: Suppose we wish to solve min(w,v) F (w,v) = EZ [f((w,v);Z)],

2Based on a conjectured improvement of our results, it should also be possible to prove a Ω(√

(d/b)T )regret lower bound, which holds for any d, T and any algorithm extracting b bits from the loss vector. SeeSec. 5 for more details.

10

wheref((w,v);Z) = w>Zv , Z ∈ [−1,+1]d×d

and w,v range over all vectors in the simplex (i.e. wi, vi ≥ 0 and∑d

i=1wi =∑d

i=1 vi = 1).Then we have for instance the following:

Theorem 5. Suppose d ≥ 10. Given m i.i.d. instantiations Z1, . . . , Zm from any distri-bution over Z, let (I , J) = arg mini,j

1m

∑mt=1 Z

ti,j. Then the solution (eI , eJ) satisfies, with

probability at least 1− δ,

F (eI , eJ)−minw,v

F (w,v) ≤ 2

√2 log(d2/δ)

m.

However, for any (b, 1,m) protocol, where mb < d2/290, and any solution w, v returnedby the protocol given m i.i.d. instances, there exists a distribution over Z such that withprobability > 1/2,

F (w, v)−minw,v

F (w,v) ≥ 1

8

The upper bound in the theorem follows from a simple concentration of measure ar-gument, noting that if E[Z] has a minimal element at location (i∗, j∗), then F (w,v) =E[f((w,v);Z)] is minimized at w = ei∗ ,v = ej∗ , and attains a value equal to that minimalelement. The lower bound follows by considering distributions where Z ∈ −1,+1d×d withprobability 1, each of the d2 entries is chosen independently, and E[Z] is zero except somecoordinate (i∗, j∗) where it equals −1/4. For such distributions, getting optimization errorsmaller than a constant reduces to detecting (i∗, j∗), and this in turn reduces to the hide-and-seek problem defined in Definition 3, over d2 coordinates and a bias ρ = 1/4. However,we showed that no information-constrained protocol will succeed if mb is much smaller thanthe number of coordinates, which is d2 in this case. We provide a full proof in AppendixB.2.

According to the theorem, there is a sample size regime m where we can get a smalloptimization error without information constraints, using a very simple procedure, yet any(say) stochastic gradient-based optimization algorithm (whose memory is linear in the di-mension d), or any non-interactive distributed algorithm with per-machine communicationbudget linear in d, will fail to get sub-constant error.

4.3 Sparse PCA and Sparse Covariance Estimation

The sparse PCA problem ([53]) is a standard and well-known statistical estimation problem,defined as follows: We are given an i.i.d. sample of vectors x ∈ Rd, and we assume thatthere is some direction, corresponding to some sparse vector v (of cardinality at most k),such that the variance E[(v>x)2] along that direction is larger than at any other direction.Our goal is to find that direction.

We will focus here on the simplest possible form of this problem, where the maximizingdirection v is assumed to be 2-sparse, i.e. there are only 2 non-zero coordinates vi, vj. In

11

that case, E[(v>x)2] = v21E[x21] + v22E[x22] + 2v1v2E[xixj]. Following previous work (e.g. [14]),we even assume that E[x2i ] = 1 for all i, in which case the sparse PCA problem reduces todetecting a coordinate pair (i∗, j∗), i∗ < j∗ for which xi∗ , xj∗ are maximally correlated. Aspecial case is a simple and natural sparse covariance estimation problem ([15, 18]), wherewe assume that all covariates are uncorrelated (E[xixj] = 0) except for unique correlatedpair of covariates (i∗, j∗) which we need to detect.

This setting bears a resemblance to the example seen in the context of stochastic opti-mization in section 4.2: We have a d × d stochastic matrix xx>, and we need to detect asingle biased entry at location (i∗, j∗). Unfortunately, these stochastic matrices are rank-1,and do not have independent entries as in the example considered in section 4.2. Thus, weneed to use a somewhat different construction, relying on a distribution supported on sparsevectors. The intuition is that then each xx> would be sparse, and the situation can bereduced to the hide-and-seek problem considered in Definition 4 and Thm. 2.

Formally, for 2-sparse PCA (or covariance estimation) problems, we show the follow-ing gap between the attainable performance of any b-memory online protocol, or even any(b, n,m) protocol for a particular choice of n, and a simple plug-in estimator without infor-mation constraints.

Theorem 6. Consider the class of 2-sparse PCA (or covariance estimation) problems ind ≥ 9 dimensions as described above, and all distributions such that:

1. E[x2i ] = 1 for all i.

2. For a unique pair of distinct coordinates (i∗, j∗), it holds that E[xi∗xj∗ ] = τ > 0, whereasE[xixj] = 0 for all distinct coordinate pairs (i, j) 6= (i∗, j∗).

3. For any i < j, if xixj is the empirical average of xixj over m i.i.d. instances, thenPr(|xixj − E[xixj]| ≥ τ

2

)≤ 2 exp (−mτ 2/6).

Then the following holds:

• Let (I , J) = arg maxi<j xixj. Then for any distribution as above, Pr((I , J) = (i∗, j∗)) ≥1− d2 exp(−mτ 2/6). In particular, when the bias τ equals Θ(1/d log(d)),

Pr((I , J) = (i∗, j∗)) ≥ 1− d2 exp

(−Ω

(m

d2 log2(d)

)).

• For any estimate (I , J) of (i∗, j∗) returned by any b-memory online protocol using minstances, or any (b, d(d − 1), d m

d(d−1)e) protocol, there exists a distribution with bias

τ = Θ(1/d log(d)) as above such that

Pr(

(I , J) = (i∗, j∗))≤ O

(1

d2+

√m

d4/b

).

12

The proof appears in Appendix B.3. The theorem implies that in the regime where bd2/ log2(d), we can choose any m such that d4

b m d2 log2(d), and get that the chances of

the protocol detecting (i∗, j∗) are arbitrarily small, even though the empirical average reveals(i∗, j∗) with arbitrarily high probability. Thus, in this sparse PCA / covariance estimationsetting, any online algorithm with sub-quadratic memory cannot be statistically optimal forall sample sizes. The same holds for any (b, n,m) protocol in an appropriate regime of (n,m),such as distributed algorithms as discussed earlier.

To the best of our knowledge, this is the first result which explicitly shows that memoryconstraints can incur a statistical cost for a standard estimation problem. It is interestingthat sparse PCA was also shown recently to be affected by computational constraints on thealgorithm’s runtime ([14]).

The theorem shows a performance gap in the regime where the number of instances mis much larger than d2 and much smaller than d4/b. However, this is not the only possibleregime, and it is possible to establish such gaps for similar problems already when m islinear in b (up to log-factors) - see the proof in Appendix B.3 for more details. Also,the distribution on which information-constrained protocols are shown to fail satisfies thetheorem’s conditions, but is ‘spiky’ and rather unnatural. Proving a similar result for a‘natural’ data distribution (e.g. Gaussian) remains an interesting open problem, as furtherdiscussed in Sec. 5.

5 Discussion and Open Questions

In this paper, we investigated cases where a generic type of information-constrained algo-rithm has strictly inferior statistical performance compared to constraint-free algorithms.As special cases, we demonstrated such gaps for memory-constrained and communication-constrained algorithms (e.g. in the context of sparse PCA and covariance estimation), aswell as online learning with partial information and stochastic optimization. These resultsare based on explicitly considering the information-theoretic structure of the problem, anddepend only on the number of bits extracted from each data point. We believe these resultsform a first step in a fuller understanding of how information constraints affect learningability in a statistical setting.

There are several immediate questions left open. One question is whether our boundsfor (b, n,m) protocols in Thm. 1 and Thm. 2 can be improved. We conjecture this is true,and that the bound should actually depend on mnρ2b/d rather than mb/d, and withoutconstraints on the bias ρ3. This would allow us, for instance, to show performance gaps formuch wider sample size regimes, as well as recover a stronger version of Thm. 4 for onlinelearning with partial information, implying a regret lower bound of Ω(

√(d/b)/T ) for any

number of rounds T .A second open question is whether there are convex stochastic optimization problems, for

which online or distributed algorithms are provably inferior to constraint-free algorithms (the

3The looseness potentially comes from the application of Lemma 3 and Jensen’s inequality in the proofof Thm. 1, which causes the bound to lose the property of becoming trivial if Prj(·) = Pr0(·)

13

example discussed in section 4.2 refers to an easily-solvable yet non-convex problem). Dueto the large current effort in developing scalable algorithms for convex learning problems,this would establish that one must pay a statistical price for using such memory-and-timeefficient algorithms.

A third open question is whether the results for non-interactive (or serial) distributedalgorithms can be extended to more interactive algorithms, where the different machinescan communicate over several rounds. As mentioned earlier, there is a rich literature on thecommunication complexity of interactive distributed algorithms within theoretical computerscience, but we don’t know how to ‘import’ these results to a statistical setting based oni.i.d. data.

A fourth open question relates to the sparse-PCA / covariance estimation result. Thehardness result we gave for information-constrained protocols uses a tailored distribution,which has a sufficiently controlled tail behavior but is ‘spiky’ and not sub-Gaussian uniformlyin the dimension. Thus, it would be interesting to establish similar hardness results for‘natural’ distributions (e.g. Gaussian).

More generally, there is much work remaining in extending the results here to otherlearning problems and other information constraints.

References

[1] A. Agarwal, P. Bartlett, P. Ravikumar, and M. Wainwright. Information-theoretic lowerbounds on the oracle complexity of stochastic convex optimization. Information Theory,IEEE Transactions on, 58(5):3235–3249, 2012.

[2] A. Agarwal, O. Chapelle, M. Dudık, and J. Langford. A reliable effective terascale linearlearning system. arXiv preprint arXiv:1110.4198, 2011.

[3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximatingthe frequency moments. In STOC, 1996.

[4] R. Arora, A. Cotter, and N. Srebro. Stochastic optimization of pca with capped msg.In NIPS, 2013.

[5] J.-Y. Audibert, S. Bubeck, and G. Lugosi. Minimax policies for combinatorial predictiongames. In COLT, 2011.

[6] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed banditproblem. Machine learning, 47(2-3):235–256, 2002.

[7] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmedbandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.

[8] M. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communicationcomplexity and privacy. In COLT, 2012.

14

[9] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incrementalpca. In NIPS, 2013.

[10] Z. Bar-Yossef, T. Jayram, R. Kumar, and D. Sivakumar. An information statisticsapproach to data stream and communication complexity. In FOCS, 2002.

[11] B. Barak, M. Braverman, X. Chen, and A. Rao. How to compress interactive commu-nication. In STOC, 2010.

[12] G. Bartok, D. Foster, D. Pal, A. Rakhlin, and C. Szepesvari. Partial monitoring –classification, regret bounds, and algorithms. 2013.

[13] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel anddistributed approaches. Cambridge University Press, 2011.

[14] A. Berthet and P. Rigollet. Complexity theoretic lower bounds for sparse principalcomponent detection. In COLT, 2013.

[15] J. Bien and R. Tibshirani. Sparse estimation of a covariance matrix. Biometrika,98(4):807–820, 2011.

[16] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization andstatistical learning via the alternating direction method of multipliers. Foundations andTrends R© in Machine Learning, 3(1):1–122, 2011.

[17] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122,2012.

[18] T. Cai and W. Liu. Adaptive thresholding for sparse covariance matrix estimation.Journal of the American Statistical Association, 106(494), 2011.

[19] N. Cesa-Bianchi and L. Gabor. Prediction, learning, and games. Cambridge UniversityPress, 2006.

[20] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer andSystem Sciences, 78(5):1404–1422, 2012.

[21] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Efficient learning with partiallyobserved attributes. The Journal of Machine Learning Research, 12:2857–2878, 2011.

[22] A. Chakrabarti, S. Khot, and X. Sun. Near-optimal lower bounds on the multi-partycommunication complexity of set disjointness. In CCC, 2003.

[23] S. Chien, K. Ligett, and A. McGregor. Space-efficient estimation of robust statisticsand distribution testing. In ICS, 2010.

15

[24] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms viaaccelerated gradient methods. In NIPS, 2011.

[25] T. Cover and J. Thomas. Elements of information theory. John Wiley & Sons, 2006.

[26] M. Crouch, A. McGregor, and D. Woodruff. Stochastic streams: Sample complexity vs.space complexity. In MASSIVE, 2013.

[27] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed onlineprediction. In ICML, 2011.

[28] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed onlineprediction using mini-batches. The Journal of Machine Learning Research, 13:165–202,2012.

[29] S. S. Dragomir. Upper and lower bounds for Csiszar’s f-divergence in terms of theKullback-Leibler distance and applications. In Inequalities for Csiszar f-Divergence inInformation Theory. RGMIA Monographs, 2000.

[30] J. Duchi, A. Agarwal, and M. Wainwright. Dual averaging for distributed optimization:convergence analysis and network scaling. Automatic Control, IEEE Transactions on,57(3):592–606, 2012.

[31] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning andan application to boosting. Journal of Computer and System Sciences, 55(1):119–139,1997.

[32] S. Guha and A. McGregor. Space-efficient sampling. In AISTATS, 2007.

[33] A. Gyorgy, T. Linder, G. Lugosi, and G. Ottucsak. The on-line shortest path problemunder partial monitoring. Journal of Machine Learning Research, 8:2369–2403, 2007.

[34] A. Kyrola, D. Bickson, C. Guestrin, and J. Bradley. Parallel coordinate descent forl1-regularized loss minimization. In ICML), 2011.

[35] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits withside information. In NIPS, 2007.

[36] L. Li, W. Chu, J. Langford, and R. Schapire. A contextual-bandit approach to person-alized news article recommendation. In WWW, 2010.

[37] S. Mannor and O. Shamir. From bandits to experts: On the value of side-observations.In NIPS, 2011.

[38] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming pca. arXiv preprintarXiv:1307.0032, 2013.

16

[39] S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc,2005.

[40] A. Nemirovsky and D. Yudin. Problem Complexity and Method Efficiency in Optimiza-tion. Wiley-Interscience, 1983.

[41] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Program-ming, 103(1):127–152, 2005.

[42] F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizingstochastic gradient descent. In NIPS, 2011.

[43] M. Raginsky and A. Rakhlin. Information-based complexity, feedback and dynamics inconvex programming. Information Theory, IEEE Transactions on, 57(10):7036–7056,2011.

[44] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS,2007.

[45] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations andTrends in Machine Learning, 4(2):107–194, 2011.

[46] S. Sra, S. Nowozin, and S. Wright. Optimization for Machine Learning. Mit Press,2011.

[47] I. Taneja and P. Kumar. Relative information of type s, Csiszar’s f-divergence, andinformation inequalities. Inf. Sci., 166(1-4):105–125, 2004.

[48] C. Williams and M. Seeger. Using the nystrom method to speed up kernel machines.In NIPS, 2001.

[49] D. Woodruff. The average-case complexity of counting distinct elements. In ICDT,2009.

[50] Y. Zhang, J. Duchi, M. Jordan, and M. Wainwright. Information-theoretic lower boundsfor distributed statistical estimation with communication constraints. In NIPS, 2013.

[51] Y. Zhang, J. Duchi, and M. Wainwright. Communication-efficient algorithms for sta-tistical optimization. In NIPS, 2012.

[52] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression.In COLT, 2013.

[53] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal ofcomputational and graphical statistics, 15(2):265–286, 2006.

17

A Proofs from Sec. 3

The proofs use several standard quantities and results from information theory – see Ap-pendix C for more details. They also make use of a few technical lemmas (presented inSubsection A.1), and a key lemma (in Subsection A.2) which quantifies how information-constrained protocols cannot provide information on all coordinates simultaneously. Theproofs themselves appear in Subsection A.3 and Subsection A.4.

A.1 Technical Lemmas

Lemma 1. Suppose that d > 1, and for some fixed distribution Pr0(·) over the messagesw1, . . . , wm computed by an information-constrained protocol, it holds that√√√√2

d

d∑j=1

Dkl (Pr0(w1 . . . wm)||Prj(w1 . . . wm)) ≤ B.

Then there exist some j such that

Pr(J = j) ≤ 3

d+ 2B.

Proof. By concavity of the square root, we have√√√√2

d

d∑j=1

Dkl (Pr0(w1 . . . wm)||Prj(w1 . . . wm)) ≥ 1

d

d∑j=1

√2 Dkl (Pr0(w1 . . . wm)||Prj(w1 . . . wm)).

Using Pinsker’s inequality and the fact that J is some function of the messages w1, . . . , wm

(independent of the data distribution), this is at least

1

d

d∑j=1

∑w1...wm

∣∣Pr0(w1 . . . wm)− Prj(w

1 . . . wm)∣∣

≥ 1

d

d∑j=1

∣∣∣∣∣ ∑w1...wm

(Pr0(w

1 . . . wm)− Prj(w1 . . . wm)

)Pr(J |w1 . . . wm

)∣∣∣∣∣≥ 1

d

d∑j=1

|Pr0(J = j)− Prj(J = j)|.

Thus, we may assume that

1

d

d∑j=1

|Pr0(J = j)− Prj(J = j)| ≤ B.

18

The argument now uses a basic variant of the probabilistic method. If the expression aboveis at most B, then for at least d/2 values of j, it holds that |Pr0( ˜J = j)−Prj(J = j)| ≤ 2B.

Also, since∑d

j=1 Pr0(J = j) = 1, then for at least 2d/3 values of j, it holds that Pr0(J =j) ≤ 3/d. Combining the two observations, and assuming that d > 1, it means there mustexist some value of j such that |Pr0(J) − Prj(J = j)| ≤ 2B, as well as Pr0(J = j) ≤ 3/d,hence Prj(J = j) ≤ 3

d+ 2B as required.

Lemma 2. Let p, q be distributions over a product domain A1 × A2 × . . .× Ad, where eachAi is a finite set. Suppose that for some j ∈ 1, . . . , d, the following inequality holds for allz = (z1, . . . , zd) ∈ A1 × . . .× Ad:

p(zii 6=j|zj) = q(zii 6=j|zj).

Also, let E be an event such that p(E|z) = q(E|z) for all z. Then

p(E) =∑zj

p(zj)q(E|zj).

Proof.

p(E) =∑z

p(z)p(E|z) =∑z

p(z)q(E|z)

=∑zj

p(zj)∑zii 6=j

p(zji 6=j|zj)q(E|zj, zii 6=j)

=∑zj

p(zj)∑zii 6=j

q(zji 6=j|zj)q(E|zj, zii 6=j)

=∑zj

p(zj)q(E|zj).

Lemma 3 ([29], Proposition 1, [47],Corollary 4.3). Let p, q be two distributions on a discrete

set, such that maxxp(x)q(x)≤ c. Then

Dkl (p(·)||q(·)) ≤ c Dkl (q(·)||p(·)) .

Lemma 4. Suppose we throw n balls independently and uniformly at random into d > 1bins, and let K1, . . . Kd denote the number of balls in each of the d bins. Then for any ε ≥ 0such that ε ≤ min1

6, 12 log(d)

, d3n, it holds that

E[exp

(εmax

jKj

)]< 13.

19

Proof. Each Kj can be written as∑n

i=1 1(ball i fell into bin j), and has expectation n/d.Therefore, by a standard multiplicative Chernoff bound, for any γ ≥ 0,

Pr(Kj > (1 + γ)

n

d

)≤ exp

(− γ2

2(1 + γ)

n

d

).

By a union bound, this implies that

Pr

(maxjKj > (1 + γ)

n

d

)≤

d∑j=1

Pr(Kj > (1 + γ)

n

d

)≤ d exp

(− γ2

2(1 + γ)

n

d

).

In particular, if γ+1 ≥ 6, we can upper bound the above by the simpler expression exp(−(1+γ)n/3d). Letting τ = γ + 1, we get that for any τ ≥ 6,

Pr

(maxjKj > τ

n

d

)≤ d exp

(−τn

3d

). (2)

Define c = max8, d3ε. Using the inequality above and the non-negativity of exp(εmaxjKj),we have

E[exp(εmax

jKj)

]=

∫ ∞t=0

Pr

(exp(εmax

jKj) ≥ t

)dt ≤ c+

∫ ∞t=c

Pr

(exp(εmax

jKj) ≥ t

)dt

= c+

∫ ∞t=c

Pr

(maxjKj ≥

log(t)

ε

)dt = c+

∫ ∞t=c

Pr

(maxjKj ≥

log(t)d

εn

n

d

)dt

Since we assume ε ≤ d/3n and c ≥ 8, it holds that exp(6εn/d) ≤ exp(2) < 8 ≤ c, whichimplies log(c)d/εn ≥ 6. Therefore, for any t ≥ c, it holds that log(t)d/εn ≥ 6. This allowsus to use Eq. (2) to upper bound the expression above by

c+ d

∫ ∞t=c

exp

(− log(t)d

3εn

n

d

)dt = c+ d

∫ ∞t=c

t−1/3εdt.

Since we assume ε ≤ 1/6, we have 1/(3ε) ≥ 2, and therefore we can solve the integration toget

c+d

13ε− 1

c1−13ε ≤ c+ dc1−

13ε .

Using the value of c, and since 1− 13ε≤ −1, this is at most

max8, d3ε+ d ∗(d3ε)1− 1

3ε = max8, d3ε+ d3ε.

Since ε ≤ 1/2 log(d), this is at most

max8, exp(3/2)+ exp(3/2) < 13

as required.

20

A.2 A Key Lemma

The following key lemma quantifies how a bounded-size message W is incapable of simulta-neously capturing much information on many almost-independent random variables.

Lemma 5. Let Z1, . . . , Zd be independent random variables, and let W be a random variablewhich can take at most 2b values. Then

1

d

d∑j=1

I(W ;Zj) ≤b

d.

Proof. We have

1

d

d∑j=1

I(W ;Zj) =1

d

d∑j=1

(H(Zj)−H(Zj|W )) .

Using the fact that∑d

j=1H(Zj|W ) ≥ H(Z1 . . . , Zd|W ), this is at most

1

d

d∑j=1

H(Zj)−1

dH(Z1 . . . Zd|W )

=1

d

d∑j=1

H(Zj)−1

d(H(Z1 . . . Zd)− I(Z1 . . . Zd;W ))

=1

dI(Z1 . . . Zd;W ) +

1

d

(d∑j=1

H(Zj)−H(Z1 . . . Zd)

). (3)

Since Z1 . . . Zd are independent,∑d

j=1H(Zj) = H(Z1 . . . Zd), hence the above equals

1

dI(Z1 . . . Zd;W ) =

1

d(H(W )−H(W |Z1 . . . Zd)) ≤

1

dH(W ),

which is at most b/d since W is only allowed to have 2b values.

A.3 Proof of Thm. 1

On top of the distributions Prj(·) defined in the hide-and-seek problem (Definition 3), wedefine an additional ‘reference’ distribution Pr0(·), which corresponds to the instances xchosen uniformly at random from −1,+1d (i.e. there is no biased coordinate).

The upper bound in the theorem statement is an immediate consequence of Hoeffding’sinequality and a union bound.

Let w1, . . . , wm denote the messages computed by the protocol. To show the lower bound,it is enough to prove that

2

d

d∑j=1

Dkl

(Pr0(w

1 . . . wm)∣∣∣∣Prj(w

1 . . . wm))≤ 15

mb

d, (4)

21

since then by applying Lemma 1, we get that for some j, Prj(J = j) ≤ (3/d)+2√

15mb/d ≤(3/d) + 8

√mb/d as required.

Using the chain rule, the left hand side in Eq. (4) equals

2

d

d∑j=1

m∑t=1

Ew1...wt−1∼Pr0[Dkl

(Pr0(w

t|w1 . . . wt−1)||Prj(wt|w1 . . . wt−1)

)]= 2

m∑t=1

Ew1...wt−1∼Pr0

[1

d

d∑j=1

Dkl

(Pr0(w

t|w1 . . . wt−1)||Prj(wt|w1 . . . wt−1)

)](5)

Let us focus on a particular choice of t and values w1 . . . wt−1. To simplify the presentation,we drop the t superscript from the message wt, and denote the previous messages w1 . . . wt−1

as w. Thus, we consider the quantity

1

d

d∑j=1

Dkl (Pr0(w|w)||Prj(w|w)) . (6)

Recall that w is some function of w and a set of n instances received in the current round.Let xj denote the vector of values at coordinate j across these n instances. We can nowuse Lemma 2, where p(·) = Prj(·|w),q(·) = Pr0(·|w) and Ai = −1,+1n (i.e. the vectorof values at a single coordinate i across the n data points). The lemma’s conditions aresatisfied since xi for i 6= j has the same distribution under Pr0(·|w) and Prj(·|w), and alsow is only a function of x1 . . .xd and w. Thus, we can rewrite Eq. (6) as

1

d

d∑j=1

Dkl

Pr0(w|w)

∣∣∣∣∣∣∣∣∣∣∣∣∑xj

Pr0(w|xj, w)Prj(xj|w)

.

Using Lemma 3, we can reverse the expressions in the relative entropy term, and upperbound the above by

1

d

d∑j=1

(maxw

Pr0(w|w)∑xj


)Dkl

∑xj


∣∣∣∣∣∣∣∣∣∣∣∣ Pr0(w|w)

.

(7)The max term equals

maxw

∑xj

Pr0(w|xj, w)Pr0(xj|w)∑xj

Pr0(w|xj, w)Prj(xj|w)≤ max

xj

Pr0(xj|w)

Prj(xj|w),

and using Jensen’s inequality and the fact that relative entropy is convex in its arguments,we can upper bound the relative entropy term by∑

xj

Prj(xj|w)Dkl (Pr0(w|xj, w) || Pr0(w|w))

≤(

maxxj

Prj(xj|w)

Pr0(xj|w)

)∑xj

Pr0(xj|w)Dkl (Pr0(w|xj, w) || Pr0(w|w)) .

22

The sum in the expression above equals the mutual information between the message w andthe coordinate vector xj (seen as random variables with respect to the distribution Pr0(·|w)).Writing this as IPr0(·|w)(w; xj), we can thus upper bound Eq. (7) by

1

d

d∑j=1

(maxxj

Pr0(xj|w)

Prj(xj|w)

)(maxxj

Prj(xj|w)

Pr0(xj|w)

)IPr0(·|w)(w; xj)

≤(

maxj,xj

Pr0(xj|w)

Prj(xj|w)

)(maxj,xj

Prj(xj|w)

Pr0(xj|w)

)1

d

d∑j=1

IPr0(·|w)(w; xj).

Since xjj are independent of each other and w contains at most b bits, we can use the keyLemma 5 to upper bound the above by(

maxj,xj

Pr0(xj|w)

Prj(xj|w)

)(maxj,xj

Prj(xj|w)

Pr0(xj|w)

)b

d.

Now, recall that for any j, xj refers to a column of n i.i.d. entries, drawn independentlyof any previous messages w, where under Pr0, each entry is chosen to be ±1 with equalprobability, whereas under Prj each is chosen to be 1 with probability 1

2+ ρ, and −1 with

probability 12− ρ. Therefore, we can upper bound the expression above by(

max

(1/2 + ρ

1/2

)n,

(1/2

1/2− ρ

)n)2b

d.

Since we assume ρ ≤ 1/4n, it’s easy to verify that the expression above is at most

((1 + 4ρ)n)2b

d≤ ((1 + 1/n)n)2

b

d≤ exp(2)

b

d.

To summarize, this constitutes an upper bound on Eq. (7), which in turn is an upper boundon Eq. (6), i.e. on any individual term inside the expectation in Eq. (5). Thus, we can upperbound Eq. (5) by 2 exp(2)mb/d < 15mb/d. This shows that Eq. (4) indeed holds, which asexplained earlier implies the required result.

A.4 Proof of Thm. 2

On top of the distributions Prj(·) defined in the hide-and-seek problem (Definition 4), wedefine an additional ‘reference’ distribution Pr0(·), which corresponds to the instances beingchosen uniformly at random from −ei,+eidi=1 (i.e. there is no biased coordinate).

We start by proving the upper bound. Any individual entry in the instances has magni-tude at most 1 and E[x2j ] ≤ 1

dwith respect to any PrJ(·). Applying Bernstein’s inequality,

the probability of its empirical average deviating from its mean by at least ρ/d is at most

2 exp

(−mn(ρ/d)2

2d

+ 23ρd

)= 2 exp

(− mnρ2(

2 + 23ρ)d

).

23

Since ρ ≤ 1/27, this can be upper bounded by 2 exp(−mnρ2/(3d)). Thus, by a union bound,with probability at least 1 − 2d exp(−mnρ2/(3d)), the empirical average of all coordinateswill deviate from their respective means by less than ρ/d. If this occurs, we can detect Jsimply by computing the empirical average and picking the estimate J to be the coordinatewith the highest empirical average.

The proof of the lower bound is quite similar to the lower bound proof of Thm. 1, butwith a few more technical intricacies.

Let w1, . . . , wm denote the messages computed by the protocol. To show the lower bound,it is enough to prove that

2

d

d∑j=1

Dkl

(Pr0(w

1 . . . wm)∣∣∣∣Prj(w

1 . . . wm))≤ 26mb

d, (8)

since then by applying Lemma 1, we get that for some j, Prj(J = j) ≤ (3/d)+2√

26mb/d <

(3/d) + 11√mb/d as required.

Using the chain rule, the left hand side in Eq. (8) equals

2

d

d∑j=1

m∑t=1

Ew1...wt−1∼Pr0[Dkl

(Pr0(w

t|w1 . . . wt−1)||Prj(wt|w1 . . . wt−1)

)]= 2

m∑t=1

Ew1...wt−1∼Pr0

[1

d

d∑j=1

Dkl

(Pr0(w

t|w1 . . . wt−1)||Prj(wt|w1 . . . wt−1)

)](9)

Let us focus on a particular choice of t and values w1 . . . wt−1. To simplify the presentation,we drop the t superscript from the message wt, and denote the previous messages w1 . . . wt−1

as w. Thus, we consider the quantity

1

d

d∑j=1

Dkl (Pr0(w|w)||Prj(w|w)) . (10)

Recall that w is some function of w and a set of n instances received in the current round.Moreover, each instance is non-zero at a single coordinate, with a value in −1,+1. Thus,given an ordered sequence of n instances, we can uniquely specify them using vectorse,x1, . . . ,xd, where

• e ∈ 1 . . . dn, where each ei indicates what is the non-zero coordinate of the i-thinstance.

• Each xj ∈ −1,+1|i:ei=j| is a (possibly empty) vector of the non-zero values, whenthose values fell in coordinate j.

For example, if d = 3 and the instances are (−1, 0, 0); (0, 1, 0); (0,−1, 0), then e = (1, 2, 2); x1 =(−1); x2 = (1,−1); x3 = ∅. Note that under both Pr0(·) and Prj(·), e is uniformly distributedin 1 . . . dn, and xjj are mutually independent conditioned on e.

24

With this notation, we can rewrite Eq. (10) as

1

d

d∑j=1

Dkl (Ee [Pr0(w|e, w)] ||Ee [Prj(w|e, w)]) .

Since the relative entropy is jointly convex in its arguments, we have by Jensen’s inequalitythat this is at most

1

d

d∑j=1

Ee [Dkl (Pr0(w|e, w)||Prj(w|e, w))] = Ee

[1

d

d∑j=1

Dkl (Pr0(w|e, w)||Prj(w|e, w))

].

(11)We now decompose Prj(w|e, w) using Lemma 2, where p(·) = Prj(·|e, w),q(·) = Pr0(·|e, w)and each zi is xi. The lemma’s conditions are satisfied since the distribution of xi, i 6= j isthe same under Pr0(·|e, w),Prj(·|e, w), and also w is only a function of e,x1 . . .xd and w.Thus, we can rewrite Eq. (11) as

Ee

1

d

d∑j=1

Dkl

Pr0(w|e, w)

∣∣∣∣∣∣∣∣∣∣∣∣∑xj

Prj(xj|e, w)Pr0(w|e,xj, w)

.Using Lemma 3, we can reverse the expressions in the relative entropy term, and upperbound the above by

Ee

[1

d

d∑j=1

(maxw

Pr0(w|e, w)∑xj


)×

Dkl

∑xj


∣∣∣∣∣∣∣∣∣∣∣∣ Pr0(w|e, w)

. (12)

The max term equals

maxw

∑xj

Pr0(xj|e, w)Pr0(w|e,xj, w)∑xj

Prj(xj|e, w)Pr0(w|e,xj, w)≤ max

xj

Pr0(xj|e, w)

Prj(xj|e, w),

and using Jensen’s inequality and the fact that relative entropy is convex in its arguments,we can upper bound the relative entropy term by∑

xj

Prj(xj|e, w)Dkl (Pr0(w|e,xj, w) || Pr0(w|e, w))

≤(

maxxj

Prj(xj|e, w)

Pr0(xj|e, w)

)∑xj

Pr0(xj|e, w)Dkl (Pr0(w|e,xj, w) || Pr0(w|e, w)) .

25

The sum in the expression above equals the mutual information between the message wand the coordinate vector xj (seen as random variables with respect to the distributionPr0(·|e, w)). Writing this as IPr0(·|e,w)(w; xj), we can thus upper bound Eq. (12) by

Ee

[1

d

d∑j=1

(maxxj

Pr0(xj|e, w)

Prj(xj|e, w)

)(maxxj

Prj(xj|e, w)

Pr0(xj|e, w)

)IPr0(·|e,w)(w; xj)

]

≤ Ee

[(maxj,xj

Pr0(xj|e, w)

Prj(xj|e, w)

)(maxj,xj

Prj(xj|e, w)

Pr0(xj|e, w)

)1

d

d∑j=1

IPr0(·|e,w)(w; xj)

].

Since x1 . . .xd are mutually independent conditioned on e and w, and also w contains atmost b bits, we can use the key Lemma 5 to upper bound the above by

Ee

[(maxj,xj

Pr0(xj|e, w)

Prj(xj|e, w)

)(maxj,xj

Prj(xj|e, w)

Pr0(xj|e, w)

)b

d

].

Now, recall that conditioned on e, each xj refers to a column of |i : ei = j| i.i.d. entries,drawn independently of any previous messages w, where under Pr0, each entry is chosen tobe ±1 with equal probability, whereas under Prj each is chosen to be 1 with probability12

+ ρ, and −1 with probability 12− ρ. Therefore, we can upper bound the expression above

by

Ee

(maxj

max

(1/2 + ρ

1/2

)|i:ei=j|,

(1/2

1/2− ρ

)|i:ei=j|)2b

d

.Since we assume ρ ≤ 1/27, it’s easy to verify that the expression above is at most

Ee

[(maxj

(1 + 2.2ρ)|i:ei=j|)2

b

d

]= Ee

[(maxj

(1 +

4.4ρ|i : ei = j|2|i : ei = j|

)2|i:ei=j|)b

d

]

≤ Ee

[maxj

exp (4.4ρ|i : ei = j|)]b

d= Ee

[exp

(4.4ρ max

j|i : ei = j|

)]b

d

Since e is uniformly distributed in 1 . . . dn, then maxj |i : ei = j corresponds to thelargest number of balls in a bin when we randomly throw n balls into d bins. By Lemma 4,and since we assume ρ ≤ min 1

27, 19 log(d)

, d14n, it holds that the expression above is at most

13b/d. To summarize, this is a valid upper bound on Eq. (10), i.e. on any individual terminside the expectation in Eq. (9). Thus, we can upper bound Eq. (9) by 26mb/d. This showsthat Eq. (8) indeed holds, which as explained earlier implies the required result.

B Proof of Results from Sec. 4

B.1 Proof of Thm. 4

Let us assume by contradiction that we can guarantee E[∑T

t=1 `t,it −∑T

t=1 `t,j] < 0.01T .

26

Consider the set of distributions Prj(·) over 0, 1d, where each coordinate is chosenindependently and uniformly, except coordinate j which equals 0 with probability 1

2+ ρ,

where ρ = 1/4. Clearly, the coordinate i which minimizes E[`t,i] is j. Moreover, if atround t the learner chooses some it 6= j, then E[`t,it − `t,j] = ρ = 1/4. Thus, to have

E[∑T

t=1 `t,it −∑T

t=1 `t,j] < 0.01T requires that the expected number of rounds where it 6= jis at most 0.04T . By Markov’s inequality, this means that the probability of j not being themost-commonly chosen coordinate is at most 0.04/(1/2) = 0.08. In other words, if we canguarantee regret smaller than 0.01T , then we can detect j with probability at least 0.92.

However, by4 Thm. 1, for any (b, 1, T ) protocol, there is some j such that the protocol

would correctly detect j with probability at most 3d

+ 8√

Tbd

. Since in the theorem we

assume d ≥ 30 and Tb/d ≤ 0.01, this is at most 0.9. This contradicts the conclusion fromthe previous paragraph that we can detect j with probability at least 0.92. Hence, our initialhypothesis is false and the result follows.

B.2 proof of Thm. 5

Since F (w,v) = w>E[Z]v, and w,v are vectors in the simplex, an optimal solution isalways attained at some w = ei∗ ,v = ej∗ where (i∗, j∗) ∈ arg mini,j E[Zi,j]. By Hoeffding’sinequality and a union bound argument, it holds with probability 1− δ that

maxi,j

∣∣∣∣∣ 1

m

m∑t=1

Zti,j − E

[Zti,j

]∣∣∣∣∣ ≤√

2 log(d2/δ)

m.

If this event occurs, then

F (eI , eJ) = E[ZI,J ] ≤ 1

m

m∑t=1

ZtI,J

+

√log(d2/δ)

m

≤ 1

m

m∑t=1

Zti∗,j∗ +

√log(d2/δ)

m≤ E[Zi∗,j∗ ] + 2

√log(d2/δ)

m

= minw,v

F (w,v) + 2

√log(d2/δ)

m,

which establishes the upper bound in the theorem statement.To get the lower bound, consider the case where Z ∈ −1,+1d×d with probability 1,

each of the d2 entries is chosen independently, and E[Z] is zero except some coordinate (i∗, j∗)where it equals −1/4. Detecting (i∗, j∗) thus reduces to the hide-and-seek problem definedin Definition 3, using d2 coordinates and a bias ρ = 1/4. Thm. 1 tells us that any (b, 1,m)protocol on d2 coordinates can detect (i∗, j∗) with probability at most 3/d2 + 8

√mb/d2,

4The theorem discusses the case where the distribution is over −1,+1d, and coordinate j has a slightpositive bias, but it’s easily seen that the lower bound also holds here where the domain is 0, 1d.

27

which is less than 1/2 by the theorem assumptions. However, if the protocol returns anestimate w, v such that

E[f((w, v);Z)]−minw,v

E[f((w,v);Z)] <1

8, (13)

it means that −14wi∗ vj∗ −

(−1

4

)< 1

8, hence wi∗ vj∗ >

12. But since w, v are in the simplex, it

must imply that minwi∗ , vj∗ > 12, hence we can detect i∗, j∗ by choosing the coordinates

on which w, v are largest. Therefore, the probability of Eq. (13) occurring must be less than1/2

B.3 Proof of Thm. 6

The lower bound follows from the concentration of measure assumption on xixj, and a unionbound, which implies that

Pr(∀i < j, |xixj − E[xixj]| <

τ

2

)≥ 1− d(d− 1)

22 exp

(−mτ 2/6

)≥ 1− d2 exp

(−mτ 2/6

).

If this event occurs, then picking (I , J) to be the coordinates with the largest empirical meanwould indeed succeed in detecting (i∗, j∗), since E[xi∗xj∗ ] ≥ E[xixj]+τ for all (i, j) 6= (i∗, j∗).

The upper bound in the theorem statement follows from a reduction to the setting dis-cussed in Thm. 2. Let Pri∗,j∗(·)1≤i∗<j∗≤d be a set of distributions over d-dimensional vectorsx, parameterized by coordinate pairs (i∗, j∗). Each such Pri∗,j∗(·) is defined as a distribution

over vectors of the form√

d2

(σ1ei + σ2ej) in the following way:

• (i, j) is picked uniformly at random from (i, j) : 1 ≤ i < j ≤ d

• σ1 is picked uniformly at random from −1,+1.

• If (i, j) 6= (i∗, j∗), σ2 is picked uniformly at random from −1,+1. If (i, j) = (i∗, j∗),then σ2 is chosen to equal σ1 with probability 1

2+ ρ (for some ρ ∈ (0, 1/2) to be

determined later), and −σ1 otherwise.

In words, each instance is a 2-sparse random vector, where the two non-zero coordinates arechosen at random, and are slightly correlated if and only if those coordinates are (i∗, j∗).

Let us first verify that any such distribution Pri∗,j∗(·) belongs to the distribution familyspecified in the theorem:

1. For any coordinate k, xk is non-zero with probability 2/d (i.e. the probability thateither i or j above equal k), in which case x2k = d/2. Therefore, E[x2k] = 1 for all k.

2. When (i, j) 6= (i∗, j∗), then σ1, σ2 are uncorrelated, hence E[xixj] = 0. On the otherhand, E[xi∗xj∗ ] = 2

d(d−1)

((12

+ ρ)d2

+(12− ρ) (−d

2

))= 2ρ

d−1 . So we can take τ = 2ρd−1 ,

and have that E[xi∗xj∗ ] = τ .

28

3. For any i < j, xixj is a random variable which is non-zero with probability 2/(d(d−1)),in which case its magnitude is d/2. Thus, E[(xixj)

2] ≤ d2(d−1) . Applying Bernstein’s

inequality, if xixj is the empirical average of xixj over m i.i.d. instances, then

Pr(|xixj − E[xixj]| ≥

τ

2

)≤ 2 exp

(− mτ 2

4(

dd−1 + d

3τ)) .

Since we chose τ = 2ρd−1 <

1d−1 , and we assume d ≥ 9, this bound is at most

2 exp

(− mτ 2

4dd−1

(1 + 1

3

)) ≤ 2 exp

(−mτ

2

6

).

Therefore, this distribution satisfies the theorem’s conditions.The crucial observation now is that the problem of detecting (i∗, j∗) is can be reduced to a

hide-and-seek problem as defined in Definition 4. To see why, let us consider the distributionover d× d matrices induced by xx>, where x is sampled according to Pri∗,j∗(·) as describedabove, and in particular the distribution on the entries above the main diagonal. It is easilyseen to be equivalent to a distribution which picks one entry (i, j) uniformly at randomfrom (i, j) : 1 ≤ i < j ≤ d, and assigns to it the value

−d

2,+d

2

with equal probability,

unless (i, j) = (i∗, j∗), in which case the positive value is picked with probability 12

+ ρ, andthe negative value with probability 1

2− ρ. This is equivalent to the hide-and-seek problem

described in Definition 4, over d(d−1)2

coordinates. Thus, we can apply Thm. 2 for d(d−1)2

coordinates, and get that if ρ ≤ min

127, 1

9 log( d(d−1)2 )

, d(d−1)28n

, then for some (i∗, j∗) and any

estimator (I , J) returned by a (b, n,m) protocol,

Pri∗,j∗(

(I , J) = (i∗, j∗))≤ 6

d(d− 1)+ 11

√2mb

d(d− 1).

Our theorem deals with two types of protocols:(b, d(d− 1), d m

d(d−1)e)

protocols, and b-

memory online protocols over m instances. In the former case, we can simply plug in⌈m

d(d−1)

⌉, d(d − 1) instead of m,n, while in the latter case we can still replace m,n by⌈

md(d−1)

⌉, d(d − 1) thanks to Thm. 3. In both cases, doing this replacement and choosing

ρ = 1

9 log( d(d−1)2 )

(which is justified when d ≥ 9, as we assume), we get that

Pri∗,j∗(

(I , J) = (i∗, j∗))≤ 6

d(d− 1)+ 11

√2b

d(d− 1)

⌈m

d(d− 1)

⌉≤ O

(1

d2+

√m

d4/b

).

(14)

29

This implies the upper bound stated in the theorem, and also noting that

τ =2ρ

d− 1=

2

9(d− 1) log(d(d−1)

2

) = Θ

(1

d log(d)

).

Having finished with the proof of the theorem as stated, we note that it is possible toextend the construction used here to show performance gaps for other sample sizes m. Forexample, instead of using a distribution supported on√

d

2(σ1ei + σ2ej)

1≤i<j≤d

for any pair of coordinates 1 ≤ i < j ≤ d, we can use a distribution supported on√λ

2(σ1ei + σ2ej)

1≤i<j≤λ

for some λ ≤ d. By choosing the bias τ = Θ(1/λ log(λ)), we can show a performance gap(in detecting the correlated coordinates) in the regime λ4

b m λ2 log2(λ). This regime

exists for λ as small as√b (up to log-factors), in which case we already get performance

gaps when m is roughly linear in the memory b.

C Basic Results in Information Theory

The proof of Thm. 1 and Thm. 2 makes extensive use of quantities and basic results frominformation theory. We briefly review here the technical results relevant for our paper. Amore complete introduction may be found in [25]. Following the settings considered in thepaper, we will focus only on discrete distributions taking values on a finite set.

Given a random variable X taking values in a domain X , and having a distributionfunction p(·), we define its entropy as

H(X) =∑x∈X

p(x) log2(1/p(x)) = EX log2

(1

p(x)

).

Intuitively, this quantity measures the uncertainty in the value of X. This definition can beextended to joint entropy of two (or more) random variables, e.g. H(X, Y ) =

∑x,y p(x, y) log2(1/p(x, y)),

and to conditional entropy

H(X|Y ) =∑y

p(y)∑x

p(x|y) log2

(1

p(x|y)

).

For a particular value y of Y , we have

H(X|Y = y) =∑x

p(x|y) log2

(1

p(x|y)

)30

It is possible to show that∑n

j=1H(Xi) ≥ H(X1, . . . , Xn), with equality when X1, . . . , Xn areindependent. Also, H(X) ≥ H(X|Y ) (i.e. conditioning can only reduce entropy). Finally, ifX is supported on a discrete set of size 2b, then H(X) is at most b.

Mutual information I(X;Y ) between two random variables X, Y is defined as

I(X;Y ) = H(X)−H(X|Y ) = H(Y )−H(Y |X) =∑x,y

p(x, y) log2

(p(x, y)

p(x)p(y)

).

Intuitively, this measures the amount of information each variable carries on the other one,or in other words, the reduction in uncertainty on one variable given we know the other.Since entropy is always positive, we immediately get I(X;Y ) ≤ minH(X), H(Y ). As forentropy, one can define the conditional mutual information between random variables X,Ygiven some other random variable Z as

I(X;Y |Z) = Ez∼Z [I(X;Y |Z = z)] =∑z

p(z)∑x,y

p(x, y|z) log2

(p(x, y|z)

p(x|z)p(y|z)

).

Finally, we define the relative entropy (or Kullback-Leibler divergence) between two dis-tributions p, q on the same set as

Dkl(p||q) =∑x

p(x) log2

(p(x)

q(x)

).

It is possible to show that relative entropy is always non-negative, and jointly convex in itstwo arguments (viewed as vectors in the simplex). It also satisfies the following chain rule:

Dkl(p(x1 . . . xn)||q(y1 . . . yn) =n∑i=1

Ex1...xi−1∼p [Dkl(p(xi|x1 . . . xi−1)||q(xi|x1 . . . xi−1))] .

Also, it is easily verified that

I(X;Y ) =∑y

p(y) Dkl(pX(·|y)||pX(·)),

where pX is the distribution of the random variable X. Finally, we will make use of Pinsker’sinequality, which upper bounds the so-called total variation distance of two distributions p, qin terms of the relative entropy between them:∑

x

|p(x)− q(x)| ≤√

2Dkl(p||q).

31

Documents

Fundamental Limits of Online and Distributed Algorithms for … · 2019-04-02 · Memory constraints: The standard implementation of many common learning tasks require memory which