Davenport-How Well Can We Estimate a Sparse Vector

Embed Size (px)

Citation preview

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    1/15

    On the Fundamental Limits of Adaptive Sensing

    Ery Arias-Castro, Emmanuel J. Candes and Mark A. Davenport

    Abstract

    Suppose we can sequentially acquire arbitrary linear measurements of an n-dimensional vector xresulting in the linear model y = Ax + z, where z represents measurement noise. If the signal is knownto be sparse, one would expect the following folk theorem to be true: choosing an adaptive strategy whichcleverly selects the next row of A based on what has been previously observed should do far better thana nonadaptive strategy which sets the rows of A ahead of time, thus not trying to learn anything aboutthe signal in between observations. This paper shows that the folk theorem is false. We prove thatthe advantages offered by clever adaptive strategies and sophisticated estimation proceduresno matter

    how intractableover classical compressed acquisition/recovery schemes are, in general, minimal.

    Keywords: sparse signal estimation, adaptive sensing, compressed sensing, support recovery, infor-mation bounds, hypothesis tests.

    1 Introduction

    This paper is concerned with the fundamental question of how well one can estimate a sparse vectorfrom noisy linear measurements in the general situation where one has the flexibility to design thosemeasurements at will (in the language of statistics, one would say that there is nearly complete freedom

    in designing the experiment). This question is of importance in a variety of sparse signal estimation orsparse regression scenarios, but perhaps arises most naturally in the context of compressive sensing (CS)[3, 4, 9]. In a nutshell, CS asserts that it is possible to reliably acquire sparse signals from just a few linearmeasurements selected a priori. More specifically, suppose we wish to acquire a sparse signal x Rn. Apossible CS acquisition protocol would proceed as follows. (i) Pick an m n random projection matrix A(the first m rows of a random unitary matrix) in advance, and collect data of the form

    y = Ax + z, (1.1)

    where z is a vector of errors modeling the fact that any real world measurement is subject to at least asmall amount of noise. (ii) Recover the signal by solving an 1 minimization problem such as the Dantzigselector [5] or the LASSO [23]. As is now well known, theoretical results guarantee that such convexprograms yield accurate solutions. In particular, when z = 0, the recovery is exact, and the error degradesgracefully as the noise level increases.

    A remarkable feature of the CS acquisition protocol is that the sensing is completely nonadaptive; thatis to say, no effort whatsoever is made to understand the signal. One simply selects a collection {ai}of sensing vectors a priori (the rows of the matrix A), and measures correlations between the signal andthese vectors. One then uses numerical optimizatione.g., linear programming [5]to tease out the sparse

    Department of Mathematics, University of California, San Diego {[email protected]}Departments of Mathematics and Statistics, Stanford University {[email protected]}Department of Statistics, Stanford University {[email protected]}

    1

    mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    2/15

    signal x from the data vector y. While this may make sense when there is no noise, this protocol mightdraw some severe skepticism in a noisy environment. To see why, note that in the scenario above, most ofthe power is actually spent measuring the signal at locations where there is no information content, i.e.,where the signal vanishes. Specifically, let a be a row of the matrix A which, in the scheme discussedabove, has uniform distribution on the unit sphere. The dot product is

    a, x

    = j ajxj,

    and since most of the coordinates xj are zero, one might think that most of the power is wasted. Anotherway to express all of this is that by design, the sensing vectors are approximately orthogonal to the signal,yielding measurements with low signal power or a poor signal-to-noise ratio (SNR).

    The idea behind adaptive sensing is that one should localize the sensing vectors around locations wherethe signal is nonzero in order to increase the SNR, or equivalently, not waste sensing power. In otherwords, one should try to learn as much as possible about the signal while acquiring it in order to designmore effective subsequent measurements. Roughly speaking, one would (i) detect those entries which arenonzero or significant, (ii) progressively localize the sensing vectors on those entries, and (iii) estimate thesignal from such localized linear functionals. This is akin to the game of 20 questions in which the search is

    narrowed by formulating the next question in a way that depends upon the answers to the previous ones.Note that in some applications, such as in the acquisition of wideband radio frequency signals, aggressiveadaptive sensing mechanisms may not be practical because they would require near instantaneous feedback.However, there do exist applications where adaptive sensing is practical and where the potential benefitsof adaptivity are too tantalizing to ignore.

    The formidable possibilities offered by adaptive sensing give rise to the following natural folk theorem.

    Folk Theorem. The estimation error one can get by using a clever adaptive sensing schemeis far better than what is achievable by a nonadaptive scheme.

    In other words, learning about the signal along the way and adapting the questions (the next sensingvectors) to what has been learned to date is bound to help. In stark contrast, the main result of this paperis this:

    Surprise. The folk theorem is wrong in general. No matter how clever the adaptive sensingmechanism, no matter how intractable the estimation procedure, in general it is not possible toachieve a fundamentally better mean-squared error (MSE) of estimation than that offered by anave random projection followed by 1 minimization.

    The rest of this article is mostly devoted to making this claim precise. In doing so, we shall also showthat adaptivity also does not help in obtaining a fundamentally better estimate of the signal support,

    which is of independent interest.

    1.1 Main result

    To formalize matters, we assume that the error vector z in (1.1) has i.i.d. N(0, 2) entries. Then if A isa random projection with unit-norm rows as discussed above, [5] shows that the Dantzig selector estimatexDS (obtained by solving a simple linear program) achieves an MSE obeying

    1

    nE x

    xDS22 C log(n)

    k2

    m, (1.2)

    2

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    3/15

    where C is some numerical constant. The bound holds universally over all k-sparse signals1 provided thatthe number of measurements m is sufficiently large (on the order of at least k log(n/k)). The fundamentalquestion is thus: how much lower can the mean-squared error be when (i) we are allowed to sense the signaladaptively and (ii) we can use any estimation algorithm we like to recover x.

    The distinction between adaptive and nonadaptive sensing can be expressed in the following manner.Begin by rewriting the statistical model (1.1) as

    yi = ai, x + zi, i = 1, . . . , m , (1.3)in which a power constraint imposes that each ai is of norm at most 1, i.e., ai2 1; then in a nonadaptivesensing scheme the vectors a1, . . . , am are chosen in advance and do not depend on x or z whereas in anadaptive setting, the measurement vectors may be chosen depending on the history of the sensing process,i.e., ai is a (possibly random) function of (a1, y1, . . . , ai1, yi1).

    If we follow the principle that you cannot get something for nothing, one might argue that giving upthe freedom to adaptively select the sensing vectors would result in a far worse MSE. Our main contributionis to show that this is not the case. We prove that the minimax rate of adaptive sensing strategies is withina logarithimic factor of that of nonadaptive schemes.

    Theorem 1. For k, n sufficiently large, with k < n/2, and any m, the following minimax lower boundholds:

    infx

    supx : k-sparse

    1

    nE x x22 47 k2m .

    For all k, n, Section 2 develops a lower bound of the form c0k2

    m . For instance, for any k, n such thatk n/4 with k 15, n 100, the lower bound holds if 4/7 is replaced by

    1 2k/n14

    .

    This bound can also be established for the full range of k and n at the cost of a somewhat worseconstant. See Theorem 4 for details. In short, Theorem 1 says that if one ignores a logarithmic factor, then

    adaptive measurement schemes cannot (substantially) outperform nonadaptive strategies. While seeminglycounterintuitive, we find that precisely the same sparse vectors which determine the minimax rate in thenonadaptive setting are essentially so difficult to estimate that by the time we have identified the support,we will have already exhausted our measurement budget.

    Theorem 1 should not be interpreted as an exotic minimax result. For example, the second statementis a direct consequence of a lower bound on the Bayes risk under the prior that selects k coordinatesuniformly at random and sets each of these coordinates to the same positive amplitude ; see Corollary 1in Section 2.2. In fact, this paper develops other results of this type with other signal priors.

    1.2 Connections with testing problems

    The arguments we develop to reach our conclusions are quite intuitive, simple, and yet they seem differentfrom the classical Fano-type arguments for obtaining information-theoretic lower bounds (see Section 1.3for a discussion of the latter methods). Our approach involves proving a lower bound for the Bayes riskunder a suitable prior. The arguments are simpler if we consider the prior that chooses the support byselecting each coordinate with probability k/n. To obtain such a lower bound, we make a detour throughtestingmultiple testing to be exact. Suppose we take m possibly adaptive measurements of the form(1.3), then we show that estimating the support of the signal is hard, which in turn yields a lower boundon the MSE.

    1A signal is said to be k-sparse if it has at most k nonzero components.

    3

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    4/15

    Support recovery in Hamming distance. We consider the multiple testing problem of deciding whichcomponents of the signal are zero and which are not. We show that no matter which adaptive strategyand tests are used, the Hamming distance between the estimated and true supports is large. Putdifferently, the multiple testing problem is shown to be difficult. In passing, this establishes thatadaptive schemes are not substantially better than nonadaptive schemes for support recovery.

    Estimation with mean-squared loss. Any estimator with a low MSE can be converted into an effectivesupport estimator simply by selecting the largest coordinates or those above a certain threshold.Hence, a lower bound on the Hamming distance immediately gives a lower bound on the MSE.

    The crux of our argument is thus to show that it is not possible to choose sensing vectors adaptively insuch a way that the support of the signal may be estimated accurately. Although our method of proof isnonstandard, it is still based on well-known tools from information and decision theory.

    1.3 Differential entropies and Fano-type arguments

    Our approach is significantly different from classical methods for getting lower bounds in decision andinformation theory. Such methods rely one way or the other on Fanos inequality [8], and are all intimately

    related to methods in statistical decision theory (see [24, 25]). Before continuing, we would like to pointout that Fano-type arguments have been used successfully to obtain (often sharp) lower bounds for someadaptive methods. For example, the work [7] uses results from [24] to establish a bound on the minimaxrate for binary classification (see the references therein for additional literature on active learning). Otherexamples include the recent paper [22], which derives lower bounds for bandit problems, and [20] whichdevelops an information theoretic approach suitable for stochastic optimization, a form of online learning,and gives bounds about the convergence rate at which iterative convex optimization schemes approach asolution.

    Following the standard approaches in our setting leads to major obstacles that we would like to brieflydescribe. Our hope is that this will help the reader to better appreciate our easy itinerary. As usual,we start by choosing a prior for x, which we take having zero mean. Coming from information theory,

    one would want to bound the mutual information between x (what we want to learn about) and y (theinformation we have), for any measurement scheme a1, . . . , am. Assuming a deterministic measurementscheme, by the chain rule, we have

    I(x, y) = h(y) h(y | x) =mi=1

    h(yi |y[i1]) h(yi |y[i1], x), (1.4)

    where y[i] := (y1, . . . , yi). Since the history up to time i1 determines ai, the conditional distribution ofyigiven y[i1] and x is then normal with mean ai, x and variance 2. Hence, h(yi | y[i1], x) = 12 log(2e2),which is a constant since we assume 2 to be fixed. This is the easy term to handle the challengingterm is h(yi | y[i1]) and it is not clear how one should go about finding a good upper bound. To see this,observe that

    Var(yi | y[i1]) = Var(ai, x | y[i1]) + 2.A standard approach to bound h(yi | y[i1]) is to write

    h(yi |y[i1]) 12 E log

    2eVar(ai, x | y[i1]) + 2e2

    ,

    using the fact that the Gaussian distribution maximizes the entropy among distributions with a givenvariance. If we simplify the problem by applying Jensens inequality, we get the following bound on themutual information

    I(x, y) mi=1

    1

    2logEai, x2/2 + 1

    . (1.5)

    4

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    5/15

    The RHS needs to be bounded uniformly over all choices of measurement schemes, which is a dauntingtask given that, for i, ai is a function of y[i1].

    Note that we have chosen to present the problem in this form because information theorists will findan analogy with the problem of understanding the role of feedback in a Gaussian channel [8]. Specifically,we can view the inner products ai, x as inputs to a Gaussian channel where we get to observe the outputof the channel via feedback. It is well-known that feedback does not substantially increase the capacity of

    a Gaussian channel, so one might expect this argument to be relevant to our problem as well. Crucially,however, in the case of a Gaussian channel the user has full control over the channel inputwhereas inthe absence of a priori knowledge ofx, in our problem we are much more restricted in our control over thechannel input ai, x.

    More amenable to computations is the approach described in [24, Th. 2.5], which is closely related tothe variant of Fanos inequality presented in [25]. This approach calls for bounding the average Kullback-Leibler divergence with respect to a reference distribution, which in our case leads to bounding

    mi=1

    Eai, x2,

    again, over all measurement schemes. We do not see any easy way to do this. We note however that thelatter approach is fruitful in the nonadaptive setting; see [2], and also [1, 21] for other asymptotic resultsin this direction.

    1.4 Can adaptivity sometimes help?

    On the one hand, our main result states that one cannot universally improve on bounds achievable vianonadaptive sensing strategies. Indeed, there are classes of k-sparse signals for which using an adaptive ora nonadaptive strategy does not really matter. As mentioned earlier, these classes are constructed in sucha way that even after applying the most clever sensing scheme and the most subtle testing procedure, onewould still not be sure about where the signal lies. This remains true even after having used up the entirety

    of our measurement budget. On the other hand, our result does not say that adaptive sensing never helps.In fact, there are many instances in which it will. For example, when some or most of the nonzero entriesin x are sufficiently large, they may be detected sufficiently early so that one can ultimately get a far betterMSE than what would be obtained via a nonadaptive scheme, see Section 3 for simple experiments in thisdirection and Section 4 for further discussion.

    1.5 Connections with other works

    A number of papers have studied the advantages (or sometimes the lack thereof) offered by adaptivesensing in the setting where one has noiseless data, see for example [9, 13, 19] and references therein. Of

    course, it is well known that one can uniquely determine a k-sparse vector from 2k linear nonadaptivenoise-free measurements and, therefore, there is not much to dwell on. The aforementioned works of coursedo not study such a trivial problem. Rather, the point of view is that the signal is not exactly sparse,only approximately sparse, and the question is thus whether one can get a lower approximation error byemploying an adaptive scheme. Whereas we study a statistical problem, this is a question in approximationtheory. Consequently, the techniques and results of this line of research have no bearing on our problem.

    There is much research suggesting intelligent adaptive sensing strategies in the presence of noise andwe mention a few of these works. In a setting closely related to oursthat of detecting the locations ofthe nonzeros of a sparse signal from possibly more measurements than the signal dimension n[12] showsthat by adaptively allocating sensing resources one can significantly improve upon the best nonadaptive

    5

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    6/15

    schemes [10]. Closer to home, [11] considers a CS sampling algorithm (with m < n) which performssequential subset selection via the random projections typical of CS, but which focus in on promising areasof the signal. When the signal is (i) very sparse (ii) has sufficiently large entries and (iii) has constantdynamic range, the method is able to remove a logarithmic factor from the MSE achieved by the Dantzigselector with (nonadaptive) i.i.d. Gaussian measurements. In a different direction, [6, 15] suggest Bayesianapproaches where the measurement vectors are sequentially chosen so as to maximize the conditionaldifferential entropy of yi given y[i1]. Finally, another approach in [14] suggests a bisection method based

    on repeated measurements for the detection of 1-sparse vectors, subsequently extended to k-sparse vectorsvia hashing. None of these works, however, establish a lower bound on the MSE of the recovered signal.

    1.6 Content

    We prove all of our results in Section 2, trying to give as much insight as possible as to why adaptive methodsare not much more powerful than nonadaptive ones for detecting the support of a sparse signal. We will alsoattempt to describe the regime in which adaptivity might be helpful via simple numerical simulations inSection 3. These simulations show that adaptive algorithms are subject to a fundamental phase transitionphenomenon. Finally, we shall offer some comments on open problems and future directions of research inSection 4.

    2 Limits of Adaptive Sensing Strategies

    This section establishes nonasymptotic lower bounds for the estimation of a sparse vector from adaptivelyselected noisy linear measurements. To begin with, we remind ourselves that we collect possibly adaptivemeasurements of the form (1.3) of an n-dimensional signal x; from now on, we assume for simplicity andwithout loss of generality that = 1. Remember that y[i] = (y1, . . . , yi), which gathers all the informationavailable after taking i measurements, and let Px be the distribution of these measurements when thetarget vector is x. Without loss of generality, we consider a deterministic measurement scheme, meaningthat a1 is a deterministic vector and, for i

    2, ai is a deterministic function of y[i1]. In this case, using

    the fact that yi is conditionally independent of y[i1] given ai, the likelihood factorizes as

    Px(y[m]) =mi=1

    Px(yi|ai). (2.1)

    We denote the total-variation metric between any two probability distributions P and Q by PQTV,and their KL divergence by K(P,Q) [18]. Our arguments will make use of Pinskers inequality, whichrelates these two quantities via

    PQTV

    K(Q,P)/2. (2.2)

    We shall also use the convexity of the KL divergence, which states that for i 0 and

    i i = 1, we have

    Ki

    iPi,i

    iQi i

    iK(Pi,Qi) (2.3)

    in which {Pi} and {Qi} are families of probability distributions.

    2.1 The Bernoulli prior

    For pedagogical reasons, it is best to study a simpler model in which our argument is most transparent.The ideas to prove our main theorem, namely, Theorem 1 are essentially the same. The only real differencesare secondary technicalities.

    6

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    7/15

    In this model, we suppose that x Rn is sampled from a product prior : for each j {1, . . . , n},

    xj =

    0 w.p. 1 k/n w.p. k/n

    and the xj s are independent. In this model, x has on average k nonzero entries, all with known positiveamplitudes equal to . This model is easier to study than the related model in which one selects kcoordinates uniformly at random and sets those to . The reason is that in this Bernoulli model, theindependence between the coordinates of x brings welcomed simplifications, as we shall see.

    Our goal here is to prove a version of Theorem 1 when x is drawn from this prior. We do this in twosteps. First, we look at recovering the support of x, which is done via a reduction to multiple testing.Second, we show that a lower bound on the error for support recovery implies a lower bound on the MSE,leading to Theorem 3.

    2.1.1 Support recovery in Hamming distance

    We would like to understand how well we can estimate the support S =

    {j : xj

    = 0

    }of x from the data

    (1.3), and shall measure performance by means of the expected Hamming distance. Here, the error of aprocedure S for estimating the support S is defined as

    Ex(|SS|) = nj=1

    Px(Sj = Sj).Here, denotes the symmetric difference, and for a subset S, Sj = 1 ifj S and equals zero otherwise.Theorem 2. Set 0 = P(xj = 0) = 1 k/n. Then for k n/2,

    E(|SS|) k 1 2m/(160n) . (2.4)

    Hence, if the amplitude of the signal is below

    n/m, we are essentially guessing at random. For

    instance, if =

    0n/m, then Ex |SS| 3k/4!Proof. With 1 = 1 0, we have

    j

    P(Sj = Sj) = j

    P(Sj = 1, Sj = 0) + P(Sj = 0, Sj = 1)=j

    0 P(

    Sj = 1|Sj = 0) + 1 P(

    Sj = 0|Sj = 1), (2.5)

    so everything reduces to testing n hypotheses H0,j : xj = 0. Fix j, and set P0,j = P(|xj = 0) andP1,j = P(|xj = 0). The test with minimum risk is of course the Bayes test rejecting H0,j if and only if

    1 P1,j(y[m])

    0 P0,j(y[m])> 1;

    that is, if the adjusted likelihood ratio exceeds one; see [17, Pbm. 3.10]. It is easy to see that the risk Bjof this test obeys

    Bj min(0, 1)

    1 12P1,j P0,jTV

    . (2.6)

    7

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    8/15

    We now use Pinskers inequality and obtain

    Bj min(0, 1)

    1

    K(P0,j,P1,j)/8

    . (2.7)

    It remains to find an upper bound on the KL divergence between P0,j and P1,j .

    Set j = 1 for concreteness and write P0 = P0,1 for short and likewise for the other. Then

    P0(y[m]) = x2,...,xn

    P(x2, . . . , xn)P(y[m]|x1 = 0, x2, . . . , xn) := xP(x

    )P0,x,

    where x = (x2, . . . , xn) and similarly for P1(y[m]). The convexity of the KL divergence (2.3) gives

    K(P0,P1) x

    P(x)K(P0,x ,P1,x). (2.8)

    We now calculate this divergence. In order to do this, observe that we have yi = ai, x + zi = ci + zi underP0,x while yi = ai,1 + ci + zi under P1,x . This yields

    K(P0,x ,P1,x) = E0,x logP0,x

    P1,x

    =

    mi=1

    E0,x 12 (yi ai,1 ci)2 12 (yi ci)2=

    mi=1

    E0,xziai,1 + (ai,1)2/2

    =2

    2

    mi=1

    E0,x(a2i,1).

    The first equality holds by definition, the second follows from (2.1), the third from yi = ci + zi under P0,x

    and the last holds since zi is independent of ai,1 and has zero mean. Using (2.8), we obtain

    K(P0,P1)

    2

    2

    m

    i=1 E[a2i,1|x1 = 0].We shall upper bound the right-hand side via E[a2i,1|x1 = 0] = [P(x1 = 0)]1 E[a2i,1 1x1=0] 10 E[a2i,1].Hence, generalizing this to any j, we have shown that

    K(P0,j,P1,j) 2

    20

    mi=1

    E(a2i,j).

    Returning to (2.5), via (2.7) and the bound above, we obtain

    jP(

    Sj = Sj) min(0, 1)

    j 1

    4

    i

    10 E a2i,j

    .

    Finally, Cauchy-Schwarz gives j

    i

    E a2i,j

    n

    i,j

    E a2i,j =

    nm.

    When k n/2, min(0, 1) = 1 = k/n, and we conclude thatj

    P(Sj = Sj) k1 2m/160n,which establishes the theorem.

    8

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    9/15

    2.1.2 Estimation in mean-squared error

    It is now straightforward to obtain a lower bound on the MSE from Theorem 2.

    Theorem 3. Consider the Bernoulli model with k n/2 and 2 = 6409 nm . Then any estimate x obeysE Ex

    1

    n x x22

    160

    27

    k

    m

    .

    In other words, for k n, the MSE is at least about 0.59 k/m.

    Proof. Let S be the support of x and set S := {j : |xj | /2}. We havex x22 =

    jS

    (xj xj)2 +j /S

    x2j 24 |S\ S| + 24 |S\ S| = 24 |SS|and, therefore,

    E Ex

    x x22

    2

    4E Ex |

    SS| k

    2

    4

    1

    2m/160n

    ,

    where the last inequality is from Theorem 2. We then plug in the value of 2 to conclude.

    2.2 The uniform prior

    We hope to have made clear how our argument above is considerably simpler than a Fano-type argument,at least for the Bernoulli prior. We now wish to prove a similar result for exactly k-sparse signals inorder to establish Theorem 1. The natural prior in this context is the uniform prior over exactly k-sparsevectors with nonzero entries all equal. The main obstacle in following the previous approach is the lackof independence between the coordinates of the sampled signal x. We present an approach that bypassesthis difficulty and directly relates the risk under the uniform prior to that under the Bernoulli prior.

    The uniform prior selects a support S {1, . . . , n} of size |S| = k uniformly at random and sets all theentries indexed by S to , and the others to zero; that is, xj = ifj S, and xj = 0 otherwise. Let U(k, n)denote the Hamming risk for support recovery under the uniform prior (we mean the risk achievable withthe best adpative sensing and testing strategies), and let B(k, n) be the corresponding quantity for theBernoulli prior with 1 = k/n, leaving and m implicit. Since the Bernoulli prior is a binomial mixtureof uniform priors, we have

    B(k, n) = EU(L, n), L Bin(n,k/n). (2.9)

    To take advantage of (2.9), we establish some order relationships between the U(k, n). We speculatethat U(k, n) is monotone increasing in k as long as k n/2, but do not prove this. The following issufficient for our purpose.

    Lemma 1. For all k and n, (i): U(k, n) U(k + 1, n + 1), and (ii): U(k, n) U(k, n + 1).Proof. We prove (i) first. Given a method consisting of a sampling scheme and a support estimator for the(k + 1, n + 1) problem from the uniform prior, denoted (a, S), we obtain a method for the (k, n) problemas follows. Presented with x Rn from the latter, insert a coordinate equal to at random, yieldingx Rn+1. Note that x is effectively drawn from the uniform prior for the (k + 1, n + 1) problem. We thenapply the method (a, S). It might seem that we need to know x to do this, but that is not the case. Indeed,assuming we insert the coordinate in the jth position, we have a, x = a, x + aj, where a is a with thejth coordinate removed; hence, this procedure may be carried out by taking inner products of x with a,which we can do, and adding aj. The resulting method (a, S) for the (k, n) problem has Hamming risk

    9

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    10/15

    no greater than that of (a, S), itself bounded by U(k + 1, n + 1); at the same time, (a, S) has Hammingrisk at least as large as U(k, n), since a2 a2 1. This yields (i). The proof of (ii) follows the samelines and is actually simpler.

    We also need the following elementary result whose proof is omitted.

    Lemma 2. For p (0, 1) and n N, let Yn,p denote a random variable with distribution Bin(n, p). Thenfor anyk,

    E[Yn,p{Yn,p > k}] = npP(Yn1,p > k 1), E[Yn,p{Yn,p < k}] k P(Yn1,p < k 1),and

    P(Yn,p k) = (1 p)P(Yn1,p = k) + P(Yn1,p k 1).

    Combining these two lemmas with (2.9) and the fact that U(, n) for all , we obtainB(k, n) P(Yn,k/n k)U(k, n + k) + E[Yn,k/n{Yn,k/n > k}]

    = P(Yn,k/n k)U(k, n + k) + k P(Yn1,k/n > k 1).

    SinceP(Yn1,k/n > k 1) = 1 P(Yn,k/n k) + (1 k/n)P(Yn1,k/n = k),

    we have the lower bound

    U(k, n + k) k + B(k, n) k k P(Yn1,k/n = k)P(Yn,k/n k)

    .

    Theorem 2 states that B(k, n) k(1 ), where := 2m/(16(n k)). Since P(Yn,k/n k) 1/2 forall k, n [16], we have:

    Theorem 4. Set n =

    2m/(16(n k)). Then under the uniform prior over k-sparse signals in dimen-sion n with k < n/2,

    E(|SS|) k (1 k,nk 2nk) , k,n := P(Yn1,k/n = k)P(Yn,k/n k)

    . (2.10)

    Setting 2 = 16(n 2k)(1 k,nk)2/9m and following the same argument as in Theorem 3, gives thebound

    E Ex1

    nx x22 4(1 k,nk)3(1 2k/n)27 km.

    We note that k,nk 0 as k, n and that k,nk < 1/5, and thus (1 k,nk)3 > (4/5)3 > 1/2, forall k 15, n 100 provided that k n/4. This leads to a lower bound of the form in Theorem 1 with aconstant of (2/27)(1 2k/n) > (1 2k/n)/14.

    When k, n are large, this approach leads to constant worse than that obtained in the Bernoulli case bya factor of 8. For sufficiently large k, n, the term (1 k,nk) 1, but there remains a factor of 4 whicharises because of the manner in which we have truncated the summation in our bound on B(k, n) above(which results in a factor of 2 multiplied by nk in (2.4)). This can be eliminated by truncating on bothsides as follows:

    B(k, n) E[Yn,k/n{Yn,k/n / [k t, k + t]}] + P(k t Yn,k/n k + t)U(k + t, n + 2t)and proceeding in a similar manner as before. Here, one uses

    E[Yn,k/n{Yn,k/n < k t}] k P(Yn1,k/n < k t 1)

    10

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    11/15

    andE[Yn,k/n{Yn,k/n > k + t}] k P(Yn1,k/n > k + t 1).

    Choosing t = tk = 5

    k log(k) and using well-known deviation bounds, we then obtain that

    B(k, n) (1 + o(1))U(k + tk, n + 2tk) + o(1), as k, n .Reversing this inequality, using Theorem 2 and following the proof of Theorem 3, we arrive at the following.

    Corollary 1. Set 0 = 1 k/n. For the uniform prior and as k, n , we have

    E(|SS|) k(1 n,k)1 2m/(160n) , E Ex 1n

    x x22 16027 (1 n,k) km. for some sequence n,k 0.

    We see that this finishes the proof of Theorem 1 because the minimax risk over a class of alternativesalways upper-bounds an average risk over that same class.

    3 Numerical Experiments

    In order to briefly illustrate the implications of the lower bounds in Section 2 and the potential limitationsand benefits of adaptivity in general, we include a few simple numerical experiments. To simplify ourdiscussion, we limit ourselves to existing adaptive procedures that aim at consistent support recovery: theadaptive procedure from [6] and the recursive bisection algorithm of [14].

    We emphasize that in the case of a generic k-sparse signal, there are many possibilities for adaptivelyestimating the support of the signal. For example, the approach in [11] iteratively rules out indices andcould, in principle, proceed until only k candidate indices remain. In contrast, the approaches in [6] and [14]are built upon algorithms for estimating the support of 1-sparse signals. An algorithm for a 1-sparse signalcould then be run k times to estimate a k-sparse signal as in [6], or used in conjunction with a hashingscheme as in [14]. Since our goal is not to provide a thorough evaluation of the merits of all the different

    possibilities, but merely to illustrate the general limits of adaptivity, we simplify our discussion and focusexclusively on the simple case of one-sparse signals, i.e., where k = 1.

    Specifically, in our experiments we will consider the uniform prior on the set of vectors with a singlenonzero entry equal to > 0 as in Section 2. Since we are focusing only on the case ofk = 1, the algorithmsin [6] and [14] are extremely simple and are shown in Algorithm 1 and Algorithm 2 respectively. Notethat in Algorithm 1 the step of updating the posterior distribution p consists of an iterative update rulegiven in [6] and does not require any a priori knowledge of the signal x or . In Algorithm 2, we simplifythe recursive bisection algorithm of [14] using the knowledge that > 0, which allows us to eliminate thesecond stage of the algorithm aimed at detecting negative coefficients. Note that this algorithm proceedsthrough smax = log2 n stages and we must allocate a certain number of measurements to each stage. In

    our experiments we set ms =

    2s

    , where is selected to ensure that

    log2 ns=1 ms

    m.

    3.1 Evolution of the posterior

    We begin by showing the results of a simple simulation that illustrates the behavior of the posteriordistribution of x as a function of for both adaptive schemes. Specifically, we assume that m is fixed andcollect m measurements using each approach. Given the measurements y, we then compute the posteriordistribution p using the true prior used to generate the signal, which can be computed using the fact that

    pj exp

    122

    y Aej22

    , (3.1)

    11

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    12/15

    Algorithm 1 Adaptive algorithm from [6]

    input: m n random matrix B with i.i.d. Rademacher (1 with equal probability) entries.initialize: p = 1n(1, . . . , 1)

    T.for i = 1 to i = m do

    Compute ai = (bi,1

    p1, . . . , bi,n

    pn)T.

    Observe yi = ai, x + zi.Update posterior distribution p of x given (a1, y1), . . . , (ai, yi) using the rule in [6].

    end for

    output: Estimate for supp(x) is the index where p attains its maximum value.

    Algorithm 2 Recursive bisection algorithm of [14]

    input: m1, . . . , msmax.

    initialize: J(1)1 = {1, . . . , n2 }, J(1)2 = {n2 + 1, . . . , n}.

    for s = 1 to s = smax doConstruct the ms n matrix A(s) with rows |J(s)1 |

    12 1

    J(s)1

    |J(s)2 |12 1

    J(s)2

    .

    Observe y(s) = A(s)x + z(s).

    Compute w(s) = msi=1 y(s)i .

    Subdivide: Update J(s+1)1 and J(s+1)2 by partitioning J(s)1 if w(s) 0 or J(s)2 if w(s) < 0.end for

    output: Estimate for supp(x) is J(smax)1 if w

    (smax) 0, J(smax)2 if w(smax) < 0.

    where 2 is the noise variance and ej denotes the jth element of the standard basis. What we expect is thatonce exceeds a certain threshold (which depends on m), the posterior will become highly concentratedon the true support of x. To quantify this, we consider the case where j denotes the true location of thenonzero element of x and define

    =pj

    maxj=j pj.

    Note that when 1, we cannot reliably detect the nonzero, but when 1 we can.In Figure 1 we show the results for a few representative values of m (a) when using nonadaptive

    measurements, i.e., a (normalized) i.i.d. Rademacher random matrix A, compared to the results of (b)Algorithm 1, and (c) Algorithm 2. For each value ofm and for each value of, we acquire m measurementsusing each approach and compute the posterior p according to (3.1). We then compute the value of . Werepeat this for 10,000 iterations and plot the median value of for each value of for all three approaches.In our experiments we set n = 512 and 2 = 1. We truncate the vertical axis at 104 to ensure that allcurves are comparable. We observe that in each case, once exceeds a certain threshold proportional to

    n/m, the ratio ofpj to the second largest posterior probability grows exponentially fast. As expected,this occurs for both the nonadaptive and adaptive strategies, with no substantial difference in terms ofhow large must be before support recovery is assured (although Algorithm 2 seems to improve upon the

    nonadaptive strategy by a small constant).

    3.2 MSE performance

    We have just observed that for a given number of measurements m, there is a critical value of belowwhich we cannot reliably detect the support. In this section we examine the impact of this phenomenonon the resulting MSE of a two-stage procedure that first uses md = pm adaptive measurements to detectthe location of the nonzero with either Algorithm 1 or Algorithm 2 and then reserves me = (1 p)mmeasurements to directly estimate the value of the identified coefficient. It is not hard to show that if we

    12

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    13/15

    10 20 30 40 50

    100

    102

    104

    m = 64

    m = 32m = 16

    10 20 30 40 50

    100

    102

    104

    m = 64

    m = 32m = 16

    2 4 6 8 10 12

    100

    102

    104

    m = 64m = 32m = 16

    (a) (b) (c)

    Figure 1: Behavior of the posterior distribution as a function of for several values of m. (a) shows the results fornonadaptive measurements. (b) shows the results for Algorithm 1. (c) shows the results for Algorithm 2. We seethat Algorithm 2 is able to detect somewhat weaker signals than Algorithm 1. However, for both cases we observethat once exceeds a certain threshold proportional to

    n/m, the ratio of pj to the second largest posterior

    probability grows exponentially fast, but that this does not differ substantially from the behavior observed in (a)when using nonadaptive measurements.

    correctly identify the location of the nonzero, then this will result in an MSE of (men)1 = ((1 p)mn)1.As a point of comparison, if an oracle provided us with the location of the nonzero a priori, we coulddevote all m measurements to estimating its value, with the best possible MSE being 1mn . Thus, if we cancorrectly detect the nonzero, this procedure will perform within a constant factor of the oracle.

    We illustrate the performance of Algorithm 1 and Algorithm 2 in terms of the resulting MSE as afunction of the amplitude of the nonzero in Figure 2. In this experiment we set n = 512 and m = 128with p = 12 so that md = 64 and me = 64. We then compute the average MSE over 100,000 iterationsfor each value of and for both algorithms. We compare this to a nonadaptive procedure which uses a(normalized) i.i.d. Rademacher matrix followed by orthogonal matching pursuit (OMP). Note that in theworst case the MSE of the adaptive algorithms is comparable to the MSE obtained by the nonadaptivealgorithm and exceeds the lower bound in Theorem 1 by a only a small constant factor. However, when

    begins to exceed a critical threshold, the MSE rapidly decays and approaches the optimal value of 1men .Note that by reserving a larger fraction of measurements for estimation when is large we could actuallyget arbitrarily close to 1m in the asymptotic regime.

    4 Discussion

    The contribution of this paper is to show that if one has the freedom to choose any adaptive sensingstrategy and any estimation procedure no matter how complicated or computationally intractable, wewould not be able to universally improve over a simple nonadaptive strategy that simply projects the

    signal onto a lower dimensional space and perform recovery via 1 minimization. This negative resultshould not conceal the fact that adaptivity may help tremendously if the SNR is sufficiently large, asillustrated in Section 3. Hence, we regard the design and analysis of effective adaptive schemes as a subjectof important future research. At the methodological level, it seems important to develop adaptive strategiesand algorithms for support estimation that are as accurate and as robust as possible. Further, a transitiontowards practical applications would need to involve engineering hardware that can effectively implementthis sort of feedback, an issue which poses all kinds of very concrete challenges. Finally, at the theoreticallevel, it would be of interest to analyze the phase transition phenomenon we expect to occur in simpleBayesian signal models. For instance, a central question would be how many measurements are requiredto transition from a nearly flat posterior to one mostly concentrated on the true support.

    13

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    14/15

    0 5 10 15 2010

    5

    104

    103

    102

    101

    MSE

    1

    m

    1

    men

    NonadaptiveAlgorithm 1Algorithm 2

    Figure 2: The performance of Algorithm 1 and Algorithm 2 in the context of a two-stage procedure that first usesmd =

    m2

    adaptive measurements to detect the location of the nonzero and then uses me =m2

    measurements todirectly estimate the value of the identified coefficient. We show the resulting MSE as a function of the amplitude of the nonzero entry, and compare this to a nonadaptive procedure which uses a (normalized) i.i.d. Rademachermatrix followed by OMP. In the worst case, the MSE of the adaptive algorithms is comparable to the MSE obtainedby the nonadaptive algorithm and exceeds the lower bound in Theorem 1 by only a small constant factor. When begins to exceed this critical threshold, the MSE of the adaptive algorithms rapidly decays below that of thenonadaptive algorithm and approaches 1

    men, which is the MSE one would obtain given me measurements and a

    priori knowledge of the support.

    Acknowledgements

    E. A-C. is partially by ONR grant N00014-09-1-0258. E. C. is partially supported by NSF via grant CCF-0963835 and the 2006 Waterman Award, by AFOSR under grant FA9550-09-1-0643 and by ONR undergrant N00014-09-1-0258. M. D. is supported by NSF grant DMS-1004718.

    References

    [1] S. Aeron, V. Saligrama, and M. Zhao. Information theoretic bounds for compressed sensing. IEEETrans. Inform. Theory, 56(10):51115130, 2010.

    [2] E. Candes and M. Davenport. How well can we estimate a sparse vector? Arxiv preprintarXiv:1104.5246, 2011.

    [3] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction fromhighly incomplete frequency information. IEEE Trans. Inform. Theory, 52(2):489509, 2006.

    [4] E. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encodingstrategies? IEEE Trans. Inform. Theory, 52(12):54065425, 2006.

    [5] E. Candes and T. Tao. The Dantzig Selector: Statistical estimation when p is much larger than n.Ann. Stat., 35(6):23132351, 2007.

    [6] R. Castro, J. Haupt, R. Nowak, and G. Raz. Finding needles in noisy haystacks. In Proc. IEEE Int.Conf. Acoust., Speech, and Signal Processing (ICASSP), Las Vegas, NV, Apr. 2008.

    14

  • 8/3/2019 Davenport-How Well Can We Estimate a Sparse Vector

    15/15

    [7] R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Trans. Inform. Theory,54(5):23392353, 2008.

    [8] T. Cover and J. Thomas. Elements of information theory. Wiley-Interscience, Hoboken, NJ, 2006.[9] D. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):12891306, 2006.

    [10] D. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat.,32(3):962994, 2004.

    [11] J. Haupt, R. Baraniuk, R. Castro, and R. Nowak. Compressive distilled sensing: Sparse recovery using

    adaptivity in compressive measurements. In Proc. Asilomar Conf. Signals, Systems, and Computers,Pacific Grove, CA, Oct. 2009.

    [12] J. Haupt, R. Castro, and R. Nowak. Distilled sensing: Selective sampling for sparse signal recovery.In Proc. Int. Conf. Art. Intell. Stat. (AISTATS), Clearwater Beach, FL, Apr. 2009.

    [13] P. Indyk, E. Price, and D. Woodruff. On the power of adaptivity in sparse recovery. In Proc. IEEESymp. Found. Comp. Science (FOCS), Palm Springs, CA, Oct. 2011.

    [14] M. Iwen and A. Tewfik. Adaptive group testing strategies for target detection and localization innoisy environments. IMA Preprint Series, 2010.

    [15] S. Ji, Y. Xue, and L. Carin. Bayesian compressive sensing. IEEE Trans. Signal Processing, 56(6):23462356, 2008.

    [16] R. Kaas and J. Buhrman. Mean, median and mode in binomial distributions. Statistica Neerlandica,

    34(1):1318, 1980.[17] E. Lehmann and J. Romano. Testing statistical hypotheses. Springer Texts in Statistics. Springer,

    New York, 2005.[18] P. Massart. Concentration inequalities and model selection, volume 1896 of Lecture Notes in Mathe-

    matics. Springer, Berlin, 2007.[19] E. Novak. On the power of adaptation. J. Complexity, 12(3):199237, 1996.[20] M. Raginsky and A. Rakhlin. Information complexity of black-box convex optimization: A new look

    via feedback information theory. In Proc. Allerton Conf. Communication, Control, and Computing,Monticello, IL, Oct. 2009.

    [21] G. Raskutti, M. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linearregression over q-balls. Arxiv preprint arXiv:0910.2042, 2009.

    [22] P. Rigollet and A. Zeevi. Nonparametric bandits with covariates. Arxiv preprint arXiv:1003.1630,2010.[23] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267

    288, 1996.[24] A. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New

    York, 2009.[25] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423435. Springer, New

    York, 1997.