23
Robust Sensing of Low-Rank Matrices with Non-Orthogonal Sparse Decomposition Johannes Maly 1 1 Department of Scientific Computing, KU Eichstaett/Ingolstadt, Germany Abstract We consider the problem of recovering an unknown low-rank matrix b X with (possibly) non-orthogonal, effectively sparse rank-1 decomposition from incomplete and inaccurate measurements y gathered in a linear measurement process A. We propose a variational formulation that lends itself to alternating min- imization and whose global minimizers provably approximate b X from y up to noise level. Working with a variant of robust injectivity, we derive reconstruction guarantees for various choices of A including sub- gaussian, Gaussian rank-1, and heavy-tailed measurements. Numerical experiments support the validity of our theoretical considerations. Keywords: Matrix sensing, Sparse and low-rank reconstruction, Alternating minimization 1 Introduction In this paper, we treat the reconstruction of effectively sparse, low-rank matrices b X R n1×n2 from incomplete and inaccurate measurements y = A( b X)+ η R m , where A : R n1×n2 R m resembles a linear measurement process and η R m models additive noise. This problem, which stems from compressed sensing [11] and related fields, is relevant in several modern applications such as sparse phase retrieval, blind deconvolution of sparse signals, machine learning, and data mining [18, 15, 14, 22, 40, 6]. 1.1 Related work I Recovering low-rank matrices — without additional sparsity constraints — from linear measurements has been well-studied in the context of classical compressed sensing, i.e., compressed sensing of vectors [5, 32]. The bar is notably raised when the unknown matrix is assumed to be sparse and of low-rank, and both structures shall contribute in reducing the number of measurements m. As Oymak et. al. pointed out in [29], a mere linear combination of regularizers for different sparsity structures in general does not allow to outperform recovery guarantees of the “best” one of them alone. To further improve recovery, one has to go beyond linear combinations of already known convex regularizers. A subtle approach to overcome the aforementioned limitations of purely convex methods is to assume a nested structure of the measurement operator A [2, 10]. In this particular scenario, basic solvers for low-rank resp. row-sparse recovery can be applied in two consecutive steps. Although elegant, the nested approach clearly restricts possible choices for A and is of limited practical use. In contrast, Lee et. al. [23] proposed and analyzed the so-called Sparse Power Factorization (SPF), a modified version of Power Factorization [16], without assuming any special structure of A. Power Factorization recovers low-rank matrices by representing them as a product of two orthogonal matrices X = UV > and then applying alternating minimization over the (de)composing matrix U C n1×R , V C n2×R . To enforce sparsity of the columns of U and/or V, SPF introduces Hard Thresholding Pursuit to each of the alternating steps. Lee et. al. were able to show that, using suitable initializations and assuming the noise level to be sufficiently small, SPF approximates low-rank matrices X that are row- and/or column-sparse from a nearly optimal number of measurements: If X is rank-R, has s 1 -sparse columns and s 2 -sparse rows, then 1 arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

Robust Sensing of Low-Rank Matriceswith Non-Orthogonal Sparse Decomposition

Johannes Maly1

1Department of Scientific Computing, KU Eichstaett/Ingolstadt, Germany

Abstract

We consider the problem of recovering an unknown low-rank matrix X with (possibly) non-orthogonal,effectively sparse rank-1 decomposition from incomplete and inaccurate measurements y gathered in alinear measurement process A. We propose a variational formulation that lends itself to alternating min-imization and whose global minimizers provably approximate X from y up to noise level. Working witha variant of robust injectivity, we derive reconstruction guarantees for various choices of A including sub-gaussian, Gaussian rank-1, and heavy-tailed measurements. Numerical experiments support the validityof our theoretical considerations.

Keywords: Matrix sensing, Sparse and low-rank reconstruction, Alternating minimization

1 IntroductionIn this paper, we treat the reconstruction of effectively sparse, low-rank matrices X ∈ Rn1×n2 from incompleteand inaccurate measurements y = A(X) +η ∈ Rm, where A : Rn1×n2 → Rm resembles a linear measurementprocess and η ∈ Rm models additive noise. This problem, which stems from compressed sensing [11] andrelated fields, is relevant in several modern applications such as sparse phase retrieval, blind deconvolutionof sparse signals, machine learning, and data mining [18, 15, 14, 22, 40, 6].

1.1 Related work IRecovering low-rank matrices — without additional sparsity constraints — from linear measurements hasbeen well-studied in the context of classical compressed sensing, i.e., compressed sensing of vectors [5, 32].The bar is notably raised when the unknown matrix is assumed to be sparse and of low-rank, and bothstructures shall contribute in reducing the number of measurements m. As Oymak et. al. pointed out in[29], a mere linear combination of regularizers for different sparsity structures in general does not allow tooutperform recovery guarantees of the “best” one of them alone. To further improve recovery, one has to gobeyond linear combinations of already known convex regularizers.A subtle approach to overcome the aforementioned limitations of purely convex methods is to assume a nestedstructure of the measurement operator A [2, 10]. In this particular scenario, basic solvers for low-rank resp.row-sparse recovery can be applied in two consecutive steps. Although elegant, the nested approach clearlyrestricts possible choices for A and is of limited practical use.In contrast, Lee et. al. [23] proposed and analyzed the so-called Sparse Power Factorization (SPF), a modifiedversion of Power Factorization [16], without assuming any special structure ofA. Power Factorization recoverslow-rank matrices by representing them as a product of two orthogonal matrices X = UV> and thenapplying alternating minimization over the (de)composing matrix U ∈ Cn1×R,V ∈ Cn2×R. To enforcesparsity of the columns of U and/or V, SPF introduces Hard Thresholding Pursuit to each of the alternatingsteps. Lee et. al. were able to show that, using suitable initializations and assuming the noise level tobe sufficiently small, SPF approximates low-rank matrices X that are row- and/or column-sparse from anearly optimal number of measurements: If X is rank-R, has s1-sparse columns and s2-sparse rows, then

1

arX

iv:2

103.

0552

3v1

[cs

.IT

] 9

Mar

202

1

Page 2: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

m & R(s1 + s2) log(max{en1/s1, en2/s2}) Gaussian measurements suffice for robust recovery, which is upto the log-factor at the information theoretical limit. Despite its theoretical optimality, the setting of SPFis actually quite restrictive as all columns (resp. rows) need to share a common support and the matricesU,V need to be simultaneously orthogonal. On the one hand, empirically, it has been shown in [23] thatSPF outperforms methods based on convex relaxation. On the other hand, SPF is heavily based on theassumption that the operator A possesses a suitable restricted isometry property and cannot be applied toarbitrary inverse problems, as it may even fail to converge otherwise. The reason is that SPF is based onhard-thresholding [3], which is not a Lipschitz continuous map. Let us mention that the authors could extendtheir analysis of SPF to the measurement set-up of blind deconvolution in [22].Inspired by recent works on multy-penalty regularization [28, 13], the authors of [9] aimed at enhancingrobustness of recovery by alternating minimization of an `1-norm based multi-penalty functional. Thoughnot as close to the information theoretical limit as SPF, the theoretical results therein hold for arbitrarilylarge noise magnitudes and a wider class of ground-truth matrices than the one considered in [23].Let us finally point out that a closely related line of work comes from statistical literature under the nameSparse Principal Component Analysis (SPCA) [40, 6]. In order to defeat the curse of dimensionality whenfinding principal subspaces of covariance matrices, SPCA admits non-orthogonal subspace decompositionsand enforces sparsity on the vectors spanning the respective spaces. However, observations in SPCA areprovided from noisy samples of the underlying distribution, whereas in our case the matrix itself is observedindirectly. It is thus hard to directly compare results of this paper with corresponding results for SPCA.

1.2 SettingWe begin by specifying the problem at hand. Given a linear measurement operator A : Rn1×n2 → Rm and acorrupted vector of measurements

y = A(X) + η =1√m

〈A1, X〉F...

〈Am, X〉F

+ η, (1)

we wish to estimate the unknown signal matrix X ∈ Rn1×n2 . Here, the vector η ∈ Rm, of which only the`2-norm is (approximately) known, models additive noise. Note that A is fully determined by the m matricesAi ∈ Rn1×n2 and that individual measurements correspond to the Frobenius products 〈Ai, X〉F = tr(AiX

>),for i ∈ [m] := {1, . . . ,m}.

While (1) is ill-posed if m < n1n2, it becomes well-posed if we assume some prior knowledge on X. Inthe rest of the work, we thus suppose that X is of rank R ≥ 1 and possesses a decomposition of the form

X = UV>, (2)

where U ∈ Rn1×R and V ∈ Rn2×R are effectively sparse, a useful concept introduced by Plan and Vershynin[30, Section 3].

Definition 1.1 (Effectively sparse vectors). The set of effectively s-sparse vectors of dimension n is definedby

Kn,s = {z ∈ Rn : ‖z‖2 ≤ 1 and ‖z‖1 ≤√s}.

Remark 1.2. Note that any unit-norm s-sparse vector is also effectively s-sparse. Effectively sparse vectorsare well approximated by sparse vectors as made precise in [30, Lemma 3.2]. Moreover, the set Kn,s may beviewed as a convex hull of the set of s-sparse unit-norm vectors [30, Lemma 3.1].

To make the initial assumption in (2) precise, we introduce the set

KRs1,s2 =

{Z = UV> : U ∈ Rn1×R, V ∈ Rn2×R and vec(U) ∈

√RKRn1,Rs1 , vec(V) ∈

√RKRn2,Rs2

}

2

Page 3: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

of low-rank matrices with (possibly non-orthogonal) effective sparse decomposition. From here on we callsuch a decomposition SD. The set KR

s1,s2 can equivalently be written as

KRs1,s2 =

{Z = UV> : max

{‖U‖2F , ‖V‖

2F

}≤ R and ‖U‖1 ≤ R

√s1, ‖V‖1 ≤ R

√s2

}, (3)

where ‖·‖1 denotes the vector `1-norm and is by abuse of notation applied to matrices, i.e., ‖Z‖1 =∑i,j |Zi,j |

for any matrix Z. The SD of Z in (3) reminds of the singular value decomposition (SVD) defined as

Z = UΣV> =

R∑r=1

σrur(vr)>, (4)

where Σ is a diagonal matrix containing the singular values σ1 ≥ ... ≥ σR > 0, while U ∈ Rn1×R andV ∈ Rn2×R have orthonormal columns which are called left and right singular vectors. Note, however, thatthe SD is neither unique nor is the SVD of Z necessarily an SD of Z.

The set KRs1,s2 particularly includes all matrices Z = UV> for which U consists of s1-sparse unit norm

columns and V consists of s2-sparse unit norm columns. The important difference with respect to [23] is thatthe columns of U and V do neither need to be orthogonal nor exactly s-sparse nor have to share a commonsupport. To simplify notation, we will often assume s1 = s2 = s in the following and use the shorthandnotation KR

s = KRs,s. It is straight-forward to generalize respective results to the case s1 6= s2.

One of the most important features of the class KRs1,s2 is that it is to a certain extent closed under summation.

In fact, if Z ∈ ΓKRs1,s2 and Z ∈ ΓKR

s1,s2 , for Γ ≥ 0, then

Z− Z ∈ ΓK2Rmax{s1,s1},max{s2,s2}. (5)

We call U (resp. V) the left (resp. right) component of X. Although (2) does not need to be the SVD of X,this case is also covered by our analysis.

1.3 ContributionWe build upon ideas in [9] and reconstruct X by a variational approach. It is based on alternating minimiza-tion of the multi-penalty functional JRα,β : Rn1×R × Rn2×R → R defined, for α1, α2, β1, β2 > 0, by

JRα,β(U,V) :=∥∥y −A (UV>

)∥∥2

2+ α1 ‖U‖2F + α2 ‖U‖1 + β1 ‖V‖2F + β2 ‖V‖1 . (6)

where α = (α1, α2)>,β = (β1, β2)> are regularization parameters and ‖·‖1 denotes the vector `1-norm. Thefunctional in (6) can be interpreted as a generalization of the linear regression based SPCA approach in [40].Despite the convex multi-penalty regularization term α1 ‖U‖2F + α2 ‖U‖1 + β1 ‖V‖2F + β2 ‖V‖1, also knownas elastic net [39], the functional (6) is highly non-convex; hence, in light of the negative results in [29] itprovides hope for better performance than lifting and convex relaxation. At the same time, one notices thatJRα,β becomes convex when U or V are fixed. We can thus efficiently minimize (6) by alternating schemes,e.g., alternating minimization{

Uk+1 = arg minU

∥∥y −A(UV>k )∥∥2

2+ α1‖U‖2F + α2 ‖U‖1

Vk+1 = arg minV

∥∥y −A(Uk+1V>)∥∥2

2+ β1‖V‖2F + β2 ‖V‖1 .

(7)

In addition to proposing the functional JRα,β and the signal set KRs1,s2 , our main contribution1 is threefold:

1. Under minimal assumptions on A, we show for any global minimizer (Uα,β,Vα,β) of (6): if X ∈ ΓKRs ,

for Γ ≥ 1, and (α,β) are suitably chosen, then Xα,β = Uα,βV>α,β ∈ ΓK4Rs and A(Xα,β) ≈ y, i.e.,

Xα,β is regular and satisfies the measurements up to noise level (Lemmas 2.1 and 2.2).1Although the functional in (6) and its predecessor in [9] have some resemblances, there are crucial differences to note: first,

the use of elastic net as regularizer allows to control both `1- and `2-norm simultaneously. Second, the regularizers are appliedto the component matrices U and V as a whole such that their non-zero entries may be distributed without further restrictions.This stands in contrast to [9] where each column of U and V has been sparsified. When combined, these ideas admit a smootheranalysis and tighter control of the effective sparsity level, and result in improved error estimates. Also note at this point thedifferences between KR

s1,s2and the corresponding signal sets in [9].

3

Page 4: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

2. If A, in addition, satisfies a robust injectivity property on ΓK5Rs (Definition 2.3), then Xα,β ≈ X up

to noise level (Theorem 2.4).

3. Various types of measurement operators A (subgaussian, heavy-tailed, rank-1) satisfy the requiredrobust injectivity for m� n1n2 (Theorems 2.7, 2.10, and 2.14).

Furthermore, we examine two approaches to minimizing (6), for which we show global convergence to station-ary points and local convergence to global minimizers, cf. Section 2.4. Like most of the non-convex methods,convergence to global minimizers and thus empirical performance of (7) depends on a proper initialization(U0,V0). We, however, do not provide such an initialization method but leave it as an open problem forfuture research.

1.4 Related Work IIFinding a reliable and tractable initialization procedure is challenging. This reflects the intrinsic hardnessof the presented problem, a hardness that stems from its deep connection to SPCA, which is known to beNP-hard in general [25]. To the best of our knowledge, all yet existing approaches to the reconstruction ofjointly sparse and low-rank matrices share this impediment in one form or another. The above mentionedguarantees for SPF, which come with a tractable initialization, are restricted to signals with few dominantentries [23, 12]. The recent work [7], which builds upon ideas on generalized projections in [10] and uses aRiemannian version of Iterative Hard Thresholding to reconstruct jointly row-sparse and low-rank matrices,only provides local convergence results. The alternative approach of using optimally weighted sums or maximaof convex regularizers [19] — the only existing work apart from our predecessor paper [9] that considers non-orthogonal sparse low-rank decompositions — requires optimal tuning of the parameters under knowledgeof the ground-truth. Whereas one may not hope for a general solution to these problems, the previouslymentioned tractable results on nested measurements [2, 10] suggest that suitable initialization methods canbe found for specific applications.

1.5 Outline and NotationThe organization of the paper is as follows. Section 2 provides the main results and part of the proofs. Theremaining proofs can be found in Section 3. In Section 4, we compare our theoretical findings to actualempirical evidence. We conclude in Section 5 with a discussion on open problems and future work.

We use the shorthand notation [R] := {1, ..., R} to write index sets. The relation a & b is used to expressa ≥ Cb for some positive constant C, and a ' b stands for a & b and b & a.For a matrix Z ∈ Rn1×n2 , we denote its transpose by Z>. The support of Z, i.e., the index set of the non-zeroentries, is denoted by supp(Z). The function vec vectorizes any matrix and vec−1 reverses the vectorization.Hence, vec(Z) ∈ Rn1n2 and Z = vec−1(vec(Z)). We denote by ‖Z‖1 the `1-norm of vec(Z), by ‖Z‖F theFrobenius norm of Z (`2-norm of the vector of singular values), by ‖Z‖2→2 the operator norm of Z (topsingular value), and by ‖Z‖∗ the nuclear norm of Z (sum of singular values).The set-valued operator ∂ denotes the limiting Fréchet subdifferential, and dom ∂f = {x ∈ Rn : ∂f(x) 6= ∅}its domain when applied to a function f : Rn → R ∪ {∞}, cf. [33, 27].The covering number N(M, ‖ · ‖, ε) of a set M is the minimal number of ‖ · ‖-balls of radius ε that are neededto cover the set M . The cardinality of any ε-net M of M , i.e., for all z ∈M there is z ∈ M with ‖z− z‖ < ε,yields an upper bound for N(M, ‖ · ‖, ε).

2 Main ResultsWe now state the main results of the paper. In Section 2.1, we show how minimizers of JRα,β yield underminimal assumptions solutions to the inverse problem (1). In Section 2.2, we estimate the approximationerror assuming robust injectivity of A on KR

s1,s2 . We then present in Section 2.3 various types of operatorsfulfilling robust injectivity and, finally, provide local convergence guarantees for (7) and related methods inSection 2.4.

4

Page 5: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

2.1 General Properties of MinimizersLet us begin with some basic properties that minimizers of JRα,β have under fairly general assumptions. Forany minimizer (Uα,β,Vα,β) of JRα,β we denote

Xα,β = Uα,βV>α,β. (8)

The first result bounds the measurement misfit of Xα,β.

Lemma 2.1 (Measurement misfit). Assume X with rank(X) ≤ R is generating the noisy measurementsy = A(X) + η and let (Uα,β,Vα,β) be a global minimizer of JRα,β where α and β are chosen sufficientlysmall (depending on ‖X‖F and ‖η‖2). Then,

‖y −A(Xα,β)‖2 ≤ 2‖η‖2. (9)

Proof : Let X = UΣV> denote the SVD of X where U ∈ Rn1×R and V ∈ Rn2×R. By minimality ofXα,β, we get that

‖y −A(Xα,β)‖22 ≤ JRα,β(Uα,β,Vα,β) ≤ JRα,β(UΣ,V)

= ‖η‖22 + α1 ‖UΣ‖2F + α2 ‖UΣ‖1 + β1 ‖V‖2F + β2 ‖V‖1≤ 4 ‖η‖22 ,

for α1 ≤‖η‖22

2‖UΣ‖2F, α2 ≤

‖η‖222‖UΣ‖1

, β1 ≤‖η‖22

2‖V‖2F, and β2 ≤

‖η‖222‖V‖1

.

The second result states that if the ratio of the parameters is fixed in a suitable way and X is low-rank witheffectively sparse decomposition, then the same holds true for global minimizers of JRα,β as long as α and βare chosen sufficiently large to avoid overfitting.

Lemma 2.2 (Regularity). Assume X ∈ ΓKRs , for Γ > 0, is generating the noisy measurements y = A(X) + η

and let (Uα,β,Vα,β) be a global minimizer of JRα,β where α1 =√

sΓα2 = β1 =

√sΓβ2. If ‖y −A(Xα,β)‖2 ≥

‖η‖2, we have that Xα,β ∈ ΓK4Rs .

Proof : Let X = UV> denote an SD of X ∈ ΓKRs such that

max{‖U‖2F , ‖V‖2F

}≤ ΓR and max

{‖U‖1, ‖V‖1

}≤ R√

Γs

(note that such an SD always exists by taking an arbitrary SD of 1ΓX ∈ KR

s and multiplying both theleft and right components with

√Γ). By minimality of Xα,β, we get that

‖y −A(Xα,β)‖22 + α1 ‖Uα,β‖2F + α2 ‖Uα,β‖1 + β1 ‖Vα,β‖2F + β2 ‖Vα,β‖1= JRα,β(Uα,β,Vα,β)

≤ JRα,β(U, V)

= ‖η‖22 + α1‖U‖2F + α2‖U‖1 + β1‖V‖2F + β2‖V‖1.

Subtracting ‖y −A(Xα,β)‖22 on both sides and using that by assumption ‖η‖2 ≤ ‖y −A(Xα,β)‖2and α1 =

√sΓα2 = β1 =

√sΓβ2 leads to

max{‖Uα,β‖2F , ‖Vα,β‖2F

}≤ ‖U‖2F +

√Γ

s‖U‖1 + ‖V‖2F +

√Γ

s‖V‖1

≤ 4RΓ

5

Page 6: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

and

max{‖Uα,β‖1 , ‖Vα,β‖1

}≤√s

Γ‖U‖2F + ‖U‖1 +

√s

Γ‖V‖2F + ‖V‖1

≤ 4R√

Γs,

which shows that Xα,β ∈ ΓK4Rs .

The assumption ‖y−A(Xα,β)‖2 ≥ ‖η‖2 in Lemma 2.2 is not restrictive. As soon as ‖y−A(Xα,β)‖2 = ‖η‖2,decreasing α and β further becomes undesirable since this would lead to overfitting.To conclude, we can claim that Xα,β is for suitable parameter choice, even without any requirements on A,a reasonable approximation of X ∈ ΓKR

s : it is of rank R, fulfills the measurements up to noise level, and hasan effectively sparse decomposition of the same order as X. However, the parameters α and β have to bechosen with care, neither too small nor too large. Moreover, Lemma 2.2 shows that α and β should be ofsimilar magnitude. Otherwise either the left or the right components of Xα,β cannot be controlled.

2.2 Reconstruction Properties of MinimizersTo obtain proper reconstruction guarantees, we introduce a variant of robust injectivity for linear operatorsA acting on KR

s1,s2 . This property may be viewed as generalization of the rank-R and (s1, s2)-sparse RIP ofLee et. al. in [23].

Definition 2.3 (Robust injectivity). Let Γ > 0. A linear operator A : Rn1×n2 → Rm satisfies robustinjectivity on ΓKR

s1,s2 with injectivity constants γ ∈ (0, 1] and δ > 0 if

‖A(Z)‖22 ≥ γ ‖Z‖2F − δ (10)

for all Z ∈ ΓKRs1,s2 .

The requirement in (10) is weak when compared to restricted isometry properties, see [11]. It only lowerbounds the contractivity of A when applied to matrices in ΓKR

s1,s2 . For further discussion on the relationbetween (10) and restricted isometry properties, see Remark 2.9 below.We are ready to state the main reconstruction result: If A is injective in the sense of Definition 2.3 andα,β are suitably chosen, any global minimizer of JRα,β provides an approximation of X up to noise level andinjectivity constant δ.

Theorem 2.4 (Reconstruction of signals). Let Γ > 0. Assume that A satisfies robust injectivity on ΓK5Rs

with injectivity constants γ ∈ (0, 1] and δ > 0. If X ∈ ΓKRs and y = A(X) + η ∈ Rm, then by choosing

α,β sufficiently small with α1 =√

sΓα2 = β1 =

√sΓβ2 one obtaines

‖X−Xα,β‖F ≤2√γ‖η‖2 +

√δ, (11)

for any global minimizer (Uα,β,Vα,β) of JRα,β. In particular, Xα,β ∈ ΓK5Rs with the SD in (8).

Proof : By assumption X ∈ ΓKRs and α1 =

√sΓα2 = β1 =

√sΓβ2. According to Lemma 2.1 we can choose

(α,β) such that ‖η‖2 ≤ ‖y −A(Xα,β)‖2 ≤ 2 ‖η‖2. Lemma 2.2 thus yields that Xα,β ∈ ΓK4Rs . Since

X−Xα,β ∈ ΓK5Rs by a modification of (5), we can now apply the robust injectivity of A to obtain

‖X−Xα,β‖F ≤1√γ‖y −A(Xα,β)‖2 +

√δ

≤ 2√γ‖η‖2 +

√δ.

and hence the claim.

6

Page 7: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

The main challenge in applying Theorem 2.4 is to tune the parameters α and β. This is common to variousregularized recovery procedures. Lemma 2.2, nevertheless, suggests a simple heuristic. As long as minimizersof JRα,β are regular in the sense that the (effective) sparsity of Uα,β and Vα,β is small, the critical pointof ‖y −A(Xα,β)‖2 = ‖η‖2 has not yet been reached and α,β may be further decreased. If the criticalpoint ‖y −A(Xα,β)‖2 < ‖η‖2 is hit, the (effective) sparsity is not controlled anymore and likely to increasemassively. This behavior can be observed in numerical experiments, see Section 4.

2.3 Robust InjectivityAs mentioned before, a linear operator A of the form (1) that is drawn from a suitable random distribu-tion fulfills the robust injectivity property introduced above with high probability. We consider here severalimportant types of measurements. First, measurement operators A characterized by matrices Ai with sub-gaussian i.i.d. entries exhibiting strong tail decay, second, measurement operators A whose components Ai

may be heavy tailed but obey a small-ball estimate, and third, measurement operators A whose componentsare Gaussian rank-1 matrices, i.e., Ai = aia

>i where ai ∈ Rn has i.i.d. Gaussian entries.

To derive and discuss the results, we need to understand the complexity of the set ΓKRs1,s2 . In order to

measure complexity of general subsets of the Euclidean space, we use the Gaussian width defined as

w(K) = E[

supz∈K〈z,g〉

],

for K ⊂ Rn and g ∈ Rn having i.i.d. standard Gaussian entries. The Gaussian width has become anestablished measure of set complexity due to its numerous favorable properties [38]. In particular,

• the width of the Euclidean unit ball Bn fulfills w(Bn) '√n,

• the width of the set of s-sparse vectors Σns intersected with Bn fulfills w(Σns ∩Bn) '√s log

(ens

),

illustrating that w(K ∩Bn)2 extends the linear dimension of subspaces in a consistent way to arbitrary setsK ⊂ Rn. The restriction to Bn in the above examples is necessary since w(K) scales in the diameter of K.

2.3.1 Subgaussian measurement operator with i.i.d. entries

We first recall the definition of subgaussian random variables (for further details see [37]).

Definition 2.5 (Subgaussian random variable). A random variable ξ ∈ R is called K-subgaussian if the tailbound Pr[|ξ| > t] ≤ 2 exp(−t2/K2) holds. The smallest possible number for K > 0 is called subgaussian normof ξ and denoted by ‖ξ‖ψ2

.

Remark 2.6. The class of subgaussian random variables covers important special cases as Gaussian, Bernoulli,and more generally all bounded random variables, cf. [37].

For subgaussian measurement operators we can derive a stronger quasi-isometric property implying therobust injectivity introduced in Definition 2.3.

Theorem 2.7 (Robust injectivity for subgaussian operators). Let Γ > 0 and A : Rn1×n2 → Rm be a linearmeasurement operator of form (1). Assume, all Ai, for i ∈ [m], have i.i.d. K-subgaussian entries ai,j,k withmean 0 and variance 1. If

m &

Γ2R2

)−2

R(s1 + s2) log3 (max {n1, n2}) , (12)

then A satisfies for all Z ∈ ΓKRs1,s2 ∣∣‖A(Z)‖22 − ‖Z‖2F

∣∣ ≤ δ, (13)

with probability at least 1− 2 exp(−C(δ/Γ2R2)m) where C > 0 is a constant depending on K. In particular,A satisfies (10) with γ = 1 and δ.

7

Page 8: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

The proof of Theorem 2.7 can be found in Section 3.1.

Remark 2.8. Note that Γ2R2 is the squared Frobenius diameter of ΓKRs1,s2 . For δ = ∆Γ2R2, ∆ ∈ (0, 1),

Theorem 2.7 hence states that up to log-factors m ≈ O(∆−2R(s1 + s2))

)subgaussian measurements are

sufficient to δ-stably embed ΓKRs1,s2 in Rm, cf. [31, Def. 1.1 & Thm. 1.5]. How does this relate to the

preliminary work on SPF in [23] where orthogonality of the matrices U and V was assumed? It is easy tocheck that Z ∈ 1√

RKRs1,s2 if Z is of unit Frobenius-norm and composed of orthogonal component matrices U

and V with s1- resp. s2-sparse columns. The bound in (12) then becomes

m & δ−2R3(s1 + s2) log3 (max {n1, n2}) . (14)

This is by a factor R2 worse than the results in [23], which have been shown to be near-optimal for recoveryof matrices with orthogonal sparse decomposition. However, [23] neither treats effective sparsity nor non-orthogonal decompositions. For a fair comparison the restriction to ‖Z‖F = 1 is necessary here since ournon-orthogonal setting forbids to derive scaling invariant versions of (13) that are independent of the diameterof ΓKR

s1,s2 , cf. Remark 2.9 below.

Remark 2.9. Let us repeat an important observation from [9] at this point. The additive quasi-isometricproperty in (13) differs from the commonly used multiplicative Restricted Isometry Properties (RIP) of theform

(1− δ)‖Z‖2F ≤ ‖A(Z)‖22 ≤ (1 + δ)‖Z‖2F (15)

as it is not scaling invariant and A(Z) = A(Z′) does not imply Z = Z′ but only ‖Z− Z′‖22 ≤ δ. In fact it isnot possible to derive a classical scaling invariant RIP like (15) on KR

s1,s2 under similar conditions as (12).The main problem is non-orthogonality of the SD. A simple example illustrates this point: Assume R = 2and m ' 2(n1 + s) log3 (max{n1, n2}) and the linear operator A fulfills (15) for all Z ∈ K2

n1,s. Choose someu ∈ Rn1 ,v1 ∈ Rn2 of unit norm and ‖v1‖1 ≤

√s/2. Define v2 := −v1 + εw for any w ∈ Rn2 and choose

ε > 0 sufficiently small to ensure ‖v2‖1 ≤√s and ‖v2‖2 ≈ 1. Then Z := (1/2)uv>1 + (1/2)uv>2 ∈ K2

n1,s and(15) holds. But this implies by definition of Z and scaling invariance of (15) that

(1− δ)‖uw>‖2F ≤ ‖A(uw>)‖22 ≤ (1 + δ)‖uw>‖2F

which means the RIP directly extends to all rank-1 matrices (not only those with sparse right component).If n1, s � n2, this is a clear contradiction to information theoretical lower bounds, as corresponding RIPswould require at least m ' max{n1, n2} (see [5, Section 2.1]).

2.3.2 Measurements with small ball property

The shape of (10) allows to consider more general classes of measurement operators than those concentratingaround their mean. Instead of assuming concentration like in the subgaussian case, it suffices that theprobability mass is not strongly concentrated around zero. For a measurement operator A of form (1) whosecomponent matrices Ai are i.i.d. copies of a random matrix A, this may be formally expressed in the smallball estimate

Pr[〈A,Z〉2F ≥ θ ‖Z‖

2F

]≥ c, for all Z ∈ K, (16)

for some signal set K ⊂ Rn1×n2 and some constants θ, c > 0 (Equation (16) does not require independententries of A). We easily deduce the following theorem from Mendelson’s work in [26].

Theorem 2.10 (Robust injectivity for small ball operators). Let Γ > 0 and A : Rn1×n2 → Rm be a linearmeasurement operator of form (1) obeying (16) on ΓKR

s1,s2 − ΓKRs1,s2 . If

m & δ−2Em(ΓKRs1,s2 − ΓKR

s1,s2 ,A)2, (17)

where

Em(K,A) = E

[supZ∈K

⟨1√m

m∑i=1

εiAi,Z

⟩F

](18)

8

Page 9: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

denotes the mean empirical width of a set K ⊂ Rn1×n2 under the distribution of A (the εi are i.i.d. copies of aRademacher variable and the Ai are i.i.d. copies of A), then A satisfies with probability at least 1−exp(− c

2

2 m)

robust injectivity in (10) with constants γ = θ2c16 and δ.

The proof of Theorem 2.10 can be found in Section 3.2.

Remark 2.11. If A is in addition subgaussian, the bound in (17) reduces to an improved version of (12).To see this, note that the mean empirical width Em(K,A) can be bounded by the Gaussian width w(K) [36,Eq. (6.4)] for subgaussian A. Hence,

Em(ΓKRs1,s2 − ΓKR

s1,s2 ,A)2 ≤ Cw(ΓKRs1,s2 − ΓKR

s1,s2)2 ≤ 4Cw(ΓKRs1,s2)2

. Γ2R3(s1 + s2) log3 (max {n1, n2}) ,

where the second step follows from elementary properties of w and the third step from Lemma 3.2 below.Consequently, (17) becomes

m &

ΓR

)−2

R(s1 + s2) log3 (max {n1, n2}) . (19)

When applying Theorem 2.10 to subgaussian measurements with small ball property, the bound in (14) is thusimproved by a factor R (recall from Remark 2.8 that Z ∈ 1√

RKRs1,s2 for all unit Frobenius-norm Z that have

orthogonal component matrices U and V with s1- resp. s2-sparse columns). This differs from the optimalcomplexity [23] only by a factor R.

By Theorem 2.10 we can, for instance, consider operators A of form (1) whose component matrices Ai

are i.i.d. copies of an isotropic random matrix A (i.e., E[〈A,Z〉2] = ‖Z‖2F for all Z ∈ Rn1×n2) fulfilling theL1-L2-equivalence

E[〈A,Z〉2F

] 12 ≤ c E

[|〈A,Z〉F |

], for all Z ∈ Rn1×n2 , (20)

where c > 1 is an absolute constant. The condition in (20) is fulfilled by various heavy-tailed distributions. Byisotropy and the Paley-Zygmund inequality, one obtains from (20) for all θ ∈

(0, 1

c2

)the small ball estimate

Pr[〈A,Z〉2F ≥ θ ‖Z‖

2F

]≥

(1−√θc

c

)2

, for all Z ∈ Rn1×n2 , (21)

see [26, Lemma 4.1]. It is now straight-forward to deduce from Theorem 2.10 the following corollary.

Corollary 2.12 (Robust injectivity for heavy-tailed operators). Let Γ > 0 and A : Rn1×n2 → Rm be a linearmeasurement operator of form (1) obeying (20). If

m & δ−2Em(ΓKRs1,s2 − ΓKR

s1,s2 ,A)2, (22)

where Em is defined in (18), then A satisfies with probability at least 1− exp(−c′m) robust injectivity in (10)with constants γ = 1

256c4 and δ where c′ > 0 is a constant depending on c from (20).

Remark 2.13. Note that by [26, Lemma 4.2] the estimate in (16) can be deduced as well if A consistsof i.i.d. entries Ai,j which satisfy E

[A2i,j

]≤ c E

[|Ai,j |

](only the constant on the right-hand side of (16)

changes). However, the requirement in (20) is more general as it does not require independence of the entriesof A.

2.3.3 Rank-1 measurement operator

Another noteworthy class of measurement operators for matrix sensing is given by rank-1 measurements, i.e.,the matrices Ai defining A are i.i.d. copies of a rank one matrix A = aa> where a ∈ Rn has i.i.d. standardGaussian entries (in this case we restrict ourselves to quadratic matrices n1 = n2 = n). The advantage of a

9

Page 10: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

structured measurement operator of this type lies in the reduced complexity in both storage and evaluationcosts. The drawback when compared to subgaussian operators with i.i.d. entries is additional dependencebetween the entries of each Ai. However, by using the observation that Gaussian rank-1 measurementssatisfy the small ball estimate (16), cf. [21], we may deduce the following result. Talagrand’s γα-functionalappearing in the statement is, for a metric space (K,dist), defined as

γα(K, d) = inf(Ki)i∈N

supz∈K

∞∑i=0

2iα dist(z,Ki), (23)

where the infimum is taken over all admissible sequences (Ki)i∈N, i.e., sequences of subsets of K for which|K0| = 1 and |Ki| ≤ 22i . Note that γα(K, d) can be seen as a measure of the intrinsic complexity of Kwhen measured in d and that γ2(K, ‖·‖2) is up to an absolute constant equivalent to the Gaussian width [38,Theorem 8.6.1], a result due to Fernique [8] and Talagrand [34], and widely known as majorizing measurestheorem.

Theorem 2.14 (Robust injectivity for Gaussian rank-1 operators). Let Γ > 0 and A : Rn×n → Rm be alinear measurement operator of form (1) with A = aa>, a ∈ Rn having i.i.d. standard Gaussian entries, andAi being i.i.d. copies of A. There exist absolute constants c, c′ > 0 such that the following holds: if

m & max{δ−2Rγ2(ΓKR

s1,s2 − ΓKRs1,s2 , ‖·‖F )2, δ−1γ1(ΓKR

s1,s2 − ΓKRs1,s2 , ‖·‖)

}, (24)

then A satisfies with probability at least 1− exp(−c′m) robust injectivity in (10) with constants γ = 1c and δ.

The proof of Theorem 2.14 can be found in Section 3.3.

Remark 2.15. By the majorizing measures theorem [35], one can replace γ2(ΓKRs1,s2 − ΓKR

s1,s2 , ‖·‖F )2 inTheorem 2.14 with w(ΓKR

s1,s2 −ΓKRs1,s2)2 leading to a bound as in (19). Unfortunately, it is not as simple to

bound γ1(ΓKRs1,s2 − ΓKR

s1,s2 , ‖·‖)2 in an intuitive way without deriving a tight bound on the covering number

of ΓKRs1,s2 in operator norm. Independent of this open point let us mention, that in its current form we do

not expect (24) to be optimal since the additional factor R in front of the γ2-functional appears to be anartifact of the proof.

2.4 Computing MinimizersThe previous sections showed that minimizers of JRα,β uniformly approximate original ground-truths as longas the measurement operator A satisfies the robust injectivity in Definition 2.3, a rather mild conditionfulfilled by many popular choices of A. The last crucial question is how to compute global minimizers of thenon-differentiable functional JRα,β in an efficient way. We discuss here two schemes based on the proximaloperator of the elastic net Enetθ(Z) = θ1 ‖Z‖2F + θ2 ‖Z‖1, for θ = (θ1, θ2)> ≥ 0. The proximal operator of aproper and lower semi-continuous function f : Rn → R ∪ {∞} is defined as

proxµf (u) = arg minz∈Rn

1

2‖z− u‖22 + µf(z),

where µ > 0 is a design parameter. Since proxχC (z) is the orthogonal projection of z onto C, for C convexand χC being the corresponding indicator function [33], proximal operators may be viewed as generalizedprojections. By separability of the components of Enetθ it is straight-forward to verify that

proxµEnetθ(Z) = (Sθ,µ(Zi,j))i,j where Sθ,µ(Zi,j) =

Zi,j−µθ21+2µθ1

Zi,j > µθ2

0 |Zi,j | ≤ µθ2Zi,j+µθ21+2µθ1

Zi,j < −µθ2

.

Due to the non-convex structure of JRα,β, we are not able to derive guaranteed global convergence to globalminimizers. However, we guarantee global convergence of all presented methods to stationary points of JRα,βand local convergence to global minimizers. Our analysis relies on the fact that JRα,β has the so calledKurdyka-Lojasiewicz property, which requires JRα,β to behave well around stationary points.

10

Page 11: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

Definition 2.16 (Kurdyka-Lojasiewicz property). A proper lower semicontinuous function f : Rn →R ∪ {∞} is said to have the KL-property at x ∈ dom ∂f if there exist η ∈ (0,∞], a neighborhood U ofx and a continuous concave function ϕ : [0,∞)→ R+ such that

- ϕ(0) = 0,

- ϕ is C1 on (0, η),

- ϕ′(t) > 0, for all t ∈ (0, η),

- and, for all x ∈ U ∩ {x ∈ Rn : f(x) < f(x) < f(x) + η}, the KL-inequality holds:

ϕ′(f(x)− f(x)) dist(0, ∂f(x)) ≥ 1.

Remark 2.17. It is easy to see that JRα,β has the KL-property with ϕ(t) = ct1−θ, for c > 0 and θ ∈ [0, 1)

since its graph is a semialgebraic set in Rn1×R ×Rn2×R ×R, cf. [1] and the more detailed discussion on therelated functional in [9, Section 6].As [1, Theorem 11] shows, a characterization of θ would determine the convergence speed of alternatingdescent schemes. While [24] can be used to compute θ for piecewise convex polynomials, it is unclear howto do the same for non-convex polynomials. Addressing this more general issue would, in particular, provideconvergence rates for (7). The main difficulty in characterizing the convergence radius in Theorems 2.18-2.19is to characterize the KL-parameters U and η of JRα,β. Doing so for a non-convex functional is a challengingtask on its own and thus the main reason for us to defer the treatment of initialization to future work.

2.4.1 Alternating Minimization

To analyze (7), i.e., Algorithm 1, we follow the arguments in [1]. It is then straight-forward to obtain the resultbelow. Note, however, that Algorithm 1 is rather inefficient since each iteration requires solving a convexoptimization problem, which can, for instance, be done via proximal gradient descent, cf. Algorithm 2.

Theorem 2.18. The sequence (Uk,Vk) generated by Algorithm 1 converges to a stationary point of JRα,β.Moreover, for any global minimizer (Uα,β,Vα,β) of JRα,β, there exist ε, η > 0, such that the initial conditions

‖(U0,V0)− (Uα,β,Vα,β)‖F < ε, min JRα,β < JRα,β(U0,V0) < min JRα,β + η,

imply that the iterations (Uk,Vk) converge to some (U∗,V∗) ∈ arg minJRα,β.

The proof is briefly discussed in Section 3.4.

Algorithm 1 : Alternating Minimization

Given: y ∈ Rm, A ∈ Rm×n1n2 such that A(UV>) = A · vec(UV>), rank R, V0 ∈ Rn2×R, andα1, α2, β1, β2 > 0

1: while stop condition is not satisfied do2: Uk+1 ← arg minU∈Rn1×R J

Rα,β(U,Vk) . Use, e.g., Algorithm 2

3: Vk+1 ← arg minV∈Rn2×R JRα,β(Uk+1,V) . Use, e.g., Algorithm 2

4: end while5: return (Ufinal,Vfinal)

2.4.2 Proximal Alternating Linearized Minimization

A second, more efficient descent scheme is given by the Proximal Alternating Linearized Minimization [4],see Algorithm 3. It is straight-forward to verify that the coercive functional JRα,β and Algorithm 3 satisfyAssumptions 1&2 in [4] such that the following holds.

11

Page 12: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

Algorithm 2 : Proximal Gradient DescentGiven: objective function h = f +g : Rn → R with f continuously differentiable and g lower semicontinuous

and convex, x0 = 0, step-size µ > 0

1: while stop condition is not satisfied do2: xk+1 ← proxg(xk − µ∇f(xk))3: end while4: return xfinal

Theorem 2.19 ([4, Lemma 3, Theorem 1]+[1, Theorem 10]). The sequence (Uk,Vk) generated by Algo-rithm 3 converges to a stationary point of JRα,β. Moreover, for any global minimizer (Uα,β,Vα,β) of JRα,β,there exist ε, η > 0, such that the initial conditions

‖(U0,V0)− (Uα,β,Vα,β)‖F < ε, min JRα,β < JRα,β(U0,V0) < min JRα,β + η,

imply that the iterations (Uk,Vk) converge to some (U∗,V∗) ∈ arg minJRα,β.

Algorithm 3 : Proximal Alternating Linearized Minimization

Given: y ∈ Rm, A ∈ Rm×n1n2 such that A(UV>) = A·vec(UV>), rank R, V0 ∈ Rn2×R, α1, α2, β1, β2 > 0,and sequences of step-sizes (λk)k, (µk)k ⊂ (r−, r+), for r+ > r− > 0

1: while stop condition is not satisfied do2: Uk+1 ← prox 1

λkEnetα

(Uk − λkvec−1

(A>Vk

(AVkvec(Uk)− y)

)). A(UV>) = AV · vec(U)

3: Vk+1 ← prox 1µk

Enetβ

(Vk − µkvec−1

(A>Uk+1

(AUk+1

vec(Vk)− y)))

. A(UV>) = AU · vec(V)

4: end while5: return (Ufinal,Vfinal)

3 ProofsThis section provides the remaining proofs for the main results from Section 2.

3.1 Proof of Theorem 2.7To prove Theorem 2.7, we need bounds on the complexity of ΓKR

s1,s2 in terms of the Gaussian widthw(ΓKR

s1,s2). Since the Gaussian width of a set K is strongly connected to the covering number N(K, ‖·‖2 , ε)of K, we begin by bounding the covering number of ΓKR

s1,s2 .

Lemma 3.1 (Covering number of ΓKRs1,s2). Let Γ > 0 and let KR

s1,s2 be the set defined in (3). Then, for allε ∈ (0, 1), one has that

log(N(ΓKR

s1,s2 , ‖ · ‖F , ε)). ε−2Γ2R3(s1 + s2) log

(max

{en1

s1,en2

s2

}). (25)

Proof : First, note that by [30, Lemma 3.4] the covering number of the set of effectively s-sparse vectorsKn,s can be bounded by

log (N(Kn,s, ‖·‖2 , ε)) . ε−2s log(ens

).

12

Page 13: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

We now construct a net for KRs1,s2 . Let KRn1,Rs1 and KRn2,Rs2 be minimal ε

2√R-nets of

√RKRn1,Rs1

and√RKRn2,Rs2 , and define

K ={

Z = UV> : vec(U) ∈ KRn1,Rs1 and vec(V) ∈ KRn2,Rs2

}.

Hence, for any Z = UV> ∈ KRs1,s2 there exists Z = UV> ∈ K such that ‖U − U‖F ≤ ε

2√R

and‖V − V‖F ≤ ε

2√R. This implies

‖Z− Z‖F ≤ ‖U− U‖F ‖V‖F + ‖U‖F ‖V − V‖F ≤ε

2√R

√R+√R

ε

2√R

= ε,

i.e., K is an ε-net ofKRs1,s2 . By construction, the cardinality of K is bounded by |K| ≤ |KRn1,Rs1 | · |KRn2,Rs2 |

and, consequently, using that N(cK, ‖·‖ , ε) = N(K, ‖·‖ , εc

), for c > 0, we get

log(N(KR

s1,s2 , ‖ · ‖F , ε))≤ log

(N

(√RKRn1,Rs1 , ‖·‖2 ,

ε

2√R

))+ log

(N

(√RKRn2,Rs2 , ‖·‖2 ,

ε

2√R

)). ε−2R3(s1 + s2) log

(max

{en1

s1,en2

s2

}).

The claim follows from N(ΓKRs1,s2 , ‖ · ‖F , ε) = N

(KRs1,s2 , ‖ · ‖F ,

εΓ

).

From Lemma 3.1, we deduce the following bound on the Gaussian width of KRs1,s2 .

Lemma 3.2 (Gaussian width of ΓKRs1,s2). Let Γ > 0 and let KR

s1,s2 be the set defined in Section 2. Then,

w(ΓKRs1,s2) .

√Γ2R3(s1 + s2) log3 (max {n1, n2}).

Proof : The two-sided Sudakov inequality [38, Theorem 8.1.13] yields that

w(ΓKRs1,s2) . log(n1n2) sup

ε>0ε√

log(N(ΓKR

s1,s2 , ‖·‖F , ε)).

Lemma 3.1 now yields the claim.

Theorem 2.7 can be proven by applying a bound on suprema of chaos processes [20, Theorems 1.4 & 3.1]in combination with the above bound on w(ΓKR

s1,s2). We first recall the relevant result in the form presentedin [17]. We refer the reader to [20] and [17] for further details. Here and below d�(H) = supH∈H ‖H‖�,where ‖·‖� is a generic norm. Recall the definition of γα from (23).

Theorem 3.3 ([17, Theorem 3.7]). Let H be a symmetric set of matrices, i.e., H = −H, and let ξ be arandom vector whose entries ξi are independent K-subgaussian random variables with mean 0 and variance1. Set

E = γ2 (H, ‖ · ‖2→2) (γ2 (H, ‖ · ‖2→2) + dF (H)) ,

V = d2→2 (H) (γ2 (H, ‖ · ‖2→2) + dF (H)) ,

U = d22→2 (H) .

Then, for t > 0,

Pr[

supH∈H

∣∣‖Hξ‖2`2 − E[‖Hξ‖22

] ∣∣ ≥ c1E + t

]≤ 2 exp

(−c2 min

(t2

V 2,t

U

)).

The constants c1 and c2 are universal and only depend on K.

13

Page 14: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

Proof of Theorem 2.7 : The proof consists of three main parts. We start by fitting our setting into theone of Theorem 3.3. Then, we bound the γ2-functional for ΓKR

s1,s2 and conclude by applying Theorem3.3.Let us first interchange the roles of our random measurement operator A applied to the fixed matricesZ to have fixed operators HZ applied to a random vector ξ. Recall that vec(Z) ∈ Rn1n2 denotes thevectorization of Z. Observe, for all Z ∈ Rn1×n2 ,

A(Z) =1√m

〈vec(A1), vec(Z)〉...

〈vec(Am), vec(Z)〉

=1√m

vec(Z)> 0 · · ·. . .

· · · 0 vec(Z)>

·vec(A1)

...vec(Am)

= HZ · ξ

where HZ ∈ Rm×mn1n2 is a matrix depending on Z and ξ ∈ Rmn1n2 has i.i.d. K-subgaussian entriesξl of mean 0 and variance 1. We define HK = {HZ : Z ∈ ΓKR

s1,s2}. Note that the mapping Z 7→ HZ

is an isometric linear bijection. In particular, we have ‖HZ‖F = ‖Z‖F and ‖HZ‖2→2 = ‖Z‖F /√m.

For Z ∈ ΓKRs1,s2 it holds that ‖Z‖F ≤ ‖U‖F ‖V>‖F ≤ ΓR. Consequently, dF (HK) ≤ ΓR and

d2→2(HK) ≤ ΓR√m.

Since ‖HZ‖2→2 = ‖Z‖F /√m and Z 7→ HZ is a linear bijection, it follows by definition of γ2 in (23)

that γ2(HK , ‖·‖2→2) = 1√mγ2(HK , ‖·‖F ). We hence estimate by [38, Theorem 8.6.1] that

γ2(HK , ‖·‖2→2) .w(HK)√

m.

√Γ2R3(s1 + s2) log3 (max {n1, n2})

m=: L.

We assume m & C∆−2R(s1 + s2) log3 (max {n1, n2}), for some 0 < ∆ < 1. Then, L ≤ ∆ΓR and

L2 + ΓRL ≤ Γ2R2(∆2 + ∆) ≤ 2Γ2R2∆. (26)

We obtain the following bounds on the quantities of Theorem 3.3:

E ≤ L2 + ΓRL, V ≤ ΓRL+ Γ2R2

√m

, U ≤ Γ2R2

m. (27)

Since E[‖HZξ‖22

]= ‖HZ‖2F = ‖Z‖2F , we finally get, for δ = (2c1 + 1)Γ2R2∆ (which implies by (26)

that δ ≥ c1E + Γ2R2∆),

Pr

[sup

Z∈ΓKRs1,s2

∣∣‖A(Z)‖22 − ‖Z‖2F∣∣ ≥ δ] ≤ Pr

[sup

HZ∈HK

∣∣‖HZξ‖22 − E[‖HZξ‖22

]∣∣ ≥ c1E + Γ2R2∆

]≤ 2 exp

(−c2 min

{m

Γ4R4∆2

Γ4R4(1 + ∆)2,m

Γ2R2∆

Γ2R2

})≤ 2 exp

(−c2∆2m

).

3.2 Proof of Theorem 2.10The proof of Theorem 2.10 is a straight-forward application of Mendelson’s small ball method [26]. We firstrecap the key result transferred to our setting.

Theorem 3.4 ([26, Corollary 5.5]). Let K ⊂ Rn1×n2 be star-shaped around 0, i.e., for any t ∈ [0, 1] andZ ∈ K one has tZ ∈ K. Let A ∈ Rn1×n2 be an isotropic random matrix satisfying

QK(2τ) := infZ∈K

Pr[|〈A,Z〉F | ≥ 2τ ‖Z‖F

]> 0,

and let Ai, i ∈ [m], be i.i.d. copies of A. Let r > 0 be sufficiently large to have

E

[sup

Z∈K∩rBF

∣∣∣∣∣⟨

1

m

m∑i=1

εiAi,Z

⟩F

∣∣∣∣∣]≤ τ

16QK(2τ)r,

14

Page 15: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

where BF denotes the Frobenius unit ball. Then with probability at least 1− 2e−12QK(2τ)2m∣∣{i : |〈Ai,Z〉F | ≥ τ ‖Z‖F

}∣∣ ≥ 1

4QK(2τ)m,

for all Z ∈ K with ‖Z‖F ≥ r.

Proof of Theorem 2.10 : Let us define K = ΓKRs1,s2 − ΓKR

s1,s2 . First note that K is star-shaped andsince (16) holds for A, we have QK(2τ) ≥ c, where 2τ = θ. By (17) and symmetry of K, we get that

E

[sup

Z∈K∩rBF

∣∣∣∣∣⟨

1

m

m∑i=1

εiAi,Z

⟩F

∣∣∣∣∣]

= E

[sup

Z∈K∩rBF

⟨1

m

m∑i=1

εiAi,Z

⟩F

]

=Em(K,A)√

m≤ τ

16QK(2τ)δ.

We may now apply Theorem 3.4 with τ = θ2 and r = δ to obtain with probability at least 1− 2e

c2

2 m

‖A(Z)‖22 =

m∑i=1

⟨1√m

Ai,Z

⟩2

F

=1

m

m∑i=1

〈Ai,Z〉2F

≥ 1

m·(

1

4QK(2τ)m

)·(τ2 ‖Z‖2F

)≥ θ2c

16‖Z‖2F ,

for all Z ∈ K with ‖Z‖F ≥ δ. Since ΓKRs1,s2 ⊂ K, the claim follows.

3.3 Proof of Theorem 2.14First, note that A satisfies (16) with θ = 1√

2and c = 1

96 , cf. [21]. To prove Theorem 2.14, we may applyTheorem 2.10 and use the following lemma to bound Em(K,A) in terms of Talagrand’s γα-functional.

Lemma 3.5. For K ⊂ Rn×n and A = aa> with a ∈ Rn having i.i.d. K-subgaussian entries, one has

Em(K,A) ≤ C(√

Rγ2(K, ‖·‖F ) +γ1(K, ‖·‖)√

m

),

where C > 0 only depends on K.

Proof : Define XZ :=⟨∑m

i=1 εiaia>i ,Z

⟩F, such that Em(K,A) = 1√

mE[supZ∈K XZ] and recall that

ε1, . . . , εm ∈ {−1, 1} are i.i.d. Bernoulli variables. We show that

Pr[|XZ −XZ′ | > t] ≤ 2e−cmin

{t2

mR‖Z−Z′‖2F, t

‖Z−Z′‖

}, (28)

for Z,Z′ ∈ Rn×n and apply generic chaining to obtain the claim. Note that (28) is equivalent toshowing

‖XZ‖Lq := E[|XZ|q]1q .√mR ‖Z‖F

√q + ‖Z‖ q, for all q ≥ 1. (29)

By using the triangle inequality, we can estimate

E[|XZ|q]1q = Eε[ EA[|XZ|q]]

1q = Eε

[‖XZ‖qLq,A

] 1q

≤ Eε[(‖XZ − EA[XZ]‖Lq,A + ‖ EA[XZ]‖Lq,A

)q] 1q

=∥∥∥‖XZ − EA[XZ]‖Lq,A + ‖ EA[XZ]‖Lq,A

∥∥∥Lq,ε

≤∥∥∥‖XZ − EA[XZ]‖Lq,A

∥∥∥Lq,ε

+∥∥∥‖ EA[XZ]‖Lq,A

∥∥∥Lq,ε

,

(30)

15

Page 16: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

where Eε[·] and EA[·] denote conditional expectations and ‖·‖Lq,ε and ‖·‖Lq,ε denote the Lq-normwith respect to ε resp. A. To estimate the first summand on the right-hand side of (30), rewriteXZ = a>MZ,εa where a = (a>1 , ...,a

>m)> ∈ Rmn is the concatenation of all ai and MZ,ε ∈ Rmn×mn is

a block diagonal matrix with blocks εiZ. Note that ‖MZ,ε‖F =√m ‖Z‖F and ‖MZ,ε‖ = ‖Z‖. For ε

and Z fixed, the Hanson-Wright inequality [38, Theorem 6.2.1] yields

Pr[|XZ − EA[XZ]| > t] ≤ 2e−cmin

{t2

m‖Z‖2F

, t‖Z‖

},

for c > 0 only depending on K, which is equivalent to

‖XZ − EA[XZ]‖Lq,A .√m ‖Z‖F

√q + ‖Z‖ q (31)

(and independent of ε). To bound the second term, note that

EA[XZ] =

m∑i=1

εitr(Z),

which implies by Hoeffding’s inequality [38, Theorem 2.2.5]

Prε[| EA[XZ]| > t] ≤ 2e− t2

2mtr(Z)2 .

Consequently, ∥∥∥‖ EA[XZ]‖Lq,A∥∥∥Lq,ε

= ‖ EA[XZ]‖Lq,ε .√m tr(Z)

√q. (32)

Combining (31) and (32) and using that tr(Z) ≤ ‖Z‖∗ ≤√R ‖Z‖F yields (29). We conclude by

applying generic chaining [35, Theorem 2.2.23] to obtain

Em(K,A) =1√m

E[

supZ∈K

XZ

]≤ C

(√Rγ2(K, ‖·‖F ) +

γ1(K, ‖·‖)√m

).

3.4 Proof of Theorem 2.18Let us briefly discuss why the claim of [1, Theorem 10] still holds if (Uk,Vk) is defined via alternatingminimization in Algorithm 1 and not proximal alternating minimization as in [1]. We need the followingobservation.

Lemma 3.6. For JRα,β and (Uk,Vk) defined by Algorithm 1, we have that

JRα,β(Uk,Vk)− JRα,β(Uk+1,Vk+1) ≥ α1 ‖Uk −Uk+1‖2F + β1 ‖Vk −Vk+1‖2F ,

and∞∑k=0

(‖Uk −Uk+1‖2F + ‖Vk −Vk+1‖2F

)<∞,

which implies that limk→∞

(‖Uk −Uk+1‖2F + ‖Vk −Vk+1‖2F

)= 0. Moreover,

2

(A∗Vk

(A(UkV

>k )− y

)−A∗Vk−1

(A(UkV

>k−1)− y

)0

)∈ ∂JRα,β(Uk,Vk),

where AV : Rn1×R → Rm is defined such that AV(U) = A(UV>).

16

Page 17: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

Proof : First note that JRα,β(·,V) is 2α1-strongly convex and JRα,β(U, ·) is 2β1-strongly convex, for allU ∈ Rn1×R,V ∈ Rn2×R. To see this, check that JRα,β(·,V) − α1 ‖·‖2F and JRα,β(U, ·) − β1 ‖·‖2F areconvex. Since Uk+1 minimizes JRα,β(·,Vk), we know that −2α1Uk+1 ∈ ∂[JRα,β(·,Vk)−α1 ‖·‖2F ](Uk+1).By convexity of JRα,β(·,V)− α1 ‖·‖2F , we hence get that

JRα,β(Uk,Vk)− α1 ‖Uk‖2F ≥ (JRα,β(Uk+1,Vk)− α1 ‖Uk+1‖2F )− 2α1 〈Uk+1,Uk −Uk+1〉 ,

which implies

JRα,β(Uk,Vk)− JRα,β(Uk+1,Vk) ≥ α1 ‖Uk −Uk+1‖2F .

The same argument applied to Vk+1 yields

JRα,β(Uk+1,Vk)− JRα,β(Uk+1,Vk+1) ≥ β1 ‖Vk −Vk+1‖2F .

We obtain

JRα,β(Uk,Vk) ≥ JRα,β(Uk+1,Vk) + α1 ‖Uk −Uk+1‖2F≥ JRα,β(Uk+1,Vk+1) + α1 ‖Uk −Uk+1‖2F + β1 ‖Vk −Vk+1‖2F

and hence the first claim. The second claim directly follows by

∞∑k=0

(‖Uk −Uk+1‖2F + ‖Vk −Vk+1‖2F

)≤ 1

min {α1, β1}

∞∑k=0

(α1 ‖Uk −Uk+1‖2F + β1 ‖Vk −Vk+1‖2F

)≤ c

∞∑k=0

(JRα,β(Uk,Vk)− JRα,β(Uk+1,Vk+1)

)= cJRα,β(U0,V0) <∞.

Let us now turn to the last claim. By minimality of Uk+1 and Vk+1, we know that

0 ∈ ∂UJRα,β(Uk+1,Vk) and 0 ∈ ∂VJ

Rα,β(Uk+1,Vk+1).

Since ∂UJRα,β(U,V) = 2A∗V

(A(UV>)− y

)+ 2α1U + α2sign(U), where sign is applied component-

wise, we have that

−2A∗Vk

(A(Uk+1V

>k )− y

)− 2α1Uk+1 ∈ α2sign(Uk+1)

and thus

2A∗Vk+1

(A(Uk+1V

>k+1)− y

)− 2A∗Vk

(A(Uk+1V

>k )− y

)∈ ∂UJ

Rα,β(Uk+1,Vk+1),

0 ∈ ∂VJRα,β(Uk+1,Vk+1).

By replacing [1, Lemma 5] with Lemma 3.6 and noting that [1, Proposition 6] holds for JRα,β with (Uk,Vk)defined by Algorithm 1, it is easy to verify that [1, Theorem 8] holds as well implying Theorem 2.18.

4 Numerical ExperimentsWe finally compare our theoretical predictions to the empirical performance2 of our alternating methods.Although Algorithm 1 resembles the less efficient implementation (in each alternating step one has to computea full proximal gradient descent), we prefer it in our simulations. It is not as sensitive to step size adaption

2The corresponding Python code is provided at https://johannes-maly.github.io/

17

Page 18: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

as Algorithm 3 and thus diminishes the need of parameter tuning.Let us now turn toward the numerical simulations. First, we check if the main theoretical result stated inTheorem 2.4 describes the qualitative and quantitative behavior of the approximation error well. Then, wedemonstrate the performance of Algorithm 1 on different measurement ensembles, comparing reconstructionfor Gaussian, log-normal, and Gaussian rank-1 type measurements. Finally, we compare Algorithm 1 to theinitially mentioned Sparse Power Factorization (SPF), [23], which will serve as a general benchmark (SPF hasbeen shown to outperform conventional reconstruction methods that solely rely on low-rankness or sparsity).

Ground-truth. In order to produce effectively sparse random samples X = UV> ∈ KRs1,s2 , we draw for

U resp. V randomly Rs1 resp. Rs2 positions in a n1 ×R- resp. n2 ×R-zero-matrix, fill them with Gaussiani.i.d. entries, re-normalize to Frobenius norm

√R and add a dense Gaussian random matrix of size n1 × R

resp. n2 ×R of Frobenius norm 0.1√R.

Initialization. In all experiments, we initialize the algorithms by randomly perturbing U and V to obtainU0 and V0 such that ‖X−X0‖F = ‖UV> −U0V

>0 ‖F ≈ 0.6‖X‖F .

Parameter tuning. Since the first experiment in Section 4.1 suggests that Algorithm 1 with α1 = α2 =β1 = β2 performs slightly better than with the parameter ratio given in Theorem 2.4, in all remainingexperiments we set α1 = α2 = β1 = β2 = µ and loosely tune µ by the heuristic proposed at the end ofSection 2.2:

1. Initialize µ sufficiently large to guarantee Xα,β = 0, i.e., ‖X−Xα,β‖F /‖X‖F = 1

2. Iteratively shrink µ by a factor 12 until the relative error ‖X−Xα,β‖F /‖X‖F stops decreasing.

4.1 Validation of Theorem 2.4In the first experiment we study the influence of the parameters α,β on the reconstruction accuracy. Figure 1shows, first, the average relative error in reconstructing 100 randomly drawn X ∈ K1

20,20 ∈ R20×300 fromm = 160 measurements with ‖η‖F = 0.05‖X‖F and, second, the corresponding (effective) sparsity levelmeasured relative to n2 = 300, i.e., s

n2∈ [0, 1]. In particular, we set α2 = µ and compare the performance of

Algorithm 1 for α1 =√

20µ = β1 =√

20β2 (as proposed in Lemma 2.2) with α1 = µ = β1 = β2.The behavior is as predicted by Theorem 2.4: decreasing the parameter(s) shrinks the reconstruction errorup to a small multiple of the noise level while the sparsity stays under control, cf. Lemma 2.2. As soon asthe noise level is hit (to be precise, a slightly higher value caused by the injectivity constants γ and δ in(2.4)), the assumptions of Lemma 2.2 fail, the regularity of the solution vanishes, and the approximationguarantee breaks. Note that setting all parameters α1, α2, β1, β2 to the same value performs better than thetheoretically motivated choice so we keep this in the remaining simulations. Furthermore, let us mentionthat sparse reconstruction (information theoretic lower bound: m ≥ s1s2 = 400) and low-rank matrix sensing(information theoretic lower-bound: m ≥ rank(X)(n1 + n2) = 320) would not allow reliable reconstructionof X in this setting.As a byproduct, the experiment suggests the simple parameter choice heuristic of starting with large α,βand shrinking the parameters until there is a drastic change in regularity of the solution.

4.2 Validation of Theorems 2.7, 2.10, and 2.14In a second experiment we compare the reconstruction performance of Algorithm 1 with respect to differentmeasurement ensembles A. Figure 2 shows the average relative approximation error when reconstructing 100randomly drawn X for varying m. We compare here three types of measurements corresponding to Theorems2.7, 2.10, and 2.14: first, operators A whose components Ai have i.i.d. standard normal entries. Second,operators A whose components Ai have i.i.d. log-normal distributed entries. Third, operators A whosecomponents satisfy Ai = aia

>i for Gaussian random vectors ai, i ∈ [m]. While the first two choices (Figure

2a) allow to re-use the setting of Section 4.1, i.e., X ∈ K120,20 ∈ R20×300, the Gaussian rank-1 measurements

18

Page 19: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

10 5 10 4 10 3 10 2 10 1 100

0.0

0.2

0.4

0.6

0.8

1.0

Rela

tive

Err

or /

Rela

tive

Spa

rsit

y Relative ErrorInitial ErrorRelative SparsityRelative Effective Sparsity

(a) α1 =√20µ = β1 =

√20β2.

10 5 10 4 10 3 10 2 10 1 100

0.0

0.2

0.4

0.6

0.8

1.0

Rela

tive

Err

or /

Rela

tive

Spa

rsit

y Relative ErrorInitial ErrorRelative SparsityRelative Effective Sparsity

(b) α1 = µ = β1 = β2.

Figure 1: Approximation quality and sparsity depending on parameter size. The approximation erroris measured relative to ‖X‖F while the sparsity and effective sparsity of Vfinal are measured relative torank(X) · n2. For comparison, the approximation error of the initialization is added.

(Figure 2b) require square matrices. We thus consider X ∈ K110,10 ∈ R50×50 in the third case. In all cases,

the noise level is ‖η‖F = 0.1‖X‖F .As Figure 2 shows, Algorithm 1 yields good approximation for all three types of measurements. Moreimportant, the error is already close to noise level for a number of measurements far below the numberrequired in mere sparse/low-rank approximation. To substantiate this point, we provide for comparison theoutcome when using AltMinSense, a state-of-the-art method only using low-rankness [16].

50 75 100 125 150 175 200 225Number of Measurements

0.2

0.4

0.6

0.8

Rela

tive

Err

or

Gaussian - S3PCAGaussian - AltMinSenseHeavy Tailed - S3PCAHeavy Tailed - AltMinSenseInitialization

(a) Gaussian and Heavy-tailed measurements.

20 40 60 80 100Number of Measurements

0.2

0.4

0.6

0.8

1.0

Rela

tive

Err

or

Rank-1 - S3PCARank-1 - AltMinSenseInitialization

(b) Gaussian rank-1 measurements.

Figure 2: Approximation for various choices of A. The approximation error is measured relative to‖X‖F . For comparison, the state-of-the-art low-rank matrix sensing algorithm AltMinSense is added.Note that the information theoretic lower bound for guaranteed reconstruction of matrices only usinglow-rank structure is m ≥ rank(X)(n1 + n2) = 320 in (a) and m ≥ 100 in (b).

4.3 Algorithm 1 vs SPFAfter having provided empirical evidence for our theoretical results, we now turn to the comparison ofAlgorithm 1 with its state-of-the-art counterpart SPF [23]. To the best of our knowledge, SPF is the onlyavailable algorithm so far, which simultaneously leverages low-rankness and sparsity constraints and comes

19

Page 20: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

with near-optimal recovery guarantees (not relying on a special structure of A as in [2]). As [23] containsexhaustive numerical comparisons of SPF and low-rank/sparse reconstruction strategies based on convexrelaxation, SPF suffices for numerical benchmark tests of Algorithm 1.

In Figure 3 we compare for s/n2 ∈ [0, 0.3] and m/(n1n2) ∈ [0.05, 0.3] the number of successful recoveriesof 10 randomly drawn X ∈ K3

16,s ⊂ R16×100 from m Gaussian measurements. We set the noise levelto ‖η‖F = 0.2‖X‖F and count the reconstruction successful if ‖X − XSPF‖F /‖X‖F ≤ 0.4 resp. ‖X −Xα,β‖F /‖X‖F ≤ 0.4. The sparsity parameter s′ of SPF is optimized over the grid {5, 10, . . . , 45, 50}. InFigures 3 (a)-(b) we compare the two algorithms if their respective parameters are tuned under knowledgeof X (Best Approximation), whereas in Figures 3 (c)-(d) only the noise level ‖η‖F is known (DiscrepancyPrinciple). In the latter case, the parameter s′ of SPF minimizes |‖X −XSPF‖F − ‖η‖F | and the heuristicfor Algorithm 1 reduces µ until |‖X−Xα,β‖F −‖η‖F | stops decreasing. Both algorithms show a comparableperformance. Nevertheless, the theoretical guarantees of Algorithm 1 cover a considerably larger class ofsignals and output non-orthogonal decompositions.

0.05 0.10 0.15 0.20 0.25 0.30s/n2

0.05

0.10

0.15

0.20

0.25

0.30

m/(n

1n2)

(a) SPF, Best Approximation

0.05 0.10 0.15 0.20 0.25 0.30s/n2

0.05

0.10

0.15

0.20

0.25

0.30

m/(n

1n2)

(b) Algorithm 1, Best Approximation

0.05 0.10 0.15 0.20 0.25 0.30s/n2

0.05

0.10

0.15

0.20

0.25

0.30

m/(n

1n2)

(c) SPF, Discrepancy Principle

0.05 0.10 0.15 0.20 0.25 0.30s/n2

0.05

0.10

0.15

0.20

0.25

0.30

m/(n

1n2)

(d) Algorithm 1, Discrepancy Principle

Figure 3: Phase transition diagrams comparing SPF and Algorithm 1. Empirical recovery probabilityis depicted by color from zero (blue) to one (yellow).

20

Page 21: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

5 Discussion and Open QuestionsIn this paper we proposed a multi-penalty approach to recover low-rank matrices with sparsity structure fromincomplete and inaccurate linear measurements. To improve on the results of [9], we introduced a conceptuallydifferent functional JRα,β and a revised signal set KR

s1,s2 , the combination of which allowed us rigid analysisof the approximation quality of global minimizers. In particular, the new approach encompasses heavy-tailedmeasurement ensembles and structured rank-1 measurements. While existing theoretical guarantees for SPFwith Gaussian measurements are still sharper when considering ground-truths with jointly sparse SVD, cf.Remark 2.8, our method tackles the recovery of a significantly larger class of matrices, namely matrices withnon-orthogonal rank-1 decompositions and effectively sparse components. Remarkably, the analysis of JRα,βis far more elementary than the one of SPF.

Several intriguing questions remain open at this point. First, there is a gap between our bounds on m inTheorem 2.7 and the corresponding results in [23]. As mentioned above, this is partly due to the much largersignal set KR

s1,s2 , which covers non-orthogonal decomposable matrices. Nevertheless, it would be desirable tounderstand the information theoretic limit for this specific class of matrices to better evaluate the quality ofour bounds.

Second, we didn’t solve the problem of initialization here. A crucial task for the near future is thus toprovide an initialization procedure that guarantees computation of global minimizers of JRα,β via Algorithms1 & 3. Spectral initialization, the state-of-the-art procedure for non-convex methods in low-rank matrixsensing, certainly works for m sufficiently large. However, we doubt that it can be used to reconstruct at theinformation theoretic limit if sparsity and low-rankness are considered simultaneously.

Finally, further structured measurement ensembles should be examined like, e.g., sub-sampled circulantmatrices. This kind of measurements naturally appears in applications like blind deconvolution [14].

AcknowledgmentsThe author gratefully acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German Re-search Foundation) through the project CoCoMIMO funded within the priority program SPP 1798 Com-pressed Sensing in Information Processing (COSIP).

References[1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, “Proximal alternating minimization and projection

methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality,” Mathe-matics of Operations Research, vol. 35, no. 2, pp. 438–457, 2010.

[2] S. Bahmani and J. Romberg, “Near-optimal estimation of simultaneously sparse and low-rank matricesfrom nested linear measurements,” Information and Inference: A Journal of the IMA, vol. 5, no. 3, pp.331–351, 2016.

[3] T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,” Applied andComputational Harmonic Analysis, vol. 27, no. 3, pp. 265 – 274, 2009.

[4] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating linearized minimization for nonconvex andnonsmooth problems,” Mathematical Programming, vol. 146, no. 1-2, pp. 459–494, 2014.

[5] E. J. Candes and Y. Plan, “Tight oracle inequalities for low-rank matrix recovery from a minimal numberof noisy random measurements,” IEEE Transactions on Information Theory, vol. 57, no. 4, pp. 2342–2359, 2011.

[6] A. d’Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. Lanckriet, “A direct formulation for sparsePCA using semidefinite programming,” in Advances in neural information processing systems, 2005, pp.41–48.

[7] H. Eisenmann, F. Krahmer, M. Pfeffer, and A. Uschmajew, “Riemannian thresholding methods forrow-sparse and low-rank matrix recovery,” ArXiv:2103.02356, 2021.

21

Page 22: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

[8] X. Fernique, “Regularité des trajectoires des fonctions aléatoires gaussiennes,” in Ecole d’Eté de Proba-bilités de Saint-Flour IV—1974. Springer, 1975, pp. 1–96.

[9] M. Fornasier, J. Maly, and V. Naumova, “Robust recovery of low-rank matrices with non-orthogonalsparse decomposition from incomplete measurements,” Applied Mathematics and Computation, vol. 392,2021.

[10] S. Foucart, R. Gribonval, L. Jacques, and H. Rauhut, “Jointly low-rank and bisparse recovery: Questionsand partial answers,” Analysis and Applications, vol. 18, no. 01, pp. 25–48, 2020.

[11] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing. Birkhäuser Basel,2013.

[12] J. Geppert, F. Krahmer, and D. Stöger, “Sparse power factorization: balancing peakiness and samplecomplexity,” Advances in Computational Mathematics, vol. 45, no. 3, pp. 1711–1728, 2019.

[13] M. Grasmair and V. Naumova, “Conditions on optimal support recovery in unmixing problems by meansof multi-penalty regularization,” Inverse Problems, vol. 32, no. 10, p. 104007, 2016.

[14] S. Haykin, “The blind deconvolution problem,” Blind Deconvolution, p. 1, 1994.

[15] M. Iwen, A. Viswanathan, and Y. Wang, “Robust sparse phase retrieval made easy,” Applied and Com-putational Harmonic Analysis, vol. 42, no. 1, pp. 135–142, 2017.

[16] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion using alternating minimization,”Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 665–674, 2013.

[17] P. Jung, F. Krahmer, and D. Stöger, “Blind demixing and deconvolution at near-optimal rate,” IEEETransactions on Information Theory, vol. 64, no. 2, pp. 704–727, 2017.

[18] M. V. Klibanov, P. E. Sacks, and A. V. Tikhonravov, “The phase retrieval problem,” Inverse problems,vol. 11, no. 1, p. 1, 1995.

[19] M. Kliesch, S. J. Szarek, and P. Jung, “Simultaneous structures in convex signal recovery—revisiting theconvex combination of norms,” Frontiers in Applied Mathematics and Statistics, vol. 5, p. 23, 2019.

[20] F. Krahmer, S. Mendelson, and H. Rauhut, “Suprema of chaos processes and the restricted isometryproperty,” Communications on Pure and Applied Mathematics, vol. 67, no. 11, pp. 1877–1904, 2014.

[21] R. Kueng, H. Rauhut, and U. Terstiege, “Low rank matrix recovery from rank one measurements,”Applied and Computational Harmonic Analysis, vol. 42, no. 1, pp. 88–116, 2017.

[22] K. Lee, Y. Li, M. Junge, and Y. Bresler, “Blind recovery of sparse signals from subsampled convolution,”IEEE Transactions on Information Theory, vol. 63, no. 2, pp. 802–821, 2016.

[23] K. Lee, Y. Wu, and Y. Bresler, “Near-optimal compressed sensing of a class of sparse low-rank matricesvia sparse power factorization,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1666–1698,2018.

[24] G. Li, “Global error bounds for piecewise convex polynomials,” Mathematical Programming, vol. 137, no.1-2, pp. 37–64, 2013.

[25] M. Magdon-Ismail, “Np-hardness and inapproximability of sparse pca,” Information Processing Letters,vol. 126, pp. 35–38, 2017.

[26] S. Mendelson, “Learning without concentration,” in Conference on Learning Theory, 2014, pp. 25–39.

[27] B. S. Mordukhovich, Variational analysis and generalized differentiation I: Basic theory. SpringerScience & Business Media, 2006, vol. 330.

[28] V. Naumova and S. Peter, “Minimization of multi-penalty functionals by alternating iterative threshold-ing and optimal parameter choices,” Inverse Problems, vol. 30, no. 12, p. 125003, 2014.

22

Page 23: arXiv:2103.05523v1 [cs.IT] 9 Mar 2021

[29] S. Oymak, A. Jalali, M. Fazel, Y. C. Eldar, and B. Hassibi, “Simultaneously structured models withapplication to sparse and low-rank matrices,” IEEE Transactions on Information Theory, vol. 61, no. 5,pp. 2886–2908, 2015.

[30] Y. Plan and R. Vershynin, “One-bit compressed sensing by linear programming,” Communications onPure and Applied Mathematics, vol. 66, no. 8, pp. 1275–1297, 2013.

[31] ——, “Dimension reduction by random hyperplane tessellations,” Discrete & Computational Geometry,vol. 51, no. 2, pp. 438–461, 2014.

[32] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equationsvia nuclear norm minimization,” SIAM Review, vol. 52, no. 3, pp. 471–501, 2010.

[33] R. T. Rockafellar and R. J.-B. Wets, Variational analysis. Springer Science & Business Media, 2009,vol. 317.

[34] M. Talagrand, The generic chaining: upper and lower bounds of stochastic processes. Springer Science& Business Media, 2006.

[35] ——, Upper and lower bounds for stochastic processes: modern methods and classical problems. SpringerScience & Business Media, 2014, vol. 60.

[36] J. A. Tropp, “Convex recovery of a structured signal from independent random linear measurements,”in Sampling Theory, a Renaissance. Springer, 2015, pp. 67–101.

[37] R. Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” in Compressed Sensing:Theory and Applications. Cambridge Univ. Press, 2012, pp. 210–268.

[38] ——, High-dimensional probability: An introduction with applications in data science. CambridgeUniversity Press, 2018, vol. 47.

[39] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the royalstatistical society: series B (statistical methodology), vol. 67, no. 2, pp. 301–320, 2005.

[40] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of computationaland graphical statistics, vol. 15, no. 2, pp. 265–286, 2006.

23