Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Lecture 15 Matrix deviation inequality
May 29 - June 3 2020
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 1 22
A is an mtimes n random matrix whose rows are independent meanzero isotropic and sub-gaussian random vectors in Rn
(If you find it helpful to think in terms of concrete examples letthe entries of A be independent N (0 1) random variables)
For a fixed vector x isin Rn we have
E983042Ax98304222 = Em983131
j=1
(Ajx)2 =
m983131
j=1
E(Ajx)2
=
m983131
j=1
xTE(ATjAj)x = m983042x98304222
Further if we assume that concentration about the mean holdshere (and in fact it does) we should expect that
983042Ax9830422 asympradicm983042x9830422
with high probability
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 2 22
Similarly to Johnson-Lindenstrauss Lemma our next goal is tomake the last formula hold simultaneously over all vectors x insome fixed set T sub Rn Precisely we may ask ndash how large is theaverage uniform deviation
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055
This quantity should clearly depend on some notion of the size ofT the larger T the larger should the uniform deviation be Sohow can we quantify the size of T for this problem In the nextsection we will do precisely this introduce a convenient geometricmeasure of the sizes of sets in Rn which is called Gaussian width
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 3 22
1 Gaussian width
Definition Let T sub Rn be a bounded set and g be a standardnormal random vector in Rn ie g sim N (0 In) Then thequantities
ω(T ) = E supxisinT
〈gx〉 and γ(T ) = E supxisinT
|〈gx〉|
are called the Gaussian width of T and the Gaussian complexity ofT respectively
Gaussian width and Gaussian complexity are closely relatedIndeed (Exercise)
2ω(T ) = ω(T minus T ) = E supxyisinT
〈gxminus y〉
= E supxyisinT
|〈gxminus y〉|
= γ(T minus T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 4 22
Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22
This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and
983042g9830422 asympradicn with high probability
(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor
radicn Ultimately the geometric
interpretation of the Gaussian width becomes the following ω(T )is approximately
radicn2 larger than the usual geometric width of
T averaged over all directions
A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn
p = x isin Rn 983042x983042p le 1 We have
γ(Bn2 ) sim
radicn γ(Bn
1 ) sim983155
log n
For any finite set T sub Bn2 we have γ(T ) ≲
983155log |T | The same
holds for Gaussian width ω(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22
2 Matrix deviation inequality
Theorem 1 (Matrix deviation inequality)
Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 le CK2γ(T )
whereK = max
i983042Ai983042ψ2
is the maximal sub-gaussian norm of the rows of A
21 Tail bound
It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22
Let u ge 0 Then the event
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 983249 CK2[γ(T ) + u middot rad(T )]
holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as
rad(T ) = supxisinT
983042x9830422
Since rad(T ) ≲ γ(T ) we improve the bound as
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 ≲ K2uγ(T )
for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation
983061E sup
xisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055p9830621p
983249 CpK2γ(T )
where Cp le Cradicp for p ge 1
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22
22 Deviation of squares
It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself
We can easily deduce the deviation of squares by using the identity
a2 minus b2 = (aminus b)2 + 2b(bminus a)
for a = 983042Ax9830422 and b =radicm983042x9830422
We have
E supxisinT
983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
A is an mtimes n random matrix whose rows are independent meanzero isotropic and sub-gaussian random vectors in Rn
(If you find it helpful to think in terms of concrete examples letthe entries of A be independent N (0 1) random variables)
For a fixed vector x isin Rn we have
E983042Ax98304222 = Em983131
j=1
(Ajx)2 =
m983131
j=1
E(Ajx)2
=
m983131
j=1
xTE(ATjAj)x = m983042x98304222
Further if we assume that concentration about the mean holdshere (and in fact it does) we should expect that
983042Ax9830422 asympradicm983042x9830422
with high probability
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 2 22
Similarly to Johnson-Lindenstrauss Lemma our next goal is tomake the last formula hold simultaneously over all vectors x insome fixed set T sub Rn Precisely we may ask ndash how large is theaverage uniform deviation
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055
This quantity should clearly depend on some notion of the size ofT the larger T the larger should the uniform deviation be Sohow can we quantify the size of T for this problem In the nextsection we will do precisely this introduce a convenient geometricmeasure of the sizes of sets in Rn which is called Gaussian width
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 3 22
1 Gaussian width
Definition Let T sub Rn be a bounded set and g be a standardnormal random vector in Rn ie g sim N (0 In) Then thequantities
ω(T ) = E supxisinT
〈gx〉 and γ(T ) = E supxisinT
|〈gx〉|
are called the Gaussian width of T and the Gaussian complexity ofT respectively
Gaussian width and Gaussian complexity are closely relatedIndeed (Exercise)
2ω(T ) = ω(T minus T ) = E supxyisinT
〈gxminus y〉
= E supxyisinT
|〈gxminus y〉|
= γ(T minus T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 4 22
Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22
This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and
983042g9830422 asympradicn with high probability
(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor
radicn Ultimately the geometric
interpretation of the Gaussian width becomes the following ω(T )is approximately
radicn2 larger than the usual geometric width of
T averaged over all directions
A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn
p = x isin Rn 983042x983042p le 1 We have
γ(Bn2 ) sim
radicn γ(Bn
1 ) sim983155
log n
For any finite set T sub Bn2 we have γ(T ) ≲
983155log |T | The same
holds for Gaussian width ω(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22
2 Matrix deviation inequality
Theorem 1 (Matrix deviation inequality)
Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 le CK2γ(T )
whereK = max
i983042Ai983042ψ2
is the maximal sub-gaussian norm of the rows of A
21 Tail bound
It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22
Let u ge 0 Then the event
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 983249 CK2[γ(T ) + u middot rad(T )]
holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as
rad(T ) = supxisinT
983042x9830422
Since rad(T ) ≲ γ(T ) we improve the bound as
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 ≲ K2uγ(T )
for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation
983061E sup
xisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055p9830621p
983249 CpK2γ(T )
where Cp le Cradicp for p ge 1
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22
22 Deviation of squares
It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself
We can easily deduce the deviation of squares by using the identity
a2 minus b2 = (aminus b)2 + 2b(bminus a)
for a = 983042Ax9830422 and b =radicm983042x9830422
We have
E supxisinT
983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
Similarly to Johnson-Lindenstrauss Lemma our next goal is tomake the last formula hold simultaneously over all vectors x insome fixed set T sub Rn Precisely we may ask ndash how large is theaverage uniform deviation
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055
This quantity should clearly depend on some notion of the size ofT the larger T the larger should the uniform deviation be Sohow can we quantify the size of T for this problem In the nextsection we will do precisely this introduce a convenient geometricmeasure of the sizes of sets in Rn which is called Gaussian width
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 3 22
1 Gaussian width
Definition Let T sub Rn be a bounded set and g be a standardnormal random vector in Rn ie g sim N (0 In) Then thequantities
ω(T ) = E supxisinT
〈gx〉 and γ(T ) = E supxisinT
|〈gx〉|
are called the Gaussian width of T and the Gaussian complexity ofT respectively
Gaussian width and Gaussian complexity are closely relatedIndeed (Exercise)
2ω(T ) = ω(T minus T ) = E supxyisinT
〈gxminus y〉
= E supxyisinT
|〈gxminus y〉|
= γ(T minus T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 4 22
Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22
This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and
983042g9830422 asympradicn with high probability
(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor
radicn Ultimately the geometric
interpretation of the Gaussian width becomes the following ω(T )is approximately
radicn2 larger than the usual geometric width of
T averaged over all directions
A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn
p = x isin Rn 983042x983042p le 1 We have
γ(Bn2 ) sim
radicn γ(Bn
1 ) sim983155
log n
For any finite set T sub Bn2 we have γ(T ) ≲
983155log |T | The same
holds for Gaussian width ω(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22
2 Matrix deviation inequality
Theorem 1 (Matrix deviation inequality)
Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 le CK2γ(T )
whereK = max
i983042Ai983042ψ2
is the maximal sub-gaussian norm of the rows of A
21 Tail bound
It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22
Let u ge 0 Then the event
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 983249 CK2[γ(T ) + u middot rad(T )]
holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as
rad(T ) = supxisinT
983042x9830422
Since rad(T ) ≲ γ(T ) we improve the bound as
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 ≲ K2uγ(T )
for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation
983061E sup
xisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055p9830621p
983249 CpK2γ(T )
where Cp le Cradicp for p ge 1
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22
22 Deviation of squares
It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself
We can easily deduce the deviation of squares by using the identity
a2 minus b2 = (aminus b)2 + 2b(bminus a)
for a = 983042Ax9830422 and b =radicm983042x9830422
We have
E supxisinT
983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
1 Gaussian width
Definition Let T sub Rn be a bounded set and g be a standardnormal random vector in Rn ie g sim N (0 In) Then thequantities
ω(T ) = E supxisinT
〈gx〉 and γ(T ) = E supxisinT
|〈gx〉|
are called the Gaussian width of T and the Gaussian complexity ofT respectively
Gaussian width and Gaussian complexity are closely relatedIndeed (Exercise)
2ω(T ) = ω(T minus T ) = E supxyisinT
〈gxminus y〉
= E supxyisinT
|〈gxminus y〉|
= γ(T minus T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 4 22
Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22
This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and
983042g9830422 asympradicn with high probability
(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor
radicn Ultimately the geometric
interpretation of the Gaussian width becomes the following ω(T )is approximately
radicn2 larger than the usual geometric width of
T averaged over all directions
A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn
p = x isin Rn 983042x983042p le 1 We have
γ(Bn2 ) sim
radicn γ(Bn
1 ) sim983155
log n
For any finite set T sub Bn2 we have γ(T ) ≲
983155log |T | The same
holds for Gaussian width ω(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22
2 Matrix deviation inequality
Theorem 1 (Matrix deviation inequality)
Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 le CK2γ(T )
whereK = max
i983042Ai983042ψ2
is the maximal sub-gaussian norm of the rows of A
21 Tail bound
It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22
Let u ge 0 Then the event
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 983249 CK2[γ(T ) + u middot rad(T )]
holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as
rad(T ) = supxisinT
983042x9830422
Since rad(T ) ≲ γ(T ) we improve the bound as
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 ≲ K2uγ(T )
for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation
983061E sup
xisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055p9830621p
983249 CpK2γ(T )
where Cp le Cradicp for p ge 1
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22
22 Deviation of squares
It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself
We can easily deduce the deviation of squares by using the identity
a2 minus b2 = (aminus b)2 + 2b(bminus a)
for a = 983042Ax9830422 and b =radicm983042x9830422
We have
E supxisinT
983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
Gaussian width has a natural geometric interpretation Suppose gis a unit vector in Rn Then a momentrsquos thought reveals thatsupxyisinT 〈gxminus y〉 is simply the width of T in the direction of gie the distance between the two hyperplanes with normal g thattouch T on both sides as shown in the figure Then 2ω(T ) can beobtained by averaging the width of T over all directions g in Rn
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 5 22
This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and
983042g9830422 asympradicn with high probability
(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor
radicn Ultimately the geometric
interpretation of the Gaussian width becomes the following ω(T )is approximately
radicn2 larger than the usual geometric width of
T averaged over all directions
A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn
p = x isin Rn 983042x983042p le 1 We have
γ(Bn2 ) sim
radicn γ(Bn
1 ) sim983155
log n
For any finite set T sub Bn2 we have γ(T ) ≲
983155log |T | The same
holds for Gaussian width ω(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22
2 Matrix deviation inequality
Theorem 1 (Matrix deviation inequality)
Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 le CK2γ(T )
whereK = max
i983042Ai983042ψ2
is the maximal sub-gaussian norm of the rows of A
21 Tail bound
It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22
Let u ge 0 Then the event
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 983249 CK2[γ(T ) + u middot rad(T )]
holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as
rad(T ) = supxisinT
983042x9830422
Since rad(T ) ≲ γ(T ) we improve the bound as
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 ≲ K2uγ(T )
for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation
983061E sup
xisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055p9830621p
983249 CpK2γ(T )
where Cp le Cradicp for p ge 1
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22
22 Deviation of squares
It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself
We can easily deduce the deviation of squares by using the identity
a2 minus b2 = (aminus b)2 + 2b(bminus a)
for a = 983042Ax9830422 and b =radicm983042x9830422
We have
E supxisinT
983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
This reasoning is valid except where we assumed that g is a unitvector Instead for g sim N (0 In) we have E983042g9830422 = n and
983042g9830422 asympradicn with high probability
(Check both these claims using Bernsteinrsquos inequality) Thus weneed to scale by the factor
radicn Ultimately the geometric
interpretation of the Gaussian width becomes the following ω(T )is approximately
radicn2 larger than the usual geometric width of
T averaged over all directions
A good exercise is to compute the Gaussian width and complexityfor some simple sets such as the unit balls of the ℓp norms in Rnwhich we denote by Bn
p = x isin Rn 983042x983042p le 1 We have
γ(Bn2 ) sim
radicn γ(Bn
1 ) sim983155
log n
For any finite set T sub Bn2 we have γ(T ) ≲
983155log |T | The same
holds for Gaussian width ω(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 6 22
2 Matrix deviation inequality
Theorem 1 (Matrix deviation inequality)
Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 le CK2γ(T )
whereK = max
i983042Ai983042ψ2
is the maximal sub-gaussian norm of the rows of A
21 Tail bound
It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22
Let u ge 0 Then the event
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 983249 CK2[γ(T ) + u middot rad(T )]
holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as
rad(T ) = supxisinT
983042x9830422
Since rad(T ) ≲ γ(T ) we improve the bound as
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 ≲ K2uγ(T )
for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation
983061E sup
xisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055p9830621p
983249 CpK2γ(T )
where Cp le Cradicp for p ge 1
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22
22 Deviation of squares
It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself
We can easily deduce the deviation of squares by using the identity
a2 minus b2 = (aminus b)2 + 2b(bminus a)
for a = 983042Ax9830422 and b =radicm983042x9830422
We have
E supxisinT
983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
2 Matrix deviation inequality
Theorem 1 (Matrix deviation inequality)
Let A be an mtimes n matrix whose rows Ai are independent isotropicand sub-gaussian random vectors in Rn Let T sub Rn be a fixed boundedset Then
E supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 le CK2γ(T )
whereK = max
i983042Ai983042ψ2
is the maximal sub-gaussian norm of the rows of A
21 Tail bound
It is often useful to have results that hold with high probabilityrather than in expectation There exists a high-probability versionof the matrix deviation inequality and it states the following
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 7 22
Let u ge 0 Then the event
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 983249 CK2[γ(T ) + u middot rad(T )]
holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as
rad(T ) = supxisinT
983042x9830422
Since rad(T ) ≲ γ(T ) we improve the bound as
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 ≲ K2uγ(T )
for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation
983061E sup
xisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055p9830621p
983249 CpK2γ(T )
where Cp le Cradicp for p ge 1
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22
22 Deviation of squares
It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself
We can easily deduce the deviation of squares by using the identity
a2 minus b2 = (aminus b)2 + 2b(bminus a)
for a = 983042Ax9830422 and b =radicm983042x9830422
We have
E supxisinT
983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
Let u ge 0 Then the event
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 983249 CK2[γ(T ) + u middot rad(T )]
holds with probability at least 1minus 2 exp(minusu2) Here rad(T ) is theradius of T defined as
rad(T ) = supxisinT
983042x9830422
Since rad(T ) ≲ γ(T ) we improve the bound as
supxisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055 ≲ K2uγ(T )
for all u ge 1 This is a weaker but still a useful inequality Forexample we can use it to bound all higher moments of thedeviation
983061E sup
xisinT
983055983055983042Ax9830422 minusradicm983042x9830422
983055983055p9830621p
983249 CpK2γ(T )
where Cp le Cradicp for p ge 1
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 8 22
22 Deviation of squares
It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself
We can easily deduce the deviation of squares by using the identity
a2 minus b2 = (aminus b)2 + 2b(bminus a)
for a = 983042Ax9830422 and b =radicm983042x9830422
We have
E supxisinT
983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
22 Deviation of squares
It is sometimes helpful to bound the deviation of the square983042Ax98304222 rather than 983042Ax9830422 itself
We can easily deduce the deviation of squares by using the identity
a2 minus b2 = (aminus b)2 + 2b(bminus a)
for a = 983042Ax9830422 and b =radicm983042x9830422
We have
E supxisinT
983055983055983042Ax98304222 minusm983042x98304222983055983055 le CK4γ(T )2 + CK2radicmrad(T )γ(T )
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 9 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
23 Deriving Johnson-Lindenstrauss Lemma
X sub Rn and T = (xminus y)983042xminus y9830422 xy isin X Then T is finiteand we have
γ(T ) ≲983155
log |T | le983155
log |X |2 ≲983155
log |X |
Matrix deviation inequality and m ge Cεminus2 logN then yield
supxyisinX
983055983055983055983055983042A(xminus y)9830422983042xminus y9830422
minusradicm
983055983055983055983055 ≲983155
logN le εradicm
with high probability say 099 Multiplying both sides by983042xminus y9830422
radicm we can write the last bound as follows With
probability at least 099 we have
(1minus ε)983042xminus y9830422 9832491radicm983042AxminusAy9830422 983249 (1 + ε)983042xminus y9830422
for all xy isin X
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 10 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
3 Covariance estimation
We already showed that N sim n log n samples are enough toestimate the covariance matrix of a general distribution in Rn
We can do better if the distribution is sub-gaussian we can get ridof the logarithmic oversampling and the boundedness condition
Theorem 2 (Covariance estimation for sub-gaussian distributions)
Let X be a random vector in Rn with covariance matrix Σ Suppose Xis sub-gaussian and more specifically for any x isin Rn
983042〈Xx〉983042ψ2 ≲ 983042〈Xx〉983042L2 = 983042Σ12x9830422
Then for every N ge 1 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
This result shows N sim εminus2n gives E983042ΣN minus Σ983042 ≲ ε983042Σ983042
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 11 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
Proof We first bring the random vectors XX1 XN to theisotropic position This can be done by a suitable lineartransformation You will easily check that there exists isotropicrandom vectors ZZ1 ZN such that
X = Σ12Z Xi = Σ12Zi i = 1 N
The sub-gaussian assumption implies that
983042Z983042ψ2 ≲ 1
Then
983042ΣN minus Σ983042 = 983042Σ12RNΣ12983042 = max983042x9830422=1
〈Σ12RNΣ12xx〉
where
RN =1
N
N983131
i=1
ZiZTi minus In
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 12 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
Let T = Σ12x isin Rn 983042x9830422 = 1 We can rewrite 983042ΣN minus Σ983042 as
983042ΣN minus Σ983042 = maxxisinT
〈RNxx〉 = maxxisinT
9830559830559830559830551
N
983131N
i=1〈Zix〉2 minus 983042x98304222
983055983055983055983055
=1
NmaxxisinT
983055983055983042Ax98304222 minusN983042x98304222983055983055
Now apply the matrix deviation inequality for squares to conclude that
E 983042ΣN minus Σ983042 ≲ 1
N
983059γ(T )2 +
radicN rad(T )γ(T )
983060
The radius and Gaussian width of the ellipsoid T are easy to compute
rad(T ) = 983042Σ98304212 and γ(T ) le (trΣ)12
By using tr(Σ) le n983042Σ983042 we have
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
n
N+
n
N
983062
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 13 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
31 Low-dimensional distributions
We can show that much fewer samples are needed for covarianceestimation of low-dimensional sub-gaussian distributions Indeedthe proof actually yields
E983042ΣN minus Σ983042 ≲ 983042Σ983042983061983157
r
N+
r
N
983062
where
r = r(Σ12) =trΣ
983042Σ983042
is the stable rank of Σ12 This means that covariance estimationis possible with
N sim r
samples
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 14 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
4 Underdetermined linear equations
Suppose we need to solve a severely underdetermined system oflinear equations say we have m equations in n ≫ m variables
Ax = y
When the linear system is underdetermined we can not find xwith any accuracy unless we know something extra about x Solet us assume that we do have some a-priori information We candescribe this situation mathematically by assuming that
x isin K
where K sub Rn is some known set in Rn that describes anythingthat we know about x a-priori
Summarizing here is the problem we are trying to solveDetermine a solution x = x(AyK) to the underdetermined linearequation Ax = y as accurately as possible assuming that x isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 15 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
41 An optimization approach
We convert the set K into a function on Rn which is called theMinkowski functional of K This is basically a function whose levelsets are multiples of K
To define it formally assume that K is star-shaped which meansthat together with any point x the set K must contain the entireinterval that connects x with the origin see the figure forillustration
The set on the left (whose boundary is shown) is star-shaped theset on the right is not
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 16 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
The Minkowski functional of K is defined as
983042x983042K = inft gt 0 xt isin K x isin Rn
If the set K is convex and symmetric about the origin 983042x983042K isactually a norm on Rn (Exercise)
Now we propose the following way to solve the recovery problemsolve the optimization program
min 983042xprime983042K subject to y = Axprime
Note that this is a very natural program it looks at all solutions tothe equation y = Axprime and tries to ldquoshrinkrdquo the solution xprime towardK (This is what minimization of Minkowski functional is about)
Also note that if K is convex this is a convex optimizationprogram and thus can be solved effectively by one of the manyavailable numeric algorithms
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 17 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
The main question we should now be asking is ndash would thesolution to this program approximate the original vector x Thefollowing result bounds the approximation error for a probabilisticmodel of linear equations
Theorem 3 (Recovery by optimization)
Assume that A is an mtimes n random matrix whose rows Ai areindependent isotropic and sub-gaussian random vectors in Rn Thesolution 983141x of the optimization problem satisfies
E983042983141xminus x9830422 ≲ω(K)radic
m
where ω(K) is the Gaussian width of K
Proof Both the original vector x and the solution 983141x are feasible vectorsfor the optimization program Then 983042983141x983042K le 983042x983042K le 1 Thus both983141xx isin K
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 18 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
By A983141x = Ax = y we have A(983141xminus x) = 0 Let us apply matrixdeviation inequality for T = K minusK It gives
supuvisinK
983055983055983042A(uminus v)9830422 minusradicm983042uminus v9830422
983055983055 ≲ γ(T ) = 2ω(K)
Substitute u = 983141x and v = x here We may do this since as we notedabove both these vectors belong to K But then the term 983042A(uminus v)9830422will equal zero It disappears from the bound and we get
Eradicm983042983141xminus x9830422 ≲ ω(K)
Dividing both sides byradicm we complete the proof
This theorem says that a signal x isin K can be efficiently recovered from
m sim ω(K)2
random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 19 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
5 Sparse recovery
Suppose we know that the signal x is sparse which means thatonly a few coordinates of x are nonzero As before our task is torecover x from the random linear measurements given by thevector
y = Ax
where A is an mtimes n random matrix
The number of nonzero coefficients of a vector x isin Rn or thesparsity of x is often denoted 983042x9830420 This is reminiscent of thenotation for the ℓp norm 983042x983042p = (
983123ni=1 |xi|p)1p and for a reason
You can quickly check that
983042x9830420 = limprarr0
983042x983042p
Keep in mind that neither 983042x9830420 nor 983042x983042p for 0 lt p lt 1 areactually norms on Rn since they fail triangle inequality
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 20 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
Our first attempt to recover x is to try the following optimizationproblem
min 983042xprime9830420 subject to y = Axprime
This is sensible because this program selects the sparsest feasiblesolution But there is an implementation caveat the functionf(x) = 983042x9830420 is highly non-convex and even discontinuous There issimply no known algorithm to solve the optimization problemefficiently
To overcome this difficulty we use
min 983042xprime9830421 subject to y = Axprime
This is a convexification of the non-convex program and a varietyof numeric convex optimization methods are available to solve itefficiently
We will now show that an s-sparse signal x isin Rn can be efficientlyrecovered from m sim s log n random linear measurements
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 21 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22
Theorem 4 (Sparse recovery by optimization)
Assume A is a random matrix as in Theorem 3 If an unknown vectorx isin Rn has at most s non-zero coordinates ie 983042x9830420 le s then thesolution 983141x of the ℓ1 optimization program satisfies
E983042983141xminus x9830422 ≲983155
(s log n)m983042x9830422
Proof Cauchy-Schwarz inequality shows that 983042x9830421 leradics983042x9830422 Denote
the unit ball of the ℓ1 norm in Rn by Bn1 Then we can rewrite
983042x9830421 leradics983042x9830422 as the inclusion
x isinradics983042x9830422 middot Bn
1 = K
By the Gaussian width of Bn1 we have
ω(K) =radics983042x9830422 middot ω(Bn
1 ) leradics983042x9830422 middot γ(Bn
1 ) leradics983042x9830422 middot
983155log n
Substitute this in Theorem 3 and complete the proof
Matrix deviation inequality DAMC Lecture 15 May 29 - June 3 2020 22 22