View
4
Download
0
Category
Preview:
Citation preview
Lecture 14 Covariance estimation and matrixcompletion
May 27 - 29 2020
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 1 22
1 Covariance estimation
Suppose we have a sample of data points X1 XN in Rn It isoften reasonable to assume that these points are independentlysampled from the same probability distribution (or ldquopopulationrdquo)which is unknown We would like to learn something useful aboutthis distribution
Denote by X a random vector with this (unknown) distributionThe most basic parameter of the distribution is the mean EXOne can estimate EX from the sample by computing the samplemean
983123Ni=1XiN The law of large numbers guarantees that the
estimate becomes tight as the sample size N grows to infinity Inother words
1
N
N983131
i=1
Xi rarr EX as N rarr infin
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 2 22
The next most basic parameter of the distribution is thecovariance matrix
Σ = E(X minus EX)(X minus EX)T
The eigenvectors of the covariance matrix Σ are called theprincipal components Principal components that correspond tolarge eigenvalues of Σ are the directions in which the distributionof X is most extended see the figure
This method is called Principal Component Analysis (PCA)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 3 22
One can estimate the covariance matrix Σ from the sample bycomputing the sample covariance
ΣN =1
N
N983131
i=1
(Xi minus EXi)(Xi minus EXi)T
Again the law of large numbers guarantees that the estimatebecomes tight as the sample size N grows to infinity ie
ΣN rarr Σ as N rarr infin
But how large should the sample size N be for covarianceestimation Generally one can not have N lt n for dimensionreasons (Why) We are going to show that
N sim n log n
is enough In other words covariance estimation is possible withjust logarithmic oversampling
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 4 22
For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))
Theorem 1 (Covariance estimation)
Let X be a random vector in Rn with covariance matrix Σ Supposethat
983042X98304222 ≲ E983042X98304222 = trΣ almost surely
Then for every N ge 1 we have
E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157
n log n
N+
n log n
N
983076
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22
Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX
Ti minus Σ and get
E 983042ΣN minus Σ983042 =1
NE
983056983056983056983056983056
N983131
i=1
983043XiX
Ti minus Σ
983044983056983056983056983056983056
≲ 1
N(σ983155
log n+K log n)
where
σ2 =
983056983056983056983056983056
N983131
i=1
E983043XiX
Ti minus Σ
9830442983056983056983056983056983056 = N
983056983056983056E983043XXT minus Σ
9830442983056983056983056
and K is chosen so that
983042XXT minus Σ983042 le K almost surely
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22
It remains to bound σ and K Let us start with σ We have
E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2
≾ tr(Σ) middot EXXT
= tr(Σ) middot Σ
Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have
983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K
Therefore
E 983042ΣN minus Σ983042 ≲ 1
N(983155
Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)
The proof is completed by using tr(Σ) le n983042Σ983042
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22
11 Low-dimensional distributions
Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms
r(A) =983042A9830422F983042A98304222
le rank(A)
The proof of Theorem 1 yields
E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157
r log n
N+
r log n
N
983076
where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22
2 Norms of random matrices
Let Ai denote the ith row of A we have (exercise)
maxi
983042Ai9830422 le 983042A9830422 leradicnmax
i983042Ai9830422
For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch
Theorem 2 (Norms of random matrices without boundednessassumptions)
Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then
Emaxi
983042Ai9830422 le E983042A9830422 le C log n middot Emaxi
983042Ai9830422
where Ai denote the rows of A
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
1 Covariance estimation
Suppose we have a sample of data points X1 XN in Rn It isoften reasonable to assume that these points are independentlysampled from the same probability distribution (or ldquopopulationrdquo)which is unknown We would like to learn something useful aboutthis distribution
Denote by X a random vector with this (unknown) distributionThe most basic parameter of the distribution is the mean EXOne can estimate EX from the sample by computing the samplemean
983123Ni=1XiN The law of large numbers guarantees that the
estimate becomes tight as the sample size N grows to infinity Inother words
1
N
N983131
i=1
Xi rarr EX as N rarr infin
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 2 22
The next most basic parameter of the distribution is thecovariance matrix
Σ = E(X minus EX)(X minus EX)T
The eigenvectors of the covariance matrix Σ are called theprincipal components Principal components that correspond tolarge eigenvalues of Σ are the directions in which the distributionof X is most extended see the figure
This method is called Principal Component Analysis (PCA)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 3 22
One can estimate the covariance matrix Σ from the sample bycomputing the sample covariance
ΣN =1
N
N983131
i=1
(Xi minus EXi)(Xi minus EXi)T
Again the law of large numbers guarantees that the estimatebecomes tight as the sample size N grows to infinity ie
ΣN rarr Σ as N rarr infin
But how large should the sample size N be for covarianceestimation Generally one can not have N lt n for dimensionreasons (Why) We are going to show that
N sim n log n
is enough In other words covariance estimation is possible withjust logarithmic oversampling
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 4 22
For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))
Theorem 1 (Covariance estimation)
Let X be a random vector in Rn with covariance matrix Σ Supposethat
983042X98304222 ≲ E983042X98304222 = trΣ almost surely
Then for every N ge 1 we have
E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157
n log n
N+
n log n
N
983076
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22
Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX
Ti minus Σ and get
E 983042ΣN minus Σ983042 =1
NE
983056983056983056983056983056
N983131
i=1
983043XiX
Ti minus Σ
983044983056983056983056983056983056
≲ 1
N(σ983155
log n+K log n)
where
σ2 =
983056983056983056983056983056
N983131
i=1
E983043XiX
Ti minus Σ
9830442983056983056983056983056983056 = N
983056983056983056E983043XXT minus Σ
9830442983056983056983056
and K is chosen so that
983042XXT minus Σ983042 le K almost surely
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22
It remains to bound σ and K Let us start with σ We have
E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2
≾ tr(Σ) middot EXXT
= tr(Σ) middot Σ
Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have
983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K
Therefore
E 983042ΣN minus Σ983042 ≲ 1
N(983155
Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)
The proof is completed by using tr(Σ) le n983042Σ983042
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22
11 Low-dimensional distributions
Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms
r(A) =983042A9830422F983042A98304222
le rank(A)
The proof of Theorem 1 yields
E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157
r log n
N+
r log n
N
983076
where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22
2 Norms of random matrices
Let Ai denote the ith row of A we have (exercise)
maxi
983042Ai9830422 le 983042A9830422 leradicnmax
i983042Ai9830422
For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch
Theorem 2 (Norms of random matrices without boundednessassumptions)
Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then
Emaxi
983042Ai9830422 le E983042A9830422 le C log n middot Emaxi
983042Ai9830422
where Ai denote the rows of A
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
The next most basic parameter of the distribution is thecovariance matrix
Σ = E(X minus EX)(X minus EX)T
The eigenvectors of the covariance matrix Σ are called theprincipal components Principal components that correspond tolarge eigenvalues of Σ are the directions in which the distributionof X is most extended see the figure
This method is called Principal Component Analysis (PCA)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 3 22
One can estimate the covariance matrix Σ from the sample bycomputing the sample covariance
ΣN =1
N
N983131
i=1
(Xi minus EXi)(Xi minus EXi)T
Again the law of large numbers guarantees that the estimatebecomes tight as the sample size N grows to infinity ie
ΣN rarr Σ as N rarr infin
But how large should the sample size N be for covarianceestimation Generally one can not have N lt n for dimensionreasons (Why) We are going to show that
N sim n log n
is enough In other words covariance estimation is possible withjust logarithmic oversampling
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 4 22
For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))
Theorem 1 (Covariance estimation)
Let X be a random vector in Rn with covariance matrix Σ Supposethat
983042X98304222 ≲ E983042X98304222 = trΣ almost surely
Then for every N ge 1 we have
E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157
n log n
N+
n log n
N
983076
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22
Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX
Ti minus Σ and get
E 983042ΣN minus Σ983042 =1
NE
983056983056983056983056983056
N983131
i=1
983043XiX
Ti minus Σ
983044983056983056983056983056983056
≲ 1
N(σ983155
log n+K log n)
where
σ2 =
983056983056983056983056983056
N983131
i=1
E983043XiX
Ti minus Σ
9830442983056983056983056983056983056 = N
983056983056983056E983043XXT minus Σ
9830442983056983056983056
and K is chosen so that
983042XXT minus Σ983042 le K almost surely
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22
It remains to bound σ and K Let us start with σ We have
E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2
≾ tr(Σ) middot EXXT
= tr(Σ) middot Σ
Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have
983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K
Therefore
E 983042ΣN minus Σ983042 ≲ 1
N(983155
Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)
The proof is completed by using tr(Σ) le n983042Σ983042
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22
11 Low-dimensional distributions
Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms
r(A) =983042A9830422F983042A98304222
le rank(A)
The proof of Theorem 1 yields
E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157
r log n
N+
r log n
N
983076
where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22
2 Norms of random matrices
Let Ai denote the ith row of A we have (exercise)
maxi
983042Ai9830422 le 983042A9830422 leradicnmax
i983042Ai9830422
For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch
Theorem 2 (Norms of random matrices without boundednessassumptions)
Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then
Emaxi
983042Ai9830422 le E983042A9830422 le C log n middot Emaxi
983042Ai9830422
where Ai denote the rows of A
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
One can estimate the covariance matrix Σ from the sample bycomputing the sample covariance
ΣN =1
N
N983131
i=1
(Xi minus EXi)(Xi minus EXi)T
Again the law of large numbers guarantees that the estimatebecomes tight as the sample size N grows to infinity ie
ΣN rarr Σ as N rarr infin
But how large should the sample size N be for covarianceestimation Generally one can not have N lt n for dimensionreasons (Why) We are going to show that
N sim n log n
is enough In other words covariance estimation is possible withjust logarithmic oversampling
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 4 22
For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))
Theorem 1 (Covariance estimation)
Let X be a random vector in Rn with covariance matrix Σ Supposethat
983042X98304222 ≲ E983042X98304222 = trΣ almost surely
Then for every N ge 1 we have
E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157
n log n
N+
n log n
N
983076
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22
Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX
Ti minus Σ and get
E 983042ΣN minus Σ983042 =1
NE
983056983056983056983056983056
N983131
i=1
983043XiX
Ti minus Σ
983044983056983056983056983056983056
≲ 1
N(σ983155
log n+K log n)
where
σ2 =
983056983056983056983056983056
N983131
i=1
E983043XiX
Ti minus Σ
9830442983056983056983056983056983056 = N
983056983056983056E983043XXT minus Σ
9830442983056983056983056
and K is chosen so that
983042XXT minus Σ983042 le K almost surely
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22
It remains to bound σ and K Let us start with σ We have
E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2
≾ tr(Σ) middot EXXT
= tr(Σ) middot Σ
Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have
983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K
Therefore
E 983042ΣN minus Σ983042 ≲ 1
N(983155
Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)
The proof is completed by using tr(Σ) le n983042Σ983042
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22
11 Low-dimensional distributions
Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms
r(A) =983042A9830422F983042A98304222
le rank(A)
The proof of Theorem 1 yields
E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157
r log n
N+
r log n
N
983076
where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22
2 Norms of random matrices
Let Ai denote the ith row of A we have (exercise)
maxi
983042Ai9830422 le 983042A9830422 leradicnmax
i983042Ai9830422
For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch
Theorem 2 (Norms of random matrices without boundednessassumptions)
Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then
Emaxi
983042Ai9830422 le E983042A9830422 le C log n middot Emaxi
983042Ai9830422
where Ai denote the rows of A
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
For simplicity we shall state the covariance estimation bound formean zero distributions (If the mean is not zero we can estimateit from the sample and subtract The mean can be accuratelyestimated from a sample of size N = O(n))
Theorem 1 (Covariance estimation)
Let X be a random vector in Rn with covariance matrix Σ Supposethat
983042X98304222 ≲ E983042X98304222 = trΣ almost surely
Then for every N ge 1 we have
E 983042ΣN minus Σ983042 ≲ 983042Σ983042983075983157
n log n
N+
n log n
N
983076
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 5 22
Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX
Ti minus Σ and get
E 983042ΣN minus Σ983042 =1
NE
983056983056983056983056983056
N983131
i=1
983043XiX
Ti minus Σ
983044983056983056983056983056983056
≲ 1
N(σ983155
log n+K log n)
where
σ2 =
983056983056983056983056983056
N983131
i=1
E983043XiX
Ti minus Σ
9830442983056983056983056983056983056 = N
983056983056983056E983043XXT minus Σ
9830442983056983056983056
and K is chosen so that
983042XXT minus Σ983042 le K almost surely
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22
It remains to bound σ and K Let us start with σ We have
E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2
≾ tr(Σ) middot EXXT
= tr(Σ) middot Σ
Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have
983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K
Therefore
E 983042ΣN minus Σ983042 ≲ 1
N(983155
Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)
The proof is completed by using tr(Σ) le n983042Σ983042
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22
11 Low-dimensional distributions
Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms
r(A) =983042A9830422F983042A98304222
le rank(A)
The proof of Theorem 1 yields
E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157
r log n
N+
r log n
N
983076
where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22
2 Norms of random matrices
Let Ai denote the ith row of A we have (exercise)
maxi
983042Ai9830422 le 983042A9830422 leradicnmax
i983042Ai9830422
For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch
Theorem 2 (Norms of random matrices without boundednessassumptions)
Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then
Emaxi
983042Ai9830422 le E983042A9830422 le C log n middot Emaxi
983042Ai9830422
where Ai denote the rows of A
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Proof Apply matrix Bernsteinrsquos inequality corollary 3 for the sum ofindependent random matrices XiX
Ti minus Σ and get
E 983042ΣN minus Σ983042 =1
NE
983056983056983056983056983056
N983131
i=1
983043XiX
Ti minus Σ
983044983056983056983056983056983056
≲ 1
N(σ983155
log n+K log n)
where
σ2 =
983056983056983056983056983056
N983131
i=1
E983043XiX
Ti minus Σ
9830442983056983056983056983056983056 = N
983056983056983056E983043XXT minus Σ
9830442983056983056983056
and K is chosen so that
983042XXT minus Σ983042 le K almost surely
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 6 22
It remains to bound σ and K Let us start with σ We have
E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2
≾ tr(Σ) middot EXXT
= tr(Σ) middot Σ
Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have
983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K
Therefore
E 983042ΣN minus Σ983042 ≲ 1
N(983155
Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)
The proof is completed by using tr(Σ) le n983042Σ983042
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22
11 Low-dimensional distributions
Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms
r(A) =983042A9830422F983042A98304222
le rank(A)
The proof of Theorem 1 yields
E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157
r log n
N+
r log n
N
983076
where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22
2 Norms of random matrices
Let Ai denote the ith row of A we have (exercise)
maxi
983042Ai9830422 le 983042A9830422 leradicnmax
i983042Ai9830422
For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch
Theorem 2 (Norms of random matrices without boundednessassumptions)
Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then
Emaxi
983042Ai9830422 le E983042A9830422 le C log n middot Emaxi
983042Ai9830422
where Ai denote the rows of A
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
It remains to bound σ and K Let us start with σ We have
E(XXT minus Σ)2 = E983042X98304222XXT minus Σ2
≾ tr(Σ) middot EXXT
= tr(Σ) middot Σ
Thus σ2 ≲ Ntr(Σ)983042Σ983042 Next to bound K we have
983042XXT minus Σ983042 le 983042X98304222 + 983042Σ983042≲ tr(Σ) + 983042Σ983042le 2tr(Σ) = K
Therefore
E 983042ΣN minus Σ983042 ≲ 1
N(983155
Ntr(Σ)983042Σ983042 log n+ tr(Σ) log n)
The proof is completed by using tr(Σ) le n983042Σ983042
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 7 22
11 Low-dimensional distributions
Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms
r(A) =983042A9830422F983042A98304222
le rank(A)
The proof of Theorem 1 yields
E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157
r log n
N+
r log n
N
983076
where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22
2 Norms of random matrices
Let Ai denote the ith row of A we have (exercise)
maxi
983042Ai9830422 le 983042A9830422 leradicnmax
i983042Ai9830422
For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch
Theorem 2 (Norms of random matrices without boundednessassumptions)
Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then
Emaxi
983042Ai9830422 le E983042A9830422 le C log n middot Emaxi
983042Ai9830422
where Ai denote the rows of A
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
11 Low-dimensional distributions
Far fewer samples are needed for covariance estimation forlow-dimensional or approximately low-dimensional distributionsTo measure approximate low-dimensionality we can use the notionof the stable rank of Σ2 The stable rank of a matrix A is definedas the square of the ratio of the Frobenius to operator norms
r(A) =983042A9830422F983042A98304222
le rank(A)
The proof of Theorem 1 yields
E 983042ΣN minus Σ983042 983249 983042Σ983042983075983157
r log n
N+
r log n
N
983076
where r = r(Σ12) = tr(Σ)983042Σ983042 Therefore covariance estimationis possible with N sim r log n samples
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 8 22
2 Norms of random matrices
Let Ai denote the ith row of A we have (exercise)
maxi
983042Ai9830422 le 983042A9830422 leradicnmax
i983042Ai9830422
For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch
Theorem 2 (Norms of random matrices without boundednessassumptions)
Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then
Emaxi
983042Ai9830422 le E983042A9830422 le C log n middot Emaxi
983042Ai9830422
where Ai denote the rows of A
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
2 Norms of random matrices
Let Ai denote the ith row of A we have (exercise)
maxi
983042Ai9830422 le 983042A9830422 leradicnmax
i983042Ai9830422
For random matrices with independent entries the bound can beimproved to the point where the upper and lower bounds almostmatch
Theorem 2 (Norms of random matrices without boundednessassumptions)
Let A be an ntimes n symmetric random matrix whose entries on andabove the diagonal are independent mean zero random variables Then
Emaxi
983042Ai9830422 le E983042A9830422 le C log n middot Emaxi
983042Ai9830422
where Ai denote the rows of A
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 9 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Lemma 3 (Symmetrization)
Let X1 XN be independent mean zero random vectors in a normedspace and ε1 εN be independent Rademacher random variablesThen
1
2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056 983249 E
983056983056983056983056983056
N983131
i=1
Xi
983056983056983056983056983056 983249 2E
983056983056983056983056983056
N983131
i=1
εiXi
983056983056983056983056983056
Proof To prove the upper bound let (X primei) be an independent copy of
the random vectors (Xi) ie just different random vectors with thesame joint distribution as (Xi) and independent from (Xi) Then
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
Xi minus E
983075983131
i
X primei
983076983056983056983056983056983056
983249 E
983056983056983056983056983056983131
i
Xi minus983131
i
X primei
983056983056983056983056983056 = E
983056983056983056983056983056983131
i
983043Xi minusX prime
i
983044983056983056983056983056983056
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 10 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
The distribution of the random vectors Yi = Xi minusX primei is symmetric
which means that the distributions of Yi and minusYi are the same(Why) Thus the distribution of the random vectors Yi and εiYi is alsothe same for all we do is change the signs of these vectors at randomand independently of the values of the vectors Summarizing we canreplace Xi minusX prime
i in the sum above with εi(Xi minusX primei) Thus
E
983056983056983056983056983056983131
i
Xi
983056983056983056983056983056 le E
983056983056983056983056983056983131
i
εi(Xi minusX primei)
983056983056983056983056983056
le E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056+ E
983056983056983056983056983056983131
i
εiXprimei
983056983056983056983056983056
= 2E
983056983056983056983056983056983131
i
εiXi
983056983056983056983056983056
This proves the upper bound in the symmetrization inequality Thelower bound can be proved by a similar argument (Do this)
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 11 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Proof of Theorem 2 The lower bound is trivial The proof of the upperbound will be based on matrix Bernsteinrsquos inequalityWe represent A as a sum of independent mean zero symmetricrandom matrices Zij each of which contains a pair of symmetric entriesof A (or one diagonal entry)
A =983131
ilejZij
By the symmetrization inequality (Lemma 3) for the random matricesZij we get
E983042A983042 = E983056983056983056983131
i983249jZij
983056983056983056 983249 2E983056983056983056983131
i983249jXij
983056983056983056
where we set Xij = εijZij and εij are independent Rademacherrandom variables Now we condition on A The random variables Zij
become fixed values and all randomness remains in the Rademacherrandom variables εij
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 12 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Note that Xij are (conditionally) bounded almost surely and this isexactly what we have lacked to apply matrix Bernsteinrsquos inequalityNow we can do it The corollary of matrix Bernsteinrsquos inequality gives
Eε
983056983056983056983131
ilejXij
983056983056983056 ≲ σ983155
log n+K log n
where σ2 = 983042983123
ilej EεX2ij983042 and K = maxilej 983042Xij983042 A good exercise is
to check that
σ ≲ maxi
983042Ai9830422 and K ≲ maxi
983042Ai9830422
Then we have
Eε
983056983056983056983131
i983249jXij
983056983056983056 ≲ log n middotmaxi
983042Ai9830422
Finally we unfix A by taking expectation of both sides of thisinequality with respect to A and using the law of total expectation
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 13 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
We state Theorem 2 for symmetric matrices but it is simple toextend it to general mtimes n random matrices A The bound in thiscase becomes
E983042A9830422 le C log(m+ n) middot (Emaxi
983042Ai9830422 + Emaxj
983042Aj9830422)
To see this apply Theorem 2 to the (m+ n)times (m+ n) symmetricrandom matrix 983063
0 AAT 0
983064
3 Matrix completion
Consider a fixed unknown ntimes n matrix X Suppose we are shownm randomly chosen entries of X Can we guess all the missingentries This important problem is called matrix completion Wewill analyze it using the bounds on the norms on random matriceswe just obtained
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 14 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Obviously there is no way to guess the missing entries unless weknow something extra about the matrix X So let us assume thatX has low rank
rank(X) = r ≪ n
The number of degrees of freedom of an ntimes n matrix with rank ris O(rn) (Why) So we may hope that m sim rn observed entriesof X will be enough to determine X completely But how
Here we will analyze what is probably the simplest method formatrix completion Take the matrix Y that consists of theobserved entries of X while all unobserved entries are set to zeroUnlike X the matrix Y may not have small rank Compute thebest rank r approximation of Y The result as we will show willbe a good approximation to X
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 15 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
But before we show this let us define sampling of entries morerigorously Assume each entry of X is shown or hiddenindependently of others with fixed probability p Which entries areshown is decided by independent Bernoulli random variables
δij sim Ber(p) with p =m
n2
which are often called selectors in this context The value of p ischosen so that among n2 entries of X the expected number ofselected (known) entries is m
Define the ntimes n matrix Y with entries Yij = δijXij We canassume that we are shown Y for it is a matrix that contains theobserved entries of X while all unobserved entries are replacedwith zeros The following result shows how to estimate X basedon Y
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 16 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Theorem 4 (Matrix completion)
Let 983142X be a best rank r approximation to pminus1Y Then
E1
n983042983142X minusX983042F 983249 C logn
983157rn
m983042X983042max
Here 983042X983042max = maxij |Xij | denotes the maximum magnitude of theentries of X
Remark This theorem controls the average error per entry in themean-squared sense To make the error small let us assume that wehave a sample of size m ≫ rn log2 n which is slightly larger than theideal size m sim rn This makes C log n
983155rnm = o(1) and forces the
recovery error to be bounded by o(1)983042X983042max Summarizing Theorem4 says that the expected average error per entry is much smaller thanthe maximal magnitude of the entries of X This is true for a sampleof almost optimal size m The smaller the rank r of the matrix X thefewer entries of X we need to see in order to do matrix completion
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 17 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Proof of Theorem 4Step 1 The error in the operator norm Let us first bound therecovery error in the operator norm Decompose the error into twoparts using triangle inequality
983042983142X minusX983042 le 983042983142X minus pminus1Y 983042+ 983042pminus1Y minusX983042
Recall that 983142X is a best approximation to pminus1Y Then the first part ofthe error is smaller than the second part and we have
983042983142X minusX983042 le 2983042pminus1Y minusX983042 =2
p983042Y minus pX983042
The entries of the matrix Y minus pX
(Y minus pX)ij = (δij minus p)Xij
are independent and mean zero random variables
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 18 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
We have
E983042Y minus pX983042
983249 C log n middot983061Emax
i983042(Y minus pX)i9830422 + Emax
j983042(Y minus pX)j9830422
983062
All that remains is to bound the norms of the rows and columns ofY minus pX This is not difficult if we note that they can be expressed assums of independent random variables
983042(Y minus pX)i98304222 =n983131
j=1
(δij minus p)2X2ij 983249
n983131
j=1
(δij minus p)2 middot 983042X9830422max
and similarly for columns Taking expectation and noting that
E (δij minus p)2 = Var (δij) = p(1minus p)
we get
E 983042(Y minus pX)i9830422 983249983059E 983042(Y minus pX)i98304222
98306012983249 radic
pn983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 19 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
This is a good bound but we need something stronger Since themaximum appears inside the expectation we need a uniform boundwhich will say that all rows are bounded simultaneously with highprobability Such uniform bounds are usually proved by applyingconcentration inequalities followed by a union bound Bernsteinsinequality (145) yields (check)
P
983099983103
983101
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 exp(minusctpn) for t 983245 3
This probability can be further bounded by nminusct using the assumptionthat m = pn2 ge n log n A union bound over n rows leads to
P
983099983103
983101maxiisin[n]
n983131
j=1
(δij minus p)2 gt tpn
983100983104
983102 983249 n middot nminusct for t 983245 3
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 20 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Integrating this tail we have
Emaxiisin[n]
n983131
j=1
(δij minus p)2 ≲ pn
And this yields the desired bound on the rows
Emaxiisin[n]
983042(Y minus pX)i9830422 ≲radicpn983042X983042max
We can do similarly for the columns Then
E983042Y minus pX983042 ≲ log nradicpn983042X983042max
Therefore we get
E983042983142X minusX983042 ≲ log n
983157n
p983042X983042max
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 21 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Step 2 Passing to Frobenius normWe know that rank(X) le r by assumption and rank(983141X) le r by
construction so rank(983142X minusX) le 2r There is a simple relationshipbetween the operator and Frobenius norms
983042983142X minusX983042F leradic2r983042983142X minusX983042
Taking expectation of both sides we get
E983042983142X minusX983042F 983249radic2rE983042983142X minusX983042 ≲ log n
983157rn
p983042X983042max
Dividing both sides by n we can rewrite this bound as
E1
n983042983142X minusX983042F ≲ log n
983157rn
pn2983042X983042max
The proof is completed by noting the definition of the samplingprobability p = mn2
Covariance estimation and MC DAMC Lecture 14 May 27 - 29 2020 22 22
Recommended