Foundations of Data Analysis (MA4800) · 2017-07-11 · Matrix Notations The index sets I;J;L;::: are assumed to be nite. As soon as the complex conjugate values appear1, the scalar

Foundations of Data Analysis (MA4800)

Massimo Fornasier

Fakultat fur MathematikTechnische Universitat Munchen

[email protected]://www-m15.ma.tum.de/

Slides of LectureJuly 12 2017

Preliminaries on Linear Algebra (LA)

We give for granted familiarity with the basics of LA taught instandard courses, in particular,

§ vector spaces, spans and linear combinations, linear bases,linear maps and matrices, eigenvalues, complex numbers,scalar products, theory of symmetric matrices.

For more details we refer to the lecture notes (in German)

G. Kemper, Lineare Algebra fuer Informatik, TUM, 2017.

of the course taught for Computer Scientsts at TUM, or any otherstandard international text of LA.

Matrix Notations

The index sets I , J, L, . . . are assumed to be finite.

As soon as the complex conjugate values appear1, the scalar fieldis restricted to K P tR,Cu.

The set KIˆJ is the vector space of matrices A P KIˆJ , whoseentries are denoted by Aij , for i P I and j P J. Vice versa, numbersaij P K may be used to define A :“ paijqiPI ,jPJ P KIˆJ .

If A P KIˆJ , the transposed matrix AT P KJÎ , and ATji :“ Aij . A

matrix A P KIÎ is symmetric if AT “ A. The Hermitiantransposed matrix AH P KJÎ coincides with AT . If K “ R thenclearly AH “ AT . A Hermitian matrix satisfies AH “ A.

Often, we will need to consider the rows Apiq and the columns Apjqof a matrix, defined respectively as vectors Apiq “ paijqjPJ andApjq “ paijqiPI “ pA

T qpjq.

1In the case K “ R, then α “ α, for all α P K.

Matrix NotationsWe assume that the standard matrix-vector and matrix-matrixmultiplications are done in the usual way: pAxqi “

ř

jPJ aijxj or

pABqi` “ř

jPJ AijBj` as usual, for x P KJ , A P KIˆJ and

B P KJˆL.

The Kronecker symbol is defined by

δij “

"

1, if i “ j P I ,0, otherwise.

The unit vector epiq P KI is defined by

epiq “ pδijqjPI

The symbol I “ pδijqjPI ,jPI is used for the identity matrix. Sincematrices and index sets do not appear in the same place, thesimultaneous use of the symbol I do not create confusion(example: I P KIÎ ).

Matrix NotationsThe range of matrix A P KIˆJ is

rangepAq “ tAx : x P KJu “ spantApiq, i P I u.

Hence, the range of a matrix is a vector space spanned by itscolumns.

The Euclidean scalar product in KI is given by

xx , yy “ yHx “ÿ

iPI

xiy i , (1)

where for K “ R the conjugate sign can be ignored .

It often useful and very important to notice that the matrix-vectorAx and matrix-matrix AB multiplications can also be expressed interms of scalar products of the rows of A with x and of the rows ofA with the colums of B, i.e., according to our terminology of

pAxqi “ xApiq, xy, pABqi` “ xA

piq,Bp`qy. (2)

Matrix Notations

Two vectors x , y P KI are orthogonal (and we may write x K y) ifxx , yy “ 0.

We often consider sets which are mutually orthogonal. In this casefor X ,Y Ă KI we say that X is orthogonal to Y and we writeX K Y if xx , yy “ 0, for all x P X and y P Y .

When X and Y are linear subspaces their orthogonality can besimply checked by showing that they possess bases which aremutually orthogonal. This is a very important principle we will useoften.

A family of vectors X “ txνuνPF Ă KI is orthogonal if the vectorsxν are pairwise orthogonal, i.e., xxν , xν1y “ 0 for ν ‰ ν 1. The familyis additionally called orthonormal if xxν , xνy “ 1 for all ν P F .

Matrix Notations

A matrix A P KIˆJ is called orthogonal, if the columns of A areorthonormal, equivalently if

AHA “ I P KJˆJ .

An orthogoal square matrix A P KIÎ is called unitary. Differentlyfrom just orthogonal matrices, unitary matrices satisfy

AHA “ AAH “ I ,

i.e, AH “ A´1 is the invese matrix of A.

Assume that the index sets satisfy either I Ă J or J Ă I . Then a(rectangular) matrix A P KIˆJ is diagonal if Aij “ 0 for all i ‰ j .

Matrix Rank

Proposition

Let A P KIˆJ . The following statements are equivalent and may beall used as a definition of matrix rank r “ rankpAq.

§ (a) r “ dim rangepAq;

§ (b) r “ dim rangepAHq;

§ (c) r is the maximal number of linearly independent rows of A;

§ (d) r is the maximal number of linearly independent columnsof A;

§ (e) r is minimal with the property

A “rÿ

i“1

aibHi , where ai P KI , b P KJ ;

§ (f) r is maximal with the property that there exists aninvertible r ˆ r submatrix of A;

§ (g) r is the number of positive singular values (soon!).

Matrix Rank

The rank is bounded by the maximal rank, which, for matrices, isalways given by rmax “ mint#I ,#Ju, and this bound is attainedby full-rank matrices.

As linear independence may depend on the field K one mayquestion whether the rank of a real-valued matrix depends onconsidering it in R or in C. For matrices it tunrs out that it doesnot matter: the rank is the same.

We will often work with matrices of bounded rank r ď k. Wedenote accordingly with Rk “ tA P KIˆJ : rangepAq ď ku such aset. Notice that this set is not a vector space (exercise!).

NormsIn the following we consider an abstract vector space V over thefield K. As a typical example we may keep in mind the Euclideanplane V “ R2 endowed with the classical Euclidean norm.

We recall below the axioms of an abstract norm on V : a norm is amap ¨ : V Ñ r0,8q with the following properties

§ v “ 0 if and only if v “ 0;§ λv “ |λ|v for all v P V and λ P K;§ v ` w ď v ` w for all v ,w P V (triangle inequality).

A norm is always continuous as a consequence of the (inverse)triangle inequalty:

|v ´ w| ď v ´ w, for all v ,w P V .

A vector space V endowed with a norm ¨ , and we write the pairpV , ¨ q to indicate it, is called a normed vector space. Asmentioned above a typical example is the Euclidean plane.

Scalar products and (pre)-Hilbert spaces

A normed vector space pV , ¨ q is a pre-Hilbert space if its norm isdefined by

v “a

xv , vy, v P V ,

where x¨, ÿ : V ˆ V Ñ K is a scalar product on V , i.e., it fulfillsthe properties

§ xv , vy ą 0 for v ‰ 0;

§ xv ,wy “ xw , vy, for v ,w P V ;

§ xu ` λv ,wy “ xu,wy ` λxv ,wy for u, v ,w P V and λ P K;

§ xw , u ` λvy “ xw , uy ` λxw , vy for u, v ,w P V and λ P K.

The triangle inequality for the norm follows from the Schwarzinequality

|xv ,wy| ď vw, v ,w P V .

Scalar products and (pre)-Hilbert spaces

We describe the pre-Hilbert space also by the pair pV , x¨, ÿq. Atypical example of scalar product over KI is the one we introducedin (1), which generates the Euclidean norm on KI :

v2 “

d

ÿ

iPI

|vi |2, v P KI .

As before one can define orthogonality between vectors,orthogonal, and orthonormal sets of vectors.

Hilbert spaces

A Hilbert space is a pre-Hilbert space pV , x¨, ÿq, whose topologyinduced by the associated norm ¨ is complete, i.e., any Cauchysequence in such vector spaces is convergent.

It is important to stress that every finite dimensional pre-Hilbertspace pV , x¨, ÿq is actually complete, hence always a Hilbert space.

Thus, those who are not familiar with topological notions (e.g.,completeness, Cauchy sequences, etc.), one should just reason inwhat follows according to their Euclidean geometric intuition.

Projections

Recall now that a set C in a vector space is convex if, for allv ,w P C and all t P r0, 1s, tv ` p1´ twq P C .

Given a closed convex set C in a Hilbert space pV , x¨, ÿq onedefines the projection of any vector v on C as

PC pvq “ arg minwPC v ´ w.

This definition is well-posed, as the projection is actually unique,and an equivalent defition is given by fulfilling the followinginequality

<xz ´ PC pvq, v ´ PC pvqy ď 0,

for all z P C . This is left as an exercise.

Projections onto subspaces: Pythagoras-Fourier Theorem

Of extreme importance for us are the orthogonal projections ontosubspaces.

In case C “ W Ă V is actually a closed linear subspace of V , thenthe projection onto W can be readily computed as soon as onedisposes of an orthonormal basis for W . Let twνuνPF be a(countable) orthonormal basis for W then

PW pvq “ÿ

νPF

xv ,wνywν , for all v P V . (3)

Moreover, it holds the Pythagoras-Fourier Theorem:

PW pvq2 “

ÿ

νPF

|xv ,wνy|2, for all v P V .

Pythagoras-Fourier Theorem and othonormal expansions

We will use very much this characterization of the orthogonalprojections and the Pythagoras-Fourier Theorem, especially for thecase where W “ V .

In this case, obviosly, PV “ I and, we have the orthonormalexpansion of any vector v P V ,

v “ÿ

νPF

xv ,wνywν ,

and the norm equivalence

v2 “ÿ

νPF

|xv ,wνy|2.

Trace of a matrix

The effort of defining a abstract scalar products and norms allowsus now to introduce several norms for matrices.

First of all we need to introduce the concept of trace, which is themap tr : KIÎ Ñ K defined by

trpAq “ÿ

iPI

Aii ,

i.e., it is the sum of the diagonal elements of the matrix A.

Trace of a matrix: properties

The trace enjoyes several properties, which we collect in thefollowing:

Proposition (Properties of the trace)

(a) trpABq “ trpBAq, for any A P KIˆJ and B P KJÎ ;

(b) trpABC q “ trpBCAq, for any A,B,C matrices of compatiblesize and indexes; this property is called the circularity propertyof the trace;

(c) as a consequence of the previous property we obtain theinvariance of the trace under unitary transformations, i.e.,trpAq “ trpUAUHq for A P KIÎ and any unitary matrixU P KIÎ ;

(d) trpAq “ř

iPI λi , where tλi : i P I u is the set of eigenvalues ofA.

Matrix norms

The Frobenius norm of a matrix is essentially the Euclidean normcomputed over the entries of the matrix (considered as a vector):

AF :“

d

ÿ

iPI ,jPJ

|Aij |2, A P KIˆJ .

It is also know as Schur norm or Hilbert-Schmidt norm. This normis generated by the scalar product (that’s why we made the effortof introducing abstract scalar products!)

xA,ByF :“ÿ

iPI

ÿ

jPJ

AijBij “ trpABHq “ trpBHAq. (4)

In particular,A2F “ trpAAHq “ trpAHAq (5)

holds.

Matrix norms

Let ¨ X and ¨ Y be vector norms on the vector spaces X “ KI

and Y “ KJ , respectively. Then the associated matrix norm is

A :“ AXÑY :“ supz‰0

AzYzX

, A P KIˆJ .

If both ¨ X and ¨ Y coincides with the Euclidean norms ¨ 2on X “ KI and Y “ KJ , respectively, then the associated matrixnorm is called the spectral norm and it is denoted with A8.

As the spectral norm has a central importance in manyapplications, it is often denoted just by A.

Unitary invariance and submultiplicativity

Both the matrix norms we introduced so far as are invariant withrespect to unitary transformations, i.e., AF “ UAV HF andA8 “ UAV H8 for unitary matrices U,V .

Moreover, both are submitiplicative, i.e.,ABF ď A8BF ď AF BF and AB ď AB, formatrices A,B of compatible sizes.

Introduction of the Singular Value Decomposition

We introduce and analyze the singular value decomposition (SVD)of a matrix A, which is is the factorization of A into the product ofthree matrices A “ UΣV H , where U,V are orthogonal matrices ofcompatible size and the matrix Σ is diagonal with positive realentries.

All these terms, orthogonal, diagonal matrix, have been introducedin the previous lectures.

Geometrical derivation: principal component analysis

To gain insight into the SVD, treat the n “ #I rows of a matrixA P KIˆJ as points in a d-dimensional space, where d “ #J, andconsider the problem of finding the best k-dimensional subspacewith respect to the set of points.

Here best means minimize the sum of the squares of theperpendicular distances of the points to the subspace.An orthonormal basis for this subspace is built as fundamentaldirections with maximal variance of the group of high dimensionalpoints, and are called principal components.

Geometrical derivation: principal component analysis

Pythagoras Theorem, scalar products, and orthogoalprojections

We begin with a special case of the problem where the subspace is1-dimensional, a line through the origin.

We will see later that the best-fitting k-dimensional subspace canbe found by k applications of the best fitting line algorithm.

Finding the best fitting line through the origin with respect to a setof points txi :“ Apiq P K2 : i P I u in the Euclidean plane K2 meansminimizing the sum of the squared distances of the points to theline. Here distance is measured perpendicular to the line. Theproblem is called the best least squares fit.


In the best least squares fit, one is minimizing the distance to asubspace. Now, consider projecting orthogonally a point xi onto aline through the origin. Then, by Pythagoras theorem

x2i1`x2

i2`¨ ¨ ¨`x2id “ (length of projection) 2

` (distance of pt. to line) 2.


In particular, from the formula above, one has

(distance of pt to line) 2“ x2

i1`x2i2`¨ ¨ ¨`x2

id´ (length of projection) 2.

To minimize the sum of the squares of the distances to the line,one could minimize

ř

i px2i1 ` x2

i2 ` ¨ ¨ ¨ ` x2idq minus the sum of the

squares of the lengths of the projections of the points to the line.

However, the first term of this difference is a constant(independent of the line), so minimizing the sum of the squares ofthe distances is equivalent to maximizing the sum of the squares ofthe lengths of the projections onto the line.


Averiging through the points and matrix notation

For best-fit subspaces, we could maximize the sum of the squaredlengths of the projections onto the subspace instead of minimizingthe sum of squared distances to the subspace.

Consider the rows of A as n “ #I rows of a matrix A P KIˆJ aspoints in a d-dimensional space, where d “ #J. Consider thebest-fit line through the origin.

Let v be a unit vector along this line. The length of the projectionof Apiq, the i th row of A, onto v is, according to our definition ofscalar product, |xApiq, vy|.

From this, we see that the sum of length squared of the projectionsis

Av22 “ÿ

iPI

|xApiq, vy|2. (6)

Best-fit and first singular direction

The best-fit line is the one maximizing Av22 and hence minimizingthe sum of the squared distances of the points to the line.

With this in mind, define the first singular vector, v1 of A, which isa column vector, as the best-fit line through the origin for the npoints in d-dimensional space that are the rows of A.

Thusv1 “ arg maxv2“1 Av2.

The value σ1pAq “ Av12 is called the first singular value of A.Note that σ1pAq

2 is the sum of the squares of the projections ofthe points to the line determined by v1.

First link: linear algebra VS optimization!!!

This is the first time we use an optimization criterion to approacha data interpretation problem and to create a connection betweenlinear algebra and optimization.

As we will discuss along this course, optimization plays indeed ahuge role in data analysis and we will have to discuss it in moredetail later on.

A greedy approach to best k-dimensional fit

The greedy approach to find the best-fit 2-dimensional subspace fora matrix A, takes v1 as the first basis vector for the 2-dimenionalsubspace and finds the best 2-dimensional subspace containing v1.

The fact that we are using the sum of squared distances will againhelp, thanks to Pythagoras Theorem: For every 2-dimensionalsubspace containing v1, the sum of squared lengths of theprojections onto the subspace equals the sum of squaredprojections onto v1 plus the sum of squared projections along avector perpendicular to v1 in the subspace.

Thus, instead of looking for the best 2-dimensional subspacecontaining v1, look for a unit vector, call it v2, perpendicular to v1

that maximizes Av2 among all such unit vectors.

Using the same greedy strategy to find the best three and higherdimensional subspaces, defines v3, v4, . . . in a similar manner.

Formal definitions

More formally, the second singular vector, v2 is defined by thebest-fit line perpendicular to v1:

v2 “ arg maxv2“1,xv1,vy“0 Av2.

The value σ2pAq “ Av22 is called the second singular value of A.

The kth singular vector vk is defined similarly by

vk “ arg maxv2“1,xv1,vy“0,...,xvk´1,vy“0 Av2. (7)

and so on.

The process stops when we have found v1, . . . , vr as singularvectors and

0 “ arg maxv2“1,xv1,vy“0,...,xvr ,vy“0 Av2.

Optimality of the greedy algorithm

If instead of finding v1 that maximized Av2 and then the best-fit2-dimensional subspace containing v1, we had found the best-fit2-dimensional subspace, we might have done better.

Surprisingly enough, this is actually not the case. We now give asimple proof that the greedy algorithm indeed finds the bestsubspaces of every dimension.

Proposition

Let A P KIˆJ and v1, v2, . . . , vr be the singular vectors definedabove. For 1 ď k ď r , let Vk be the subspace spanned byv1, v2, . . . , vk . Then for each k, Vk is the best-fit k-dimensionalsubspace for A.

The natural oder of the singular values

We conclude this part with an important property left as anexercise: show that necessarily

σ1pAq ě σ2pAq ě ¨ ¨ ¨ ě σr pAq, (8)

i.e., the singular values come always with a natural nonincreasingorder.

On matrix norms again

Note that the vector Avi is really a list of lengths (with signs) ofthe projections of the rows of A onto vi . Think of σi pAq “ Avi2as the component of the matrix A along vi .

For this interpretation to make sense, it should be true that addingup the squares of the components of A along each of the vi givesthe square of the whole content of the matrix A.

This is indeed the case and is the matrix analogy of decomposing avector into its components along orthogonal directions.

On matrix norms againConsider one row, say Apiq of A. Since v1, v2, . . . , vr span the spaceof all rows of A(exercise!), xApiq, vy “ 0 for all v orthogonal tov1, v2, . . . , vr .

Thus, for each row Apiq, by Pythagoras Theorem,Apiq22 “

řrk“1 |xA

piq, vky|2.

Summing over all rows i P I and recalling (6), we obtain

ÿ

iPI

Apiq22 “rÿ

k“1

ÿ

iPI

|xApiq, vky|2 “

rÿ

k“1

Avk22 “

rÿ

k“1

σkpAq2.

Butÿ

iPI

Apiq22 “ÿ

iPI

ÿ

jPJ

|Aij |2,

is the Frobenius norm of A, so that we obtained

AF “

g

f

f

e

rÿ

k“1

σkpAq2. (9)

On matrix norms againWe might also observe that

σ1pAq “ maxv2“1

Av2 “ arg maxv‰0 Av2v2 “ A, (10)

i.e., the first singular vector corresponds to the spectral norm of A.This allows us to define other and more general norms, called theSchatten-p-norms for matricesdefined for 1 ď p ă 8 by

Ap “

˜

rÿ

k“1

σkpAqp

¸1p

,

and for p “ 8A8 “ max

k“1,...,rσkpAq.

Notice that these definitions gives precisely AF “ A2 andA “ A8 as particular Schatten-norms. Of particular relevancein certain applications related to recommender systems is theso-called nuclear norm A˚ “ A1 corresponding to theSchatten-1-norm.

On rank again

We just mentioned above, and left it as an exercise, that theorthogonal basis consitituted by the vectors v1, v2, . . . , vr spans thespace of all rows of A.

This means that the number r (of them) is actually the rank of thematrix! And this a proof of (g) in the Proposition abovecharacterizing the rank!

The Singular Value Decomposition in its full glory

The vectors Av1,Av2, . . . ,Avr form yet another fundamental set ofvectors associated with A: We normalize them to length one bysetting

ui “1

σi pAqAvi

The vectors u1, u2, . . . , ur are called the left singular vectors of A.

The vi are called the right singular vectors.

The SVD theorem (soon!) will fully explain the reason for theseterms.

Orthonormality of the left singular vectors

Clearly, the right singular vectors are orthogonal by definition.

We now show that the left singular vectors are also orthogonal.

TheoremLet A P KIˆJ be a rank r matrix. The left singular vectors of A,u1, u2, . . . , ur are orthogonal.

SVD in its full glory

TheoremLet A P KIˆJ with right singular vectors v1, v2, . . . , vr , left singularvectors u1, u2, . . . , ur , and corresponding singular values σ1, . . . , σr .Then

A “rÿ

k“1

σkukvHk

Best rank-k approximation

Given the singular value decomposition A “řr

k“1 σkukvHk of a

matrix A P KIˆJ , we define the k-rank truncation by

Ak “

kÿ

`“1

σù`vH` .

It should be clear that Ak is a k-rank matrix.

LemmaThe rows of Ak are the orthogonal projections of the rows of Aonto the subspace Vk spanned by the first k singular vectors of A.

Best rank-k Frobenius approximation

TheoremFor any matrix B of rank at most k

A´ AkF ď A´ BF

Best rank-k spectral approximation

LemmaA´ Ak

2 “ σ2k`1.

TheoremFor any matrix B of rank at most k

A´ Ak ď A´ B.

Nonconvexity and curse of dimensionality

We introduced the SVD by means of a greedy algorithm, which ateach step is supposed to solve a nonconvex optimization problemof the type

vk “ arg maxv2“1,xv1,vy“0,...,xvk´1,vy“0 Av2. (11)

Well, the issue with this is that it is neither a convex optimizationproblem nor, in principle, stated on a low dimensional space. Infact our goal is often to deal with “big data” (i.e., many datapoints, each of very high dimension), which eventually implies largedimensionalities of the matrix A.

How can we then approach the computation of the SVD withoutincurring in the so called “curse of dimensionality”, which is anemphatic way of describing the computational infeasibility of acertain numerical task?

The Power Method

It is easiest to describe first in the case when A is real-valued,square, and symmetric and has the same right and left singularvectors, namely, A “

řrk“1 σkvkvT

k .

In this case we have

A2“

˜

rÿ

k“1

σkvkvTk

¸˜

rÿ

`“1

σ`v`vT`

¸

“

rÿ

k,`“1

σ`σkvk vTk v`

loomoon

:“δk`

vT` “

rÿ

k“1

σ2kvkvT

k .

Similarly, if we take the mth power of A, again all the cross termsare zero and we will get

Am “

rÿ

k“1

σmk vkvTk .

We had the spectral norm of A and a spectral gap

If we had σ1 " σ2, we would have

limmÑ8

Am

σm1“ v1vT

1 .

While it is not so simple to compute σ1 (why?!), which correspondsto the spectral norm of A, one can easily compute the Frobeniusnorm AmF , so that we can consider Am

AmFwhich again converges

for m Ñ8 to v1vT1 from which v1 may be computed (exercise!).

What if A is not squared?But then B “ AAT is squared. If again, the SVD of A isA “

řrk“1 σkukvT

k then

B “rÿ

k“1

σ2kukuT

k .

This is the spectral decomposition of B. Using the same kind ofcalculation as above, one obtains

Bm “

rÿ

k“1

σ2mk ukuT

k .

As m increases, for k ą 1 the ratio σ2mk σ2m

1 goes to zero and Bm

gets approximately equal to

σ2m1 u1uT

1 ,

provided again that σ1 " σ2. This suggests how to compute bothσ1 and u1, simply by powering B.

Two relevant issues

§ If we do not have a spectral gap σ1 " σ2 we get into troubles.

§ Computing Bm costs m matrix-matrix multiplication, whendone in the schoolbook way it costs Opmd3q operations orOpm log dq when done by successive squaring.

Use a random vector application instead!

Instead, we computeBmx ,

where x is a random unit length vector.

The idea is that the component of x in the direction of u1, i.e.,x1 :“ xx , u1y is actually bounded away from zero with highprobability and would get mutiplied by σ2

1, while the othercomponents of x along other ui ’s would be multiplied by σ2

k ! σ21

only.

Each increase in m requires multiplying B by the vector Bm´1x ,which we can further break up into

Bmx “ ApAT pBm´1xqq.

This requires two matrix vector products, involving the matricesAT and A. Since Bmx « σ2k

1 u1puT1 xq is a scalar multiple of u1, u1

can be recovered from Bmx by normalization.

Concentration of measure phenomenon

The concentration of measure phenomenon informally speaking isdescribing the fact that - in high-dimension - certain events happenwith high-probability. In particular, we have

LemmaLet x P Rd be a unit d-dimensional vector of componentsx “ px1, . . . , xdq with respect to the canonincal basis and picked atradnom from the set tx : x2 ď 1u. The probability that|x1| ě α ą 0 is at least 1´ 2α

?d ´ 1.

Extensions

RemarkNotice that in the previous result essentially shows also that,independently of the dimension d, the x1 “ xx , u1y component of arandom unit vector x with respect to any orthonormal basistu1, . . . , udu

2 is bounded away from zero with overwhelmingprobability.

RemarkIn view of the isometrical mapping pa, bq Ñ a` ib from R2 to Cthe previous result extends to random unit vectors in Cd simply bymodifying the statement as follows: The probability that, for arandomly chosen unit vecor z P Cd |z1| ě α ą 0 holds is at least1´ 2α

?2d ´ 1.

2Since the sphere is rotation invariant, the arguments above apply to anyorthonormal basis by rotating it to coincide with the canonical basis!

Randomized Power MethodBy injecting randomization in the process, we are able to use theblessing of the dimensionality (concentration of measure) againstthe curse of dimensionality.

TheoremLet A P KIˆJ and x P KI be a random unit length vector. Let Vbe the space spanned by the left singular vectors of Acorresponding to singular values greater than p1´ εqσ1. Let m be

Ω´

lnpdεqε

¯

. Let w˚ be unit vector after m iterations of the power

method, namely,

w˚ “pAAHqmx

pAAHqmx2.

The probability that w˚ has a component of at least O`

εαd

˘

orthogonal to V is at most 1´ 2α?

2d ´ 1.

To make the previous statement concrete, lets take α “ 1102d .

Then, with probability at least 910 (i.e., very close to 1), after m

iteration of order Ω´

lnpdεqε

¯

the vector w˚ has a component of at

least O pεq orthogonal to V .

Applications of the SVD: Principal Component Analysis

The traditional use of SVD is in Principal Component Analysis(PCA). PCA is illustrated by an example - customer-product datawhere there are n customers buying d products.

Applications of the SVD: Principal Component Analysis

Let matrix A with elements Aij represent the probability ofcustomer i purchasing product j (or the amount or utility ofproduct j to customer i).

One hypothesizes that there are really only k underlying basicfactors like age, income, family size, etc. that determine acustomer’s purchase behavior.

An individual customer’s behavior is determined by some weightedcombination of these underlying factors.

Principal Component AnalysisThat is, a i customer’s purchase behavior can be characterized bya k-dimensional vector pu`,i q`“1,...,k , where k is much smaller thann and d .

The components u`,i , for ` “ 1, . . . , k of the vector are weights foreach of the basic factors.

Associated with each basic factor ` is a vector of probabilities v`,each component of which is the probability of purchasing a givenproduct by someone whose behavior depends only on that factor.

More abstractly, A “ UV H is an n ˆ d matrix that can beexpressed as the product of two matrices U and V where U is ann ˆ k matrix expressing the factor weights for each customer andV is a k ˆ d matrix expressing the purchase probabilities ofproducts that correspond to that factor:

pApiqqH “kÿ

`“1

u`,ivH` .

Principal Component Analysis

One twist is that A may not be exactly equal to UV T , but closeto it since there may be noise or random perturbations.

Taking the best rank-k approximation Ak from SVD (as describedabove) gives us such a U,V . In this traditional setting, oneassumed that A was available fully and we wished to find U,V toidentify the basic factors or in some applications to denoise A (ifwe think of A´ UV T as noise).

Low rankness and recommender systems

Now imagine that n and d are very large, on the order ofthousands or even millions, there is probably little one could do toestimate or even store A.

In this setting, we may assume that we are given just given a fewelements of A and wish to estimate A. If A was an arbitrary matrixof size n ˆ d , this would require Ωpndq pieces of information andcannot be done with a few entries.

But again hypothesize that A was a small rank matrix with addednoise. If now we also assume that the given entries are randomlydrawn according to some known distribution, then there is apossibility that SVD can be used to estimate the whole of A.

This area is called collaborative filtering or low-rank matrixcompletion, and one of its uses is to target an ad to a customerbased on one or two purchases.

SVD as a compression algorithm

Suppose A is the pixel intensity matrix of a large image.

The entry Aij gives the intensity of the ij th pixel. If A is n ˆ n, thetransmission of A requires transmitting Opn2q real numbers.Instead, one could send Ak , that is, the top k singular valuesσ1, σ2, . . . , σk along with the left and right singular vectorsu1, u2, . . . , uk , and v1, v2, . . . , vk .

This would require sending Opknq real numbers instead of Opn2q

real numbers.

If k is much smaller than n, this results in savings. For manyimages, a k much smaller than n can be used to reconstruct theimage provided that a very low resolution version of the image issufficient.

Thus, one could use SVD as a compression method.

SVD as a compression algorithm

Sparsifying bases

It turns out that in a more sophisticated approach, for certainclasses of pictures one could use a fixed sparsifying basis so thatthe top (say) hundred singular vectors are sufficient to representany picture approximately.

This means that the space spanned by the top hundred sparsifyingbasis vectors is not too different from the space spanned by thetop two hundred singular vectors of a given matrix in that class.

Compressing these matrices by this sparsifying basis can savesubstantially since the sparsifying basis is transmitted only onceand a matrix is transmitted by sending the top several hundredsingular values for the sparsifying basis.

We will discuss later on the use of sparsifying bases.

Singular Vectors and ranking documents

Consider representing a document by a vector each component ofwhich corresponds to the number of occurrences of a particularword in the document.

The English language has on the order of 25.000 words. Thus,such a document is represented by a 25.000- dimensional vector.The representation of a document is called the word vector mode.

A collection of n documents may be represented by a collection of25.000-dimensional vectors, one vector per document. The vectorsmay be arranged as rows n ˆ 25.000 matrix.


An important task for a document collection is to rank thedocuments.

A naive method would be to rank in order of the total length ofthe document (which is the sum of the components of itsterm-document vector). Clearly, this is not a good measure inmany cases.

This naive method attaches equal weight to each term andessentially takes the projection of each term-document vector inthe direction of the vector whose components have just value 1.

Is there a better weighting of terms, i.e., a better projectiondirection which would measure the intrinsic relevance of thedocument to the collection?


A good candidate is the best-fit direction for the collection ofterm-document vectors, namely the top (left) singular vector of theterm-document matrix.

An intuitive reason for this is that this direction has the maximumsum of squared projections of the collection and so can be thoughtof as a synthetic term-document vector best representing thedocument collection.

WWW ranking

Ranking in order of the projection of each document’s term vectoralong the best-fit direction has a nice interpretation in terms of thepower method. For this, we consider a different example - that ofweb with hypertext links.

The World Wide Web can be represented by a directed graphwhose nodes correspond to web pages and directed edges tohypertext links between pages.

Some web pages, called authorities, are the most prominentsources for information on a given topic. Other pages called hubs,are ones that identify the authorities on a topic.

Authority pages are pointed to by many hub pages and hub pagespoint to many authorities.

WWW ranking

WWW rankingOne would like to assign hub weights and authority weights to eachnode of the web.

If there are n nodes, the hub weights form a n-dimensional vectoru and the authority weights form a n-dimensional vector v .Suppose A is the adjacency matrix representing the directed graph:Aij is 1 if there is a hypertext link from page i to page j and 0otherwise. Given hub vector u, the authority vector v could becomputed by the formula

vj “nÿ

i“1

uiaij ,

since the right hand side is the sum of the hub weights of all thenodes that point to node j . In matrix terms,

v “ ATu.

Similarly, given an authority vector v , the hub vector could becomputed by u “ Av .

WWW ranking

Of course, at the start, we have neither vector.

But the above suggests a power-like iteration: Start with any v .Set u “ Av ; then set v “ ATu, and repeat the process.

We know from the power method that this converges to the leftand right singular vectors. So, after sufficiently many iterations, wemay use the left vector u as hub weights vector and project eachcolumn of A onto this direction and rank columns (authorities) inorder of this projection.

But the projections just form the vector ATu which equals v . Sowe can just rank by order of vj .

This is the basis of an algorithm called the HITS algorithm whichwas one of the early proposals for ranking web pages. A differentranking called page rank is widely used. It is based on a randomwalk on the graph described above, but we do not explore it here.

Pseudo-inverse matrices

Given A P KIˆJ of rank r and its singular value decomposition

A “ UΣV H “

rÿ

k“1

σkukvHk ,

we define the Moore-Penrose pseudo-inverse A: P KJÎ as

A: “rÿ

k“1

σ´1k vkuH

k .

In matrix form it isA: “ V Σ´1UH ,

where the inverse ¨´1 is applied only to the nonzero singular valuesof A.

Pseudo-inverse matricesIf AHA P KJˆJ is invertible (implying #I ě #J) and this is true ifr “ #J and the columns are linearly independent, then

A: “ pAHAq´1AH .

Indeed

pAHAq´1AH“ pV Σ2V H

q´1V ΣUH

“ pV Σ´2V HqV ΣUH

“ V Σ´1UH“ A:.

In this case, it should be also noted two fundamental properties ofA:. First of all, it is a left inverse of A, i.e.,

I “ pAHAq´1AHA “ A:A,

where I P Krˆr . On the other side the matrix

PrangepAq “ AA:,

corresponds to the orthogonal projection of KI onto the range ofthe matrix A, i.e., the span on the columns of A.

Pseudo-inverse matrices

Similarly, if AAH P KIÎ is invertible (implying #I ď #J) and thisis true if r “ #I and the rows are linearly independent, then

A: “ AHpAAHq´1.

In this case, A: is a right inverse of A, i.e.,

I “ AAHpAAHq´1 “ AA:.

Least square problems

§ The system Ax “ y is overdetermined, i.e., there are moreequations than unknowns. In this case the problem mighteven not be solvable and approximated solutions are needed,i.e., those that minimize the mismatch Ax ´ y2;

§ The system Ax “ y is underdetermined, i.e., there are lessequations than unknowns. In this case the problem has aninfinite number of solutions and one would wish to determineone of them with special properties, by means of a properselection criterion. A criterion of selection of proper solutionsof the systems Ax “ y is often to pick the solution withminimal Euclidean norm x2.

Pseudo inverse matrices and Least square problems

Let us first connect least squares problems with the Moore-Penrosepseudo-inverse introduced above.

Proposition

Let A P KIˆJ and y P KI . Define M Ă KJ to be the set ofminimizers of the map x Ñ Ax ´ y2. The convex optimizationproblem

arg minxPM x2 (12)

has the unique solution x˚ “ A:y.

Proof at the blackboard ...

Overdetermined systems

Corollary

Let A P KIˆJ with n “ #I ě #J “ d be of full rank r “ d, andlet y P KI . The the least square problem

arg minxPKJ Ax ´ y2, (13)

has the unique solution x “ A:y.

This result follows from the Proposition above because theminimizer x˚ of Ax ´ y2 is in this case unique.

Underdetermined systems

Corollary

Let A P KIˆJ with n “ #I ď #J “ d be of full rank r “ n, and lety P KI . The the least square problem

arg minxPKJ x2 subject to Ax “ y , (14)

has the unique solution x “ A:y.

Stability of the Singular Value DecompositionIn many situations, there is a data matrix of interest A, but onehas only a perturbed version of it A “ A` E where E is going tobe a relatively small quantifiable perturbation.To which extend canwe rely on the singular value decomposition of A to estimate theone of A?In many real-life cases, certain sets of data tend to behave asrandom vectors generated in the vicinity of a lower dimensionallinear subspace.

The more data we get the more they take a specific envelopingshape around that subspace. Hence, we could consider manymatrices built out of sets of such randomly generated points. Howdo the SVD’s of such matrices look like?

What happens if we acquire more and more data and the matricesbecome larger and larger?

Also to be able to respond in a quantitative way to such questionswe need to develop a theory of stability of singular valuedecompositions under small perturbations.

Spectral theoremA P KIÎ is actually a Hermitian matrix, i.e. A “ AH . We recallthat a nonzero vector v P KI is an eigenvector of A if Av “ λv forsome scalar λ P K called the corresponding eigenvalue.

Theorem (Spectral theorem for Hermitian matrices )

If A P KIÎ and A “ AH , then there exists an orthonormal basistv1, . . . vnu consisting of eingenvectors of A with real correspondingeigenvalues tλ1, . . . λnu such that

A “ÿ

k“1

λkvkvHk .

The previous representation is called the spectral decomposition ofA.

In fact, by denoting uk “ signλk and σk “ |λk |, then

A “nÿ

k“1

λkvkvHk “

nÿ

k“1

signλk |λk |vkvHk “

nÿ

k“1

σkukvHk ,

which is actually the SVD of A.

Weyl’s bounds

Assume now E is Hermitian so that A “ AH holds, as well. As theeigenvalues of such matrices are always real, we can order in anonincreasing order so that λ1 is the largest (non necessarilypositive!). Then

λ1pAq “ maxv2“1

vHpA` E qv

ď maxv2“1

vHAv ` maxv2“1

vHEv

“ λ1pAq ` λ1pE q

By using the spectral decomposition of A` E do prove as anexercise the first equality above.

Weyl’s bounds

Let v1 be the eigenvector of A associated to the top eigenvalue ofA and

λ1pAq “ maxv2“1

vHpA` E qv

ě vH1 pA` E qv1

“ λ1pAq ` vH1 Ev1

ě λ1pAq ` λnpE q.

From these estimates we obtain the bounds

λ1pAq ` λnpE q ď λ1pA` E q ď λ1pAq ` λ1pE q.

Weyl’s boundsThis chain of inequalities can be properly extended to the 2nd, 3rd,etc. eigenvalues (we do not perform this extension), leading to thefollowing important result.

Theorem (Weyl)

If A,E P KIÎ are two Hermitian matrices, then for all k “ 1, . . . , n

λkpAq ` λnpE q ď λkpA` E q ď λkpAq ` λ1pE q.

This is a very strong result because it does not depend at all fromthe magnitude of the perturbation matrix E . As a corollary

Corollary

If A,E P KIÎ are two arbitrary matrices, then for all k “ 1, . . . , n

|σkpA` E q ´ σkpAq| ď E,

where σk ’s are the corresponding singular vectors and E is thespectral norm of E .

Mirsky’s bounds

Actually more can be said:

Theorem (Mirsky)

If A,E P KIÎ are two arbitrary matrices, then

g

f

f

e

nÿ

k“1

|σkpA` E q ´ σkpAq|2 ď EF ,

where EF is the Frobenius norm of E .

In its original form, Mirsky’s theorem holds for an arbitraryunitarily invariant norm (hence for the spectral norm) and includesWeyl’s theorem as a special case.

Stability of singular spaces: principal angles

To quantify the stability properties of the singular valuedecomposition under perturbations, we need to introduce thenotion of principal angles between subspaces. To introduce theconcept we recall that in a Euclidean space Rd the angle θpv ,wqof two vectors v ,w P Rd is defined by

cos θpv ,wq “wT v

v2w2.

In particular, for unitary vectors v ,w one has cos θpv ,wq “ wT v .


DefinitionLet V ,W P KJÎ be orthogonal matrices spanning V and W ,n-dimensional subspaces of KJ (here we may assume thatn “ #I ď #J “ d)3. We define the principal angles between Vand W by

cos ΘpV ,W q “ ΣpW HV q,

where ΣpW HV q is the list of the singular values of the matrixW HV .

3Here we committed an abuse of notation as we denoted with V ,W thematrices of the orthonormal bases of two subspaces V ,W . However, thisidentification here is legitimated as all we need to know about the subspacesare some orthonormal bases of them.

Stability of singular spaces: principal anglesThis definition sounds a bit abstract and it is perhaps useful tocharacterize the principal angles in terms of a geometricdescription. Recalling the explicit characterization of theorthogonal projection onto a subspace given in (3), we observe that

PV “ VV H , PW “ WW H .

The following result essentially says that the principal anglesquantify how different are the orthogonal projections onto V andW ..

Proposition

Let V ,W P KJÎ be orthogonal matrices spanning V and W ,n-dimensional subspaces of KJ (here we may assume thatn “ #I ď #J “ d). Then

PV ´ PW F “ VV H ´WW HF “?

2 sin ΘpV ,W q2.



The orhogonal space WK to any subspace W of KJ is define byWK “ tv P KJ : xv ,wy “ 0, @w P W u. We show yet anotherdescription of the singular angles. The projections onto W andWK are related by

PW “ I ´ PWKor WW H “ I ´WKW H

K . (15)

Proposition

Let V ,W P KJÎ be orthogonal matrices spanning V and W ,n-dimensional subspaces of KJ (here we may assume thatn “ #I ď #J “ d). We denote WK an orthogonal matricesspanning WK. Then

W HK V F “ sin ΘpV ,W q2.



We leave as an additional exercise to show that, as a corollary ofthe previous result, one has also

PWKPV F “ sin ΘpV ,W q2. (16)

Again, one has to use the properties of the trace operator!

Stability of singular spaces: Wedin’s boundA, A P KIˆJ and define E “ A´ A. We fix1 ď k ď rmax “ mint#I ,#Ju and we consider the SVD

A “`

U1 U0

˘

ˆ

Σ1 00 Σ0

˙ˆ

V H1

V H0

˙

“ U1Σ1V H1

looomooon

:“A1

Ù0Σ0V H0

looomooon

:“A0

,

where

V1 “ rv1, . . . , vk s, V0 “ rvk`1, . . . , vrmax s, U1 “ ru1, . . . , uk s, U0 “ ruk`1, . . . , urmax s,

and

Σ1 “ diagpσ1, . . . , σkq, Σ0 “ diagpσk`1, . . . , σrmaxq.

For the same k we use corresponding notations for an analogousdecomposition for A except that all the symbols have ˜ on the top.It is obvious that

A` E “ A1 ` A0 ` E “ A1 ` A0 “ A. (17)

Stability of singular spaces: Wedin’s bound

Let us notice that the rangepA1q is indeed spanned by theorthonormal system U1, while rangepAH

1 q is spanned by V1. Similarstatements hold for A1.We define now two “residual” matrices

R11 “ AV1 ´ U1Σ and R21 “ AH U1 ´ V1ΣH .


A connection between R11 and E is seen through the rewriting

R11 “ AV1 ´ U1Σ “ pA´ E qV1 ´ U1pUH1 AV1q “ É V1.

We used that Σ “ UH1 AV1 “ UH

1 A1V1 and that U1UH1 is the

orthogonal projection onto the range of rangepA1q, henceU1UH

1 A1 “ A1, and A1V1 “ AV1. Analogously, it can be shownthat

R21 “ ÉH U1.


Theorem (Wedin)

Under the notations so far used in this section, assume that thereexist an α ě 0 and δ ą 0 such that

σkpAq ě α` δ and σk`1pAq ď α. (18)

Then for every unitarily invariant norm ¨ ‹ (in particular for thespectral and Frobenius norm)

maxt sin ΘpV1,V1q‹, sin ΘpU1,U1q‹u

ďmaxtR11‹, R21‹u

δďE‹δ

.

Introduction to probability theoryIn the probabilistic proof of the convergence of the power methodfor computing singular vectors of a matrix A, we used the followingrandom process

to define a random point on the sphere: we considered animaginary player X of darts, with no experience, throwing darts -actually at random - at the target, the two dimensional diskΩ “ Br , the disk of radius r and center 0. At every point ω P Ωwithin the target hit by a dart we assign a point on the boundaryof the target (the one dimensional sphere S1) X pωq “ ω

ω2P S1.

Introduction to probability theory

It is also true that the one dimensional sphere S1 is parametrizedby the angle θ P r0, 2πq, hence we can also consider the player Xas a map from ω P Ω to θ “ arg ω

ω2P r0, 2πq Ă R

X : Ω Ñ R.

We defined the probability that the very inexperienced player Xhits a point ω P B Ă Ω simply as the ratio of the area of B relativeto the entire area of the target Ω:

Ppω P Bq “VolpBq

VolpΩq.

What is then the probability of the event

B “ tω P Ω : θ1 ď X pωq ď θ2u?


What is then the probability of the event

B “ tω P Ω : θ1 ď X pωq ď θ2u?

The area of B is the area of the disk sector of vectors with anglesbetween rθ1, θ2s, which is computed by using the polar coordinates:

VolpBq “

ż θ2

θ1

ż r

0sdsdθ “

r 2pθ2 ´ θ1q

2.

Hence, recalling that VolpΩq “ r 2π, we have

Pptω P Ω : θ1 ď X pωq ď θ2uq “VolpBq

VolpΩq“

r 2pθ2 ´ θ1q

2πr 2“θ2 ´ θ1

2π.


What is the average outcome of the (random) player? Well, if weconsider N attempts of the player, the empirical average resultwould be

1

N

Nÿ

i“1

X pωi q,

which, for N Ñ8 converges to

EX :“

ż

ΩX pωqdPpωq “

1

2π

ż 2π

0θdθ “

p2πq2

4π“ π.

The function φpθq “ 12π Ir0,2πqpθq is called the probability density of

the random player X . In general one computes the quantities

EgpX q :“

ż

ΩgpX pωqqdPpωq “

ż

Rgpθqϕpθqdθ.


If we define Σ the set of all possible events (identified withmeasurable subsets of Ω) the triple pΩ,Σ,Pq is called a probabilityspace.

The random player, which we described above with the functionX : Ω Ñ R, is an example of random variable.

The average outcome of its playing EX is the expected value ofthe random variable.

As already mentioned above, this particular random variable has adensity function φpθq which allows to compute interestingquantities (such as the probabilities of certain events) by integralson the real line.

This relatively intuitive construction is generalized to abstractprobability spaces as follows.

Abstract probability theory

Let pΩ,Σ,Pq be a probability space, where Σ denotes a σ-algebra(of the events) on the sample space and P a probability measureon pΩ,Σq.

The probability of an event B P Σ is denoted by

PpBq “ż

BdPpωq “

ż

ΩIBpωq dPpωq

where the characteristic function IBpωq takes the value 1 if ω P Band 0 otherwise.

The union bound (or Bonferroni’s inequality, or Boole’s inequality)states that for a collection of events B` P Σ, ` “ 1, . . . , n, we have

P

˜

nď

`“1

B`

¸

ď

nÿ

`“1

PpB`q . (19)

Random variables

A random variable X is a real-valued measurable function onpΩ,Σq. Recall that X is called measurable if the preimageX´1pAq “ tω P Ω : X pΩq P Au is contained in Σ for all Borelmeasurable subsets A Ă R.

Usually, every reasonable function X will be measurable; inparticular, all functions appearing in this lectures.

In what follows we will usually not mention the underlyingprobability space pΩ,Σ,Pq when speaking about random variables.

The distribution function F “ FX of X is defined as

F ptq “ PpX ď tq, t P R . (20)

DensitiesA random variable X possesses a probability density functionφ : RÑ R` if

Ppa ă X ď bq “

ż b

aφptqdt for all a ă b P R (21)

Thenφ “ ddt F ptq. The expectation or mean of a random variable

will be denoted by

EX “

ż

ΩX pωq dPpωq . (22)

If X has probability density function φ then for a functiong : RÑ R,

EgpX q “

ż `8

´8

gptqφptqdt (23)

whenever the integral exists.

MomentsThe quantities EX p, p ą 0 are called moments of X , while E|X |pare called absolute moments. (Sometimes we may omit“absolute”.)

The quantity EpX ´ EX q2 “ EX 2 ´ pEX q2 is called variance.

For 1 ď p ă 8, pE|X |pq1p defines a norm on the LppΩ,Pq-space of

all p-integrable random variables, in particular, the triangleinequality

pE|X ` Y |pq1p ď pE|X |pq1

p` pE|Y |pq1

p(24)

holds for all p-integrable random variables X ,Y on pΩ,Σ,Pq.

Hoelder’s inequality states that, for random variables X ,Y on acommon probability space and p, q ě 1 with 1

p `1q “ 1, we have

|EXY | ď pE|X |pq1p pE|Y |qq

1q .

Sequences of random variables and convergence

Let Xn, n P N, be a sequence of random variables such that Xn

converges to X as n Ñ8 in the sense that limnÑ8 Xnpωq “ X pωqfor all ω.

Lebesgue’s dominated convergence theorem states that if thereexists a random variable Y with E|Y | ă 8 such that |Xn| ď |Y |a.s. then limnÑ8 EXn “ EX .

Lebesgue’s dominated convergence theorem has as well an obviousformulation for integrals of sequences of functions.

Fubini’s theorem

Fubini’s theorem on the integration of functions of two variablescan be formulated as follows.

Let f : Aˆ B Ñ C be measurable, where pA, νq and pB, µq aremeasurable spaces.

Ifş

AˆB |f px , yq|dpν b µqpx , yq ă 8 then

ż

A

˜

ż

Bf px , yq dµpyq

¸

dνpxq “

ż

B

˜

ż

Af px , yq dνpxq

¸

dµpyq .

Moment computations and Cavalieri’s formula

Absolute moments can be computed by means of the followingformula.

Proposition

The absolute moments of a random variable X can be expressed as

E|X |p “ p

ż 8

0Pp|X | ě tqtp´1 dt, p ą 0 .

Corollary

For a random variable X the expectation satisfies

EX “

ż 8

0PpX ě tq dt ´

ż 8

0PpX ď ´tqdt .

Tail bounds and Markov inequality

The function t Ñ Pp|X | ě tq is called the tail of X. A simple butoften effective tool to estimate the tail by expectations andmoments is the Markov inequality.

TheoremLet X be a random variable. Then

Pp|X | ě tq ďE|X |

tfor all t ą 0 .

Tail bounds and Markov inequality

RemarkAs an important consequence we note that for p ą 0

Pp|X | ě tq “ Pp|X |p ě tpq ď t´pE|X |p, for all t ą 0 .

The special case p “ 2 is referred to as the Chebyshev inequality.Similarly, for θ ą 0 we obtain

PpX ě tq “ P`

exppθX q ě exppθtq˘

ď expp´θtqE exppθX q for all t P R .

The function θ ÞÑ E exppθX q is usually called the Laplacetransform or the moment generating function of X .

Normal distributions

A normal distributed random variable or Gaussian randomvariable X has probability density function

ψptq “1

?2πσ2

exp

˜

´pt ´ µq2

2σ2

¸

. (25)

Its distribution is often denoted by Npµ, σq.

It has mean EX “ µ and variance EpX ´ µq2 “ σ2.

A standard Gaussian random variable (or standard normal orsimply standard Gaussian), usually denoted g , is a Gaussianrandom variable with Eg “ 0 and Eg 2 “ 1.

Random vectorsA random vector X “ rX1, . . . ,Xns

J P Rn is a collection of nrandom variables on a common probability space pΩ,Σ,Pq.

Its expectation is the vector EX “ rEX1, . . . ,EXnsJ P Rn, while its

joint distribution function is defined as

F pt1, . . . , tnq “ PpX1 ď t1, . . . ,Xn ď tnq, t1, . . . , tn P R .

Similarly to the univariate case, the random vector X has a jointprobability density if there exists a function φ : Rn Ñ r0, 1s suchthat for any measurable domain D Ă Rn

PpX P Dq “

ż

Dφpt1, . . . , tnq dt1 ¨ ¨ ¨ dtn .

A complex random vector Z “ X` iY P Cn is a special case of a2n-dimensional real random vector pX,Yq P R2n.

IndependenceA collection of random variables X1, . . . ,Xn is (stochastically)independent if, for all t1, . . . , tn P R,

PpX1 ď t1, . . . ,Xn ď tnq “nź

`“1

PpX` ď t`q .

For independent random variables, we have

E

«

nź

`“1

X`

ff

“

nź

`“1

ErX`s . (26)

If they have a joint probability density function φ then the latterfactorizes as

φpt1, . . . , tnq “ φ1pt1q ˆ ¨ ¨ ¨ ˆ φnptnq

where the φ1, . . . φn are the probability density functions ofX1, . . . ,Xn.

Independence

In general, a collection X1 P Rn1 , . . . ,Xm P Rnm of random vectorsare independent if for any collection of measurable sets A` Ă Rn` ,` P rms,

PpX1 P A1, . . . ,Xm P Amq “

mź

`“1

PpX` P A`q .

If furthermore f` : Rn` Ñ RN` , ` “ 1, . . . ,m, are measurablefunctions then also the random vectors f1pX1q, . . . , fmpXmq areindependent.

A collection X1, . . . ,Xm P Rn of independent random vectors thatall have the same distribution is called independent identicallydistributed (i.i.d.).A random vector X1 will be called an independent copy of X if Xand X1 are independent and have the same distribution.

Sum of independent random variables

The sum X ` Y of two independent random variables X ,Y havingprobability density functions φX , φY , has probability densityfunction φX`Y given by the convolution

φX`Y ptq “ pφX ˚ φY qptq “

ż `8

´8

φX puqφY pt ´ uq du (27)

Fubini’s theorem for expectations

The theorem takes the following form. Let X,Y P Rn be twoindependent random vectors (or simply random variables) andf : Rn ˆ Rn Ñ R be a measurable function such thatE|f pX,Yq| ă 8. Then the functions

f1 : Rn Ñ R, f1pxq “ Ef px,Yq, f2 : Rn Ñ R, f2pyq “ Ef pX, yq

are measurable, E|f1pXq| ă 8 and E|f2pYq| ă 8 and

Ef1pXq “ Ef2pYq “ Ef pX,Yq . (28)

The random variable f1pXq is also called conditional expectation orexpectation conditional on X and will sometimes be denoted byEY f pX,Yq.

Gaussian vectorsA random vector g P Rn is called a standard Gaussian vector if itscomponents are independent standard normal distributed randomvariables. More generally, a random vector X P Rn is said to be aGaussian vector or multivariate normal distributed if there exists amatrix A P Rnˆk such that X “ Ag ` µ, where g P Rk is astandard Gaussian vector and µ P Rn is the mean of X.

The matrix Σ “ AAT is then the covariance matrix of X , i.e.,Σ “ EpX´ µqpX´ µqJ. If Σ is non-degenerate, i.e., Σ is positivedefinite, then X has a joint probability density function of the form

ψpxq “1

p2πqn2

a

detpΣqexp

ˆ

´1

2xx´ µ,Σ´1px´ µqy

˙

In the degenerate case when Σ is not invertible X does not have adensity.It is easily deduced from the density that a rotated standardGaussian Ug, where U is an orthogonal matrix, has the samedistribution as g itself.

Sum of gaussian variables

If X1, . . . ,Xn are independent and normal distributed randomvariables with means µ1, . . . , µn and variances σ2

1, . . . , σ2n then

X “ rX1, . . . ,XnsJ has a multivariate normal distribution and its

sum Z “řn`“1 X` has the univariate normal distribution with

mean µ “řn`“1 µ` and variance σ2 “

řn`“1 σ

2` , as can be

calculated from (27).

Jensen’s inequality

Jensen’s inequality reads as follows.

TheoremLet f : Rn Ñ R be a convex function, and let X P Rn be a randomvector. Then

f pEXq ď Ef pXq .

Note that ´f is convex if f is concave, so that for concavefunctions f , Jensen’s inequality reads

Ef pXq ď f pEXq . (29)

Cramer’s Theorem

We often encounter sums of independent mean zero randomvariables. Deviation inequalities bound the tail of such sums.

We recall that the moment generating function of a (real-valued)random variable X is defined by

θ ÞÑ E exppθX q ,

for all θ P R whenever the expectation on the right hand side iswell-defined.

Its logarithm is the cumulant generating function

CX pθq “ lnE exppθX q .

With the help of these definitions we can formulate Cramer’stheorem.

Cramer’s Theorem

TheoremLet X1, . . . ,XM be a sequence of independent (real-valued)random variables, with cumulant generating functions CX` ,` P rMs :“ t1, . . . ,Mu. Then, for t ą 0,

P

˜

Mÿ

`“1

X` ě t

¸

ď exp

˜

infθą0

#

´θt `Mÿ

`“1

CX`pθq

+¸

.

RemarkThe function

t ÞÑ infθą0

#

´θt `Mÿ

`“1

XX`pθq

+

appearing in the exponential is closely connected to a convexconjugate function appearing in convex analysis.

Hoeffdings’ inequality

TheoremLet X1, . . . ,XM be a sequence of independent random variablessuch that EX` “ 0 and |X`| ď B` almost surely, ` P rMs. Then

P

˜

Mÿ

`“1

X` ě t

¸

ď exp

˜

´t2

2řM`“1 B2

`

¸

,

and consequently,

P

˜∣∣∣∣∣ Mÿ

`“1

X`

∣∣∣∣∣ ě t

¸

ď 2 exp

˜

´t2

2řM`“1 B2

`

¸

. (30)

Bernstein inequalities

TheoremLet X1, . . . ,XM be independent mean zero random variables suchthat, for all integers n ě 2,

E|X`|n ďn!

2Rn´2σ2

` for all ` P rMs (31)

for some constants R ą 0 and σ` ą 0, ` P rMs. Then, for all t ą 0,

P

˜∣∣∣∣∣ Mÿ

`“1

X`

∣∣∣∣∣ ě t

¸

ď 2 exp

˜

´t2

2pσ2 ` Rtq

¸

, (32)

where σ2 :“řM`“1 σ

2` .


Before providing the proof we give two consequences. The first isthe Bernstein inequality for bounded random variables.

Corollary

Let X1, . . . ,XM be independent random variables with zero meansuch that |X`| ă K almost surely, for ` P rMs and some constantK ą 0. Further assume E|X`|2 ď σ2

` for constants σ` ą 0, ` P rMs.Then, for all t ą 0,

P

˜∣∣∣∣∣ Mÿ

`“1

X`

∣∣∣∣∣ ě t

¸

ď 2 exp

˜

´t2

2pσ2 ` 13 Ktq

¸

, (33)

where σ2 :“řM`“1 σ

2` .


As a second consequence, we present the Bernstein inequality forsubexponential random variables.

Corollary

Let X1, . . . ,XM be independent mean zero subexponential randomvariables, that is, Pp|X`| ě tq ď βe´κt for some constants β, κ ą 0for all t ą 0, ` P rMs. Then

P

˜∣∣∣∣∣ Mÿ

`“1

X`

∣∣∣∣∣ ě t

¸

ď 2 exp

˜

´pκtq2

2p2βM ` κtq

¸

, (34)

The Johnson-Lindenstrauss Lemma

Theorem ( Johnson-Lindenstrauss Lemma)

For any 0 ă ε ă 1 and any integer n, let k be a positive integerand such that

k ě 2βpε22´ ε33q´1 ln n,

for some β ě 2. Then for any set P of n points in Rd , there is amap f : Rd Ñ Rk such that for all v ,w P P

p1´ εqv ´ w22 ď f pvq ´ f pwq22 ď p1` εqv ´ w22 (35)

Furthermore, this map can be generated at random withprobability 1´ pn2´β ´ n1´βq.

Let us comment on the probability: choosing β " 2 leads to a veryfavorable probability 1´ pn2´β ´ n1´βq « 1 as soon as n is largeenough.

The Johnson-Lindenstrauss Lemma

To clarify the impact of this result, consider a cloud of n “ 107

points (ten million points) in dimension d “ 106 (one million).

Then there exists a map f generated at random with probabilityclose to 1, which allows us to project the 10 million points indimension k “ Opε´2 log 106q “ Op6ε´2q, preserving the distanceswithin the point cloud with a maximal distortion of Opεq.

For example, for ε “ 0.1 we obtain k of the order of a thousand,while the original dimension d was one million. Hence, we obtain aquasi-isometrical dimensionality reduction of 3-4 orders ofmagnitude.

Projection onto random subspaces

One proof of the lemma takes f to be a suitable multiple of theorthogonal projection onto a random subspace of dimension k inRd (Gupta, Dasgupta).

In this case f would be linear and representable by a matrix A.

Then, by considering unit vectors z “ v ´ wv ´ w2, we obtainthat (35) is equivalent to

p1´ εq ď Az2 ď p1` εq.

or|Az2 ´ 1| ď ε.

... or a random point onto a fixed subspace

Hence the aim is essentially to estimate the length of a unit vectorin Rd when it is projected onto a random k-dimensional subspace.

However, this length has the same distribution as the length of arandom unit vector projected down onto a fixed k-dimensionalsubspace.

Here we take this subspace to be the space spanned by the first kcoordinate vectors, for simplicity.

Gaussianly distributed vectors

We recall that X „ Np0, 1q if its density is

ψpxq “ 1?

2π expp´x2p2qq.

Let X1, . . . ,Xd be d-independent Gaussian Np0, 1q randomvariables, and let Y “ X ´1

2 pX1, . . . ,Xdq.

It is easy to see that Y is a point chosen uniformly at random fromthe surface of the d-dimensional sphere Sd´1 (exercise!).

Let the vector Z P Rk be the projection of Y onto its first kcoordinates, and let L “ Z22. Clearly the expected squared lengthof Z is µ “ ErZ22s “ kd (exercise!).

Concentration inequalities

LemmaLet k ă d. Then,

(a) If α ă 1, then

Pˆ

L ď αk

d

˙

ď αk2

ˆ

1`p1´ αqk

d ´ k

˙pd´kq2

ď exp

ˆ

k

2p1´ α` lnαq

˙

.

(b) If α ą 1, then

Pˆ

L ě αk

d

˙

ď αk2

ˆ

1`p1´ αqk

d ´ k

˙pd´kq2

ď exp

ˆ

k

2p1´ α` lnαq

˙

.

Proof techniques

The proof of this lemma uses by now standard techniques appliedfor proving large deviation bounds on sums of random variablessuch as the ones we used for proving the Hoeffding and Bernsteininequalities.

The proof of the Johnson-Lindenstrauss Lemma - as developed atthe blackboard - follows by union bound over the application of theconcentration inequalities of the Lemma.

Johnson-Lindenstrauss Lemma by Gaussian matrices

Theorem ( Johnson-Lindenstrauss Lemma)

For any 0 ă ε ă 12 and any integer n, let k be a positive integerand such that

k ě βε´2 ln n.

Then for any set P of n points in Rd , there is a map f : Rd Ñ Rk

such that for all v ,w P P

p1´ εqv ´ w22 ď f pvq ´ f pwq22 ď p1` εqv ´ w22 (36)

Furthermore, this map can be generated at random withprobability 1´ pn2´βp1´εq ´ n1´βp1´εqq.

Concentration inequalities for Gaussian matrices

LemmaLet x P Rd . Assume that the entries of A P Rkˆd are sampledindependently and identically distributed according to Np0, 1q.Then

P

˜ˇ

ˇ

ˇ

ˇ

ˇ

›

›

›

›

1?

kAx

›

›

›

›

2

2

´ x22

ˇ

ˇ

ˇ

ˇ

ˇ

ą εx22

¸

ď 2e´pε2´ε3qk4,

or, equivalently,

P

˜

p1´ εqx22 ď

›

›

›

›

1?

kAx

›

›

›

›

2

2

ď p1` εqx22

¸

ě 1´ 2e´pε2´ε3qk4.

Nothing special in Gaussian matrices

LemmaLet x P Rd . Assume that the entries of A P Rkˆd are either `1 or´1 with probability 12. Then

P

˜ˇ

ˇ

ˇ

ˇ

ˇ

›

›

›

›

1?

kAx

›

›

›

›

2

2

´ x22

ˇ

ˇ

ˇ

ˇ

ˇ

ą εx22

¸

ď 2e´pε2´ε3qk4,

or, equivalently,

P

˜

p1´ εqx22 ď

›

›

›

›

1?

kAx

›

›

›

›

2

2

ď p1` εqx22

¸

ě 1´ 2e´pε2´ε3qk4.

Scalar product preservation

Corollary

Let P be a set points in the unit ball Bp1q Ă Rd and thatf : Rd Ñ Rk is a linear function such that

p1´ εqu ˘ v22 ď f pu ˘ vq22 ď p1` εq2u ˘ v22, (37)

for all u, v P P. Then

|xu, vy ´ xf puq, f pvqy| ď ε,

for u, v P P.

Scalar product preservation

Corollary

Let u, v be two points in the unit ball Bp1q Ă Rd , A P Rkˆd is amatrix with random entries such that, for a given x P Rd

P

˜

p1´ εqx22 ď

›

›

›

›

1?

kAx

›

›

›

›

2

2

ď p1` εqx22

¸

ě 1´ 2e´pε2´ε3qk4.

(For instance A is a matrix with Gaussian random entries or withentries ˘1 with probability 12.) Then, with probability at least1´ 4e´pε

2´ε3qk4

|xu, vy ´ xf puq, f pvqy| ď ε,

for u, v P P.

Convex sets

DefinitionA subset K Ă RN is called convex if for all x, z P K the linesegment connecting x, z is also completely contained in K , that is,

tx` p1´ tqz P K for all t P r0, 1s.

It is straightforward to check that a set K P RN is convex if andonly if for all x1, ..., xn P K and t1, ..., tn ě 0 such that

řnj“1 tj “ 1

the convex combinationřn

j“1 tjxj is also contained in K .

Convex sets

DefinitionLet T P RN be a set. Its convex hull convpT q is the smallestconvex set containing T .

It is well-known that the convex hull of T consists of the convexcombinations of T ,

convpT q “

#

ÿ

j

tjxj : tj ě 0,ÿ

j

tj “ 1, xj P T

+

.

Simple examples of convex sets include subspaces, affine spaces,half spaces, polygons or norm balls Bpx,Rq. The intersection ofconvex sets is again convex.

Cones

DefinitionA K Ă RN is called a cone if for all x P K and all t ě 0 also tx iscontained in K . If, in addition, K is convex, then K is called aconvex cone.

Obviously, the zero vector is contained in every cone. A set K is aconvex cone if for all x, z P K and t, s ě 0 also sx` tz iscontained in K .Simple examples of convex cones include the positive orthantRN` “ tx P RN : xi ě 0 for all i P rNsu or the set of positive

semidefinite matrices in RN ˆ RN . Another important example ofa convex cone is the second order cone

$

&

%

x P RN`1 :

g

f

f

e

Nÿ

j“1

x2j ď xN`1

,

.

-

. (38)

Dual cones

For a cone K Ă RN , its dual cone K˚ is defined via

K˚ “ tz P RN : xx, zy ě 0 for all x P Ku. (39)

As the intersection of half-spaces, K˚ is closed and convex, and itis straight-forward to check that K˚ is again a cone.

If K is a closed cone then K˚˚ “ K .

Moreover, if H,K Ă RN are cones such that H Ă K thenK˚ Ă H˚.

As an example, the dual cone of the positive orthant RN` is RN

`

itself, i.e. it is self-dual.

Polar cones

Note that the dual cone is closely related to the polar cone, whichis defined by

K ˝ “ tz P RN : xx, zy ď 0 for all x P Ku “ ´K˚. (40)

The conic hull conepT q of a set T Ă RN is the smallest convexcone containing T .

It can be described as

conepT q “

#

ÿ

j

tjxj : tj ě 0, xj P T

+

. (41)

Geometrical Hahn-Banach theorem

Convex sets can be separated by hyper-planes as stated next.

TheoremLet K1,K2 Ă RN be convex sets whose interiors have emptyintersection. Then there exists a vector w P RN and a scalar λsuch that

K1 Ă tx P RN : xx,wy ď λu,

K2 Ă tx P RN : xx,wy ě λu.

Extremal points

DefinitionLet K Ă RN be a convex set. A point x P K is called extremepoint of K if x “ tw ` p1´ tqz for w, z P K and t P p0, 1q impliesx “ w “ z.

TheoremA compact convex set is the convex hull of its extreme points.

If K is a polygon then its extreme points are the zero-dimensionalfaces of K , and the above statement is clearly intuitive.

Proper functions

We work with extended valued functions F : RN Ñ p´8,8s.

Sometimes we also consider an additional extension of the valuesto ´8. Addition, multiplication and inequalities in p´8,8s areunderstood in the “natural” sense; for instance, x `8 “ 8 for allx P R.

The domain of an extended-valued function F is defined as

dompF q “ tx P RN : F pxq ‰ 8u.

A function with dompF q ‰ H is called proper.

A function F : K Ñ R on a subset K Ă RN can be extended to anextended valued function by setting F pxq “ 8 for x R K . Thenclearly dompF q “ K , and we call this extension the canonical one.

Convex functions

DefinitionAn extended valued function F : RN Ñ p´8,8s is called convex iffor all x, z P RN and t P r0, 1s,

F ptx` p1´ tqzq ď tF pxq ` p1´ tqF pzq. (42)

F is called strictly convex if above inequality is strict for x ‰ z andt P p0, 1q.

F is called strongly convex with parameter γ ą 0 if for allx, z P RN and t P r0, 1s

F ptx`p1´ tqzq ď tF pxq` p1´ tqF pzq´γ

2tp1´ tqx´ z22. (43)

A function F : RN Ñ r´8,8q is called (strictly, strongly) concaveif ´F is (strictly, strongly) convex.

Convex functions

Obviously, a strongly convex function is strictly convex.

The domain of a convex function is convex, and a functionF : K Ñ RN on a convex subset K Ă RN is called convex if itscanonical extension to RN is convex.

A function F is convex if and only if its epigraph

epipF q “ tpx, rq : r ě F pxqu Ă RN ˆ R

is a convex set.

Smooth convex functions

PropositionLet F : RN

Ñ R be differentiable.

1. F is convex if and only if for all x, y P RN

F pxq ě F pyq ` x∇F pyq, x´ yy,

where the gradient is defined as ∇F pyq “ pBy1 F pyq, ..., BynF pyqqT .

2. F is strongly convex with parameter γ ą 0 if and only if for all x, y P RN

F pxq ě F pyq ` x∇F pyq, x´ yy `γ

2tp1´ tqx´ z2

2.

3. Assume that F is twice differentiable. Then F is convex if and only if

∇2F pxq ě 0

for all x P RN , where ∇2F is the Hessian of F .

Composition of convex functions

Proposition

1. Let F ,G be convex functions on RN . Then, for α, β ě 0 thefunction αF ` βG is convex.

2. Let F : RÑ R be convex and nondecreasing, andG : RN Ñ R be convex. Then Hpxq “ F pG pxqq is convex.

Examples of convex functions

§ Every norm ¨ on RN is convex by triangle inequality andhomogeneity.

§ The `p-norms are strictly convex for 1 ă p ă 8 and notstrictly convex for p “ 1,8.

§ For A P RNˆN positive semidefinite, the function F pxq “ x˚Axis convex. It is strictly convex if A is positive definite.

§ For a convex set K the characteristic function

χK pxq “

#

0 x P K ,

8 else(44)

is convex.

Convex functions, continuity, and lower-semicontinuity

Proposition

Let F : RN Ñ R be a convex function. Then F is continuous onRN .

Treatment of extended valued functions requires the concept oflower semicontinuity.

DefinitionA function F : RN Ñ p´8,8s is called lower semicontinuous if forall x P RN and every sequence pxjqjPN Ă RN converging to x itholds

lim infjÑ8

F pxjq ě F pxq.

Convex functions, continuity, and lower-semicontinuity

Clearly, a continuous function F : RN Ñ R is lowersemicontinuous.

A nontrivial example is the characteristic function χK of a propersubset K Ă RN . Clearly, χK is not continuous, but it is lowersemicontinuous if and only if K is closed.

A function is lower semicontinuous if and only if its epigraph isclosed.

We remark that the notion of lower semicontinuity is particularlyuseful in infinite-dimensional Hilbert spaces, where for instance thenorm ¨ is not continuous with respect to the weak topology butis still lower semicontinuous with respect to the weak topology.

Convex functions and minimization

Convex functions have nice properties related to minimization.

A (global) minimum (or minimizer) of a functionF : RN Ñ p´8,8s is a point x P RN satisfying F pxq ď F pyq for ally P RN .

A local minimum of F is a point x P RN such that there existsε ą 0 and F pxq ď F pyq for all y satisfying x´ y2 ď ε.

Proposition

Let F : RN Ñ p´8,8s be convex.

1. A local minimum of F is a global minimum.

2. The set of minima of F is convex.

3. If F is strictly convex then the minimum is unique.

The fact that local minima of convex functions are automaticallyglobal minima is the essential reason why efficient optimizationmethods are available for convex optimization problems.

Jointly convex functions

We say that a function f px, yq of two arguments x P Rn, y P Rm isjointly convex if it is convex as a function of the variablez “ px, yq.

Partial minimization of a jointly convex function in one variableyields again a convex function as stated next.

TheoremLet f : Rn ˆ Rm Ñ p´8,8s be a jointly convex function. Thenthe function gpxq “ infyPRm f px, yq, x P Rn, is convex.

Maxima of convex functions on compact convex sets

TheoremLet K Ă RN be a compact convex set, and F : K Ñ R be a convexfunction. Then F attains its maximum at an extreme point of K .

The convex conjugate

The convex conjugate is a very useful concept in convex analysisand optimization.

DefinitionLet F : RN Ñ p´8,8s. Then its convex conjugate (or Fencheldual) function F ˚ : RN Ñ p´8,8s is defined by

F ˚pyq :“ supxPRN

txx, yy ´ F pxqu .

The convex conjugate F ˚ is always a convex function, no matterwhether the function F is convex or not. The definition of F ˚

immediately gives the Fenchel (or Young, or Fenchel-Young)inequality

xx, yy ď F pxq ` F ˚pyq for all x, y P RN . (45)


Let us summarize some properties of convex conjugate functions.

Proposition

Let F : RN Ñ p´8,8s.

1. The convex conjugate F ˚ is lower semicontinuous.

2. The biconjugate F ˚˚ is the largest lower semicontinuousconvex function satisfying F ˚˚pxq ď F pxq for all x P R. Inparticular, if F is convex and lower semicontinuous thenF “ F ˚˚.

3. For τ ‰ 0 let Fτ pxq :“ F pτxq. Then pFτ q˚pyq “ F ˚p y

τ q.

4. For τ ą 0, pτF q˚pyq “ τF ˚p yτ q.

5. For z P RN let F pzq :“ F px´ zq. ThenpF pzqq˚pyq “ xz, yy ` F ˚pyq.

The biconjugate F ˚˚ is sometimes also called the convex relaxationof F because of 2.

The convex conjugateLet us compute the convex conjugate for some examples.

Example

§ Let F pxq “ 12‖x‖2

2, x P RN . Then F ˚pyq “ 12‖y‖2

2 “ F pyq,y P RN . Indeed, since

xx, yy ď1

2‖x‖2

2 `1

2‖y‖2

2

we have

F ˚pyq “ supxPRN

txx, yy ´ F pxqu ď1

2‖y‖2

2 .

For the converse inequality, we just set x “ y in the definitionof the convex conjugate to obtain

F ˚pyq ě ‖y‖22 ´

1

2‖y‖2

2 “1

2‖y‖2

2 .

Note that this example is the only function F on RN

satisfying F “ F ˚.


§ Let F pxq “ exppxq, x P R- The map x ÞÑ xy ´ exppxq takesits maximum at x “ ln y if y ą 0 so that

F ˚pyq “ supxPRtxy ´ exu “

$

&

%

y ln y ´ y if y ą 0 ,0 if y “ 0 ,8 if y ă 0 ,

The Young inequality for this particular pair reads

xy ď ex ` y lnpyq ´ y for all x P R , y ą 0

§ Let F “ χK be the characteristic function of a convex set K .Its convex conjugate is given by

F ˚pyq “ supxPKxx, yy

The subdifferential

The subdifferential generalizes the gradient for not necessarilydifferentiable functions.

DefinitionThe subdifferential of a convex function F : RN Ñ R at a pointx P RN is defined by

BF pxq “ tv P RN : F pzq ´ F pxq ě xv, z´ xy for all z P RNu . (46)

The elements of BF pxq are called subgradients of F at x.

The subdifferential BF pxq of a convex function F is alwaysnon-empty. If F is differentiable in x then BF pxq contains only thegradient,

BF pxq “ t∇F pxqu .

The subdifferential

A simple example of a function with a non-trivial subdifferential isthe absolute value F pxq “ |x |, for which

BF pxq “

#

tsignpxqu if x ‰ 0 ,

r´1, 1s if x “ 0 ,

where signpxq “ `1 for x ą 0 and signpxq “ ´1 for x ă 0 as usual.

The subdifferential

The subdifferential allows a simple characterization of minimizersof convex functions.

TheoremA vector x is a minimum of F if and only if 0 P BF pxq.

The subdifferential and conjugation

Convex conjugate functions and subdifferentials are related in thefollowing way.

TheoremLet F : RN Ñ p´8,8s be a convex function and x, y P RN . Thefollowing conditions are equivalent

1. y P BF pxq,

2. F pxq ` F ˚pyq “ xx, yy,If additionally, F is lower semicontinuous then 1 and 2 areequivalent to

3. x P BF ˚pyq

As a consequence, if F is a convex lower semicontinuous functionthen BF is the inverse of BF ˚ in the sense that x P BF ˚pyq if andonly if y P BF pxq.

Proximal mapping

Let F : RN Ñ p´8,8s be a convex function. Then, for z P RN

the function

x ÞÑ F pxq `1

2‖x´ z‖2

2

is strictly convex due to the strict convexity of x ÞÑ ‖x‖22. By a

previous result, its minimizer is unique. The mapping

PF pzq :“ arg min

#

F pxq `1

2‖x´ z‖2

2 : x P RN

+

,

is called the proximal mapping associated with F . In the specialcase that F “ χK is the characteristic function of a convex set Kdefined in (44), then PK :“ PχK

is the orthogonal projection ontoK , that is,

PK pzq “ arg minxPK‖x´ z‖2 .

If K is a subspace of RN then it is the usual orthogonal projectiononto K , and in particular, a linear map.

Proximal mapping

PropositionLet F : RN

Ñ p´8,8s be a convex function. Then x “ PF pzq if and only ifz P x` BF pxq.

The previous proposition justifies to write

PF “ pI ` BF q´1

Moreau’s identity relates the proximal mappings of F and F˚.

Theorem (Moreau’s Identity)Let F : RN

Ñ p´8,8s be a lower semicontinuous convex function. Then, forall z P RN ,

PF pzq ` PF˚pzq “ z .

If PF is easy to compute then the previous result shows that alsoPF˚pzq “ z´ PF pzq is easy to compute.It is useful to note that applyingMoreau’s identity to the function τF for some τ ą 0 shows that

PτF pzq ` τPτ´1F˚pzτq “ z. (47)

Nonexpansiveness of proximal mappings

TheoremFor a convex function F : RN Ñ p´8,8s the proximal mappingPF is a contraction,

PF pzq ´ PF pz1q2 ď z´ z12 for all z, z1 P RN .

Relevant example

Let F pxq “ |x |, x P R, be the absolute value function. Then astraightforward computation shows that, for τ ą 0,

PτF pyq “ arg minxPR

"

1

2px ´ yq2 ` τ |x |

*

“

$

’

&

’

%

y ´ τ if y ě τ,

0 if |y | ď τ,

y ` τ if y ď ´τ

“ Sτ pyq.

(48)

The function Sτ pyq is called soft thresholding or shrinkageoperator. More generally, if F pxq “ x1 is the `1-norm on RN

then the minimization problem defining the proximal operatordecouples and PτF pyq, y P RN is given entrywise by

PτF pyql “ Sτ pylq, l P rNs. (49)

Optimization problems

An optimization problem is of the form

minxPRN

F0pxq subject to Ax “ y, (50)

Fjpxq ď bj , j P rMs, (51)

where the function F0 : RN Ñ p´8,8s is called objective function,the functions F1, ...,FM : RN Ñ p´8,8s are called constraintfunctions, and A P Rmˆn, y P Rm provide the equality constraint.A point x satisfying the constraints is called feasible, and (50)-(51)is called feasible if there exists a feasible point. A feasible point x7

for which the minimum is attained, that is, F0px7q ď F pxq for allfeasible x, is called a minimizer or optimal point, and F0px7q is theoptimal value.


An optimization problem is of the form

minxPRN

F0pxq subject to Ax “ y,

Fjpxq ď bj , j P rMs,

We note that the equality constraint may be removed andrepresented by inequality constraints of the formFjpxq ď yj ,´Fjpxq ď ýj with Fjpxq “ xAj , xy where Aj P RN is arow of A.The set of feasible points described by the constraints is given by

K “ tx P RN : Ax “ y, Fjpxq ď bj , j P rMsu. (52)


The optimization problem (50)-(51) is equivalent to the problem ofminimizing F0 over K ,

minxPK

F0pxq. (53)

Then our optimization problem becomes as well equivalent to theunconstrained optimization problem

minxPRN

F0pxq ` χK pxq.

Convex optimization problems

A convex optimization problem (or convex program) is a problemof the form (50)-(51), in which the objective function F0 and theconstraint functions Fi are convex. In this case, the set of feasiblepoints K defined in (52) is convex. The convex optimizationproblem becomes then equivalent to the unconstrainedoptimization problem of finding the minimum of the convexfunction

F pxq “ F0pxq ` χK pxq.

Due to this equivalence we may freely switch between constrainedand unconstrained optimization problems.A linear optimization problem (or linear program) is one, where theobjective function F0 and all constraint functions F1, ...,FM arelinear. Clearly, this is a special case of a convex optimizationproblem.

Lagrange function

The Lagrange function of an optimization problem of the form(50)-(51) is defined for x P RN , ξ P Rm, ν P RM , νl ě 0, l P rMs,by

Lpx, ξ,νq “ F0pxq ` ξ˚pAx´ yq `Mÿ

l“1

νlpFlpxq ´ blq. (54)

For an optimization problem without inequality constraints weclearly set

Lpx, ξq “ F0pxq ` ξ˚pAx´ yq. (55)

The variables ξ,ν are called Lagrange multipliers. For ease ofnotation we write ν ě 0 if νl ě 0 for all l P rMs.

Lagrange dual function

The Lagrange dual function is defined by

Hpξ,νq “ infxPRN

Lpx, ξ,νq, ξ P Rm,ν P RM ,ν ě 0.

If x ÞÑ Lpx, ξ,νq is unbounded from below, then we setHpξ,νq “ ´8. Again, if there are no inequality constraints then

Hpξq “ infxPRN

Lpx, ξq “ infxPRN

tF0pxq ` ξ˚pAx´ yqu , ξ P Rm.

The dual function is always concave because it is the pointwiseinfimum of a family of affine functions, even if the original problem(50)-(51) is not convex.

Lagrange dual functionThe dual function provides a bound on the optimal value of F0px7qof the minimization problem (50),

Hpξ,νq ď F px7q for all ξ P Rm,ν ě 0. (56)

Indeed, if x is a feasible point for (50) then Ax´ y “ 0 andFlpxq ď bl , l “ 1, ...,M, so that, for all ξ P Rm and ν ě 0,

ξ˚pAx´ yq `Mÿ

l“1

νlpFlpxq ´ blq ď 0.

Therefore,

Lpx, ξ,νq “ F0pxq ` ξ˚pAx´ yq `Mÿ

l“1

νlpFlpxq ´ blq ď F0pxq.

Taking the infimum over all x P RN on the left hand side, and overall feasible x on the right hand side shows (56).

Dual problem

We would like this lower bound to be as tight as possible. Thismotivates to consider the optimization problem

max Hpξ,νq subject to ν ě 0. (57)

This optimization problem is called the dual problem to (50)-(51),which in this context is sometimes called primal problem. Since His concave this problem is equivalent to the convex optimizationproblem of minimizing the convex function ´H subject to thepositivity constraint ν ě 0. A pair pξ,νq, ξ P Rm,ν ě 0 is calleddual feasible. A (feasible) maximizer pξ7,ν7q of (57) is referred toas dual optimal or optimal Lagrange multipliers. If x7 is optimal forthe primal problem (50)-(51) then the triple px7, ξ7,ν7q is calledprimal-dual optimal.

Weak and strong duality

ClearlyHpξ7,ν7q ď F px7q. (58)

This inequality is called weak duality. For most (but not all)convex optimization problems even strong duality holds,

Hpξ7,ν7q “ F px7q. (59)

Theorem (Slater’s constraint qualification)

Assume that F0,F1, ...,FM are convex functions withdompF0q “ RN . If there exists x P RN such that Ax “ y andFlpxq ă bl for all l P rMs, then strong duality holds for theoptimization problem (50)-(51).In case that there are no inequality constraints, strong duality holdsprovided there exists x with Ax “ y, that is, if (50)-(51) is feasible.

Saddle-point interpretation

Lagrange duality has a saddle-point interpretation. For ease ofexposition we consider (50)-(51) without inequality constraints,but note that extensions that include inequality constraints areeasily derived in the same way.Let px 7, ξ7q be primal dual optimal. Recalling the definition of theLagrange function L we have

supξPRm

Lpx, ξq “ supξPRm

`

F0pxq ` ξ˚pAx´ yq˘

“

#

F0pxq if Ax “ y ,

8 otherwise .(60)

In other words, the above supremum is 8 if x is not feasible. The(feasible) minimizer x7 of the primal problem (50)-(51) satisfiestherefore

F0px7q “ inf

xPRNsupξPRm

Lpx, ξq .

Saddle-point interpretationOn the other hand, a dual optimal vector ξ7 satisfies

Hpξ7q “ supξPRm

infxPRN

Lpx, ξq

by definition of the Lagrange dual function. Weak duality impliestherefore,

supξPRm

infxPRN

Lpx, ξq ď infxPRN

supξPRm

Lpx, ξq ,

while strong inequality reads

supξPRm

infxPRN

Lpx, ξq “ infxPRN

supξPRm

Lpx, ξq

In other words, the order of minimization and maximization can beinterchanged in the case of strong duality. This property is calledthe strong max-min property or saddle point property. Indeed, inthis case, a primal dual optimal px7, ξ7q is a saddle point of theLagrange function,

Lpx7, ξq ď Lpx7, ξ7q ď Lpx, ξ7q for all x P RN , ξ P RN . (61)

One application

TheoremLet ‖¨‖,~¨~ be two norms on RN . For A P RmˆN , y P Rm andη ą 0 consider the optimization problem

minxPRN

~x~ subject to ‖Ax´ y‖ ď η . (62)

Assume that (62) is strictly feasible in the sense that there existsx P RN such that ‖Ax´ y‖ ă η. Consider a minimizer x7 of (62).Then there exists a parameter λ ě 0 such that x7 is also theminimizer of the optimization problem

minxPRN

~x~ ` λ‖Ax´ y‖ . (63)

Conversely, for λ ą 0 let x7 be a minimizer of (63). Then thereexists η ą 0 such that x7 is a minimizer of (62).

Equivalent problems

1. The same type of statement and proof is valid for the pair ofoptimization problems (63) and

minxPRN

Ax´ y subject to x ď t, t ą 0. (64)

Note that in this case strict feasibility always holds becausethe zero vector x “ 0 satisfies x ă t.

2. The equivalence of the problems (62), (63), (64) is somewhatjust “theoretical” because the parameter transformationbetween λ and η or λ and t is implicit.

A typical general form of convex optimization problem indata analysis

We consider a convex optimization problem of the form

minxPRN

F pAxq ` G pxq, (65)

with A P RmˆN and convex functionsF : Rm Ñ p´8,8s, G : RN Ñ p´8,8s. Most of the relevantconvex optimization problems in data analysis fall into this class.The substitution z “ Ax yields the equivalent problem

minxPRN ,zPRm

F pzq ` G pxq subject to Ax´ z “ 0. (66)


The Lagrange dual function to this problem is given by

Hpξq “ infx,ztF pzq ` G pxq ` xA˚ξ, xy ´ xξ, zyu

“ ´ supzPRm

txξ, zy ´ F pzqu ´ supxPRN

txx,Á˚ξy ´ G pxqu

“ ´F ˚pξq ´ G˚pÁ˚ξq,

(67)

where F ˚ and G˚ are the convex conjugate functions of F and G ,respectively. Therefore, the dual problem of (65) is

maxξPRm

´F ˚pξq ´ G˚pÁ˚ξq. (68)


TheoremLet A P RmˆN and F : Rm Ñ p´8,8s, G : RN Ñ p´8,8s beproper convex functions such that either dompF q “ Rm ordompG q “ RN and there exists x such that Ax P dompF q. Assumethat the optima in (65) and (68) are attained. Then strong dualityholds in the form

minxPRN

F pAxq ` G pxq “ maxξPRm

´F ˚pξq ´ G˚pÁ˚ξq.

Furthermore, a primal dual optimum px7, ξ7q is a solution to thesaddle point problem

minxPRN

maxξPRm

xAx, ξy ` G pxq ´ F ˚pξq, (69)

where F ˚ is the convex conjugate of F .

Gradient descent

Assume we have given a function f : RN Ñ R which we wish tominimize and find x˚ “ arg minx f pxq. We know that a functionlocally grows according to its gradient ∇f pxq. Hence, if we move abit from any point x in the opposite direction as dictated by thegradient, i.e., we consider x “ x ´ α∇f pxq for a small α ą 0, wemight expect that f pxq gets smaller, in other words we aredescending.

Gradient descent

Let us introduce the gradient descent algorithm, by consideringfirst a time-dependent gradient flow evolution (ordinary differentialequation) of the type

9xptq “ ´∇f pxptqq xp0q “ x p0q. (70)

In the following we make the following assumption: ∇f pxq isLipschitz continuous with constant L. If f is twice differentiable,this is achieved by asking that the largest eigenvalue of the Hessianis bounded by L (∇2f pxq ĺ LI ). But more directly, there must bean L such that,

∇f pxq ´∇f pyq2 ď Lx ´ y2, for all x , y P RN

Co-coercivity lemma

Lemma (Co-coercivity)

For a convex smooth function f whose gradient has Lipschitzconstant L ą 0, it holds

∇f pxq ´∇f pyq22 ď Lxx ´ y ,∇f pxq ´∇f pyqy.

Convergence of the gradient flow

TheoremAssume f P C2pRNq strongly convex, i.e., ∇2f pxq ľ γ2I for γ ą 0and with bounded Hessian, i.e., ∇2f pxq ĺ LI . Then the evolutiondefined by (70) always converges to the minimizerx˚ “ arg minxPRd f pxq

The gradient descent iteration

A natural time discretization of the gradient flow is given by theso-called explicit Euler scheme

x pn`1q “ x pnq ´ αpnq∇f px pnqq for x p0q given. (71)

Here the parameters αpnq are tyipically chosen adaptively with aline search. However, also a constant choice α “ αpnq sufficientlysmall will suffice. Let us show now a result of convergence of thediscrete algorithm (71).


Let us recall once again, that the convexity assumption on f tellsus that f is lower bounded by an affine function,

f pyq ě f pxq `∇f pxqT py ´ xq for all x , yRd . (72)

If ∇f pxq is Lipschitz continuous with constant L then f isadditionally upper bounded by a quadratic function (this is notobvious and it is left as an exercise),

f pyq ď f pxq `∇f pxqT py ´ xq `L

2y ´ x22 (73)


TheoremAssume that f is convex and finite for all x, a finite solution x˚

exists, and ∇f pxq is Lipschitz continuous with constant L. Thenthe iteration (71) with α “ αpnq ď 1

L converges to x˚.

Stochastic gradient descent

The gradient descent algorithm is simple to implement and widelyused for machine learning purposes. However, it requires the abilityof computing the gradient ∇f pxq and any point very efficiently.This is not always the case. In fact, often in machine learning theobjective function to be minimized is of the type

f pxq “ Ef px , ¨q “

ż

Ωf px , ωqdPpωq,

where f px 9q : Ω Ñ R are all random variables over an appropriateprobability space pΩ,Σ,Pq. This would imply that

∇f pxq “

ż

Ω∇f px , ωqdPpωq,

and the computation of ∇f pxq would require the computation of∇f px , ωq for all the events ω P Ω. Often Ω represents the trainingdata in supervised problems and it can be extremely large.


For this particular setting, there is a way out, which stems frominjecting again a randomized sampling to reduce the complexity:for each n iteration number, we pick at random ωpnq independentlyand instead of considering ∇f px pnqq in a gradient descent iteration,we consider ∇f px pnq, ωpnqq

x pn`1q “ x pnq ´ α∇x f px pnq, ωpnqq for x p0q given. (74)

Of course, this algorithm is never the same and it depends on thesequence of randomized samples ωpnq. However, the averagebehavior of this algorithm can be estimated with elementarymethods.


TheoremLet each of f p¨, ωq be convex where ∇x f px , ωq has Lipschitzconstant Lpωq ą 0 with sup L :“ supωPΩ Lpωq ă `8, and letf pxq “ Ef px , ¨q be γ-strongly convex. Set σ2 “ E∇x f px˚, ¨q22where x˚ “ arg minx f pxq. Suppose that α ă 1

sup L . Then thestochastic gradient descent iteration satisfies:

Ex pnq´x˚22 ď p1´2αγp1´α sup Lqqnx p0q´x˚22`ασ2

γp1´ α sup Lq,

(75)where the expectation is over the sequence of sampling pωpnqqnPN.

Stochastic gradient descentIf we are given a desired tolerance, x pnq ´ x˚22 ď ε to be achievedand we knew the Lipschitz constants and parameter of strongconvexity, we may optimize the step-size α, and obtain:

Corollary

For any desired ε ą 0, using a step-size of

α “γε

2εγ sup L` 2σ2

we have that after

n ě 2 logp2ε0εq

ˆ

sup L

γ`

σ2

γ2ε

˙

, (76)

the stochastic gradient descent iterations fulfill

Ex pnq ´ x˚22 ď ε,

where ε0 “ xp0q ´ x˚22.

A Primal Dual AlgorithmMany of the most common tasks in data analysis and machinelearning can often be approached by solving optimization problemsof the form

minxPCN

F pAxq ` G pxq (77)

with A being an m ˆ N matrix and F ,G being convex functionswith values in p´8,8s.We already investigated the corresponding dual problem given by

maxξPCm

´F ˚pξq ´ G˚pÁ˚ξq, (78)

whereF ˚pξq “ suptRexz , ξy ´ F pzq : z P Cmu

is the convex (Fenchel) conjugate of F , and likewise G˚ is theconvex conjugate of G . The joint primal-dual optimization isequivalent to solving the saddle point problem

minxPCN

maxxiPCm

RexAx , ξy ` G pxq ´ F ˚pξq. (79)

A Primal Dual Algorithm

The primal dual algorithm introduced by Chambolle and Pock anddescribed below solves this problem iteratively.In order to formulateit, one needs to introduce the so-called proximal mapping(proximity operator).For the convex function G it is defined as

PG pτ, zq :“ argminxPCN

"

τG pxq `1

2x ´ z22

*

.

One also requires below the proximal mapping associated to theconvex conjugate F ˚.


The primal dual algorithm performs an iteration on the dualvariable, the primal variable and an auxiliary primalvariable.Starting from initial points x0, x0 “ x P CN , ξ0 P Cm andparameters τ, σ ą 0, θ P r0, 1s, one iteratively computes

ξn`1 :“ PF˚pσ; ξn ` σAxnq, (80)

xn`1 :“ PG pτ ; xn ´ τA˚ξn`1q, (81)

xn`1 :“ xn`1 ` θpxn`1 ´ xnq. (82)

It can be shown that px , ξq is a fixed point of these iterations ifand only if px , ξq is a saddle point.


For the case that θ “ 1, the convergence of the primal dualalgorithm has been established:

TheoremConsider the primal dual algorithm (80)-(82) with θ “ 1 andτ, σ ą 0 such that τσA2 ă 1. If the problem (79) has a saddlepoint, then the sequence pxn, ξnq converges to a saddle point of(79) and in particular pxnq converges to a minimizer of (77).

Nonconvex optimization: convex relaxationTwo approaches.The first is determining an equivalent convexoptimization problem, which can be tackled, e.g., with themethods we described above.

-4 -2 0 2 4

0

5

10

15

20

25

A nonconvex function f and a convex envelope g ď f from below.

If we were interested in solving an optimization problem

minxPK

f pxq,

where f is a nonconvex, lower-semincontinuous function, and K isa closed convex set, it might be convenient to recast the problemby considering its convexification, i.e.,

minxPK

f pxq,

where f is called the convex relaxation or the convex envelope of fand is given by

f pxq :“ suptgpxq ď f pxq : g is a convex functionu.

Nonconvex optimization: convex relaxation

The motivation of this choice is simply geometrical.While f can have many minimizers on K , its convex envelop f has

global minimizers, and such global minimizers are likely to be in aneighborhood of a global minimizer of f .Actually if K is compact and f is smooth enough, then the global

minima of f and f must necessarily coincide.We show concrete examples of convex relaxation of nonconvex

problems, also nonsmooth ones related to sparsity optimization, infollowing lectures on compressed sensing.

Unfortunately, the precise computation of f might be in somecases a very difficult problem and one might need to consider asecond option.

Nonconvex optimization: graduated nonconvexity orsequentially quadratic programming

If the nonconvex objective function is sufficiently smooth, then it islocally convexifiable and one can use this property to solve thenonconvex optimization, by using so-called graduated nonconvexityor sequentially quadratic programming.

From now, about the function f , we will make the following morespecific assumption:

(A1) f is ω-semi-convex, that is there exists ω ą 0 such thatf p¨q ` ω ¨ 2 is convex;

(A2) the subdifferential of f satisfies the following growthcondition: there exist two nonnegative numbers K , L suchthat, for every v P RN and ξ P Bf pvq

ξ ď Kf pvq ` L . (83)


Given ω ą 0, and u P RN we will denote

fω,upvq :“ f pvq ` ωv ´ u2 (84)

We observe that, if f satisfies condition (A1), we can alwaysassume that ω is chosen in such a way that fω,u is also γ-stronglyconvex with γ depending on f and ω, but not on u. Analogously, if(A1) and (A2) are satisfied, it is easy to see that fω,u satisfies (83)with two constants K , L depending again on f and ω, but not on u.We now present a typical algorithm for constrained nonsmooth andnonconvex minimization, based on the successive convexapproximations:


We pick initial v p0q P RN . Notice that there is no restriction to anyspecific neighborhood for the choice of the initial iteration. Wedefine the successive iterations by:

v p``1q “ arg minvPK fω,v p`qpvq. (85)

Here, thanks to condition (A1), ω is chosen in such a way thatfω,v`´1

is γ-strongly convex, and the minimization executed at eachiteration has a unique solution and can be performed by thefavorite convex optimization method/algorithm.

Under appropriate assumptions on f one can prove that theiterations v pnq produced by the algorithm (85) converges to criticalpoints of the nonconvex function f .

Documents

Foundations of Data Analysis (MA4800) · 2017-07-11 · Matrix Notations The index sets I;J;L;::: are assumed to be nite. As soon as the complex conjugate values appear1, the scalar