Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Foundations of Data Analysis (MA4800)
Massimo Fornasier
Fakultat fur MathematikTechnische Universitat Munchen
[email protected]://www-m15.ma.tum.de/
Slides of LectureJuly 12 2017
Preliminaries on Linear Algebra (LA)
We give for granted familiarity with the basics of LA taught instandard courses, in particular,
§ vector spaces, spans and linear combinations, linear bases,linear maps and matrices, eigenvalues, complex numbers,scalar products, theory of symmetric matrices.
For more details we refer to the lecture notes (in German)
G. Kemper, Lineare Algebra fuer Informatik, TUM, 2017.
of the course taught for Computer Scientsts at TUM, or any otherstandard international text of LA.
Matrix Notations
The index sets I , J, L, . . . are assumed to be finite.
As soon as the complex conjugate values appear1, the scalar fieldis restricted to K P tR,Cu.
The set KIˆJ is the vector space of matrices A P KIˆJ , whoseentries are denoted by Aij , for i P I and j P J. Vice versa, numbersaij P K may be used to define A :“ paijqiPI ,jPJ P KIˆJ .
If A P KIˆJ , the transposed matrix AT P KJˆI , and ATji :“ Aij . A
matrix A P KIˆI is symmetric if AT “ A. The Hermitiantransposed matrix AH P KJˆI coincides with AT . If K “ R thenclearly AH “ AT . A Hermitian matrix satisfies AH “ A.
Often, we will need to consider the rows Apiq and the columns Apjqof a matrix, defined respectively as vectors Apiq “ paijqjPJ andApjq “ paijqiPI “ pA
T qpjq.
1In the case K “ R, then α “ α, for all α P K.
Matrix NotationsWe assume that the standard matrix-vector and matrix-matrixmultiplications are done in the usual way: pAxqi “
ř
jPJ aijxj or
pABqi` “ř
jPJ AijBj` as usual, for x P KJ , A P KIˆJ and
B P KJˆL.
The Kronecker symbol is defined by
δij “
"
1, if i “ j P I ,0, otherwise.
The unit vector epiq P KI is defined by
epiq “ pδijqjPI
The symbol I “ pδijqjPI ,jPI is used for the identity matrix. Sincematrices and index sets do not appear in the same place, thesimultaneous use of the symbol I do not create confusion(example: I P KIˆI ).
Matrix NotationsThe range of matrix A P KIˆJ is
rangepAq “ tAx : x P KJu “ spantApiq, i P I u.
Hence, the range of a matrix is a vector space spanned by itscolumns.
The Euclidean scalar product in KI is given by
xx , yy “ yHx “ÿ
iPI
xiy i , (1)
where for K “ R the conjugate sign can be ignored .
It often useful and very important to notice that the matrix-vectorAx and matrix-matrix AB multiplications can also be expressed interms of scalar products of the rows of A with x and of the rows ofA with the colums of B, i.e., according to our terminology of
pAxqi “ xApiq, xy, pABqi` “ xA
piq,Bp`qy. (2)
Matrix Notations
Two vectors x , y P KI are orthogonal (and we may write x K y) ifxx , yy “ 0.
We often consider sets which are mutually orthogonal. In this casefor X ,Y Ă KI we say that X is orthogonal to Y and we writeX K Y if xx , yy “ 0, for all x P X and y P Y .
When X and Y are linear subspaces their orthogonality can besimply checked by showing that they possess bases which aremutually orthogonal. This is a very important principle we will useoften.
A family of vectors X “ txνuνPF Ă KI is orthogonal if the vectorsxν are pairwise orthogonal, i.e., xxν , xν1y “ 0 for ν ‰ ν 1. The familyis additionally called orthonormal if xxν , xνy “ 1 for all ν P F .
Matrix Notations
A matrix A P KIˆJ is called orthogonal, if the columns of A areorthonormal, equivalently if
AHA “ I P KJˆJ .
An orthogoal square matrix A P KIˆI is called unitary. Differentlyfrom just orthogonal matrices, unitary matrices satisfy
AHA “ AAH “ I ,
i.e, AH “ A´1 is the invese matrix of A.
Assume that the index sets satisfy either I Ă J or J Ă I . Then a(rectangular) matrix A P KIˆJ is diagonal if Aij “ 0 for all i ‰ j .
Matrix Rank
Proposition
Let A P KIˆJ . The following statements are equivalent and may beall used as a definition of matrix rank r “ rankpAq.
§ (a) r “ dim rangepAq;
§ (b) r “ dim rangepAHq;
§ (c) r is the maximal number of linearly independent rows of A;
§ (d) r is the maximal number of linearly independent columnsof A;
§ (e) r is minimal with the property
A “rÿ
i“1
aibHi , where ai P KI , b P KJ ;
§ (f) r is maximal with the property that there exists aninvertible r ˆ r submatrix of A;
§ (g) r is the number of positive singular values (soon!).
Matrix Rank
The rank is bounded by the maximal rank, which, for matrices, isalways given by rmax “ mint#I ,#Ju, and this bound is attainedby full-rank matrices.
As linear independence may depend on the field K one mayquestion whether the rank of a real-valued matrix depends onconsidering it in R or in C. For matrices it tunrs out that it doesnot matter: the rank is the same.
We will often work with matrices of bounded rank r ď k. Wedenote accordingly with Rk “ tA P KIˆJ : rangepAq ď ku such aset. Notice that this set is not a vector space (exercise!).
NormsIn the following we consider an abstract vector space V over thefield K. As a typical example we may keep in mind the Euclideanplane V “ R2 endowed with the classical Euclidean norm.
We recall below the axioms of an abstract norm on V : a norm is amap ¨ : V Ñ r0,8q with the following properties
§ v “ 0 if and only if v “ 0;§ λv “ |λ|v for all v P V and λ P K;§ v ` w ď v ` w for all v ,w P V (triangle inequality).
A norm is always continuous as a consequence of the (inverse)triangle inequalty:
|v ´ w| ď v ´ w, for all v ,w P V .
A vector space V endowed with a norm ¨ , and we write the pairpV , ¨ q to indicate it, is called a normed vector space. Asmentioned above a typical example is the Euclidean plane.
Scalar products and (pre)-Hilbert spaces
A normed vector space pV , ¨ q is a pre-Hilbert space if its norm isdefined by
v “a
xv , vy, v P V ,
where x¨, ¨y : V ˆ V Ñ K is a scalar product on V , i.e., it fulfillsthe properties
§ xv , vy ą 0 for v ‰ 0;
§ xv ,wy “ xw , vy, for v ,w P V ;
§ xu ` λv ,wy “ xu,wy ` λxv ,wy for u, v ,w P V and λ P K;
§ xw , u ` λvy “ xw , uy ` λxw , vy for u, v ,w P V and λ P K.
The triangle inequality for the norm follows from the Schwarzinequality
|xv ,wy| ď vw, v ,w P V .
Scalar products and (pre)-Hilbert spaces
We describe the pre-Hilbert space also by the pair pV , x¨, ¨yq. Atypical example of scalar product over KI is the one we introducedin (1), which generates the Euclidean norm on KI :
v2 “
d
ÿ
iPI
|vi |2, v P KI .
As before one can define orthogonality between vectors,orthogonal, and orthonormal sets of vectors.
Hilbert spaces
A Hilbert space is a pre-Hilbert space pV , x¨, ¨yq, whose topologyinduced by the associated norm ¨ is complete, i.e., any Cauchysequence in such vector spaces is convergent.
It is important to stress that every finite dimensional pre-Hilbertspace pV , x¨, ¨yq is actually complete, hence always a Hilbert space.
Thus, those who are not familiar with topological notions (e.g.,completeness, Cauchy sequences, etc.), one should just reason inwhat follows according to their Euclidean geometric intuition.
Projections
Recall now that a set C in a vector space is convex if, for allv ,w P C and all t P r0, 1s, tv ` p1´ twq P C .
Given a closed convex set C in a Hilbert space pV , x¨, ¨yq onedefines the projection of any vector v on C as
PC pvq “ arg minwPC v ´ w.
This definition is well-posed, as the projection is actually unique,and an equivalent defition is given by fulfilling the followinginequality
<xz ´ PC pvq, v ´ PC pvqy ď 0,
for all z P C . This is left as an exercise.
Projections onto subspaces: Pythagoras-Fourier Theorem
Of extreme importance for us are the orthogonal projections ontosubspaces.
In case C “ W Ă V is actually a closed linear subspace of V , thenthe projection onto W can be readily computed as soon as onedisposes of an orthonormal basis for W . Let twνuνPF be a(countable) orthonormal basis for W then
PW pvq “ÿ
νPF
xv ,wνywν , for all v P V . (3)
Moreover, it holds the Pythagoras-Fourier Theorem:
PW pvq2 “
ÿ
νPF
|xv ,wνy|2, for all v P V .
Pythagoras-Fourier Theorem and othonormal expansions
We will use very much this characterization of the orthogonalprojections and the Pythagoras-Fourier Theorem, especially for thecase where W “ V .
In this case, obviosly, PV “ I and, we have the orthonormalexpansion of any vector v P V ,
v “ÿ
νPF
xv ,wνywν ,
and the norm equivalence
v2 “ÿ
νPF
|xv ,wνy|2.
Trace of a matrix
The effort of defining a abstract scalar products and norms allowsus now to introduce several norms for matrices.
First of all we need to introduce the concept of trace, which is themap tr : KIˆI Ñ K defined by
trpAq “ÿ
iPI
Aii ,
i.e., it is the sum of the diagonal elements of the matrix A.
Trace of a matrix: properties
The trace enjoyes several properties, which we collect in thefollowing:
Proposition (Properties of the trace)
(a) trpABq “ trpBAq, for any A P KIˆJ and B P KJˆI ;
(b) trpABC q “ trpBCAq, for any A,B,C matrices of compatiblesize and indexes; this property is called the circularity propertyof the trace;
(c) as a consequence of the previous property we obtain theinvariance of the trace under unitary transformations, i.e.,trpAq “ trpUAUHq for A P KIˆI and any unitary matrixU P KIˆI ;
(d) trpAq “ř
iPI λi , where tλi : i P I u is the set of eigenvalues ofA.
Matrix norms
The Frobenius norm of a matrix is essentially the Euclidean normcomputed over the entries of the matrix (considered as a vector):
AF :“
d
ÿ
iPI ,jPJ
|Aij |2, A P KIˆJ .
It is also know as Schur norm or Hilbert-Schmidt norm. This normis generated by the scalar product (that’s why we made the effortof introducing abstract scalar products!)
xA,ByF :“ÿ
iPI
ÿ
jPJ
AijBij “ trpABHq “ trpBHAq. (4)
In particular,A2F “ trpAAHq “ trpAHAq (5)
holds.
Matrix norms
Let ¨ X and ¨ Y be vector norms on the vector spaces X “ KI
and Y “ KJ , respectively. Then the associated matrix norm is
A :“ AXÑY :“ supz‰0
AzYzX
, A P KIˆJ .
If both ¨ X and ¨ Y coincides with the Euclidean norms ¨ 2on X “ KI and Y “ KJ , respectively, then the associated matrixnorm is called the spectral norm and it is denoted with A8.
As the spectral norm has a central importance in manyapplications, it is often denoted just by A.
Unitary invariance and submultiplicativity
Both the matrix norms we introduced so far as are invariant withrespect to unitary transformations, i.e., AF “ UAV HF andA8 “ UAV H8 for unitary matrices U,V .
Moreover, both are submitiplicative, i.e.,ABF ď A8BF ď AF BF and AB ď AB, formatrices A,B of compatible sizes.
Introduction of the Singular Value Decomposition
We introduce and analyze the singular value decomposition (SVD)of a matrix A, which is is the factorization of A into the product ofthree matrices A “ UΣV H , where U,V are orthogonal matrices ofcompatible size and the matrix Σ is diagonal with positive realentries.
All these terms, orthogonal, diagonal matrix, have been introducedin the previous lectures.
Geometrical derivation: principal component analysis
To gain insight into the SVD, treat the n “ #I rows of a matrixA P KIˆJ as points in a d-dimensional space, where d “ #J, andconsider the problem of finding the best k-dimensional subspacewith respect to the set of points.
Here best means minimize the sum of the squares of theperpendicular distances of the points to the subspace.An orthonormal basis for this subspace is built as fundamentaldirections with maximal variance of the group of high dimensionalpoints, and are called principal components.
Geometrical derivation: principal component analysis
Pythagoras Theorem, scalar products, and orthogoalprojections
We begin with a special case of the problem where the subspace is1-dimensional, a line through the origin.
We will see later that the best-fitting k-dimensional subspace canbe found by k applications of the best fitting line algorithm.
Finding the best fitting line through the origin with respect to a setof points txi :“ Apiq P K2 : i P I u in the Euclidean plane K2 meansminimizing the sum of the squared distances of the points to theline. Here distance is measured perpendicular to the line. Theproblem is called the best least squares fit.
Pythagoras Theorem, scalar products, and orthogoalprojections
In the best least squares fit, one is minimizing the distance to asubspace. Now, consider projecting orthogonally a point xi onto aline through the origin. Then, by Pythagoras theorem
x2i1`x2
i2`¨ ¨ ¨`x2id “ (length of projection) 2
` (distance of pt. to line) 2.
Pythagoras Theorem, scalar products, and orthogoalprojections
In particular, from the formula above, one has
(distance of pt to line) 2“ x2
i1`x2i2`¨ ¨ ¨`x2
id´ (length of projection) 2.
To minimize the sum of the squares of the distances to the line,one could minimize
ř
i px2i1 ` x2
i2 ` ¨ ¨ ¨ ` x2idq minus the sum of the
squares of the lengths of the projections of the points to the line.
However, the first term of this difference is a constant(independent of the line), so minimizing the sum of the squares ofthe distances is equivalent to maximizing the sum of the squares ofthe lengths of the projections onto the line.
Pythagoras Theorem, scalar products, and orthogoalprojections
Averiging through the points and matrix notation
For best-fit subspaces, we could maximize the sum of the squaredlengths of the projections onto the subspace instead of minimizingthe sum of squared distances to the subspace.
Consider the rows of A as n “ #I rows of a matrix A P KIˆJ aspoints in a d-dimensional space, where d “ #J. Consider thebest-fit line through the origin.
Let v be a unit vector along this line. The length of the projectionof Apiq, the i th row of A, onto v is, according to our definition ofscalar product, |xApiq, vy|.
From this, we see that the sum of length squared of the projectionsis
Av22 “ÿ
iPI
|xApiq, vy|2. (6)
Best-fit and first singular direction
The best-fit line is the one maximizing Av22 and hence minimizingthe sum of the squared distances of the points to the line.
With this in mind, define the first singular vector, v1 of A, which isa column vector, as the best-fit line through the origin for the npoints in d-dimensional space that are the rows of A.
Thusv1 “ arg maxv2“1 Av2.
The value σ1pAq “ Av12 is called the first singular value of A.Note that σ1pAq
2 is the sum of the squares of the projections ofthe points to the line determined by v1.
First link: linear algebra VS optimization!!!
This is the first time we use an optimization criterion to approacha data interpretation problem and to create a connection betweenlinear algebra and optimization.
As we will discuss along this course, optimization plays indeed ahuge role in data analysis and we will have to discuss it in moredetail later on.
A greedy approach to best k-dimensional fit
The greedy approach to find the best-fit 2-dimensional subspace fora matrix A, takes v1 as the first basis vector for the 2-dimenionalsubspace and finds the best 2-dimensional subspace containing v1.
The fact that we are using the sum of squared distances will againhelp, thanks to Pythagoras Theorem: For every 2-dimensionalsubspace containing v1, the sum of squared lengths of theprojections onto the subspace equals the sum of squaredprojections onto v1 plus the sum of squared projections along avector perpendicular to v1 in the subspace.
Thus, instead of looking for the best 2-dimensional subspacecontaining v1, look for a unit vector, call it v2, perpendicular to v1
that maximizes Av2 among all such unit vectors.
Using the same greedy strategy to find the best three and higherdimensional subspaces, defines v3, v4, . . . in a similar manner.
Formal definitions
More formally, the second singular vector, v2 is defined by thebest-fit line perpendicular to v1:
v2 “ arg maxv2“1,xv1,vy“0 Av2.
The value σ2pAq “ Av22 is called the second singular value of A.
The kth singular vector vk is defined similarly by
vk “ arg maxv2“1,xv1,vy“0,...,xvk´1,vy“0 Av2. (7)
and so on.
The process stops when we have found v1, . . . , vr as singularvectors and
0 “ arg maxv2“1,xv1,vy“0,...,xvr ,vy“0 Av2.
Optimality of the greedy algorithm
If instead of finding v1 that maximized Av2 and then the best-fit2-dimensional subspace containing v1, we had found the best-fit2-dimensional subspace, we might have done better.
Surprisingly enough, this is actually not the case. We now give asimple proof that the greedy algorithm indeed finds the bestsubspaces of every dimension.
Proposition
Let A P KIˆJ and v1, v2, . . . , vr be the singular vectors definedabove. For 1 ď k ď r , let Vk be the subspace spanned byv1, v2, . . . , vk . Then for each k, Vk is the best-fit k-dimensionalsubspace for A.
The natural oder of the singular values
We conclude this part with an important property left as anexercise: show that necessarily
σ1pAq ě σ2pAq ě ¨ ¨ ¨ ě σr pAq, (8)
i.e., the singular values come always with a natural nonincreasingorder.
On matrix norms again
Note that the vector Avi is really a list of lengths (with signs) ofthe projections of the rows of A onto vi . Think of σi pAq “ Avi2as the component of the matrix A along vi .
For this interpretation to make sense, it should be true that addingup the squares of the components of A along each of the vi givesthe square of the whole content of the matrix A.
This is indeed the case and is the matrix analogy of decomposing avector into its components along orthogonal directions.
On matrix norms againConsider one row, say Apiq of A. Since v1, v2, . . . , vr span the spaceof all rows of A(exercise!), xApiq, vy “ 0 for all v orthogonal tov1, v2, . . . , vr .
Thus, for each row Apiq, by Pythagoras Theorem,Apiq22 “
řrk“1 |xA
piq, vky|2.
Summing over all rows i P I and recalling (6), we obtain
ÿ
iPI
Apiq22 “rÿ
k“1
ÿ
iPI
|xApiq, vky|2 “
rÿ
k“1
Avk22 “
rÿ
k“1
σkpAq2.
Butÿ
iPI
Apiq22 “ÿ
iPI
ÿ
jPJ
|Aij |2,
is the Frobenius norm of A, so that we obtained
AF “
g
f
f
e
rÿ
k“1
σkpAq2. (9)
On matrix norms againWe might also observe that
σ1pAq “ maxv2“1
Av2 “ arg maxv‰0 Av2v2 “ A, (10)
i.e., the first singular vector corresponds to the spectral norm of A.This allows us to define other and more general norms, called theSchatten-p-norms for matricesdefined for 1 ď p ă 8 by
Ap “
˜
rÿ
k“1
σkpAqp
¸1p
,
and for p “ 8A8 “ max
k“1,...,rσkpAq.
Notice that these definitions gives precisely AF “ A2 andA “ A8 as particular Schatten-norms. Of particular relevancein certain applications related to recommender systems is theso-called nuclear norm A˚ “ A1 corresponding to theSchatten-1-norm.
On rank again
We just mentioned above, and left it as an exercise, that theorthogonal basis consitituted by the vectors v1, v2, . . . , vr spans thespace of all rows of A.
This means that the number r (of them) is actually the rank of thematrix! And this a proof of (g) in the Proposition abovecharacterizing the rank!
The Singular Value Decomposition in its full glory
The vectors Av1,Av2, . . . ,Avr form yet another fundamental set ofvectors associated with A: We normalize them to length one bysetting
ui “1
σi pAqAvi
The vectors u1, u2, . . . , ur are called the left singular vectors of A.
The vi are called the right singular vectors.
The SVD theorem (soon!) will fully explain the reason for theseterms.
Orthonormality of the left singular vectors
Clearly, the right singular vectors are orthogonal by definition.
We now show that the left singular vectors are also orthogonal.
TheoremLet A P KIˆJ be a rank r matrix. The left singular vectors of A,u1, u2, . . . , ur are orthogonal.
SVD in its full glory
TheoremLet A P KIˆJ with right singular vectors v1, v2, . . . , vr , left singularvectors u1, u2, . . . , ur , and corresponding singular values σ1, . . . , σr .Then
A “rÿ
k“1
σkukvHk
Best rank-k approximation
Given the singular value decomposition A “řr
k“1 σkukvHk of a
matrix A P KIˆJ , we define the k-rank truncation by
Ak “
kÿ
`“1
σ`u`vH` .
It should be clear that Ak is a k-rank matrix.
LemmaThe rows of Ak are the orthogonal projections of the rows of Aonto the subspace Vk spanned by the first k singular vectors of A.
Best rank-k Frobenius approximation
TheoremFor any matrix B of rank at most k
A´ AkF ď A´ BF
Best rank-k spectral approximation
LemmaA´ Ak
2 “ σ2k`1.
TheoremFor any matrix B of rank at most k
A´ Ak ď A´ B.
Nonconvexity and curse of dimensionality
We introduced the SVD by means of a greedy algorithm, which ateach step is supposed to solve a nonconvex optimization problemof the type
vk “ arg maxv2“1,xv1,vy“0,...,xvk´1,vy“0 Av2. (11)
Well, the issue with this is that it is neither a convex optimizationproblem nor, in principle, stated on a low dimensional space. Infact our goal is often to deal with “big data” (i.e., many datapoints, each of very high dimension), which eventually implies largedimensionalities of the matrix A.
How can we then approach the computation of the SVD withoutincurring in the so called “curse of dimensionality”, which is anemphatic way of describing the computational infeasibility of acertain numerical task?
The Power Method
It is easiest to describe first in the case when A is real-valued,square, and symmetric and has the same right and left singularvectors, namely, A “
řrk“1 σkvkvT
k .
In this case we have
A2“
˜
rÿ
k“1
σkvkvTk
¸˜
rÿ
`“1
σ`v`vT`
¸
“
rÿ
k,`“1
σ`σkvk vTk v`
loomoon
:“δk`
vT` “
rÿ
k“1
σ2kvkvT
k .
Similarly, if we take the mth power of A, again all the cross termsare zero and we will get
Am “
rÿ
k“1
σmk vkvTk .
We had the spectral norm of A and a spectral gap
If we had σ1 " σ2, we would have
limmÑ8
Am
σm1“ v1vT
1 .
While it is not so simple to compute σ1 (why?!), which correspondsto the spectral norm of A, one can easily compute the Frobeniusnorm AmF , so that we can consider Am
AmFwhich again converges
for m Ñ8 to v1vT1 from which v1 may be computed (exercise!).
What if A is not squared?But then B “ AAT is squared. If again, the SVD of A isA “
řrk“1 σkukvT
k then
B “rÿ
k“1
σ2kukuT
k .
This is the spectral decomposition of B. Using the same kind ofcalculation as above, one obtains
Bm “
rÿ
k“1
σ2mk ukuT
k .
As m increases, for k ą 1 the ratio σ2mk σ2m
1 goes to zero and Bm
gets approximately equal to
σ2m1 u1uT
1 ,
provided again that σ1 " σ2. This suggests how to compute bothσ1 and u1, simply by powering B.
Two relevant issues
§ If we do not have a spectral gap σ1 " σ2 we get into troubles.
§ Computing Bm costs m matrix-matrix multiplication, whendone in the schoolbook way it costs Opmd3q operations orOpm log dq when done by successive squaring.
Use a random vector application instead!
Instead, we computeBmx ,
where x is a random unit length vector.
The idea is that the component of x in the direction of u1, i.e.,x1 :“ xx , u1y is actually bounded away from zero with highprobability and would get mutiplied by σ2
1, while the othercomponents of x along other ui ’s would be multiplied by σ2
k ! σ21
only.
Each increase in m requires multiplying B by the vector Bm´1x ,which we can further break up into
Bmx “ ApAT pBm´1xqq.
This requires two matrix vector products, involving the matricesAT and A. Since Bmx « σ2k
1 u1puT1 xq is a scalar multiple of u1, u1
can be recovered from Bmx by normalization.
Concentration of measure phenomenon
The concentration of measure phenomenon informally speaking isdescribing the fact that - in high-dimension - certain events happenwith high-probability. In particular, we have
LemmaLet x P Rd be a unit d-dimensional vector of componentsx “ px1, . . . , xdq with respect to the canonincal basis and picked atradnom from the set tx : x2 ď 1u. The probability that|x1| ě α ą 0 is at least 1´ 2α
?d ´ 1.
Extensions
RemarkNotice that in the previous result essentially shows also that,independently of the dimension d, the x1 “ xx , u1y component of arandom unit vector x with respect to any orthonormal basistu1, . . . , udu
2 is bounded away from zero with overwhelmingprobability.
RemarkIn view of the isometrical mapping pa, bq Ñ a` ib from R2 to Cthe previous result extends to random unit vectors in Cd simply bymodifying the statement as follows: The probability that, for arandomly chosen unit vecor z P Cd |z1| ě α ą 0 holds is at least1´ 2α
?2d ´ 1.
2Since the sphere is rotation invariant, the arguments above apply to anyorthonormal basis by rotating it to coincide with the canonical basis!
Randomized Power MethodBy injecting randomization in the process, we are able to use theblessing of the dimensionality (concentration of measure) againstthe curse of dimensionality.
TheoremLet A P KIˆJ and x P KI be a random unit length vector. Let Vbe the space spanned by the left singular vectors of Acorresponding to singular values greater than p1´ εqσ1. Let m be
Ω´
lnpdεqε
¯
. Let w˚ be unit vector after m iterations of the power
method, namely,
w˚ “pAAHqmx
pAAHqmx2.
The probability that w˚ has a component of at least O`
εαd
˘
orthogonal to V is at most 1´ 2α?
2d ´ 1.
To make the previous statement concrete, lets take α “ 1102d .
Then, with probability at least 910 (i.e., very close to 1), after m
iteration of order Ω´
lnpdεqε
¯
the vector w˚ has a component of at
least O pεq orthogonal to V .
Applications of the SVD: Principal Component Analysis
The traditional use of SVD is in Principal Component Analysis(PCA). PCA is illustrated by an example - customer-product datawhere there are n customers buying d products.
Applications of the SVD: Principal Component Analysis
Let matrix A with elements Aij represent the probability ofcustomer i purchasing product j (or the amount or utility ofproduct j to customer i).
One hypothesizes that there are really only k underlying basicfactors like age, income, family size, etc. that determine acustomer’s purchase behavior.
An individual customer’s behavior is determined by some weightedcombination of these underlying factors.
Principal Component AnalysisThat is, a i customer’s purchase behavior can be characterized bya k-dimensional vector pu`,i q`“1,...,k , where k is much smaller thann and d .
The components u`,i , for ` “ 1, . . . , k of the vector are weights foreach of the basic factors.
Associated with each basic factor ` is a vector of probabilities v`,each component of which is the probability of purchasing a givenproduct by someone whose behavior depends only on that factor.
More abstractly, A “ UV H is an n ˆ d matrix that can beexpressed as the product of two matrices U and V where U is ann ˆ k matrix expressing the factor weights for each customer andV is a k ˆ d matrix expressing the purchase probabilities ofproducts that correspond to that factor:
pApiqqH “kÿ
`“1
u`,ivH` .
Principal Component Analysis
One twist is that A may not be exactly equal to UV T , but closeto it since there may be noise or random perturbations.
Taking the best rank-k approximation Ak from SVD (as describedabove) gives us such a U,V . In this traditional setting, oneassumed that A was available fully and we wished to find U,V toidentify the basic factors or in some applications to denoise A (ifwe think of A´ UV T as noise).
Low rankness and recommender systems
Now imagine that n and d are very large, on the order ofthousands or even millions, there is probably little one could do toestimate or even store A.
In this setting, we may assume that we are given just given a fewelements of A and wish to estimate A. If A was an arbitrary matrixof size n ˆ d , this would require Ωpndq pieces of information andcannot be done with a few entries.
But again hypothesize that A was a small rank matrix with addednoise. If now we also assume that the given entries are randomlydrawn according to some known distribution, then there is apossibility that SVD can be used to estimate the whole of A.
This area is called collaborative filtering or low-rank matrixcompletion, and one of its uses is to target an ad to a customerbased on one or two purchases.
SVD as a compression algorithm
Suppose A is the pixel intensity matrix of a large image.
The entry Aij gives the intensity of the ij th pixel. If A is n ˆ n, thetransmission of A requires transmitting Opn2q real numbers.Instead, one could send Ak , that is, the top k singular valuesσ1, σ2, . . . , σk along with the left and right singular vectorsu1, u2, . . . , uk , and v1, v2, . . . , vk .
This would require sending Opknq real numbers instead of Opn2q
real numbers.
If k is much smaller than n, this results in savings. For manyimages, a k much smaller than n can be used to reconstruct theimage provided that a very low resolution version of the image issufficient.
Thus, one could use SVD as a compression method.
SVD as a compression algorithm
Sparsifying bases
It turns out that in a more sophisticated approach, for certainclasses of pictures one could use a fixed sparsifying basis so thatthe top (say) hundred singular vectors are sufficient to representany picture approximately.
This means that the space spanned by the top hundred sparsifyingbasis vectors is not too different from the space spanned by thetop two hundred singular vectors of a given matrix in that class.
Compressing these matrices by this sparsifying basis can savesubstantially since the sparsifying basis is transmitted only onceand a matrix is transmitted by sending the top several hundredsingular values for the sparsifying basis.
We will discuss later on the use of sparsifying bases.
Singular Vectors and ranking documents
Consider representing a document by a vector each component ofwhich corresponds to the number of occurrences of a particularword in the document.
The English language has on the order of 25.000 words. Thus,such a document is represented by a 25.000- dimensional vector.The representation of a document is called the word vector mode.
A collection of n documents may be represented by a collection of25.000-dimensional vectors, one vector per document. The vectorsmay be arranged as rows n ˆ 25.000 matrix.
Singular Vectors and ranking documents
An important task for a document collection is to rank thedocuments.
A naive method would be to rank in order of the total length ofthe document (which is the sum of the components of itsterm-document vector). Clearly, this is not a good measure inmany cases.
This naive method attaches equal weight to each term andessentially takes the projection of each term-document vector inthe direction of the vector whose components have just value 1.
Is there a better weighting of terms, i.e., a better projectiondirection which would measure the intrinsic relevance of thedocument to the collection?
Singular Vectors and ranking documents
A good candidate is the best-fit direction for the collection ofterm-document vectors, namely the top (left) singular vector of theterm-document matrix.
An intuitive reason for this is that this direction has the maximumsum of squared projections of the collection and so can be thoughtof as a synthetic term-document vector best representing thedocument collection.
WWW ranking
Ranking in order of the projection of each document’s term vectoralong the best-fit direction has a nice interpretation in terms of thepower method. For this, we consider a different example - that ofweb with hypertext links.
The World Wide Web can be represented by a directed graphwhose nodes correspond to web pages and directed edges tohypertext links between pages.
Some web pages, called authorities, are the most prominentsources for information on a given topic. Other pages called hubs,are ones that identify the authorities on a topic.
Authority pages are pointed to by many hub pages and hub pagespoint to many authorities.
WWW ranking
WWW rankingOne would like to assign hub weights and authority weights to eachnode of the web.
If there are n nodes, the hub weights form a n-dimensional vectoru and the authority weights form a n-dimensional vector v .Suppose A is the adjacency matrix representing the directed graph:Aij is 1 if there is a hypertext link from page i to page j and 0otherwise. Given hub vector u, the authority vector v could becomputed by the formula
vj “nÿ
i“1
uiaij ,
since the right hand side is the sum of the hub weights of all thenodes that point to node j . In matrix terms,
v “ ATu.
Similarly, given an authority vector v , the hub vector could becomputed by u “ Av .
WWW ranking
Of course, at the start, we have neither vector.
But the above suggests a power-like iteration: Start with any v .Set u “ Av ; then set v “ ATu, and repeat the process.
We know from the power method that this converges to the leftand right singular vectors. So, after sufficiently many iterations, wemay use the left vector u as hub weights vector and project eachcolumn of A onto this direction and rank columns (authorities) inorder of this projection.
But the projections just form the vector ATu which equals v . Sowe can just rank by order of vj .
This is the basis of an algorithm called the HITS algorithm whichwas one of the early proposals for ranking web pages. A differentranking called page rank is widely used. It is based on a randomwalk on the graph described above, but we do not explore it here.
Pseudo-inverse matrices
Given A P KIˆJ of rank r and its singular value decomposition
A “ UΣV H “
rÿ
k“1
σkukvHk ,
we define the Moore-Penrose pseudo-inverse A: P KJˆI as
A: “rÿ
k“1
σ´1k vkuH
k .
In matrix form it isA: “ V Σ´1UH ,
where the inverse ¨´1 is applied only to the nonzero singular valuesof A.
Pseudo-inverse matricesIf AHA P KJˆJ is invertible (implying #I ě #J) and this is true ifr “ #J and the columns are linearly independent, then
A: “ pAHAq´1AH .
Indeed
pAHAq´1AH“ pV Σ2V H
q´1V ΣUH
“ pV Σ´2V HqV ΣUH
“ V Σ´1UH“ A:.
In this case, it should be also noted two fundamental properties ofA:. First of all, it is a left inverse of A, i.e.,
I “ pAHAq´1AHA “ A:A,
where I P Krˆr . On the other side the matrix
PrangepAq “ AA:,
corresponds to the orthogonal projection of KI onto the range ofthe matrix A, i.e., the span on the columns of A.
Pseudo-inverse matrices
Similarly, if AAH P KIˆI is invertible (implying #I ď #J) and thisis true if r “ #I and the rows are linearly independent, then
A: “ AHpAAHq´1.
In this case, A: is a right inverse of A, i.e.,
I “ AAHpAAHq´1 “ AA:.
Least square problems
§ The system Ax “ y is overdetermined, i.e., there are moreequations than unknowns. In this case the problem mighteven not be solvable and approximated solutions are needed,i.e., those that minimize the mismatch Ax ´ y2;
§ The system Ax “ y is underdetermined, i.e., there are lessequations than unknowns. In this case the problem has aninfinite number of solutions and one would wish to determineone of them with special properties, by means of a properselection criterion. A criterion of selection of proper solutionsof the systems Ax “ y is often to pick the solution withminimal Euclidean norm x2.
Pseudo inverse matrices and Least square problems
Let us first connect least squares problems with the Moore-Penrosepseudo-inverse introduced above.
Proposition
Let A P KIˆJ and y P KI . Define M Ă KJ to be the set ofminimizers of the map x Ñ Ax ´ y2. The convex optimizationproblem
arg minxPM x2 (12)
has the unique solution x˚ “ A:y.
Proof at the blackboard ...
Overdetermined systems
Corollary
Let A P KIˆJ with n “ #I ě #J “ d be of full rank r “ d, andlet y P KI . The the least square problem
arg minxPKJ Ax ´ y2, (13)
has the unique solution x “ A:y.
This result follows from the Proposition above because theminimizer x˚ of Ax ´ y2 is in this case unique.
Underdetermined systems
Corollary
Let A P KIˆJ with n “ #I ď #J “ d be of full rank r “ n, and lety P KI . The the least square problem
arg minxPKJ x2 subject to Ax “ y , (14)
has the unique solution x “ A:y.
Stability of the Singular Value DecompositionIn many situations, there is a data matrix of interest A, but onehas only a perturbed version of it A “ A` E where E is going tobe a relatively small quantifiable perturbation.To which extend canwe rely on the singular value decomposition of A to estimate theone of A?In many real-life cases, certain sets of data tend to behave asrandom vectors generated in the vicinity of a lower dimensionallinear subspace.
The more data we get the more they take a specific envelopingshape around that subspace. Hence, we could consider manymatrices built out of sets of such randomly generated points. Howdo the SVD’s of such matrices look like?
What happens if we acquire more and more data and the matricesbecome larger and larger?
Also to be able to respond in a quantitative way to such questionswe need to develop a theory of stability of singular valuedecompositions under small perturbations.
Spectral theoremA P KIˆI is actually a Hermitian matrix, i.e. A “ AH . We recallthat a nonzero vector v P KI is an eigenvector of A if Av “ λv forsome scalar λ P K called the corresponding eigenvalue.
Theorem (Spectral theorem for Hermitian matrices )
If A P KIˆI and A “ AH , then there exists an orthonormal basistv1, . . . vnu consisting of eingenvectors of A with real correspondingeigenvalues tλ1, . . . λnu such that
A “ÿ
k“1
λkvkvHk .
The previous representation is called the spectral decomposition ofA.
In fact, by denoting uk “ signλk and σk “ |λk |, then
A “nÿ
k“1
λkvkvHk “
nÿ
k“1
signλk |λk |vkvHk “
nÿ
k“1
σkukvHk ,
which is actually the SVD of A.
Weyl’s bounds
Assume now E is Hermitian so that A “ AH holds, as well. As theeigenvalues of such matrices are always real, we can order in anonincreasing order so that λ1 is the largest (non necessarilypositive!). Then
λ1pAq “ maxv2“1
vHpA` E qv
ď maxv2“1
vHAv ` maxv2“1
vHEv
“ λ1pAq ` λ1pE q
By using the spectral decomposition of A` E do prove as anexercise the first equality above.
Weyl’s bounds
Let v1 be the eigenvector of A associated to the top eigenvalue ofA and
λ1pAq “ maxv2“1
vHpA` E qv
ě vH1 pA` E qv1
“ λ1pAq ` vH1 Ev1
ě λ1pAq ` λnpE q.
From these estimates we obtain the bounds
λ1pAq ` λnpE q ď λ1pA` E q ď λ1pAq ` λ1pE q.
Weyl’s boundsThis chain of inequalities can be properly extended to the 2nd, 3rd,etc. eigenvalues (we do not perform this extension), leading to thefollowing important result.
Theorem (Weyl)
If A,E P KIˆI are two Hermitian matrices, then for all k “ 1, . . . , n
λkpAq ` λnpE q ď λkpA` E q ď λkpAq ` λ1pE q.
This is a very strong result because it does not depend at all fromthe magnitude of the perturbation matrix E . As a corollary
Corollary
If A,E P KIˆI are two arbitrary matrices, then for all k “ 1, . . . , n
|σkpA` E q ´ σkpAq| ď E,
where σk ’s are the corresponding singular vectors and E is thespectral norm of E .
Mirsky’s bounds
Actually more can be said:
Theorem (Mirsky)
If A,E P KIˆI are two arbitrary matrices, then
g
f
f
e
nÿ
k“1
|σkpA` E q ´ σkpAq|2 ď EF ,
where EF is the Frobenius norm of E .
In its original form, Mirsky’s theorem holds for an arbitraryunitarily invariant norm (hence for the spectral norm) and includesWeyl’s theorem as a special case.
Stability of singular spaces: principal angles
To quantify the stability properties of the singular valuedecomposition under perturbations, we need to introduce thenotion of principal angles between subspaces. To introduce theconcept we recall that in a Euclidean space Rd the angle θpv ,wqof two vectors v ,w P Rd is defined by
cos θpv ,wq “wT v
v2w2.
In particular, for unitary vectors v ,w one has cos θpv ,wq “ wT v .
Stability of singular spaces: principal angles
DefinitionLet V ,W P KJˆI be orthogonal matrices spanning V and W ,n-dimensional subspaces of KJ (here we may assume thatn “ #I ď #J “ d)3. We define the principal angles between Vand W by
cos ΘpV ,W q “ ΣpW HV q,
where ΣpW HV q is the list of the singular values of the matrixW HV .
3Here we committed an abuse of notation as we denoted with V ,W thematrices of the orthonormal bases of two subspaces V ,W . However, thisidentification here is legitimated as all we need to know about the subspacesare some orthonormal bases of them.
Stability of singular spaces: principal anglesThis definition sounds a bit abstract and it is perhaps useful tocharacterize the principal angles in terms of a geometricdescription. Recalling the explicit characterization of theorthogonal projection onto a subspace given in (3), we observe that
PV “ VV H , PW “ WW H .
The following result essentially says that the principal anglesquantify how different are the orthogonal projections onto V andW ..
Proposition
Let V ,W P KJˆI be orthogonal matrices spanning V and W ,n-dimensional subspaces of KJ (here we may assume thatn “ #I ď #J “ d). Then
PV ´ PW F “ VV H ´WW HF “?
2 sin ΘpV ,W q2.
Proof at the blackboard ...
Stability of singular spaces: principal angles
The orhogonal space WK to any subspace W of KJ is define byWK “ tv P KJ : xv ,wy “ 0, @w P W u. We show yet anotherdescription of the singular angles. The projections onto W andWK are related by
PW “ I ´ PWKor WW H “ I ´WKW H
K . (15)
Proposition
Let V ,W P KJˆI be orthogonal matrices spanning V and W ,n-dimensional subspaces of KJ (here we may assume thatn “ #I ď #J “ d). We denote WK an orthogonal matricesspanning WK. Then
W HK V F “ sin ΘpV ,W q2.
Proof at the blackboard ...
Stability of singular spaces: principal angles
We leave as an additional exercise to show that, as a corollary ofthe previous result, one has also
PWKPV F “ sin ΘpV ,W q2. (16)
Again, one has to use the properties of the trace operator!
Stability of singular spaces: Wedin’s boundA, A P KIˆJ and define E “ A´ A. We fix1 ď k ď rmax “ mint#I ,#Ju and we consider the SVD
A “`
U1 U0
˘
ˆ
Σ1 00 Σ0
˙ˆ
V H1
V H0
˙
“ U1Σ1V H1
looomooon
:“A1
`U0Σ0V H0
looomooon
:“A0
,
where
V1 “ rv1, . . . , vk s, V0 “ rvk`1, . . . , vrmax s, U1 “ ru1, . . . , uk s, U0 “ ruk`1, . . . , urmax s,
and
Σ1 “ diagpσ1, . . . , σkq, Σ0 “ diagpσk`1, . . . , σrmaxq.
For the same k we use corresponding notations for an analogousdecomposition for A except that all the symbols have ˜ on the top.It is obvious that
A` E “ A1 ` A0 ` E “ A1 ` A0 “ A. (17)
Stability of singular spaces: Wedin’s bound
Let us notice that the rangepA1q is indeed spanned by theorthonormal system U1, while rangepAH
1 q is spanned by V1. Similarstatements hold for A1.We define now two “residual” matrices
R11 “ AV1 ´ U1Σ and R21 “ AH U1 ´ V1ΣH .
Stability of singular spaces: Wedin’s bound
A connection between R11 and E is seen through the rewriting
R11 “ AV1 ´ U1Σ “ pA´ E qV1 ´ U1pUH1 AV1q “ ´E V1.
We used that Σ “ UH1 AV1 “ UH
1 A1V1 and that U1UH1 is the
orthogonal projection onto the range of rangepA1q, henceU1UH
1 A1 “ A1, and A1V1 “ AV1. Analogously, it can be shownthat
R21 “ ´EH U1.
Stability of singular spaces: Wedin’s bound
Theorem (Wedin)
Under the notations so far used in this section, assume that thereexist an α ě 0 and δ ą 0 such that
σkpAq ě α` δ and σk`1pAq ď α. (18)
Then for every unitarily invariant norm ¨ ‹ (in particular for thespectral and Frobenius norm)
maxt sin ΘpV1,V1q‹, sin ΘpU1,U1q‹u
ďmaxtR11‹, R21‹u
δďE‹δ
.
Introduction to probability theoryIn the probabilistic proof of the convergence of the power methodfor computing singular vectors of a matrix A, we used the followingrandom process
to define a random point on the sphere: we considered animaginary player X of darts, with no experience, throwing darts -actually at random - at the target, the two dimensional diskΩ “ Br , the disk of radius r and center 0. At every point ω P Ωwithin the target hit by a dart we assign a point on the boundaryof the target (the one dimensional sphere S1) X pωq “ ω
ω2P S1.
Introduction to probability theory
It is also true that the one dimensional sphere S1 is parametrizedby the angle θ P r0, 2πq, hence we can also consider the player Xas a map from ω P Ω to θ “ arg ω
ω2P r0, 2πq Ă R
X : Ω Ñ R.
We defined the probability that the very inexperienced player Xhits a point ω P B Ă Ω simply as the ratio of the area of B relativeto the entire area of the target Ω:
Ppω P Bq “VolpBq
VolpΩq.
What is then the probability of the event
B “ tω P Ω : θ1 ď X pωq ď θ2u?
Introduction to probability theory
What is then the probability of the event
B “ tω P Ω : θ1 ď X pωq ď θ2u?
The area of B is the area of the disk sector of vectors with anglesbetween rθ1, θ2s, which is computed by using the polar coordinates:
VolpBq “
ż θ2
θ1
ż r
0sdsdθ “
r 2pθ2 ´ θ1q
2.
Hence, recalling that VolpΩq “ r 2π, we have
Pptω P Ω : θ1 ď X pωq ď θ2uq “VolpBq
VolpΩq“
r 2pθ2 ´ θ1q
2πr 2“θ2 ´ θ1
2π.
Introduction to probability theory
What is the average outcome of the (random) player? Well, if weconsider N attempts of the player, the empirical average resultwould be
1
N
Nÿ
i“1
X pωi q,
which, for N Ñ8 converges to
EX :“
ż
ΩX pωqdPpωq “
1
2π
ż 2π
0θdθ “
p2πq2
4π“ π.
The function φpθq “ 12π Ir0,2πqpθq is called the probability density of
the random player X . In general one computes the quantities
EgpX q :“
ż
ΩgpX pωqqdPpωq “
ż
Rgpθqϕpθqdθ.
Introduction to probability theory
If we define Σ the set of all possible events (identified withmeasurable subsets of Ω) the triple pΩ,Σ,Pq is called a probabilityspace.
The random player, which we described above with the functionX : Ω Ñ R, is an example of random variable.
The average outcome of its playing EX is the expected value ofthe random variable.
As already mentioned above, this particular random variable has adensity function φpθq which allows to compute interestingquantities (such as the probabilities of certain events) by integralson the real line.
This relatively intuitive construction is generalized to abstractprobability spaces as follows.
Abstract probability theory
Let pΩ,Σ,Pq be a probability space, where Σ denotes a σ-algebra(of the events) on the sample space and P a probability measureon pΩ,Σq.
The probability of an event B P Σ is denoted by
PpBq “ż
BdPpωq “
ż
ΩIBpωq dPpωq
where the characteristic function IBpωq takes the value 1 if ω P Band 0 otherwise.
The union bound (or Bonferroni’s inequality, or Boole’s inequality)states that for a collection of events B` P Σ, ` “ 1, . . . , n, we have
P
˜
nď
`“1
B`
¸
ď
nÿ
`“1
PpB`q . (19)
Random variables
A random variable X is a real-valued measurable function onpΩ,Σq. Recall that X is called measurable if the preimageX´1pAq “ tω P Ω : X pΩq P Au is contained in Σ for all Borelmeasurable subsets A Ă R.
Usually, every reasonable function X will be measurable; inparticular, all functions appearing in this lectures.
In what follows we will usually not mention the underlyingprobability space pΩ,Σ,Pq when speaking about random variables.
The distribution function F “ FX of X is defined as
F ptq “ PpX ď tq, t P R . (20)
DensitiesA random variable X possesses a probability density functionφ : RÑ R` if
Ppa ă X ď bq “
ż b
aφptqdt for all a ă b P R (21)
Thenφ “ ddt F ptq. The expectation or mean of a random variable
will be denoted by
EX “
ż
ΩX pωq dPpωq . (22)
If X has probability density function φ then for a functiong : RÑ R,
EgpX q “
ż `8
´8
gptqφptqdt (23)
whenever the integral exists.
MomentsThe quantities EX p, p ą 0 are called moments of X , while E|X |pare called absolute moments. (Sometimes we may omit“absolute”.)
The quantity EpX ´ EX q2 “ EX 2 ´ pEX q2 is called variance.
For 1 ď p ă 8, pE|X |pq1p defines a norm on the LppΩ,Pq-space of
all p-integrable random variables, in particular, the triangleinequality
pE|X ` Y |pq1p ď pE|X |pq1
p` pE|Y |pq1
p(24)
holds for all p-integrable random variables X ,Y on pΩ,Σ,Pq.
Hoelder’s inequality states that, for random variables X ,Y on acommon probability space and p, q ě 1 with 1
p `1q “ 1, we have
|EXY | ď pE|X |pq1p pE|Y |qq
1q .
Sequences of random variables and convergence
Let Xn, n P N, be a sequence of random variables such that Xn
converges to X as n Ñ8 in the sense that limnÑ8 Xnpωq “ X pωqfor all ω.
Lebesgue’s dominated convergence theorem states that if thereexists a random variable Y with E|Y | ă 8 such that |Xn| ď |Y |a.s. then limnÑ8 EXn “ EX .
Lebesgue’s dominated convergence theorem has as well an obviousformulation for integrals of sequences of functions.
Fubini’s theorem
Fubini’s theorem on the integration of functions of two variablescan be formulated as follows.
Let f : Aˆ B Ñ C be measurable, where pA, νq and pB, µq aremeasurable spaces.
Ifş
AˆB |f px , yq|dpν b µqpx , yq ă 8 then
ż
A
˜
ż
Bf px , yq dµpyq
¸
dνpxq “
ż
B
˜
ż
Af px , yq dνpxq
¸
dµpyq .
Moment computations and Cavalieri’s formula
Absolute moments can be computed by means of the followingformula.
Proposition
The absolute moments of a random variable X can be expressed as
E|X |p “ p
ż 8
0Pp|X | ě tqtp´1 dt, p ą 0 .
Corollary
For a random variable X the expectation satisfies
EX “
ż 8
0PpX ě tq dt ´
ż 8
0PpX ď ´tqdt .
Tail bounds and Markov inequality
The function t Ñ Pp|X | ě tq is called the tail of X. A simple butoften effective tool to estimate the tail by expectations andmoments is the Markov inequality.
TheoremLet X be a random variable. Then
Pp|X | ě tq ďE|X |
tfor all t ą 0 .
Tail bounds and Markov inequality
RemarkAs an important consequence we note that for p ą 0
Pp|X | ě tq “ Pp|X |p ě tpq ď t´pE|X |p, for all t ą 0 .
The special case p “ 2 is referred to as the Chebyshev inequality.Similarly, for θ ą 0 we obtain
PpX ě tq “ P`
exppθX q ě exppθtq˘
ď expp´θtqE exppθX q for all t P R .
The function θ ÞÑ E exppθX q is usually called the Laplacetransform or the moment generating function of X .
Normal distributions
A normal distributed random variable or Gaussian randomvariable X has probability density function
ψptq “1
?2πσ2
exp
˜
´pt ´ µq2
2σ2
¸
. (25)
Its distribution is often denoted by Npµ, σq.
It has mean EX “ µ and variance EpX ´ µq2 “ σ2.
A standard Gaussian random variable (or standard normal orsimply standard Gaussian), usually denoted g , is a Gaussianrandom variable with Eg “ 0 and Eg 2 “ 1.
Random vectorsA random vector X “ rX1, . . . ,Xns
J P Rn is a collection of nrandom variables on a common probability space pΩ,Σ,Pq.
Its expectation is the vector EX “ rEX1, . . . ,EXnsJ P Rn, while its
joint distribution function is defined as
F pt1, . . . , tnq “ PpX1 ď t1, . . . ,Xn ď tnq, t1, . . . , tn P R .
Similarly to the univariate case, the random vector X has a jointprobability density if there exists a function φ : Rn Ñ r0, 1s suchthat for any measurable domain D Ă Rn
PpX P Dq “
ż
Dφpt1, . . . , tnq dt1 ¨ ¨ ¨ dtn .
A complex random vector Z “ X` iY P Cn is a special case of a2n-dimensional real random vector pX,Yq P R2n.
IndependenceA collection of random variables X1, . . . ,Xn is (stochastically)independent if, for all t1, . . . , tn P R,
PpX1 ď t1, . . . ,Xn ď tnq “nź
`“1
PpX` ď t`q .
For independent random variables, we have
E
«
nź
`“1
X`
ff
“
nź
`“1
ErX`s . (26)
If they have a joint probability density function φ then the latterfactorizes as
φpt1, . . . , tnq “ φ1pt1q ˆ ¨ ¨ ¨ ˆ φnptnq
where the φ1, . . . φn are the probability density functions ofX1, . . . ,Xn.
Independence
In general, a collection X1 P Rn1 , . . . ,Xm P Rnm of random vectorsare independent if for any collection of measurable sets A` Ă Rn` ,` P rms,
PpX1 P A1, . . . ,Xm P Amq “
mź
`“1
PpX` P A`q .
If furthermore f` : Rn` Ñ RN` , ` “ 1, . . . ,m, are measurablefunctions then also the random vectors f1pX1q, . . . , fmpXmq areindependent.
A collection X1, . . . ,Xm P Rn of independent random vectors thatall have the same distribution is called independent identicallydistributed (i.i.d.).A random vector X1 will be called an independent copy of X if Xand X1 are independent and have the same distribution.
Sum of independent random variables
The sum X ` Y of two independent random variables X ,Y havingprobability density functions φX , φY , has probability densityfunction φX`Y given by the convolution
φX`Y ptq “ pφX ˚ φY qptq “
ż `8
´8
φX puqφY pt ´ uq du (27)
Fubini’s theorem for expectations
The theorem takes the following form. Let X,Y P Rn be twoindependent random vectors (or simply random variables) andf : Rn ˆ Rn Ñ R be a measurable function such thatE|f pX,Yq| ă 8. Then the functions
f1 : Rn Ñ R, f1pxq “ Ef px,Yq, f2 : Rn Ñ R, f2pyq “ Ef pX, yq
are measurable, E|f1pXq| ă 8 and E|f2pYq| ă 8 and
Ef1pXq “ Ef2pYq “ Ef pX,Yq . (28)
The random variable f1pXq is also called conditional expectation orexpectation conditional on X and will sometimes be denoted byEY f pX,Yq.
Gaussian vectorsA random vector g P Rn is called a standard Gaussian vector if itscomponents are independent standard normal distributed randomvariables. More generally, a random vector X P Rn is said to be aGaussian vector or multivariate normal distributed if there exists amatrix A P Rnˆk such that X “ Ag ` µ, where g P Rk is astandard Gaussian vector and µ P Rn is the mean of X.
The matrix Σ “ AAT is then the covariance matrix of X , i.e.,Σ “ EpX´ µqpX´ µqJ. If Σ is non-degenerate, i.e., Σ is positivedefinite, then X has a joint probability density function of the form
ψpxq “1
p2πqn2
a
detpΣqexp
ˆ
´1
2xx´ µ,Σ´1px´ µqy
˙
In the degenerate case when Σ is not invertible X does not have adensity.It is easily deduced from the density that a rotated standardGaussian Ug, where U is an orthogonal matrix, has the samedistribution as g itself.
Sum of gaussian variables
If X1, . . . ,Xn are independent and normal distributed randomvariables with means µ1, . . . , µn and variances σ2
1, . . . , σ2n then
X “ rX1, . . . ,XnsJ has a multivariate normal distribution and its
sum Z “řn`“1 X` has the univariate normal distribution with
mean µ “řn`“1 µ` and variance σ2 “
řn`“1 σ
2` , as can be
calculated from (27).
Jensen’s inequality
Jensen’s inequality reads as follows.
TheoremLet f : Rn Ñ R be a convex function, and let X P Rn be a randomvector. Then
f pEXq ď Ef pXq .
Note that ´f is convex if f is concave, so that for concavefunctions f , Jensen’s inequality reads
Ef pXq ď f pEXq . (29)
Cramer’s Theorem
We often encounter sums of independent mean zero randomvariables. Deviation inequalities bound the tail of such sums.
We recall that the moment generating function of a (real-valued)random variable X is defined by
θ ÞÑ E exppθX q ,
for all θ P R whenever the expectation on the right hand side iswell-defined.
Its logarithm is the cumulant generating function
CX pθq “ lnE exppθX q .
With the help of these definitions we can formulate Cramer’stheorem.
Cramer’s Theorem
TheoremLet X1, . . . ,XM be a sequence of independent (real-valued)random variables, with cumulant generating functions CX` ,` P rMs :“ t1, . . . ,Mu. Then, for t ą 0,
P
˜
Mÿ
`“1
X` ě t
¸
ď exp
˜
infθą0
#
´θt `Mÿ
`“1
CX`pθq
+¸
.
RemarkThe function
t ÞÑ infθą0
#
´θt `Mÿ
`“1
XX`pθq
+
appearing in the exponential is closely connected to a convexconjugate function appearing in convex analysis.
Hoeffdings’ inequality
TheoremLet X1, . . . ,XM be a sequence of independent random variablessuch that EX` “ 0 and |X`| ď B` almost surely, ` P rMs. Then
P
˜
Mÿ
`“1
X` ě t
¸
ď exp
˜
´t2
2řM`“1 B2
`
¸
,
and consequently,
P
˜∣∣∣∣∣ Mÿ
`“1
X`
∣∣∣∣∣ ě t
¸
ď 2 exp
˜
´t2
2řM`“1 B2
`
¸
. (30)
Bernstein inequalities
TheoremLet X1, . . . ,XM be independent mean zero random variables suchthat, for all integers n ě 2,
E|X`|n ďn!
2Rn´2σ2
` for all ` P rMs (31)
for some constants R ą 0 and σ` ą 0, ` P rMs. Then, for all t ą 0,
P
˜∣∣∣∣∣ Mÿ
`“1
X`
∣∣∣∣∣ ě t
¸
ď 2 exp
˜
´t2
2pσ2 ` Rtq
¸
, (32)
where σ2 :“řM`“1 σ
2` .
Bernstein inequalities
Before providing the proof we give two consequences. The first isthe Bernstein inequality for bounded random variables.
Corollary
Let X1, . . . ,XM be independent random variables with zero meansuch that |X`| ă K almost surely, for ` P rMs and some constantK ą 0. Further assume E|X`|2 ď σ2
` for constants σ` ą 0, ` P rMs.Then, for all t ą 0,
P
˜∣∣∣∣∣ Mÿ
`“1
X`
∣∣∣∣∣ ě t
¸
ď 2 exp
˜
´t2
2pσ2 ` 13 Ktq
¸
, (33)
where σ2 :“řM`“1 σ
2` .
Bernstein inequalities
As a second consequence, we present the Bernstein inequality forsubexponential random variables.
Corollary
Let X1, . . . ,XM be independent mean zero subexponential randomvariables, that is, Pp|X`| ě tq ď βe´κt for some constants β, κ ą 0for all t ą 0, ` P rMs. Then
P
˜∣∣∣∣∣ Mÿ
`“1
X`
∣∣∣∣∣ ě t
¸
ď 2 exp
˜
´pκtq2
2p2βM ` κtq
¸
, (34)
The Johnson-Lindenstrauss Lemma
Theorem ( Johnson-Lindenstrauss Lemma)
For any 0 ă ε ă 1 and any integer n, let k be a positive integerand such that
k ě 2βpε22´ ε33q´1 ln n,
for some β ě 2. Then for any set P of n points in Rd , there is amap f : Rd Ñ Rk such that for all v ,w P P
p1´ εqv ´ w22 ď f pvq ´ f pwq22 ď p1` εqv ´ w22 (35)
Furthermore, this map can be generated at random withprobability 1´ pn2´β ´ n1´βq.
Let us comment on the probability: choosing β " 2 leads to a veryfavorable probability 1´ pn2´β ´ n1´βq « 1 as soon as n is largeenough.
The Johnson-Lindenstrauss Lemma
To clarify the impact of this result, consider a cloud of n “ 107
points (ten million points) in dimension d “ 106 (one million).
Then there exists a map f generated at random with probabilityclose to 1, which allows us to project the 10 million points indimension k “ Opε´2 log 106q “ Op6ε´2q, preserving the distanceswithin the point cloud with a maximal distortion of Opεq.
For example, for ε “ 0.1 we obtain k of the order of a thousand,while the original dimension d was one million. Hence, we obtain aquasi-isometrical dimensionality reduction of 3-4 orders ofmagnitude.
Projection onto random subspaces
One proof of the lemma takes f to be a suitable multiple of theorthogonal projection onto a random subspace of dimension k inRd (Gupta, Dasgupta).
In this case f would be linear and representable by a matrix A.
Then, by considering unit vectors z “ v ´ wv ´ w2, we obtainthat (35) is equivalent to
p1´ εq ď Az2 ď p1` εq.
or|Az2 ´ 1| ď ε.
... or a random point onto a fixed subspace
Hence the aim is essentially to estimate the length of a unit vectorin Rd when it is projected onto a random k-dimensional subspace.
However, this length has the same distribution as the length of arandom unit vector projected down onto a fixed k-dimensionalsubspace.
Here we take this subspace to be the space spanned by the first kcoordinate vectors, for simplicity.
Gaussianly distributed vectors
We recall that X „ Np0, 1q if its density is
ψpxq “ 1?
2π expp´x2p2qq.
Let X1, . . . ,Xd be d-independent Gaussian Np0, 1q randomvariables, and let Y “ X ´1
2 pX1, . . . ,Xdq.
It is easy to see that Y is a point chosen uniformly at random fromthe surface of the d-dimensional sphere Sd´1 (exercise!).
Let the vector Z P Rk be the projection of Y onto its first kcoordinates, and let L “ Z22. Clearly the expected squared lengthof Z is µ “ ErZ22s “ kd (exercise!).
Concentration inequalities
LemmaLet k ă d. Then,
(a) If α ă 1, then
Pˆ
L ď αk
d
˙
ď αk2
ˆ
1`p1´ αqk
d ´ k
˙pd´kq2
ď exp
ˆ
k
2p1´ α` lnαq
˙
.
(b) If α ą 1, then
Pˆ
L ě αk
d
˙
ď αk2
ˆ
1`p1´ αqk
d ´ k
˙pd´kq2
ď exp
ˆ
k
2p1´ α` lnαq
˙
.
Proof techniques
The proof of this lemma uses by now standard techniques appliedfor proving large deviation bounds on sums of random variablessuch as the ones we used for proving the Hoeffding and Bernsteininequalities.
The proof of the Johnson-Lindenstrauss Lemma - as developed atthe blackboard - follows by union bound over the application of theconcentration inequalities of the Lemma.
Johnson-Lindenstrauss Lemma by Gaussian matrices
Theorem ( Johnson-Lindenstrauss Lemma)
For any 0 ă ε ă 12 and any integer n, let k be a positive integerand such that
k ě βε´2 ln n.
Then for any set P of n points in Rd , there is a map f : Rd Ñ Rk
such that for all v ,w P P
p1´ εqv ´ w22 ď f pvq ´ f pwq22 ď p1` εqv ´ w22 (36)
Furthermore, this map can be generated at random withprobability 1´ pn2´βp1´εq ´ n1´βp1´εqq.
Concentration inequalities for Gaussian matrices
LemmaLet x P Rd . Assume that the entries of A P Rkˆd are sampledindependently and identically distributed according to Np0, 1q.Then
P
˜ˇ
ˇ
ˇ
ˇ
ˇ
›
›
›
›
1?
kAx
›
›
›
›
2
2
´ x22
ˇ
ˇ
ˇ
ˇ
ˇ
ą εx22
¸
ď 2e´pε2´ε3qk4,
or, equivalently,
P
˜
p1´ εqx22 ď
›
›
›
›
1?
kAx
›
›
›
›
2
2
ď p1` εqx22
¸
ě 1´ 2e´pε2´ε3qk4.
Nothing special in Gaussian matrices
LemmaLet x P Rd . Assume that the entries of A P Rkˆd are either `1 or´1 with probability 12. Then
P
˜ˇ
ˇ
ˇ
ˇ
ˇ
›
›
›
›
1?
kAx
›
›
›
›
2
2
´ x22
ˇ
ˇ
ˇ
ˇ
ˇ
ą εx22
¸
ď 2e´pε2´ε3qk4,
or, equivalently,
P
˜
p1´ εqx22 ď
›
›
›
›
1?
kAx
›
›
›
›
2
2
ď p1` εqx22
¸
ě 1´ 2e´pε2´ε3qk4.
Scalar product preservation
Corollary
Let P be a set points in the unit ball Bp1q Ă Rd and thatf : Rd Ñ Rk is a linear function such that
p1´ εqu ˘ v22 ď f pu ˘ vq22 ď p1` εq2u ˘ v22, (37)
for all u, v P P. Then
|xu, vy ´ xf puq, f pvqy| ď ε,
for u, v P P.
Scalar product preservation
Corollary
Let u, v be two points in the unit ball Bp1q Ă Rd , A P Rkˆd is amatrix with random entries such that, for a given x P Rd
P
˜
p1´ εqx22 ď
›
›
›
›
1?
kAx
›
›
›
›
2
2
ď p1` εqx22
¸
ě 1´ 2e´pε2´ε3qk4.
(For instance A is a matrix with Gaussian random entries or withentries ˘1 with probability 12.) Then, with probability at least1´ 4e´pε
2´ε3qk4
|xu, vy ´ xf puq, f pvqy| ď ε,
for u, v P P.
Convex sets
DefinitionA subset K Ă RN is called convex if for all x, z P K the linesegment connecting x, z is also completely contained in K , that is,
tx` p1´ tqz P K for all t P r0, 1s.
It is straightforward to check that a set K P RN is convex if andonly if for all x1, ..., xn P K and t1, ..., tn ě 0 such that
řnj“1 tj “ 1
the convex combinationřn
j“1 tjxj is also contained in K .
Convex sets
DefinitionLet T P RN be a set. Its convex hull convpT q is the smallestconvex set containing T .
It is well-known that the convex hull of T consists of the convexcombinations of T ,
convpT q “
#
ÿ
j
tjxj : tj ě 0,ÿ
j
tj “ 1, xj P T
+
.
Simple examples of convex sets include subspaces, affine spaces,half spaces, polygons or norm balls Bpx,Rq. The intersection ofconvex sets is again convex.
Cones
DefinitionA K Ă RN is called a cone if for all x P K and all t ě 0 also tx iscontained in K . If, in addition, K is convex, then K is called aconvex cone.
Obviously, the zero vector is contained in every cone. A set K is aconvex cone if for all x, z P K and t, s ě 0 also sx` tz iscontained in K .Simple examples of convex cones include the positive orthantRN` “ tx P RN : xi ě 0 for all i P rNsu or the set of positive
semidefinite matrices in RN ˆ RN . Another important example ofa convex cone is the second order cone
$
&
%
x P RN`1 :
g
f
f
e
Nÿ
j“1
x2j ď xN`1
,
.
-
. (38)
Dual cones
For a cone K Ă RN , its dual cone K˚ is defined via
K˚ “ tz P RN : xx, zy ě 0 for all x P Ku. (39)
As the intersection of half-spaces, K˚ is closed and convex, and itis straight-forward to check that K˚ is again a cone.
If K is a closed cone then K˚˚ “ K .
Moreover, if H,K Ă RN are cones such that H Ă K thenK˚ Ă H˚.
As an example, the dual cone of the positive orthant RN` is RN
`
itself, i.e. it is self-dual.
Polar cones
Note that the dual cone is closely related to the polar cone, whichis defined by
K ˝ “ tz P RN : xx, zy ď 0 for all x P Ku “ ´K˚. (40)
The conic hull conepT q of a set T Ă RN is the smallest convexcone containing T .
It can be described as
conepT q “
#
ÿ
j
tjxj : tj ě 0, xj P T
+
. (41)
Geometrical Hahn-Banach theorem
Convex sets can be separated by hyper-planes as stated next.
TheoremLet K1,K2 Ă RN be convex sets whose interiors have emptyintersection. Then there exists a vector w P RN and a scalar λsuch that
K1 Ă tx P RN : xx,wy ď λu,
K2 Ă tx P RN : xx,wy ě λu.
Extremal points
DefinitionLet K Ă RN be a convex set. A point x P K is called extremepoint of K if x “ tw ` p1´ tqz for w, z P K and t P p0, 1q impliesx “ w “ z.
TheoremA compact convex set is the convex hull of its extreme points.
If K is a polygon then its extreme points are the zero-dimensionalfaces of K , and the above statement is clearly intuitive.
Proper functions
We work with extended valued functions F : RN Ñ p´8,8s.
Sometimes we also consider an additional extension of the valuesto ´8. Addition, multiplication and inequalities in p´8,8s areunderstood in the “natural” sense; for instance, x `8 “ 8 for allx P R.
The domain of an extended-valued function F is defined as
dompF q “ tx P RN : F pxq ‰ 8u.
A function with dompF q ‰ H is called proper.
A function F : K Ñ R on a subset K Ă RN can be extended to anextended valued function by setting F pxq “ 8 for x R K . Thenclearly dompF q “ K , and we call this extension the canonical one.
Convex functions
DefinitionAn extended valued function F : RN Ñ p´8,8s is called convex iffor all x, z P RN and t P r0, 1s,
F ptx` p1´ tqzq ď tF pxq ` p1´ tqF pzq. (42)
F is called strictly convex if above inequality is strict for x ‰ z andt P p0, 1q.
F is called strongly convex with parameter γ ą 0 if for allx, z P RN and t P r0, 1s
F ptx`p1´ tqzq ď tF pxq` p1´ tqF pzq´γ
2tp1´ tqx´ z22. (43)
A function F : RN Ñ r´8,8q is called (strictly, strongly) concaveif ´F is (strictly, strongly) convex.
Convex functions
Obviously, a strongly convex function is strictly convex.
The domain of a convex function is convex, and a functionF : K Ñ RN on a convex subset K Ă RN is called convex if itscanonical extension to RN is convex.
A function F is convex if and only if its epigraph
epipF q “ tpx, rq : r ě F pxqu Ă RN ˆ R
is a convex set.
Smooth convex functions
PropositionLet F : RN
Ñ R be differentiable.
1. F is convex if and only if for all x, y P RN
F pxq ě F pyq ` x∇F pyq, x´ yy,
where the gradient is defined as ∇F pyq “ pBy1 F pyq, ..., BynF pyqqT .
2. F is strongly convex with parameter γ ą 0 if and only if for all x, y P RN
F pxq ě F pyq ` x∇F pyq, x´ yy `γ
2tp1´ tqx´ z2
2.
3. Assume that F is twice differentiable. Then F is convex if and only if
∇2F pxq ě 0
for all x P RN , where ∇2F is the Hessian of F .
Composition of convex functions
Proposition
1. Let F ,G be convex functions on RN . Then, for α, β ě 0 thefunction αF ` βG is convex.
2. Let F : RÑ R be convex and nondecreasing, andG : RN Ñ R be convex. Then Hpxq “ F pG pxqq is convex.
Examples of convex functions
§ Every norm ¨ on RN is convex by triangle inequality andhomogeneity.
§ The `p-norms are strictly convex for 1 ă p ă 8 and notstrictly convex for p “ 1,8.
§ For A P RNˆN positive semidefinite, the function F pxq “ x˚Axis convex. It is strictly convex if A is positive definite.
§ For a convex set K the characteristic function
χK pxq “
#
0 x P K ,
8 else(44)
is convex.
Convex functions, continuity, and lower-semicontinuity
Proposition
Let F : RN Ñ R be a convex function. Then F is continuous onRN .
Treatment of extended valued functions requires the concept oflower semicontinuity.
DefinitionA function F : RN Ñ p´8,8s is called lower semicontinuous if forall x P RN and every sequence pxjqjPN Ă RN converging to x itholds
lim infjÑ8
F pxjq ě F pxq.
Convex functions, continuity, and lower-semicontinuity
Clearly, a continuous function F : RN Ñ R is lowersemicontinuous.
A nontrivial example is the characteristic function χK of a propersubset K Ă RN . Clearly, χK is not continuous, but it is lowersemicontinuous if and only if K is closed.
A function is lower semicontinuous if and only if its epigraph isclosed.
We remark that the notion of lower semicontinuity is particularlyuseful in infinite-dimensional Hilbert spaces, where for instance thenorm ¨ is not continuous with respect to the weak topology butis still lower semicontinuous with respect to the weak topology.
Convex functions and minimization
Convex functions have nice properties related to minimization.
A (global) minimum (or minimizer) of a functionF : RN Ñ p´8,8s is a point x P RN satisfying F pxq ď F pyq for ally P RN .
A local minimum of F is a point x P RN such that there existsε ą 0 and F pxq ď F pyq for all y satisfying x´ y2 ď ε.
Proposition
Let F : RN Ñ p´8,8s be convex.
1. A local minimum of F is a global minimum.
2. The set of minima of F is convex.
3. If F is strictly convex then the minimum is unique.
The fact that local minima of convex functions are automaticallyglobal minima is the essential reason why efficient optimizationmethods are available for convex optimization problems.
Jointly convex functions
We say that a function f px, yq of two arguments x P Rn, y P Rm isjointly convex if it is convex as a function of the variablez “ px, yq.
Partial minimization of a jointly convex function in one variableyields again a convex function as stated next.
TheoremLet f : Rn ˆ Rm Ñ p´8,8s be a jointly convex function. Thenthe function gpxq “ infyPRm f px, yq, x P Rn, is convex.
Maxima of convex functions on compact convex sets
TheoremLet K Ă RN be a compact convex set, and F : K Ñ R be a convexfunction. Then F attains its maximum at an extreme point of K .
The convex conjugate
The convex conjugate is a very useful concept in convex analysisand optimization.
DefinitionLet F : RN Ñ p´8,8s. Then its convex conjugate (or Fencheldual) function F ˚ : RN Ñ p´8,8s is defined by
F ˚pyq :“ supxPRN
txx, yy ´ F pxqu .
The convex conjugate F ˚ is always a convex function, no matterwhether the function F is convex or not. The definition of F ˚
immediately gives the Fenchel (or Young, or Fenchel-Young)inequality
xx, yy ď F pxq ` F ˚pyq for all x, y P RN . (45)
The convex conjugate
Let us summarize some properties of convex conjugate functions.
Proposition
Let F : RN Ñ p´8,8s.
1. The convex conjugate F ˚ is lower semicontinuous.
2. The biconjugate F ˚˚ is the largest lower semicontinuousconvex function satisfying F ˚˚pxq ď F pxq for all x P R. Inparticular, if F is convex and lower semicontinuous thenF “ F ˚˚.
3. For τ ‰ 0 let Fτ pxq :“ F pτxq. Then pFτ q˚pyq “ F ˚p y
τ q.
4. For τ ą 0, pτF q˚pyq “ τF ˚p yτ q.
5. For z P RN let F pzq :“ F px´ zq. ThenpF pzqq˚pyq “ xz, yy ` F ˚pyq.
The biconjugate F ˚˚ is sometimes also called the convex relaxationof F because of 2.
The convex conjugateLet us compute the convex conjugate for some examples.
Example
§ Let F pxq “ 12‖x‖2
2, x P RN . Then F ˚pyq “ 12‖y‖2
2 “ F pyq,y P RN . Indeed, since
xx, yy ď1
2‖x‖2
2 `1
2‖y‖2
2
we have
F ˚pyq “ supxPRN
txx, yy ´ F pxqu ď1
2‖y‖2
2 .
For the converse inequality, we just set x “ y in the definitionof the convex conjugate to obtain
F ˚pyq ě ‖y‖22 ´
1
2‖y‖2
2 “1
2‖y‖2
2 .
Note that this example is the only function F on RN
satisfying F “ F ˚.
The convex conjugate
§ Let F pxq “ exppxq, x P R- The map x ÞÑ xy ´ exppxq takesits maximum at x “ ln y if y ą 0 so that
F ˚pyq “ supxPRtxy ´ exu “
$
&
%
y ln y ´ y if y ą 0 ,0 if y “ 0 ,8 if y ă 0 ,
The Young inequality for this particular pair reads
xy ď ex ` y lnpyq ´ y for all x P R , y ą 0
§ Let F “ χK be the characteristic function of a convex set K .Its convex conjugate is given by
F ˚pyq “ supxPKxx, yy
The subdifferential
The subdifferential generalizes the gradient for not necessarilydifferentiable functions.
DefinitionThe subdifferential of a convex function F : RN Ñ R at a pointx P RN is defined by
BF pxq “ tv P RN : F pzq ´ F pxq ě xv, z´ xy for all z P RNu . (46)
The elements of BF pxq are called subgradients of F at x.
The subdifferential BF pxq of a convex function F is alwaysnon-empty. If F is differentiable in x then BF pxq contains only thegradient,
BF pxq “ t∇F pxqu .
The subdifferential
A simple example of a function with a non-trivial subdifferential isthe absolute value F pxq “ |x |, for which
BF pxq “
#
tsignpxqu if x ‰ 0 ,
r´1, 1s if x “ 0 ,
where signpxq “ `1 for x ą 0 and signpxq “ ´1 for x ă 0 as usual.
The subdifferential
The subdifferential allows a simple characterization of minimizersof convex functions.
TheoremA vector x is a minimum of F if and only if 0 P BF pxq.
The subdifferential and conjugation
Convex conjugate functions and subdifferentials are related in thefollowing way.
TheoremLet F : RN Ñ p´8,8s be a convex function and x, y P RN . Thefollowing conditions are equivalent
1. y P BF pxq,
2. F pxq ` F ˚pyq “ xx, yy,If additionally, F is lower semicontinuous then 1 and 2 areequivalent to
3. x P BF ˚pyq
As a consequence, if F is a convex lower semicontinuous functionthen BF is the inverse of BF ˚ in the sense that x P BF ˚pyq if andonly if y P BF pxq.
Proximal mapping
Let F : RN Ñ p´8,8s be a convex function. Then, for z P RN
the function
x ÞÑ F pxq `1
2‖x´ z‖2
2
is strictly convex due to the strict convexity of x ÞÑ ‖x‖22. By a
previous result, its minimizer is unique. The mapping
PF pzq :“ arg min
#
F pxq `1
2‖x´ z‖2
2 : x P RN
+
,
is called the proximal mapping associated with F . In the specialcase that F “ χK is the characteristic function of a convex set Kdefined in (44), then PK :“ PχK
is the orthogonal projection ontoK , that is,
PK pzq “ arg minxPK‖x´ z‖2 .
If K is a subspace of RN then it is the usual orthogonal projectiononto K , and in particular, a linear map.
Proximal mapping
PropositionLet F : RN
Ñ p´8,8s be a convex function. Then x “ PF pzq if and only ifz P x` BF pxq.
The previous proposition justifies to write
PF “ pI ` BF q´1
Moreau’s identity relates the proximal mappings of F and F˚.
Theorem (Moreau’s Identity)Let F : RN
Ñ p´8,8s be a lower semicontinuous convex function. Then, forall z P RN ,
PF pzq ` PF˚pzq “ z .
If PF is easy to compute then the previous result shows that alsoPF˚pzq “ z´ PF pzq is easy to compute.It is useful to note that applyingMoreau’s identity to the function τF for some τ ą 0 shows that
PτF pzq ` τPτ´1F˚pzτq “ z. (47)
Nonexpansiveness of proximal mappings
TheoremFor a convex function F : RN Ñ p´8,8s the proximal mappingPF is a contraction,
PF pzq ´ PF pz1q2 ď z´ z12 for all z, z1 P RN .
Relevant example
Let F pxq “ |x |, x P R, be the absolute value function. Then astraightforward computation shows that, for τ ą 0,
PτF pyq “ arg minxPR
"
1
2px ´ yq2 ` τ |x |
*
“
$
’
&
’
%
y ´ τ if y ě τ,
0 if |y | ď τ,
y ` τ if y ď ´τ
“ Sτ pyq.
(48)
The function Sτ pyq is called soft thresholding or shrinkageoperator. More generally, if F pxq “ x1 is the `1-norm on RN
then the minimization problem defining the proximal operatordecouples and PτF pyq, y P RN is given entrywise by
PτF pyql “ Sτ pylq, l P rNs. (49)
Optimization problems
An optimization problem is of the form
minxPRN
F0pxq subject to Ax “ y, (50)
Fjpxq ď bj , j P rMs, (51)
where the function F0 : RN Ñ p´8,8s is called objective function,the functions F1, ...,FM : RN Ñ p´8,8s are called constraintfunctions, and A P Rmˆn, y P Rm provide the equality constraint.A point x satisfying the constraints is called feasible, and (50)-(51)is called feasible if there exists a feasible point. A feasible point x7
for which the minimum is attained, that is, F0px7q ď F pxq for allfeasible x, is called a minimizer or optimal point, and F0px7q is theoptimal value.
Optimization problems
An optimization problem is of the form
minxPRN
F0pxq subject to Ax “ y,
Fjpxq ď bj , j P rMs,
We note that the equality constraint may be removed andrepresented by inequality constraints of the formFjpxq ď yj ,´Fjpxq ď ´yj with Fjpxq “ xAj , xy where Aj P RN is arow of A.The set of feasible points described by the constraints is given by
K “ tx P RN : Ax “ y, Fjpxq ď bj , j P rMsu. (52)
Optimization problems
The optimization problem (50)-(51) is equivalent to the problem ofminimizing F0 over K ,
minxPK
F0pxq. (53)
Then our optimization problem becomes as well equivalent to theunconstrained optimization problem
minxPRN
F0pxq ` χK pxq.
Convex optimization problems
A convex optimization problem (or convex program) is a problemof the form (50)-(51), in which the objective function F0 and theconstraint functions Fi are convex. In this case, the set of feasiblepoints K defined in (52) is convex. The convex optimizationproblem becomes then equivalent to the unconstrainedoptimization problem of finding the minimum of the convexfunction
F pxq “ F0pxq ` χK pxq.
Due to this equivalence we may freely switch between constrainedand unconstrained optimization problems.A linear optimization problem (or linear program) is one, where theobjective function F0 and all constraint functions F1, ...,FM arelinear. Clearly, this is a special case of a convex optimizationproblem.
Lagrange function
The Lagrange function of an optimization problem of the form(50)-(51) is defined for x P RN , ξ P Rm, ν P RM , νl ě 0, l P rMs,by
Lpx, ξ,νq “ F0pxq ` ξ˚pAx´ yq `Mÿ
l“1
νlpFlpxq ´ blq. (54)
For an optimization problem without inequality constraints weclearly set
Lpx, ξq “ F0pxq ` ξ˚pAx´ yq. (55)
The variables ξ,ν are called Lagrange multipliers. For ease ofnotation we write ν ě 0 if νl ě 0 for all l P rMs.
Lagrange dual function
The Lagrange dual function is defined by
Hpξ,νq “ infxPRN
Lpx, ξ,νq, ξ P Rm,ν P RM ,ν ě 0.
If x ÞÑ Lpx, ξ,νq is unbounded from below, then we setHpξ,νq “ ´8. Again, if there are no inequality constraints then
Hpξq “ infxPRN
Lpx, ξq “ infxPRN
tF0pxq ` ξ˚pAx´ yqu , ξ P Rm.
The dual function is always concave because it is the pointwiseinfimum of a family of affine functions, even if the original problem(50)-(51) is not convex.
Lagrange dual functionThe dual function provides a bound on the optimal value of F0px7qof the minimization problem (50),
Hpξ,νq ď F px7q for all ξ P Rm,ν ě 0. (56)
Indeed, if x is a feasible point for (50) then Ax´ y “ 0 andFlpxq ď bl , l “ 1, ...,M, so that, for all ξ P Rm and ν ě 0,
ξ˚pAx´ yq `Mÿ
l“1
νlpFlpxq ´ blq ď 0.
Therefore,
Lpx, ξ,νq “ F0pxq ` ξ˚pAx´ yq `Mÿ
l“1
νlpFlpxq ´ blq ď F0pxq.
Taking the infimum over all x P RN on the left hand side, and overall feasible x on the right hand side shows (56).
Dual problem
We would like this lower bound to be as tight as possible. Thismotivates to consider the optimization problem
max Hpξ,νq subject to ν ě 0. (57)
This optimization problem is called the dual problem to (50)-(51),which in this context is sometimes called primal problem. Since His concave this problem is equivalent to the convex optimizationproblem of minimizing the convex function ´H subject to thepositivity constraint ν ě 0. A pair pξ,νq, ξ P Rm,ν ě 0 is calleddual feasible. A (feasible) maximizer pξ7,ν7q of (57) is referred toas dual optimal or optimal Lagrange multipliers. If x7 is optimal forthe primal problem (50)-(51) then the triple px7, ξ7,ν7q is calledprimal-dual optimal.
Weak and strong duality
ClearlyHpξ7,ν7q ď F px7q. (58)
This inequality is called weak duality. For most (but not all)convex optimization problems even strong duality holds,
Hpξ7,ν7q “ F px7q. (59)
Theorem (Slater’s constraint qualification)
Assume that F0,F1, ...,FM are convex functions withdompF0q “ RN . If there exists x P RN such that Ax “ y andFlpxq ă bl for all l P rMs, then strong duality holds for theoptimization problem (50)-(51).In case that there are no inequality constraints, strong duality holdsprovided there exists x with Ax “ y, that is, if (50)-(51) is feasible.
Saddle-point interpretation
Lagrange duality has a saddle-point interpretation. For ease ofexposition we consider (50)-(51) without inequality constraints,but note that extensions that include inequality constraints areeasily derived in the same way.Let px 7, ξ7q be primal dual optimal. Recalling the definition of theLagrange function L we have
supξPRm
Lpx, ξq “ supξPRm
`
F0pxq ` ξ˚pAx´ yq˘
“
#
F0pxq if Ax “ y ,
8 otherwise .(60)
In other words, the above supremum is 8 if x is not feasible. The(feasible) minimizer x7 of the primal problem (50)-(51) satisfiestherefore
F0px7q “ inf
xPRNsupξPRm
Lpx, ξq .
Saddle-point interpretationOn the other hand, a dual optimal vector ξ7 satisfies
Hpξ7q “ supξPRm
infxPRN
Lpx, ξq
by definition of the Lagrange dual function. Weak duality impliestherefore,
supξPRm
infxPRN
Lpx, ξq ď infxPRN
supξPRm
Lpx, ξq ,
while strong inequality reads
supξPRm
infxPRN
Lpx, ξq “ infxPRN
supξPRm
Lpx, ξq
In other words, the order of minimization and maximization can beinterchanged in the case of strong duality. This property is calledthe strong max-min property or saddle point property. Indeed, inthis case, a primal dual optimal px7, ξ7q is a saddle point of theLagrange function,
Lpx7, ξq ď Lpx7, ξ7q ď Lpx, ξ7q for all x P RN , ξ P RN . (61)
One application
TheoremLet ‖¨‖,~¨~ be two norms on RN . For A P RmˆN , y P Rm andη ą 0 consider the optimization problem
minxPRN
~x~ subject to ‖Ax´ y‖ ď η . (62)
Assume that (62) is strictly feasible in the sense that there existsx P RN such that ‖Ax´ y‖ ă η. Consider a minimizer x7 of (62).Then there exists a parameter λ ě 0 such that x7 is also theminimizer of the optimization problem
minxPRN
~x~ ` λ‖Ax´ y‖ . (63)
Conversely, for λ ą 0 let x7 be a minimizer of (63). Then thereexists η ą 0 such that x7 is a minimizer of (62).
Equivalent problems
1. The same type of statement and proof is valid for the pair ofoptimization problems (63) and
minxPRN
Ax´ y subject to x ď t, t ą 0. (64)
Note that in this case strict feasibility always holds becausethe zero vector x “ 0 satisfies x ă t.
2. The equivalence of the problems (62), (63), (64) is somewhatjust “theoretical” because the parameter transformationbetween λ and η or λ and t is implicit.
A typical general form of convex optimization problem indata analysis
We consider a convex optimization problem of the form
minxPRN
F pAxq ` G pxq, (65)
with A P RmˆN and convex functionsF : Rm Ñ p´8,8s, G : RN Ñ p´8,8s. Most of the relevantconvex optimization problems in data analysis fall into this class.The substitution z “ Ax yields the equivalent problem
minxPRN ,zPRm
F pzq ` G pxq subject to Ax´ z “ 0. (66)
A typical general form of convex optimization problem indata analysis
The Lagrange dual function to this problem is given by
Hpξq “ infx,ztF pzq ` G pxq ` xA˚ξ, xy ´ xξ, zyu
“ ´ supzPRm
txξ, zy ´ F pzqu ´ supxPRN
txx,´A˚ξy ´ G pxqu
“ ´F ˚pξq ´ G˚p´A˚ξq,
(67)
where F ˚ and G˚ are the convex conjugate functions of F and G ,respectively. Therefore, the dual problem of (65) is
maxξPRm
´F ˚pξq ´ G˚p´A˚ξq. (68)
A typical general form of convex optimization problem indata analysis
TheoremLet A P RmˆN and F : Rm Ñ p´8,8s, G : RN Ñ p´8,8s beproper convex functions such that either dompF q “ Rm ordompG q “ RN and there exists x such that Ax P dompF q. Assumethat the optima in (65) and (68) are attained. Then strong dualityholds in the form
minxPRN
F pAxq ` G pxq “ maxξPRm
´F ˚pξq ´ G˚p´A˚ξq.
Furthermore, a primal dual optimum px7, ξ7q is a solution to thesaddle point problem
minxPRN
maxξPRm
xAx, ξy ` G pxq ´ F ˚pξq, (69)
where F ˚ is the convex conjugate of F .
Gradient descent
Assume we have given a function f : RN Ñ R which we wish tominimize and find x˚ “ arg minx f pxq. We know that a functionlocally grows according to its gradient ∇f pxq. Hence, if we move abit from any point x in the opposite direction as dictated by thegradient, i.e., we consider x “ x ´ α∇f pxq for a small α ą 0, wemight expect that f pxq gets smaller, in other words we aredescending.
Gradient descent
Let us introduce the gradient descent algorithm, by consideringfirst a time-dependent gradient flow evolution (ordinary differentialequation) of the type
9xptq “ ´∇f pxptqq xp0q “ x p0q. (70)
In the following we make the following assumption: ∇f pxq isLipschitz continuous with constant L. If f is twice differentiable,this is achieved by asking that the largest eigenvalue of the Hessianis bounded by L (∇2f pxq ĺ LI ). But more directly, there must bean L such that,
∇f pxq ´∇f pyq2 ď Lx ´ y2, for all x , y P RN
Co-coercivity lemma
Lemma (Co-coercivity)
For a convex smooth function f whose gradient has Lipschitzconstant L ą 0, it holds
∇f pxq ´∇f pyq22 ď Lxx ´ y ,∇f pxq ´∇f pyqy.
Convergence of the gradient flow
TheoremAssume f P C2pRNq strongly convex, i.e., ∇2f pxq ľ γ2I for γ ą 0and with bounded Hessian, i.e., ∇2f pxq ĺ LI . Then the evolutiondefined by (70) always converges to the minimizerx˚ “ arg minxPRd f pxq
The gradient descent iteration
A natural time discretization of the gradient flow is given by theso-called explicit Euler scheme
x pn`1q “ x pnq ´ αpnq∇f px pnqq for x p0q given. (71)
Here the parameters αpnq are tyipically chosen adaptively with aline search. However, also a constant choice α “ αpnq sufficientlysmall will suffice. Let us show now a result of convergence of thediscrete algorithm (71).
The gradient descent iteration
Let us recall once again, that the convexity assumption on f tellsus that f is lower bounded by an affine function,
f pyq ě f pxq `∇f pxqT py ´ xq for all x , yRd . (72)
If ∇f pxq is Lipschitz continuous with constant L then f isadditionally upper bounded by a quadratic function (this is notobvious and it is left as an exercise),
f pyq ď f pxq `∇f pxqT py ´ xq `L
2y ´ x22 (73)
The gradient descent iteration
TheoremAssume that f is convex and finite for all x, a finite solution x˚
exists, and ∇f pxq is Lipschitz continuous with constant L. Thenthe iteration (71) with α “ αpnq ď 1
L converges to x˚.
Stochastic gradient descent
The gradient descent algorithm is simple to implement and widelyused for machine learning purposes. However, it requires the abilityof computing the gradient ∇f pxq and any point very efficiently.This is not always the case. In fact, often in machine learning theobjective function to be minimized is of the type
f pxq “ Ef px , ¨q “
ż
Ωf px , ωqdPpωq,
where f px 9q : Ω Ñ R are all random variables over an appropriateprobability space pΩ,Σ,Pq. This would imply that
∇f pxq “
ż
Ω∇f px , ωqdPpωq,
and the computation of ∇f pxq would require the computation of∇f px , ωq for all the events ω P Ω. Often Ω represents the trainingdata in supervised problems and it can be extremely large.
Stochastic gradient descent
For this particular setting, there is a way out, which stems frominjecting again a randomized sampling to reduce the complexity:for each n iteration number, we pick at random ωpnq independentlyand instead of considering ∇f px pnqq in a gradient descent iteration,we consider ∇f px pnq, ωpnqq
x pn`1q “ x pnq ´ α∇x f px pnq, ωpnqq for x p0q given. (74)
Of course, this algorithm is never the same and it depends on thesequence of randomized samples ωpnq. However, the averagebehavior of this algorithm can be estimated with elementarymethods.
Stochastic gradient descent
TheoremLet each of f p¨, ωq be convex where ∇x f px , ωq has Lipschitzconstant Lpωq ą 0 with sup L :“ supωPΩ Lpωq ă `8, and letf pxq “ Ef px , ¨q be γ-strongly convex. Set σ2 “ E∇x f px˚, ¨q22where x˚ “ arg minx f pxq. Suppose that α ă 1
sup L . Then thestochastic gradient descent iteration satisfies:
Ex pnq´x˚22 ď p1´2αγp1´α sup Lqqnx p0q´x˚22`ασ2
γp1´ α sup Lq,
(75)where the expectation is over the sequence of sampling pωpnqqnPN.
Stochastic gradient descentIf we are given a desired tolerance, x pnq ´ x˚22 ď ε to be achievedand we knew the Lipschitz constants and parameter of strongconvexity, we may optimize the step-size α, and obtain:
Corollary
For any desired ε ą 0, using a step-size of
α “γε
2εγ sup L` 2σ2
we have that after
n ě 2 logp2ε0εq
ˆ
sup L
γ`
σ2
γ2ε
˙
, (76)
the stochastic gradient descent iterations fulfill
Ex pnq ´ x˚22 ď ε,
where ε0 “ xp0q ´ x˚22.
A Primal Dual AlgorithmMany of the most common tasks in data analysis and machinelearning can often be approached by solving optimization problemsof the form
minxPCN
F pAxq ` G pxq (77)
with A being an m ˆ N matrix and F ,G being convex functionswith values in p´8,8s.We already investigated the corresponding dual problem given by
maxξPCm
´F ˚pξq ´ G˚p´A˚ξq, (78)
whereF ˚pξq “ suptRexz , ξy ´ F pzq : z P Cmu
is the convex (Fenchel) conjugate of F , and likewise G˚ is theconvex conjugate of G . The joint primal-dual optimization isequivalent to solving the saddle point problem
minxPCN
maxxiPCm
RexAx , ξy ` G pxq ´ F ˚pξq. (79)
A Primal Dual Algorithm
The primal dual algorithm introduced by Chambolle and Pock anddescribed below solves this problem iteratively.In order to formulateit, one needs to introduce the so-called proximal mapping(proximity operator).For the convex function G it is defined as
PG pτ, zq :“ argminxPCN
"
τG pxq `1
2x ´ z22
*
.
One also requires below the proximal mapping associated to theconvex conjugate F ˚.
A Primal Dual Algorithm
The primal dual algorithm performs an iteration on the dualvariable, the primal variable and an auxiliary primalvariable.Starting from initial points x0, x0 “ x P CN , ξ0 P Cm andparameters τ, σ ą 0, θ P r0, 1s, one iteratively computes
ξn`1 :“ PF˚pσ; ξn ` σAxnq, (80)
xn`1 :“ PG pτ ; xn ´ τA˚ξn`1q, (81)
xn`1 :“ xn`1 ` θpxn`1 ´ xnq. (82)
It can be shown that px , ξq is a fixed point of these iterations ifand only if px , ξq is a saddle point.
A Primal Dual Algorithm
For the case that θ “ 1, the convergence of the primal dualalgorithm has been established:
TheoremConsider the primal dual algorithm (80)-(82) with θ “ 1 andτ, σ ą 0 such that τσA2 ă 1. If the problem (79) has a saddlepoint, then the sequence pxn, ξnq converges to a saddle point of(79) and in particular pxnq converges to a minimizer of (77).
Nonconvex optimization: convex relaxationTwo approaches.The first is determining an equivalent convexoptimization problem, which can be tackled, e.g., with themethods we described above.
-4 -2 0 2 4
0
5
10
15
20
25
A nonconvex function f and a convex envelope g ď f from below.
If we were interested in solving an optimization problem
minxPK
f pxq,
where f is a nonconvex, lower-semincontinuous function, and K isa closed convex set, it might be convenient to recast the problemby considering its convexification, i.e.,
minxPK
f pxq,
where f is called the convex relaxation or the convex envelope of fand is given by
f pxq :“ suptgpxq ď f pxq : g is a convex functionu.
Nonconvex optimization: convex relaxation
The motivation of this choice is simply geometrical.While f can have many minimizers on K , its convex envelop f has
global minimizers, and such global minimizers are likely to be in aneighborhood of a global minimizer of f .Actually if K is compact and f is smooth enough, then the global
minima of f and f must necessarily coincide.We show concrete examples of convex relaxation of nonconvex
problems, also nonsmooth ones related to sparsity optimization, infollowing lectures on compressed sensing.
Unfortunately, the precise computation of f might be in somecases a very difficult problem and one might need to consider asecond option.
Nonconvex optimization: graduated nonconvexity orsequentially quadratic programming
If the nonconvex objective function is sufficiently smooth, then it islocally convexifiable and one can use this property to solve thenonconvex optimization, by using so-called graduated nonconvexityor sequentially quadratic programming.
From now, about the function f , we will make the following morespecific assumption:
(A1) f is ω-semi-convex, that is there exists ω ą 0 such thatf p¨q ` ω ¨ 2 is convex;
(A2) the subdifferential of f satisfies the following growthcondition: there exist two nonnegative numbers K , L suchthat, for every v P RN and ξ P Bf pvq
ξ ď Kf pvq ` L . (83)
Nonconvex optimization: graduated nonconvexity orsequentially quadratic programming
Given ω ą 0, and u P RN we will denote
fω,upvq :“ f pvq ` ωv ´ u2 (84)
We observe that, if f satisfies condition (A1), we can alwaysassume that ω is chosen in such a way that fω,u is also γ-stronglyconvex with γ depending on f and ω, but not on u. Analogously, if(A1) and (A2) are satisfied, it is easy to see that fω,u satisfies (83)with two constants K , L depending again on f and ω, but not on u.We now present a typical algorithm for constrained nonsmooth andnonconvex minimization, based on the successive convexapproximations:
Nonconvex optimization: graduated nonconvexity orsequentially quadratic programming
We pick initial v p0q P RN . Notice that there is no restriction to anyspecific neighborhood for the choice of the initial iteration. Wedefine the successive iterations by:
v p``1q “ arg minvPK fω,v p`qpvq. (85)
Here, thanks to condition (A1), ω is chosen in such a way thatfω,v`´1
is γ-strongly convex, and the minimization executed at eachiteration has a unique solution and can be performed by thefavorite convex optimization method/algorithm.
Under appropriate assumptions on f one can prove that theiterations v pnq produced by the algorithm (85) converges to criticalpoints of the nonconvex function f .