MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING · theory. That is why statistical learning theory is important part of machine learning theory. 1.2. A brief history of machine learning

MATHEMATICAL FOUNDATIONS OF MACHINELEARNING

(19ANMAG469P1, FALL TERM 2019-2020)

HONG VAN LE ∗

Contents

1. Basic mathematical problems in machine learning 31.1. Learning, inductive learning and machine learning 41.2. A brief history of machine learning 51.3. Main tasks of current machine learning 71.4. Main types of machine learning 101.5. Basic mathematical problems in machine learning 111.6. Conclusion 122. Mathematical models for supervised learning 122.1. Discriminative model of supervised learning 122.2. Generative model of supervised learning 162.3. Empirical Risk Minimization and overfitting 202.4. Conclusion 223. Mathematical models for unsupervised learning 223.1. Mathematical models for density estimation 223.2. Mathematical models for clustering 273.3. Mathematical models for dimension reduction and manifold

learning 283.4. Conclusion 304. Mathematical model for reinforcement learning 314.1. Knowledge to accumulate in reinforcement learning 314.2. Markov decision process 314.3. Existence and uniqueness of the optimal policy 324.4. Conclusion 325. The Fisher metric and maximum likelihood estimator 325.1. The space of all probability measures and total variation norm 335.2. The Fisher metric on a statistical model 355.3. The Fisher metric, MSE and Cramer-Rao inequality 375.4. Efficient estimators and MLE 405.5. Consistency of MLE 405.6. Conclusion 41

Date: February 9, 2020.∗ Institute of Mathematics of ASCR, Zitna 25, 11567 Praha 1, email: [email protected].

1

2 HONG VAN LE ∗

6. Consistency of a learning algorithm 416.1. Consistent learning algorithm and its sample complexity 416.2. Uniformly consistent learning and VC-dimension 456.3. Fundamental theorem of binary classification 476.4. Conclusions 497. Generalization ability of a learning machine 497.1. Covering number and sample complexity 497.2. Rademacher complexities and sample complexity 527.3. Model selection 547.4. Undecidability of learnability 567.5. Conclusion 568. Support vector machines 568.1. Linear classifier and hard SVM 578.2. Soft SVM 608.3. Sample complexities of SVM 618.4. Conclusion 639. Kernel based SVMs 639.1. Kernel trick 649.2. PSD kernels and reproducing kernel Hilbert spaces 669.3. Kernel based SVMs and their generalization ability 699.4. Conclusion 7010. Neural networks 7010.1. Neural networks as a model of computation 7110.2. The expressive power of neural networks 7310.3. Sample complexities of neural networks 7510.4. Conclusion 7711. Training neural networks 7711.1. Gradient and subgradient descend 7711.2. Stochastic gradient descend (SGD) 7911.3. Online gradient descend and online learnability 8111.4. Conclusion 8212. Bayesian machine learning 8212.1. Bayesian concept of learning 8212.2. Applications of Bayesian machine learning 8412.3. Consistency 8412.4. Conclusion 84Appendix A. Some basic notions in probability theory 85A.1. Dominating measures and the Radon-Nikodym theorem 85A.2. Conditional expectation and regular conditional measure 86A.3. Joint distribution, regular conditional probability and Bayes’

theorem 89A.4. Probabilistic morphism and regular conditional probability 90A.5. The Kolmogorov theorem 92Appendix B. Law of large numbers and concentration-of-measure

inequalities 92

MACHINE LEARNING 3

B.1. Laws of large numbers 93B.2. Markov’s inequality 93B.3. Chebyshev’s inequality 93B.4. Hoeffding’s inequality 94B.5. Bernstein’s inequality 94B.6. McDiarmid’s inequality 94References 94

It is not knowledge, but the act of learning ... which grants the greatestenjoyment.

Carl Friedrich Gauss

Machine learning is an interdisciplinary field in the intersection of math-ematical statistics and computer sciences. Machine learning studies sta-tistical models and algorithms for deriving predictors, or meaningful pat-terns from empirical data. Machine learning techniques are applied insearch engine, speech recognition and natural language processing, imagedetection, robotics etc. In our course we address the following questions:What is the mathematical model of learning? How to quantify the diffi-culty/hardness/complexity of a learning problem? How to choose a learningmodel and learning algorithm? How to measure success of machine learning?

The syllabus of our course:1. Supervised learning, unsupervised learning2. Generalization ability of machine learning3. Support vector machine, Kernel machine4. Neural networks and deep learning5. Bayesian machine learning and Bayesian networks.

Recommended Literature.1. S. Shalev-Shwart, and S. Ben-David, Understanding Machine Learning:

From Theory to Algorithms, Cambridge University Press, 2014.2. Sergios Theodoridis, Machine Learning A Bayesian and Optimization

Perspective, Elsevier, 2015.3. M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of Machine

Learning, MIT Press, 2012.

1. Basic mathematical problems in machine learning

Machine learning is the foundation of countless important applicationsincluding speech recognition, image detection, self-driving car and manything more which I shall discuss today in my lecture. Machine learningtechniques are developed using many mathematical theories. In my lecturecourse I shall explain the mathematical model of machine learning and howdo we design a machine which shall learn successfully.

4 HONG VAN LE ∗

In my today lecture I shall discuss the following questions,1. What are learning, inductive learning and machine learning.2. History of machine learning and artificial intelligence.3. Current tasks and main types of machine learning.4. Basic mathematical problems in machine learning.

1.1. Learning, inductive learning and machine learning. To start ourdiscussion on machine learning let us begin first with the notion of learning.Every one from us know what is learning from our experiences at the veryearly age.

(a) Small children learn to speak by observing, repeating and mimickingadults’ phrases. At the beginning their language is very simple and often er-roneous. Gradually they speak freely with less and less mistakes. Their wayof learning is inductive learning: from examples of words and phrases theylearn the rules of combinations of these words and phrases into meaningfulsentences.

(b) In school we learn mathematics, physics, biology, chemistry by fol-lowing the instructions of our teachers and those in textbooks. We learngeneral rules and apply them to particular cases. This type of learning isdeductive learning. Of course we also learn inductively in school by search-ing similar patterns in new problems and then apply the most appropriatemethods possibly with modifications for solving the problem.

(c) Experimental physicists design experiments and observe the outcomesof the experiments to validate/support or dispute/refute a statement/conjectureon the nature of the observables. In other words, experimental physicistslearn about the dependence of certain features of the observables from em-pirical data which are outcomes of the experiments. This type of learningis inductive learning.

In mathematical theory of machine learning, or more general, in mathe-matical theory of learning we consider only inductive learning. Deductivelearning is not very interesting; essentially it is equivalent to performing a setof computations using a finite set of rules and a knowledge database. Clas-sical computer programs learn or gain some new information by deductivelearning.

Let me suggest a definition of learning, that will be updated later to bemore and more precise.

Definition 1.1. A learning is a process of gaining new knowledge, moreprecisely, new correlations of features of observable by examination of em-pirical data of the observable. Furthermore, a learning is successful if thecorrelations can be tested in examination of new data and will be moreprecise with the increase of data.

The above definition is an expansion of Vapnik’s mathematical postula-tion: “Learning is a problem of function estimation on the basis of empiricaldata”.

MACHINE LEARNING 5

Example 1.2. A classical example of learning is that of learning a physicallaw by curve fitting to data. In mathematical terms, a physical law isexpressed by a function f , and data are the value yi of f at observable pointsxi. The goal of learning in this case is to estimate the unknown f from aset of pairs (x1, y1), · · · , (xm, ym). Usually we assume that f belongs to afinite dimensional family of functions of observable points x. For instance,f is assumed to be a polynomial of degree d over R, i.e., we can write

f = fw(x) :=d∑j=0

wjxj where w = (w0, · · · , wd) ∈ Rd+1.

In this case, to estimate f = fw is the same as to estimate the parameter wof fw, observing the data (xi, yi). The most popular method of curve fittingis the least square method which quantifies the error R(w) of the estimationof the parameter w in terms of the value

(1.1) R(w) :=m∑i=1

(fw(xi)− yi)2

which the desired function f should minimize. If the measurements gener-ating the data (xi, yi) were exact, then f(xi) would be equal to yi and thelearning problem is an interpolation problem. But in general one expectsthe values yi to be affected by noise.

The least square technique, going back to Gauss and Legendre 1, whichis computational efficient and relies on numerical linear algebra, solves thisminimization problem.

In the case of measurement noise, which is the reality according to quan-tum physics, we need to use the language of probability theory to modelthe noise and therefore to use tools of mathematical statistics in learningtheory. That is why statistical learning theory is important part of machinelearning theory.

1.2. A brief history of machine learning. Machine learning was bornas a domain of artificial intelligence and it was reorganized as a separatedfield only in the 1990s. Below I recall several important events when theconcept of machine learning has been discussed by famous mathematiciansand computer scientists.• In 1948 John von Neumann suggested that machine can do any thing

that peoples are able to do. 2

1The least-squares method is usually credited to Carl Friedrich Gauss (1809), but itwas first published by Adrien-Marie Legendre (1805)

2According to Jaynes [Jaynes2003, p.8], who attended the talk by J. von Neumann oncomputers given in Princeton in 1948, J. von Neumann replied to the question from theaudience “But of course, a mere machine can’t really think, can it?” as follows “You insistthat there is something a machine cannot do. If you will tell me precisely what it is thata machine cannot do, then I can always make a machine which will do just that”.

6 HONG VAN LE ∗

• In 1950 Alan Turing asked “Can machines think?” in “Computing Ma-chine and Intelligence” and proposed the famous Turing test. The Turingtest is carried out as an imitation game. On one side of a computer screensits a human judge, whose job is to chat to an unknown gamer on the otherside. Most of those gamers will be humans; one will be a chatbot with thepurpose of tricking the judge into thinking that it is the real human.• In 1956 John McCarthy coined the term “artificial intelligence”.• In 1959, Arthur Samuel, the American pioneer in the field of com-

puter gaming and artificial intelligence, defined machine learning as a fieldof study that gives computers the ability to learn without being explicitlyprogrammed. The Samuel Checkers-playing Program appears to be theworld’s first self-learning program, and as such a very early demonstrationof the fundamental concept of artificial intelligence (AI).

In the early days of AI, statistical and probabilistic methods were em-ployed. Perceptrons which are simple models used in statistics were usedfor classification problems in machine learning. Perceptrons were later de-veloped into more complicated neural networks. Because of many theoret-ical problems and because of small capacity of hardware memory and slowspeed of computers, statistical methods were out of favour. By 1980, ex-pert systems, which were based on knowledge database, and inductive logicprogramming had come to dominate AI. Neural networks returned back tomachine learning with success in the mid-1980s with the reinvention of anew algorithm, called back-propagation, and thanks to increasing speed ofcomputers and increasing hardware memory.

Machine learning, reorganized as a separate field, started to flourish inthe 1990s. The current trend is benefited from Internet.

In the book by Russel and Norvig “Artificial Intelligence a modern Ap-proach” (2010) AI encompass the following domains:- natural language processing,- knowledge representation,- automated reasoning to use the stored information to answer questionsand to draw new conclusions;- machine learning to adapt to new circumstances and to detect and extrap-olate patterns,- computer vision to perceive objects,- robotics.

All the listed above domains of artificial intelligence except robotics arenow considered domains of machine learning.• Representation learning and feature engineering are part of machine

learning replacing the field of knowledge representation.• Pattern detection and recognition become a part of machine learning.• Robotics becomes a combination of machine learning and mechatronics.Why did such a move from artificial intelligence to machine learning hap-

pen?

MACHINE LEARNING 7

The answer is that we are able to formalize most concepts and modelproblems of artificial intelligence using mathematical language and representas well as unify them in such a way that we can apply mathematical methodsto solve many problems in terms of algorithms that machine are able toperform.

As a final remark on the history of machine learning I would like tonote that data science, much hyped in 2018, has the same goal as machinelearning: Data science seeks actionable and consistent pattern for predictiveuses. 3.

Now I shall briefly describe which tasks current machine learning canperform and how they do them.

1.3. Main tasks of current machine learning.• Classification task is a construction of a mapping from a set of items

we are interested in to a smaller countable set of other items, which areregarded as possible “features” of the items in the domain set, using givendata of the values of the desired functions at given items. For example,document classification may assign items with features, such as politics,email spam, sports, or weather, while image classification may assign itemswith categories such as landscape, portrait, or animal. Items in the targetset are also called categories in machine learning community and we shalluse this terminology from now on. The number of categories in such taskscan be unbounded as in OCR, text classification, or speech recognition.

As we have remarked in the classical example of learning (Example 1.2),usually we have ambiguous/incorrect measurement and we have to add a“noise” to our measurement. If every thing would be exact, the classificationtask is the classical interpolation function problem in mathematics: thefeatures are the values of an unknown function on the set of items which weneed to construct knowing its values at given items using given data. (Don’tgive your data free on internet!)

• Regression task is a construction of a real valued function on the setof items, using given data of the values of the desired functions at givenitems. Examples of regression tasks include learning physical law by curvefitting to data (Example 1.2) with application to predictions of stock val-ues or variations of economic variables. In this problem, the error of theprediction, which is also called the error of the estimation in Example 1.2,depends on the magnitude of the distance between the true and predictedvalues, in contrast with the classification problem, where there is typicallyno notion of closeness between various categories. As in the classification

3according to Dhar, V. (2013). “Data science and prediction”. Communications of theACM. 56 (12): 64. doi:10.1145/2500499, see also wiki site on data science.

8 HONG VAN LE ∗

task, in regression problems we also need to take into account a “noise” fromincorrect measurement. 4

• Density estimation task is a construction (or estimation) of the distribu-tion of observed items. In other words, given a sample space (X ,ΣX ), whereΣX denotes the σ-algebra of a measurable space X , and a finite number ofobserved items x1, · · · , xl ∈ X , usually assumed to be i.i.d. for the sake ofsimplicity, such that xi are subjected to unknown probability measures puon (X ,ΣX ), we need to estimate pu. This is one of basic problems in sta-tistics. Over one hundred year ago Karl Pearson (1980-1962), the founderof the modern statistics, 5 proposed that all observations come from someprobability distribution and the purpose of sciences is to estimate the pa-rameter of these distributions, assuming that they belong to a known finitedimensional family S of probability measures on (X ,ΣX ). Density estima-tion problem has been proposed by Ronald Fisher (1980-1962), the father ofmodern statistics and experiment designs, 6 as a key element of his simpli-fication of statistical theory, namely he assumed the existence of a densityfunction p(ξ) that governs the randomness of a problem of interest.

Digression. A measure ν on (X ,ΣX ) is called dominated by a measure µon (X ,ΣX ) (or absolutely continuous with respect to µ), if ν(A) = 0 for everyset A ∈ ΣX with µ(A) = 0. Notation: ν µ. By the Radon-Nykodymtheorem, see Appendix, Subsection A.1, we can write

ν = f · µ

where f ∈ L1(X , µ) is the density function of ν w.r.t. µ.For example, the multivariate Gaussian (family of) probability distribu-

tions, also denoted by MVN (multivariate normal), is a muti-dimensionalgeneralization of the 2-dimensional family of normal distributions. It is an(n+3)

2 -dimensional parameterized statistical model of dominated measures

4The term “regression” was coined by Francis Galton in the nineteenth century todescribe a biological phenomenon. The phenomenon was that the heights of descendantsof tall ancestors tend to regress down towards a normal average (a phenomenon alsoknown as regression toward the mean of population). For Galton, regression had only thisbiological meaning, but his work was later extended by Udny Yule and Karl Pearson to amore general statistical context: movement toward the mean of a statistical population.Galton’s method of investigation is non-standard at that time: first he collected the data,then he guessed the relationship model of the events.

5 He founded the world’s first university statistics department at University CollegeLondon in 1911, the Biometrical Society and Biometrika, the first journal of mathematicalstatistics and biometry.

6Fisher introduced the main models of statistical inference in the unified framework ofparametric statistics. He described different problems of estimating functions from givendata (the problems of discriminant analysis, regression analysis, and density estimation)as the problems of parameter estimation of specific (parametric) models and suggested themaximum likelihood method for estimating the unknown parameters in all these models.

MACHINE LEARNING 9

on Rn whose density function N is given as follows

(1.2) N (x|µ,Σ) =1

(2π)n/2|Σ|2exp(−1

2Σ−1(x− µ, x− µ)

)Here the couple (µ,Σ) is the parameter of the family, where µ ∈ Rn and Σis a positive definite quadratic form (symmetric bilinear form) on Rn. Usingthe canonical Euclidean metric on Rn, we identify Rn with its dual space(Rn)∗. Thus the inverse Σ−1 is regarded as a quadratic form on Rn and thenorm |Σ| is also well-defined.

The classical problem of density estimation is the problem of estimation ofthe unknown probability measure pu ∈ S, knowing observed data as formu-lated above, assuming that S consists of measures dominated by a measureµ. As in the classical example of learning by curve fitting (Example 1.2), usu-ally we assume that the dominated measure family S is finite dimensional,e.g., S is a MVN, whose density depends on the parameter (µ,Σ). In thiscase we need to estimate (µ,Σ) based on observables (x1, · · · , xm) ∈ Xm.

• Ranking task orders observed items according to some criterion. Websearch, e.g., returning web pages relevant to a search query, is a typicalranking example. If the number of ranking is finite, then this task is closeto the classification problem, but not the same, since in the ranking task weneed to specify each rank during the task and not before the task as in theclassification problem.

• Clustering task partitions items into (homogeneous) regions. Clusteringis often performed to analyze very large data sets. Clustering is one of themost widely used techniques for exploratory data analysis. In all disciplines,from social sciences to biology to computer science, people try to get afirst intuition about their data by identifying meaningful groups among thedata points. For example, computational biologists cluster genes on thebasis of similarities in their expression in different experiments; retailerscluster customers, on the basis of their customer profiles, for the purposeof targeted marketing; and astronomers cluster stars on the basis of theirspacial proximity.

• Dimensionality reduction or manifold learning transforms an initial rep-resentation of items in high dimensional space into a representation of theitems in a space of lower dimension while preserving some properties of theinitial representation. A common example involves pre-processing digitalimages in computer vision tasks. Many of dimensional reduction techniquesare linear. When the technique is non-linear we speak about manifold learn-ing technique. We can regard clustering as dimension reduction too.

10 HONG VAN LE ∗

1.4. Main types of machine learning. The type of a machine learningtask is defined by the type of interaction between the learner and the en-vironment. In particular, we consider types of training data, i.e., the dataavailable to the learner before making decision and prediction, and the typeof the outcomes.

Main types of machine learning are supervised, unsupervised and rein-forcement.• In supervised learning training data are labelled, i.e., each item in the

training data set consists of the pair of a known instance and its feature, alsocalled label. Examples of labeled data are emails that are labeled “spam”or “no spam” and medical histories that are labeled with the occurrence orabsence of a certain disease. In these cases the learners output would bea spam filter and a diagnostic program, respectively. Most of classificationand regression problems of machine learning belong to supervised learning.We also interpret a learning machine in supervised learning as a studentwho gives his supervisor a known instance and the supervisor answers withthe known feature.• In unsupervised learning there is no additional label attached to the

training data and the task is to describe structure of data. Density estima-tion, clustering and dimensionality reduction are examples of unsupervisedlearning problems. Regarding a learning machine in unsupervised learningas a student, then the student has to learn by himself without teacher. Thislearning is harder but happens more often in life. At the current time, unsu-pervised learning is most exciting part of machine learning where many newdeep mathematical methods, e.g., persistent homology and other topologi-cal methods, are invented to described the structure of data. Furthermore,many tasks in supervised learning can be reduced to the density estimationproblem, which is unsupervised learning.

Remark 1.3. A machine learning task can be performed using supervisedlearning or unsupervised learning or some intermediate type between su-pervised learning and unsupervised learning. For instance, let us considerproblem of anomaly detection, i.e., the task of finding samples that are “in-consistent with the rest of the data. Unsupervised anomaly detection tech-niques detect anomalies in an unlabeled test data set under the assumptionthat the majority of the instances in the data set are normal by looking forinstances that seem to fit least to the remainder of the data set. Unsuper-vised anomaly can be regarded as a clustering task. Supervised anomalydetection techniques require a data set that has been labeled as “normal”and “abnormal” and involves training a classifier .

• Reinforcement learning is the type of machine learning where a learneractively interacts with the environment to achieve a certain goal. For ex-amples, reinforcement learning is used in self-driving car and chess play. Inreinforcement learning the learner collects information through a course ofactions by interacting with the environment. This active interaction justifies

MACHINE LEARNING 11

the terminology of an agent used to refer to the learner. The achievementof the agent’s goal is typically measured by the reward he receives from theenvironment and which he seeks to maximize. In reinforcement learningthe training data are not labeled, i.e., the supervisor does not directly giveanswers to the students questions. Instead, the supervisor evaluates thestudents behavior and gives feedback about it in terms of rewards.

1.5. Basic mathematical problems in machine learning. Let me recallthat a learning is a process of gaining knowledge on a feature of observableby examination of partially available data. The learning is successful if wecan make a prediction on unseen data, which improves when we have moredata. For example, in classification problem, the learning machine has topredict the category of a new item from a specific set, after seeing a lotof labeled data consisting of items and their categories. The classificationtask is a typical task in supervised learning where we can explain how andwhy a learning machine works and how and why machine learns successfully.Mathematical foundations of machine learning aim to answer these questionsin mathematical language.

Question 1.4. How to find a mathematical model of learning?

To answer Question 1.4 we need to specify our definition of learning in amathematical language which can be used to build instructions for machines.A model can be quite abstract for a large class of problems and we also wantto build a specific model for a specific problem. A search for optimal modelof specific problem is called a model selection. It can be formalized as a taskof machine learning we discussed above: model selection can be a supervisedlearning task or unsupervised learning task.

Question 1.5. How to quantify the difficulty/complexity of a learning prob-lem?

We quantify the difficulty of a problem in terms of its time complexity,which is the minimum time needed for performing computer program tosolve a problem, and in terms of its resource complexity which measure thecapacity of data storage and energy resource needed to solve the problem.If the complexity of a problem is very large then we cannot not learn it. SoQuestion 1.5 contains the sub-question “ why can we learn a problem?”

Question 1.6. How to choose a learning algorithm?

Clearly we want to have a best learning algorithm, once we know a modelof a machine learning which specifies the set of possible predictors (decisions)and the associated error/reward function.

By Definition 1.1, a learning process is successful, if its prediction/estimationimproves with the increase of data. Thus the notion of success of learn-ing process requires a mathematical treatment of asymptotic rate of er-ror/reward in the presence of complexity of the problem.

12 HONG VAN LE ∗

1.6. Conclusion. Machine learning is automatized learning, whose perfor-mance is improves with increasing volume of empirical data. In machinelearning we use mathematical statistics to model incomplete information andthe random nature of the observed data, in other words, we build mathemat-ical/statistical models for problems whose solutions are learning algorithms.To build mathematical models for problems in machine learning we need tospecify the type of the problem we want to solve and the type of availabledata, which result in the type of machine learning model, which we also callthe type of machine learning.

One of the main problems of mathematical foundations of machine learn-ing concerns the quantification of the difficulty/complexity of a learningproblem. This problem is ultimately related to the question of learnabilityof a learning problem, i.e., if a learning problem admits a successful learn-ing algorithm in the sense of Definition 1.1. We shall discuss in our courseseveral characterization of learnability. Recently S. Ben-David, P. Hrubes, SMoran, A Shpilka and A. Yehudayoff proved that the question of learnabilityof a learning problem is not always decidable [BHMSY2019].

2. Mathematical models for supervised learning

What I cannot create, I do not understand.Richard Feynman

Last week we discussed the concept of learning and examined several ex-amples. Today I shall specify the concept of learning by presenting basicmathematical models of supervised learning. A model for machine learn-ing must be able to make predictions and improves their ability to makepredictions in light of new data.

The model of supervised learning I present today is based on Vapnik’sstatistical learning theory, which starts from the following concise conceptof learning.

Definition 2.1. ([Vapnik2000, p. 17]) Learning is a problem of functionestimation on the basis of empirical data.

There are two main model types for machine learning: discriminativemodel and generative model. They are distinguished by the type of functionswe want to estimate for understanding the feature of observable.

2.1. Discriminative model of supervised learning. Let us consider thefollowing toy example of a binary classification task, which like regressiontasks (Example 1.2), is a typical example of tasks in supervised learning.

Example 2.2 (Toy example). We are working for a dermatological clinicand we want to develop a program for prediction of a certain type of skindiseases by examination of image scanning of the skin condition of patients.

MACHINE LEARNING 13

First we need to define the set of essential parameters of the skin image. Forexample, the area of the spot in trouble, which has value in an interval I1 ⊂R, the thickness of the skin spot in trouble, which has value in an intervalI2 ⊂ R, the color of the skin spot in trouble, which can be quantified by areal number in an interval I3, the age of the patient, which can be dividedin five age groups A1, A2, A3, A4, A5. We want to construct a function f :X := ∪5

i=1I1 × I2 × I3 × Ai → Y := ±1, whose value predicts “Yes” or“No” possibility of the type of disease. Such a function f can be identifiedwith a subset f−1(1) ⊂ X . Our function f shall depends on data at time ni.e., depends on the set Sn consisting of n parameters x1, · · · , xn ∈ X withgiven “label” f(x1), · · · , f(xn). Thus our function should denoted by fSnand the success of our construction of fSn means that as n grows the valuefSn(y) on new data y ∈ X is closer to the true value f(y).

A discriminative model of supervised learning consists of the followingcomponents.• A domain set X (also called an input space) consists of elements, whose

features we like to learn. Elements x ∈ X are called random inputs (orrandom instances) which are distributed by an unknown probability measureµX . In other words, the probability that x belongs to a subset A ⊂ X isµX (A). In general we don’t know the distribution µX .

In Example 2.2 of a classification task, the domain set is also denoted byX in the example. In Example 1.2 of learning a physical law, a regressiontalk, the domain set X is the domain R.

• An output space Y, also called a label set, consists of possible features(also called labels) y of inputs x ∈ X . We are interested in finding a predic-tor/mapping h : X → Y such that a feature of x is h(x). If such a mappingh exists and is measurable, the feature h(x) is distributed by the measureh∗(µX ). In general such a function does not exist, and we assume that thereexists only a probability measure µX×Y on the space (X×Y) that defines theprobability that y is a feature of x, i.e., the probability of (x, y) ∈ A ⊂ X×Ybeing a labeled pair is equal to µX×Y(A). In general we don’t know µX×Y .

In Example 2.2 the label set is also denoted by Y. In Example 1.2 oflearning a physical law the label set is the set R of all possible value of f(x).

• A training data is a sequence S = (x1, y1), · · · , (xn, yn) ∈ (X ×Y)n ofobserved labeled pairs, which are usually assumed to be i.i.d. (independentlyidentically distributed). In this case S is distributed by the product measureµnX×Y on (X ×Y)n. The number n is called the size of S. The training data‘S is thought as given by a “supervisor”.

• A hypothesis space H ⊂ YX of possible predictors h : X → Y.

14 HONG VAN LE ∗

For instance, in Example 2.2 we may wish to choose H consisting ofsubsets Ai ⊂ Xi := I1 × I2 × I3 × Ai, i ∈ 1, 2, 3, 4, 5, and Ai is a cuboidin I1 × I2 × I3.

In Example 1.2 we already letH := h : R→ R| h is a polynomial of degree at most d ∼= Rd+1.

• The aim of a learner is to find a best prediction rule A that assigns atraining data S to a prediction hS ∈ H. In other words the learner needs tofind a rule, more precisely, an algorithm

(2.1) A :⋃n∈N

(X × Y)n → H, S → hS

such that hS(x) predicts the label of (unseen) instance x with the less error.

• The error function, also called a risk function, measures the discrepancybetween a hypothesis h ∈ H and an ideal predictor. The error function is acentral notion in learning theory. The error function, denoted by R : H → R,is usually defined as the averaged discrepancy of h(x) and y, where (x, y)runs over X ×Y, see also (3.14) for an alternative definition. The averagingis calculated using the probability measure µ := µX×Y that governs thedistribution of labeled pairs (x, y). Thus a risk function R must depend onµ, so we denote it by Rµ. It is accepted that the risk function Rµ : H → Ris defined as follows.

(2.2) RLµ(h) :=

∫X×Y

L(x, y, h) dµ

where L : X × Y ×H → R is an instantaneous loss function that measuresthe discrepancy between the value of a prediction/hypothesis h at x and thepossible feature y:

(2.3) L(x, y, h) := d(y, h(x)).

Here d : Y × Y → R≥0 is a non-negative function that vanishes at thediagonal (y, y)| y ∈ Y of Y × Y. For example d(y, y′) = |y − y′|2. Bytaking averaging over (X ×Y) using µ, we effectively count only the points(x, y) which are correlated as labeled pairs.

Note the expected risk function is well defined on H only if L(x, y, h) ∈L1(X × Y, µ) for all h ∈ H.

• The main question of learning theory is to find necessary and sufficientconditions for the existence of a prediction rule A in (2.1) such that, as thesize of S goes to infinity, the error of hS converges to the error of an idealpredictor, or more precisely, to the infimum of the error of h ∈ H, and thento construct such a prediction rule A.

MACHINE LEARNING 15

We summarize our discussion in the following

Definition 2.3. A discriminative model of supervised learning consists of aquintuple (X ,Y,H, L, PX×Y) where X is a measurable input space, Y is ameasurable output space, H ⊂ YX is a hypothesis space, L : X ×Y×H → Ris an instantaneous loss function and PX×Y ⊂ P(X×Y) is a statistical model.An optimal predictor hSn ∈ H should minimize the risk function RLµ basedon empirical labeled data Sn ∈ (X × Y)n that have been distributed by ununknown measure µn where µ ∈ PX×Y .

Remark 2.4. (1) In our discriminative model of supervised learning wemodel the random nature of training data S ∈ (X × Y)n via a probabilitymeasure µn on (X × Y)n, where µ = µX×Y ∈ PX×Y . Thus we assumeimplicitly that the training data are i.i.d.. In general, the distribution µX×Yof labeled pairs (x, y) does not correlate with the distribution µX of x ∈ X .The main difficulty in search for the best prediction rule A is that we don’tknow µ, we know only training data S distributed by µn.

(2) Note the expected risk function is well defined onH only if L(x, y, h) ∈L1(X × Y, µ) for all h ∈ H. We don’t know µ exactly, but based on ourprior knowledge, we know that µ ∈ PX×Y . Thus we should assume thatL ∈ L1(X × Y, ν) for any ν ∈ PX×Y .

(3) The quasi-distance function d : Y × Y → R induces a quasi-distancefunction dn : Yn × Yn → R as follows

(2.4) dn([y1, · · · , yn], [y′1, · · · , y′n]) =n∑i=1

d(yi, y′i),

and therefore it induces the expected loss function RL(dn)µn : H → R as follows

RL(dn)µn (h) =

∫(X×Y)n

dn([y1, · · · , yn], [h(x1), · · · , h(xn)])dµn

= n

∫X×Y

L(x, y, h)dµ.(2.5)

Thus it suffices to consider only the loss function RL(d)µ (h), if S is a sequence

of i.i.d. observable.(4) Now we show that the classical case of learning a physical law by fitting

to data, assuming exact measurement, is a “classical limit” of our discrimi-native model of supervised learning. In the classical learning problem, sincewe know the exact position S := (x1, y1), · · · , (xn, yn) ∈ (X × Y)n, weassign the Dirac probability measure µδS := δx1,y1 × · · · × δxn,yn to the space(X × Y)n. 7 Now let d(y, y′) = |y − y′|2. Then we have

(2.6) RL(dn)

µδS(h) =

n∑i=1

|h(xi)− yi|2.

7If (x1, y1), · · · , (xn, yn) are i.i.d. then the measure µSn := 1n

∑ni=1 δii,yi ∈ P(X × Y)

is called the empirical measure defined by the data Sn.

16 HONG VAN LE ∗

Note that the RHS of (2.6) coincides with the error of estimation in (1.1),

i.e., our empirical risk function RL(dn)

µδSdefined on H = Rd+1 coincides with

the error function R in (1.1).

Example 2.5 (0-1 loss). Let H be the set YX of all mappings X → Y. The

0-1 instantaneous loss function L(0−1) : X × Y × H → 0, 1 is defined asfollows:

L(0−1)(x, y, h) := d(y, h(x)) = 1− δyh(x).

The corresponding expected 0-1 loss determines the probability of the answerh(x) that does not correlate with x:

(2.7) R(0−1)µX×Y (h) = µX×Y(x, y) ∈ X ×Y|h(x) 6= y = 1−µX×Y(x, h(x)).

Example 2.6. Assume that x ∈ X is distributed by a probability measureµX and its feature y is defined by y = h(x) where h : X → Y is a measurablemapping. Denote by Γh : X → X × Y, x 7→ (x, h(x)), the graph of h. Then(x, y) is distributed by the push-forward measure µh := (Γh)∗(µX ), where

(2.8) (Γh)∗µX (A) = µX(Γ−1h (A)

)= µX

(Γ−1h (A ∩ Γh(X )

)).

Let us compute the expected 0-1 loss function for a mapping f ∈ H := YXw.r.t. the measure µh. By (2.7) and by (2.8), we have

(2.9) R(0−1)µh

(f) = 1− µX (x|f(x) = h(x)).

Hence R(0−1)µh (f) = 0 iff f = h µX -a. e..

2.2. Generative model of supervised learning. In many cases a dis-criminative model of supervised learning may not yield a successful learningalgorithm because the hypothesis space H is too small and cannot approx-imate a desired prediction for a feature ∈ Y of instance x ∈ X with asatisfying accuracy, i.e., the optimal performance error of the class H(2.10) RLµ,H := inf

h∈HRLµ(h)

that represents the optimal performance of a learner using H is quite large.One of possible reasons of this failure is that a feature y ∈ Y of x cannot be

accurately approximated by any function h : X → Y. Recall that in generalwe have only a correlation between x and its feature y, which is expressedin terms of a measure µX×Y on X × Y, but this measure not always canbe represented in terms of (Γh)∗µX for some mapping h : X → Y andµX ∈ P(X ), see the exercise below.

Exercise 2.7. Let X = Rn and Y = Rm be Euclidean spaces. The observedfeature y ∈ Y of an input x ∈ X satisfies the following relation: |y|2 + |x|2 =1, where |.| denotes the Euclidean norm on Y and X respectively. Denoteby Sn+m−1 the unit sphere in the product space X ×Y = Rn+m. Then y isa feature of x if and only if (x, y) ∈ S. Since S ∈ ΣX×Y = B(Rn+m), thiscorrelation is expressed via a measure µS ∈ P(X × Y) with support in S,

MACHINE LEARNING 17

i.e., for any D ∈ ΣX×Y we have µS(D) = µS(A∩S). Let µ0S be the uniform

measure on S, i.e., for any g ∈ SO(n+m) we have g∗(µ0S) = µ0

S . Prove thatµS cannot represented as a graph (Γh)∗µX for any µX ∈ P(X ).

Hint. Let sppt(µX ) denotes the support the measure µX , i.e.,

sppt(X ) : x ∈ X | ∀x 3 Ux ∈ B(X ) we have µX (Ux) > 0.

Show that sppt((Γh)∗µ) = Γh(sppt(µ)).

Therefore we may wish to estimate the probability that an element y ∈ Yis a feature of x. This is expressed in term of the conditional probabilityP (y ∈ B|x) - the probability that a feature y of x ∈ X belongs to the subsetB ⊂ ΣY .

Digression. Conditional probability is one of most basic concepts in prob-ability theory. In general we always have a prior information before takingdecision, e.g. before estimating the probability of a future event. Condi-tional probability P (A|B) formalizes the probability of an event A given theknowledge that event B happens. Here we assume that A,B are elements ofthe sigma-algebra ΣX of a probability space (X ,ΣX , P ). If X is countable,the concept of conditional probability can be defined straightforward:

(2.11) P (A|B) :=P (A ∩B)

P (B).

It is not hard to see that, given B, the conditional probability P (·|B) definedin (2.11) is a probability measure on X , A 7→ P (A|B), which is called theconditional probability measure given B. In its turn, by taking integrationover X using the conditional probability P (·|B), we obtain the notion ofconditional expectation, given B, which shall be denoted by EP (·|B), seeSubsection A.2.2.

In general case when X is not countable, the definition of conditionalprobability is more subtle, especially when we have to define P (A|B), whereP (B) = 0. A typical situation is the case B = h−1(z0), where h : X → Zis a measurable mapping and h∗(P )(z0) = 0. To treat this important casewe need to define first the notion of conditional expectation, see SubsectionA.2 in Appendix.

Remark 2.8. (1) We may also wish to estimate the joint distributionµ := µX×Y of i.i.d. labeled pairs (x, y). By Formula (A.14) the knowl-edge of a joint distribution µX×Y implies the knowledge of the conditionalprobability measure µY|X (B|x); and conversely, by (A.12) the joint distri-bution µX×Y can be recovered from the conditional probability µY|X (·|x),given the marginal probability measure µY ∈ P(Y).

(2) The knowledge of a joint distribution µX×Y leads to a solution of indiscriminative models (X ,Y,H, L, µX×Y) as follows. First, knowing µ :=µX×Y we compute the expected risk RLµ for an instantaneous loss function

18 HONG VAN LE ∗

L. Then we determine a minimizing sequence hi ∈ H of RLµ , i.e.,

limn→∞

RLµ(hi) = RLµ,H.

Hence generative models give more information than discriminative models.

• The Bayes principle states that if the probability distribution µ of la-beled pairs is known, then the optimal predictor hµ can be expressed explic-itly.

Example 2.9 (The Bayes Optimal Predictor). Let us consider a discrimina-

tive model of supervised learning (X ,Y = 0, 1,H = YX , L(0−1), PX×Y =µ), where X is a discrete set 8. Denote by πX the projection X ×Y → X .For such a model we define the Bayes Optimal Predictor fµ ∈ H by

fµ(x) =

1 if µσ(ΠX )(y = 1|x) ≥ 1/2,0 otherwise.

Here µσ(ΠX )(y = 1|x) is the measure of the measurable subset (x, 1)|x ∈X ⊂ X × Y under the conditioning ΠX (x, 1) = x, see (A.6) and RemarkA.6. We claim that the Bayes Optimal Predictor fµ is optimal. In otherwords for every classifier g ∈ YX we have

(2.12) R(0−1)µ (fµ) ≤ R(0−1)

µ (g).

To prove our claim, note that for any h ∈ YX we have

(2.13) R(0−1)µ (h)

(2.7)= Eµ(|h(x)− y|) dµ (A.9)

=

∫X×Y

Eσ(ΠX )µ (|h(x)− y|) dµ.

By (2.13), it suffices to prove that for all x ∈ X we have

(2.14) Eµ(|h(x)− y||x) ≥ Eµ(|fµ(x)− y||x) = µσ(ΠX )fµ(x) 6= y|x.

To prove (2.14), we write(2.15)

Eµ(|h(x)− y||x) = h(x)µσ(ΠX )(y = 0|x) + (1− h(x))(1− µσ(ΠX )(y = 0|x)),

and notice that the RHS of (2.15) is at least equal minµσ(ΠX )(y = 0|x), (1−µσ(ΠX )(y = 0|x)) = µσ(ΠX )fµ(x) 6= y|x. This proves (2.14) and henceour claim.

Exercise 2.10 (Regression optimal Bayesian estimator). In regression prob-lem the output space Y is R. Let us define the following embedding

i1 : RX → RX×Y : [i1(f)](x, y) := f(x),

i2 : RY → RX×Y : [i2(f)](x, y) := f(y).

8If X is not discrete, then the Bayes optimal predictor depends on the choice ofa representative of the conditional measure µσ(ΠX )(y = 1) representing a function(ΠX )µ∗ (1(y=1)) ∈ L1(X , (ΠX )µ∗ )

MACHINE LEARNING 19

(These embeddings are adjoint to the projections: X : X × R ΠX→ X and

X × R ΠR→ R.) For a given probability measure µ on X × R we set

L2(X , (ΠX )∗µ) = f ∈ RX | i1(f) ∈ L2(X × R, µ),

L2(R, (ΠR)∗µ) = f ∈ RR| i2(f) ∈ L2(X × R, µ).

Now we let F := L2(X , (ΠX )∗µ) ⊂ YX . Let Y denote the identity func-tion on R i.e., Y (y) = y. Assume that Y ∈ L2(R, (ΠR)∗µ) and define thequadratic loss function L : X × Y × F → R

(2.16) L(x, y, h) := |y − h(x)|2,

(2.17) RLµ(h) = Eµ(|Y (y)− h(x)|2

)= |i2(Y )− i1(h)|2L2(X×R,µ).

The expected risk RLµ is called the L2-risk, also known as mean squared error

(MSE). Show that the regression function r(x) := Eµ(i2(Y )|X = x

)belongs

to F and minimizes the L2(µ)-risk.

Hint. Establish the following formula for L2-risk

(2.18) Rµ(h) = Eµ(|i2(Y )− i1(r)|2) + Eµ(|i1(h)− i1(r)|2).

• Generative models give us more complete information of the correlationbetween a feature y and an instance x. Often they are more complicated,since even in the case of regular conditional probability we could interpret aprobability measure p(B|x) as a probabilistic mapping from X to Y, equiv-alently as a measurable mapping p : X → P(Y), see e.g. [JLLT2019]. Sinceany measurable function is a probabilistic mapping, discriminative modelsof supervised learning is a particular case of generative model of supervisedlearning. This approximation of a generative model by a discriminativemodel agrees with Fisher’s suggestion that a probability map p : X ; Rcan be written as a function f : X → R up to a white noise, i.e.,

p(x) = f(x) + ξ where Eµuξ = 0.

Here ξ : X × R → R is a measurable map, µu ∈ P(X × R) is the un-known probability measure that governs the distribution of i.i.d. observable(xi, yi) ∈ (X × R).

Practically we always assume that conditional probability measures areregular. A further simplification is to assume that the family of regular con-ditional measures p(x) ∈ P(Y) is dominated by a measure µ0 ∈ P(Y), andhence we need to estimate the density function p(y|x) := dp(x)/dµ0 under-stood as a representative of the equivalence class of the function dp(x)/dµ0 ∈L1(Y, µ0).

This simplified/classical setting of a generative model (of supervised learn-ing) is a discriminative model of supervised learning.

20 HONG VAN LE ∗

Definition 2.11. A (simplified/classical) generative model of supervisedlearning is a quintuple (X ,Y,H, L, µ0 ∈ P(Y)) where X and Y are mea-surable spaces, H is a family of density functions p(y|x) and L is a lossfunction.

Example 2.12. Let us revisit Example 2.2 of prediction of a skin disease.A generative model in this case considers probability p(Y |x) of the skindisease (recall that “Y” means “yes” of disease), given a skin image x ∈X = ∪5

i=1I1 × I2 × I2 × Ai. Clearly p(Y |x) can be regarded as a functionX → [0, 1]. If we know the value p(Y |xi) for any training data (x1, · · · , xn),the generative model for supervised learning in this case is a model for aregression task in supervised learning. If we don’t know the value p(Y |xi)this problem becomes to a density estimation problem, so it is unsupervisedmachine learning.

We shall learn popular generative models of supervised learning, calledBayesian learning models in the last lecture.

2.3. Empirical Risk Minimization and overfitting. In a discriminativemodel of supervised learning our aim is to construct a prediction rule A thatassigns a predictor hS to each sequence

S = (x1, y1), · · · , (xn, yn) ∈ (X × Y)n

of i.i.d. labeled data such that the expected error RLµ(hS) tends to the

optimal performance error RLµ,H of the class H. One of most popular waysto find a prediction rule A is to use the Empirical Risk Minimization.

For a loss functionL : X × Y ×H → R,

and a training data S ∈ (X ×Y)n we define the empirical risk of a predictorh as follows

(2.19) RLS(h) :=1

n

n∑i=1

L(xi, yi, h).

If L is fixed, then we also omit the superscript L.The empirical risk is a function of two variables: the “empirical data” S

and the predictor h. Given S a learner can compute RS(h) for any functionh : X → Y. A minimizer of the empirical risk should have also “approxi-mately” minimize the expected risk. This is the empirical risk minimizationprinciple, abbreviated as ERM.

Remark 2.13. We note that

(2.20) RLS(h) = RLµS (h)

where µS is the empirical measure on (X × Y) associated to S, cf. (2.6). Ifh is fixed, by the weak law of large numbers, see e.g. Proposition B.2 in theAppendix, the RHS of (2.20) converges in probability to the expected risk

MACHINE LEARNING 21

RLµ(h), so we could hope to find a condition under which the RHS of (2.20)

for a sequence of hS , instead of h, converges to RLµ,H.

Example 2.14. In this example we shall show the failure of ERM in certaincases. The 0-1 empirical risk corresponding to 0-1-loss function L : X ×Y ×YX → 0, 1 is defined as follows

(2.21) R0−1S (h) :=

|i ∈ [n] : h(xi) 6= yi|n

for a training data S = (x1, y1), · · · , (xn, yn) and a function h : X → Y.We also often call R0−1

S (h) - the training error or the empirical error.Now we assume that labeled data (x, y) is generated by a map f : X → Y,

i.e., y = f(x), and further more, x is distributed by a measure µX on X as inExample 2.6. Then (x, f(x)) is distributed by the measure µf = (Γf )∗(µX ).

Let H = YX . Then f ∈ H and R0−1µf

(f) = 0. For any given ε > 0 and

any n we shall find a map f , a measure µX , and a predictor hSn such that

R0−1Sn

(hSn) = 0 and R0−1µf

(hSn) = ε, which shall imply that the ERM is

invalid in this case.Set

(2.22) hSn(x) =

f(xi) if there exists i ∈ [n] s.t. xi = x0 otherwise.

Clearly R0−1Sn

(hSn) = 0. We also note that hSn(x) = 0 except finite (atmost n) points x in X .

Let X be the unit cube Ik in Rk and Y = Z2. Let µ0 be the Lebesguemeasure on Ik, k ≥ 1.We decompose X into a disjoint union of two measur-able subsets A1 and A2 such that µX (A1) = ε. Let f : X → Z2 be equal 1A1

- the indicator function of A1. By (2.7) we have

(2.23) Rµf (hSn) = µX (x ∈ X |hSn(x) 6= 1A1(x)).Since hSn(x) = 0 a.e. on X it follows from (2.23) that

Rµf (hSn) = µX (A1) = ε.

Such a predictor hSn is said to be overfitting, i.e., it fits well to trainingdata but not real life.

Exercise 2.15 (Empirical risk minimization). Let X × Y = Rd × R andF := h : X → Y| ∃v ∈ Rd : h(x) = 〈v, x〉 be the class of linear functions inYX . For S =

((xi, yi)

)ni=1∈ (X × Y)n and the quadratic loss L (defined in

(2.16)), find the hypothesis hS ∈ F that minimizes the empirical risk RLS .

The phenomenon of overfitting suggests the following questions concern-ing ERM principle [Vapnik2000, p.21]

1) Can we learn in discriminative model of supervised learning using theERM principle?

2) If we can learn, we would like to know the rate of convergence of thelearning process as well as construction method of learning algorithms.

22 HONG VAN LE ∗

We shall address these questions later in our course and recommend thebooks by Vapnik on statistical learning theory for further reading.

2.4. Conclusion. Discriminative model for supervised learning is a quintu-ple (X ,Y,H, L, PX×Y) where H ⊂ YX . In this setting the aim of a learneris to find a prediction rule A : S 7→ hS ∈ H for any sequence of i.i.d. labeledtraining data S such that (x, hS(x)) approximates the training data best,RLµ(hS) is smallest as possible. The ERM principle suggests that we could

choose hS to be the minimizer of the empirical risk RS instead of unknownfunction RLµ , and we hope that as the size of S increases the expected error

RLµ(hS) converges to the optimal performance error RLµ,H. Without furthercondition on H and L the ERM principle does not work.

The (simplified/classical) generative model of supervised learning is aquintuple (X ,Y,H, L, µ0 ∈ P(Y)) where H is the set of possible conditionaldensity functions p(y|x). In this setting the learner aims to have find a bestprediction rule A : S 7→ hS ∈ H based on training data (x1, y1, · · · , xn, yn).

Both discriminative models and generative models for supervised learningcan be unified in the same framework using the language of probabilisticmappings developed in [JLLT2019].

3. Mathematical models for unsupervised learning

Last week we discussed models (X ,Y,H, L,PX×P(Y)) of supervised learn-

ing where H ⊂ P(Y)X is a subset of probabilistic mappings from X to Y.The error function L : X ×Y ×H → R is a central notion of learning theorythat specifies the idea of “best approximation”, “best predictor”. 9

Today we shall study statistical models of machine learning for severalimportant tasks in unsupervised learning: density estimation, clustering,dimension reduction and manifold learning. The key problem is to specifythe error function that measures the accuracy of an estimator or the fitnessof a decision.

3.1. Mathematical models for density estimation. Recall that for ameasurable space X we denote by P(X ) the space of all probability measureson X . In a density estimation problem we are given a sequence of observablesSn = (x1, · · · , xn) ∈ X n, which are i.i.d. by un unknown probability measureµu. We have to estimate the measure µu ∈ P(X ), using Sn. Usually,having a prior knowledge, we assume that µu belongs to a statistical modelPX ⊂ P(X ). Simplifying further, we assume that PX consists of probabilitymeasures that are dominated by a measure µ0 ∈ P(X ). Thus we regardPX as a family of density functions on X . If PX is finite dimensional, thenestimating µu ∈ P is called a parametric problem of density estimation,otherwise it is called a nonparametric problem.

9The notion of a loss function has been introduced by A. Wald in his statistical decisiontheory [Wald1950].

MACHINE LEARNING 23

Example 3.1. As we have noted in Remark 2.8, the density estimationproblem encompasses the generative model (X ,Y,H, L, µ0 ∈ P(Y)) of su-pervised learning as a particular case by regarding labelled pair (x, y) as aninstance in the product space X ×Y. Of course this model for searching thejoint distribution µX×Y is very extensive, since we are not interested in thestructure of the relationship between x and y but we are only interested inthe question, given x ∈ X , what is probability µ(y|x) that y ∈ Y is a featureof x.

• In the parametric density estimation problem we assume that PX isparameterized by a nice parameter set Θ, e.g. Θ is an open set of Rn. Thatis, there exists a surjective map p : Θ → PX , which is usually assumedto be continuous. If Θ is an open set of Rn then Θ contains a countabledense set, hence, by [JLS2017, Proposition 2.3], the image p(Θ) := PX isa dominated measure family, i.e., there exists a measure µ0 ∈ P(X ) suchthat p(θ) = pθµ0 for all θ ∈ Θ, where pθ ∈ L1(X , µ0). Thus the probabilitymeasure p(θ) is specified by its density pθ.

In this lecture we shall assume that Θ is an open subset of Rn and p isa continuous 1-1 map. This condition always holds in classical theory ofdensity estimations. Thus we shall identify PX with Θ and the parametricdensity estimation in this case is equivalent to estimating the parameterθ ∈ Θ, assuming knowledge of Sn.

As in mathematical models for supervised learning, we define an expectedrisk function Rµ : Θ→ R by averaging an instantaneous loss function

L : X ×Θ→ R.

Usually the instantaneous loss function L is given by the minus log-likelihoodfunction 10

(3.1) L(x, θ) = − log pθ(x).

Given the instantaneous loss log-likelihood function L in (3.1), the ex-pected risk function

RL : Θ× PX → Ris the expected log-likelihood function, cf. (2.2)

(3.2) RL(θ, µu) := RLµu(θ) = −∫X

log pθ(x)pu(x)dµ0

where µu = pu · µ0, which we have to estimate. Given a data Sn =(x1, · · · , xn) ∈ X n, by (3.1), the corresponding empirical risk function is

(3.3) RLSn(θ) := RLµSn (θ) = −n∑i=1

log pθ(xi) = − log[pnθ (Sn)],

10Note that the log-likelihood function log pθ(x) does not depend on the choice ofa dominating measure µ0, i.e., if we replace µ0 by µ′0 that also dominates p(θ) then

log dp(θ)dµ0

= log dp(θ)dµ′

0.

24 HONG VAN LE ∗

where pnθ (Sn) is the density of the probability measure µnθ on X n. It follows

that the minimizer θ of the empirical risk RLSn = RLµSn is the maximizer of

the log-likelihood function log[pnθ (Sn)], cf. (2.5).According to ERM principle, which is inspired by the weak law of large

numbers, the minimizer θ of RLSn should provide an “approximation” ofthe density pu of the unknown probability measure µu that governs thedistribution of i.i.d xi ∈ X .

Remark 3.2. (1) The instantaneous loss function (x, y, h) in supervisedlearning measures the discrepancy between the value of a prediction at x ∈ Xand a possible feature y, see (2.3). In density estimation problem we don’thave a separation of instances z into separate features and therefore it isharder to interpret the meaning of an instantaneous loss function L : X ×Θ→ R. In Theorem 5.19 we shall give an interpretation of the instantaneousloss function given by the minus log-likelihood function (3.1) using the notionof efficient estimator and Cramer-Rao inequality. In unsupervised learning,when one defines a risk function R : H× PX → R one has to justify that aminimizer hmin of the expected risk function Rµ(h) := R(h, µ) is the bestapproximation of the true hypothesis h we are searching for. In particular,the true hypothesis h must be a minimizer of the risk function R. Forexample, one uses the Bretagnolle-Huber inequality∫

|p(x, α)− p(x, α0)|dx ≤ 2√

1− exp(R(α0)−R(α)),

to prove that the distance between the values of the risk function of densitiesgives an upper bound for a distance between the density, see [Vapnik1998,p. 30] for detailed analysis.

(2) For X = R the ERM principle for the expected log-likelihood functionholds. Namely one can show that the minimum of the risk functional in(3.2), if exists, is attained at a function p∗u which may differ from pu only ona set of zero measure, see [Vapnik1998, p.30] for a proof.

(3) Note that minimizing the expected log-likelihood function Rµ(θ) isthe same as minimizing the following modified risk function [Vapnik2000,p.32]

(3.4) R∗µu(θ) := RLµu(θ) +

∫X

log pu(x)pu(x)dµ0 = −∫X

logpθ(x)

pu(x)pu(x)dµ0.

The expression on the RHS of (3.4) is the Kullback-Leibler divergenceKL(pθµ0|µu)that is used in statistics for measuring the divergence between pθµ0 andµu = puµ0. The Kullback-Leibler divergence KL(µ|µ′) is defined for proba-bility measures (µ, µ′) ∈ P(X ) × P(X ) such that µ µ′, see also Remark5.9 below. It is a divergence, i.e., it satisfies the following properties:

(3.5) KL(µ|µ′) ≥ 0 and KL(µ|µ′) = 0 iff µ = µ′.

MACHINE LEARNING 25

By (3.4), a maximizer of the expected risk function RLµu(µθ) on Θ minimizesthe KL-divergence KL(µθ|µu) regarded as a function on Θ. The relationsin (3.5) justify the choice of the expected risk function RLµu .

(3) It is important to find quasi-distance functions on P(X ) that satisfyingcertain natural statistical requirement. This problem has been consideredin information geometry, see [Amari2016, AJLS2017] for further reading.

(4) If the parameter set Θ is large, e.g., Θ = PX = P(X ) where dim(X ) ≥1, then PX cannot be dominated by a probability measure µ0 ∈ P(X ). Inthis case we have to consider nonparametric techniques.

• A popular nonparametric technique for estimating density functions onX = Rm using empirical data Sn ∈ X n is the kernel density estimation(KDE) cf. [Tsybakov2009, p. 2]. For understanding the idea of KDE weshall consider only the case X = R and µ0 = dx. We need to estimate theunknown measure µu := pudx ∈ P(R). Let

Fµu(t) =

∫ t

−∞pu(x)dx = µu((−∞, t))

be the corresponding cumulative distribution function. Then we have

(3.6) pu(t) =Fµu(t+ h)− Fµu(t− h)

2h+O(h).

Since µu governs the distribution of Sn = (x1, · · · , xn) ∈ X n, we replace µuin the RHS of (3.6) by µSn and obtain the Rosenblatt estimator

(3.7) pRSn(h; t) :=FSn(t+ h)− FSn(t− h)

2h,

where FSn is the empirical distribution function and h is the bandwidth ofthe estimator.

FSn(t) = FµSn (t) =1

n

n∑i=1

1(xi≤t).

Furthermore, the LHS of (3.7) can be rewritten in the following form

(3.8) pRSn(h; t) =1

2nh

n∑i=1

1(t−h≤xi≤t+h) =1

nh

n∑i=1

K0(xi − th

)

where K0(u) := 121(−1≤u≤1). A simple generalization of the Rosenblatt

estimator is given by

(3.9) pPRSn (h; t) =1

nh

n∑i=1

K(xi − th

)

where K : R→ R is an integrable function satisfying∫K(u)du = 1. Such a

function K is called kernel and the parameter h is called the bandwidth of thekernel density estimator (3.9), also called the Parzen-Rosenblatt estimator.Popular kernels are the rectangular kernel K(u) = 1

2I(|u|≤1), the Gaussian

kernel K(u) = (2π)−1/2 exp(u2/2) etc., see e.g., [Tsybakov2009, p. 3].

26 HONG VAN LE ∗

To measure the accuracy of the estimator pPRSn (h; ·) we first need to defined

an expected loss function RL on the hypothesis class H of possible densitiesof probability measures we are interested in. In the given case we choose Hto be defined as follows

(3.10) H = P(β, L) := p ∈ RR≥0 ∩ ΣR(β, L)|

∫Rp(x) dx = 1

where ΣR(β, L) is the Holder class on R of the set of l = bβc 11 times

differentiable functions f : R→ R whose derivative f (l) satisfies

(3.11) |f (l)(x)− f (l)(x′)| ≤ L|x− x′|β−l, ∀x, x′ ∈ R.Let PR ⊂ P(R) be a statistical model of possible distributions µu = pudx.By definition PR = H. A natural candidate for an instantaneous loss func-tion

(3.12) L : R×H×PR → Ris the square distance function

(3.13) L(x, p, pu) := |p(x)− pu(x)|2.This induces the expected loss at x0 ∈ R function as follows

(3.14) RL := RL|x0: H×PX → R, (p, pu) 7→ |p(x0)− pu(x0)|2.

LetAn ⊂ HRn

be a set of possible estimators pn : Rn → H, Sn 7→ pSn ∈ H. Then we definethe expected accuracy at x0 function

MRL|x0: An → R

by

(3.15) MRL|x0(pn) := MSE|x0

(pn) := Eµnu(RL|x0(pn(Sn), pu)),

where µu = pudx and Sn is distributed by µnu.For instance, for the Parzen-Rosenblatt estimator pPRn (h; ), Sn 7→ ρPRSn (h; ),

we have

(3.16) MSEx0(pPRn (h; )) = Eµnu [(pPRSn (h; )(x0)− pu(x0))2].

Note that the RHS of (3.16) measures the expected accuracy at x0 of theestimator pPRSn (x0) averaging over Sn ∈ Rn. This is an important concept ofthe accuracy of an estimator in the presence of uncertainty.

It is known that under certain condition on the kernel function K, andassuming the infinite dimensional statistical model P = H of density func-tions is given in (3.11), the mean expected risk MSE(pPRSn (h; )x0 converges tozero uniformly on R as n goes to infinity and h goes to zero [Tsybakov2009,Theorem 1.1, p. 9].

11 bβc denotes the greatest integer strictly less than β ∈ R

MACHINE LEARNING 27

Remark 3.3. We have considered two models for density estimation usingdifferent instantaneous functions to measure the accuracy of our estimatorsin terms of an expected loss function. In the first model of density estimation(X ,H = Θ, L : X ×H → R,PX = H) we let L(θ, x) := log pθ(x). Similarlyas in the mathematical model of supervised learning, the ERM says thata minimizer of the empirical risk function RLµSn (θ) = − log[pnθ (Sn)] shouldbe the best estimator i.e., it should minimize the expected loss functionRLµu : Θ→ R. Note that a minimizer of RLµSn is the MLE estimator: θ must

maximize the value log pθn(Sn) given a data Sn ∈ X n.In the second model for density estimation (X = Rm,H, L,PX ) an instan-

taneous function L : X ×H×PX → R depends also on the unknown proba-bility measure µu ∈ PX , see (3.12), cf. (2.2). Then we define a new conceptof the expected accuracy function MRL(ρn) of an estimator ρn : X n → Hthat measures the accuracy of ρn averaging over observable data Sn. In thesecond model we don’t discuss how to find a “best” estimator ρn : X n → H,we find the Parzen-Rosenblatt estimator by heuristic argument. The notionof a mean expected loss function is an important concept in theory of gen-eralization ability of machine learning that gives an answer to Question 1.5:how to quantify the difficulty/complexity of a learning problem.

3.2. Mathematical models for clustering. Given an input measurablespace X , clustering is the process of grouping similar objects x ∈ X together.There are two possible types of grouping in machine learning: a partitionalclustering, where we partition the objects into disjoint sets; and a hierarchi-cal clustering, where we create a nested tree of partitions. We propose thefollowing unified notion of a clustering.

Let Ωk denote a finite sample space of k elements ω1, · · · , ωk.

Definition 3.4. A clustering of X is a measurable mapping χk : X → Ωk.A clustering χk : X → Ωk is called hierarchical, if there exists a number

l ≤ k − 1 such that χk = χlk χl : X χl→ Ωlχlk→ Ωk.

A probabilistic clustering of X is a probabilistic mapping χk : X ; Ωk.A probabilistic clustering χk : X ; Ωk is called hierarchical, if there exists

a number l ≤ k − 1 such that χk = χlk χl : X χl; Ωl

χlk; Ωk.

Now we can define a model of clustering similarly as in the case of densityestimations. To formalize the notion of similarity we introduce a quasi-distance function on X . That is, a function d : X × X → R+ that issymmetric, satisfies d(x, x) = 0 for all x ∈ X . Then we pick up a setC of possible clusterings, i.e., C is a subset of all measurable/probabilisticmappings ∪∞i=1χk : X → Ωk and the goal of the clustering algorithm is tofind a clustering χk ∈ C of minimal cost. Under this paradigm, like othertasks in statistical decision theory, the clustering task is turned into anoptimization problem. The function to be minimized is called the objectivefunction, which is a function

(3.17) G : (X , d)× C → R.

28 HONG VAN LE ∗

Given G, the goal of a clustering algorithm is defined as finding, for a giveninput (X , d), a clustering χk ∈ C so that G((X, d), χk) is minimized. In orderto reach that goal, one has to apply some appropriate search algorithm. Asit turns out, most of the resulting optimization problems are NP-hard, andsome are even NP-hard to approximate.

Example 3.5. The k-means objective function is one of the most popularclustering objectives. In k-means, given a clustering χk : X → Ωk, the datais partitioned into disjoint sets C1 = χ−1

k (ω1), · · · , Ck = χ−1k (ωk) where each

Ci is represented by a centroid ci := c(Ci). It is assumed that the input setX is embedded in some larger metric space (X ′, d) and ci ⊂ X ′. We defineci as follows

ci := arg minci∈X ′

∫Ci

d(x, ci)dµ,

where µ is a probability measure on X . The k-means objective function Gkis defined as follows

(3.18) Gk((X , d), χk) =k∑i=1

∫Ci

d(x, ci)dµ

where µ is probability measure on X .The k-means objective function G is used in digital communication tasks,

where the members of X may be viewed as a collection of signals that have tobe transmitted. For further reading on k-means algorithms, see [SSBD2014,p. 313], [HTF2008, §14.3.6, p. 505].

Remark 3.6. The above formulation of clustering is deterministic. Forprobabilistic clustering, where the output is probabilistic mapping χk : X →Ωk i.e., χk(x) = (p1(x), · · · , pk(x)) ∈ P(Ωk), we modify the cost functionGk as follows

(3.19) Gk((X , d), χk) =k∑i=1

∫Ci

d(x, ci)pi(x)dµ

where µ is a probability measure on X .

3.3. Mathematical models for dimension reduction and manifoldlearning. A central cause of the difficulties with unsupervised learning isthe high dimensionality of the random variables being modeled. As thedimensionality of input x grows, any learning problem significantly getsharder and harder. Handling high-dimensional data is cumbersome in prac-tice, which is often referred to as the curse of dimensionality. Hence variousmethods of dimensionality reduction are introduced. Dimension reductionis the process of taking data in a high dimensional space and mapping itinto a new space whose dimension is much smaller.• Classical (linear) dimension reduction methods. Given original data

Sm := xi ∈ Rd| i ∈ [1,m] we want to embed it into Rn, n < d, then we

MACHINE LEARNING 29

would like to find a linear transformation W ∈ Hom(Rd,Rn) such that

W (Sm) := W (xi) ⊂ Rn.To find the “best” transformation W = W(Sm) we define an error function

on the space Hom(Rd,Rn) and solve the associated optimization problem.

Example 3.7. A popular linear method for dimension reduction is calledPrincipal Component Analysis (PCA), which has another name SVD (sin-gular value decomposition.) Given Sm ⊂ Rd, we use a linear transforma-tion W ∈ Hom(Rd,Rn), where n < d, to embed Sm into Rd. Then, asecond linear transformation U ∈ Hom(Rn,Rd) can be used to (approxi-mately) recover Sm from its compression W (Sm). In PCA, we search forW and U to be a minimizer of the following reconstruction error functionRSm : Hom(Rd,Rn)×Hom(Rn,Rd)→ R

(3.20) RSm(W,U) =

m∑i=1

||xi − UW (xi)||2

where ||, || denotes the quadratic norm.

Exercise 3.8. ([SSBD2014, Lemma 23.1, p.324]) Let (W,U) be a minimizer

of RSm defined in (3.20). Show that U can be chosen as an orthogonalembedding and W U = Id|Rn .

Hint. First we show that if a solution (W,U) of (3.20) exists, then thereis a solution (W ′, U ′) of (3.20) such that dim ker(U ′W ′) = d− n.

Let Homg(Rd,Rn) denotes the set of all orthogonal projections from Rdto Rn and Homg(Rn,Rd) the set of all orthogonal embeddings from Rmto Rd. Let F ⊂ Homg(Rd,Rn) × Homg(Rn,Rd) be the subset of all pairs(W,U) of transformations such that W U = Id|Rn . Exercise 3.8 implies

that any minimizer (W,U) of RSm is an element of F .

Exercise 3.9. ([SSBD2014, Theorem 3.23, p. 325]) Let C(Sm) ∈ End(Rd)be defined as follows

C(Sm)(v) :=m∑i=1

〈xi, v〉xi.

Assume that ξ1, · · · , ξd ∈ Rd are eigenvectors of C(Sm) with eigenvaluesλ1 ≥ · · · ≥ λd ≥ 0. Show that any (W,U) ∈ F with W (xj) = 0 for allj ≥ m+ 1 is a solution of (3.20).

Thus a PCA problem can be solved using linear algebra method.

•Manifold learning and autoencoder. In real life data are not concentratedon a linear subspace of Rd but around a submanifold M ⊂ Rd. The currentchallenge in ML community is that to reduce representation of data in Rdnot using all the data in Rd but only use only data concentrated around

30 HONG VAN LE ∗

M . For that purpose we use autoencoder, which is a non-linear analogue ofPCA.

In an auto-encoder we learn a pair of functions: an encoder functionψ : Rd → Rn, and a decoder function ϕ : Rn → Rd. The goal of the learningprocess is to P find a pair of functions (ψ,ϕ) such that the reconstructionerror

RSm(ψ,ϕ) :=m∑i=1

||xi − ϕ(ψ(xi))||2

is small. We therefore must restrict ψ and ϕ in some way. In PCA, weconstrain k < d and further restrict ψ and ϕ to be linear functions.

Remark 3.10. Modern autoencoders have generalized the idea of an en-coder and a decoder beyond deterministic functions to probabilistic mappingsT : X ; Y. For further reading I recommend [SSBD2014], [GBC2016] and[Bishop2006, §12.2, p.570].

3.4. Conclusion. A mathematical model for supervised learning and un-supervised learning consists of a quadruple (Z,H, L,PZ) where• Z is a measurable space of inputs z whose features we want to learn. Forexample Z = Z × Y in classification and regression problems,• H is a decision space containing all possible decisions we have to findbased on a sequence of observable (z1, · · · , zn) ∈ Zn. For example, H is ahypothesis space in the space of all functions X → Y in the discriminativemodel for supervised learning.• PZ ⊂ P(Z) is a statistical model that containing all possible distributionµu that governs the distribution of i.i.d. sequence z, · · · , zn.• An instantaneous loss function L : Z × H × PZ → R≥0 which gener-ates the expected loss function RL : H × PZ → R≥0 by setting RL(h, µ) =RLµ(h) := EµL(z, h, µ), where µ is the unknown probability measure thatgoverns the distribution of observable zi, or the loss function evaluated atz0: RL|z0(h, µ) := Eδz0L(z, h, µ) = L(z0, h, µ). Let us denote by µ0 for µ or

δz0 used in the definition of the expected loss function in both the cases.• A learning rule ρn : X n → H which is usually chosen to be a minimizer ofthe emprical loss function RLµSn .

• An µ0-expected accuracy function Lµ0(ρn) := EµnRLµ0L(ρn(Sn)), where µ

is the unknown probability measure that governs the distribution of observ-ables zi.• Supervised learning is distinguished from unsupervised learning in theiraim of learning: supervised learning is interested only in a concrete answer:given an input x ∈ X what is probability that y ∈ Y is a feature of x, so thedecision space H is a subspace of the space PMap(X ,Y) of all probabilisticmappings from X to Y. In unsupervised learning we are interested in astructure of Z, e.g., the probability measure µZ that governs the distribu-tion of zi ∈ Z, or the homogeneity of Z in clustering problem, etc.• To define a decision rule/ algorithm An : Zn → H we usually use the ERM

MACHINE LEARNING 31

principle that minimize the expected loss RLµSn : H → R for each given dataSn ∈ Zn, or some heuristic method as in Parzen-Rosenblatt nonparametricestimation. To estimate the success of learning rule An is to estimate theaccuracy of the algorithm An as n grows to infinity.

4. Mathematical model for reinforcement learning

Today we shall discuss a mathematical model for reinforcement learning,in which a learner, also called an agent, has to learn to act correctly bycorrelating immediate reward given to the agent by the environment at everyaction of the agent with his past action.

4.1. Knowledge to accumulate in reinforcement learning. The knowl-edge the learner acquires during the course of reinforcement learning shallbe measured by an accumulative reward, which the learner has to maximizeby choosing an optimal decision, also is called an optimal policy.

A reinforcement learning agent interacts with its environment in discretetime steps as follows.

- At each time t, the agent observes a state st in the set S consisting ofpossible states of environment, and selects an action at ∈ A.

- The environment moves to a new state st+1 ∈ S with probabilityPr[st+1|at, st]) and an immediate reward rt+1 (with probability Pr[rt+1|at, st])is given to the agent.

- The discount cumulative reward at the time t is a discount sum of im-mediate rewards along the trajectory h(t) := [s1, a1, · · · , st, at, st+1].

- A policy of an agent is his choice of a course of the actions a based onstate s and given trajectory of his past action.

- The value of a policy π is the expected discount cumulative reward oftrajectory following policy averaging over all initial state s ∈ S.

- The goal of the agent in a reinforcement learning is to choose the optimalpolicy with maximal value.

Remark 4.1. Mathematical model of reinforcement learning is a Markovdecision process (MDP) since we assume the following Markov property inmodeling reinforcement learning: the conditional probability distributionPr[rt+1, st+1|at, st] for the state (and reward) at the next step depends onlyon the current state and action of the system, and not additionally on thestate of the system at previous steps. In [Sugiyama2015, p.8,11] the authorassumes that rt+1 = r(st, at, st+1), where r is a (probabilistic) mapping.Since st+1 = s(at, st) we can write rt+1 = r(st, at), where r is a probabilisticmapping, [JLLT2019].

4.2. Markov decision process. A MDP consists of the following compo-nents:• S is a set of possible states,• A is a set of possible actions,• R is a set of possible rewards,

32 HONG VAN LE ∗

• A state-reward transition is a probabilistic mapping p : S×A; S×R,• A policy is a probabilistic mapping π : S ; A,• The discounted cumulative reward

(4.1) G(h(t)) :=

∞∑k=0

γkrt+k+1

is given to any trajectory h(t) = [st, at, · · · , st+k, at+k, st+k+1, · · · ].• The value function Vπ(s) is a function on the set of all policies π starting

at a state s and follows π thereafter, typically defined as follows, cf. [SB2018,(3.12), p. 58]

(4.2) Vπ(s) :=

∫RrG(s, π),

where µ ∈ P(S) is unknown, G(·, π) : S ; R is a probabilistic mappingdefined by (4.1) and the trajectory h(t) defined by the policy π starting at

the state s. Here G(·, π) : S → P(R) is the associated measurable mapping[JLLT2019].• The aim of an agent is to choose a policy π with maximum value Vπ(s)

for given state s. A policy π is called optimal, if it has maximal value Vπ(s)for all s ∈ S.

4.3. Existence and uniqueness of the optimal policy. (1) If the MDPis finite, i.e., both S and A are finite sets, then there exists unique an optimalpolicy π such that the value Vπ(s) is maximal for all s ∈ S [MRT2012,Theorem 14.1, p. 318].

(2) If the environment is known, i.e., if the probabilistic mapping p :S × A ; S × R is known, then finding an optimal policy is an planningproblemm, for which there are known several algorithms [MRT2012, §14.4.,p. 319].

(3) If the environment is unknown, then the agent needs to learn themby estimating the probabilistic mapping p : S × A ; S × R. As we havediscussed, this problem can be solved within supervised and unsupervisedframework.

4.4. Conclusion. Reinforcement learning is a type of learning where theagent needs to learn to interact with environment correctly. For this purposehe needs to know/learn the environment which leads to unsupervised orsupervised learning.

5. The Fisher metric and maximum likelihood estimator

We have considered several mathematical models in unsupervised learn-ing, which is currently the most interesting part of machine learning. Themost important problem in unsupervised learning is the problem of densityestimation, or more general, the problem of probability measure estimation,which is also a problem in generative models of supervised learning and theessential problem in reinforcement learning when we don’t have complete

MACHINE LEARNING 33

information of environment. We learned a parametric model for densityestimation problem, where the error function is defined as the expectedlog-likelihood function. This error function combining with the ERM prin-ciple leads to the well known maximum likelihood estimator (MLE). Thepopularity of this estimator stems from its asymptotic accuracy, also calledconsistency, which holds under mild conditions. This type of asymptoticaccuracy also holds for Parzen-Rosenblatt’s estimator that we have learned.Today we shall study MLE using the Fisher metric, the associated MSEfunction and the Cramer-Rao inequality.

We shall also clarify the relation between the Fisher metric and anotherchoice of the error function for parametric density estimation - the Kullback-Leibler divergence.

5.1. The space of all probability measures and total variation norm.We begin today lecture with our investigation of a natural geometry of P(X )for an arbitrary measurable space (X ,Σ). This geometry induces the Fishermetric on any statistical model P ⊂ P(X ) that satisfies a mild condition.

Let us fix some notations. Recall that a signed finite measure µ on X isa function µ : Σ → R which satisfies all axioms of a measure except that µneeds not take non-negative value. Now we set

M(X ) := µ : µ a finite measure on X,

S(X ) := µ : µ a signed finite measure on X.It is known that S(X ) is a Banach space whose norm is given by the total

variation of a signed measure, defined as

‖µ‖TV := sup

n∑i=1

|µ(Ai)|

where the supremum is taken over all finite partitions X = A1∪ . . . ∪An withdisjoint sets Ai ∈ Σ(X ) (see e.g. [Halmos1950]). Here, the symbol ∪ standsfor the disjoint union of sets.

Let me describe the total variation norm using the Jordan decompositiontheorem for signed measures, which is an analogue of the decompositiontheorem for a measurable function. For a measurable function φ : X →[−∞,∞] we define φ+ := max(φ, 0) and φ− := max(−φ, 0), so that φ± ≥ 0are measurable with disjoint support, and

(5.1) φ = φ+ − φ− |φ| = φ+ + φ−.

Similarly, by the Jordan decomposition theorem, each measure µ ∈ S(X ) canbe decomposed uniquely as

(5.2) µ = µ+ − µ− with µ± ∈M(X ), µ+ ⊥ µ−.

That is, there is a Hahn decomposition X = X+∪X− with µ+(X−) =µ−(X+) = 0 (in this case the measures µ+ and µ− are called mutually

34 HONG VAN LE ∗

singular). Thus, if we define

|µ| := µ+ + µ− ∈M(X ),

then (5.2) implies

(5.3) |µ(A)| ≤ |µ|(A) for all µ ∈ S(X ) and A ∈ Σ(X ),

so that

‖µ‖TV = ‖ |µ| ‖TV = |µ|(X ).

In particular,

P(X ) = µ ∈M(X ) : ‖µ‖TV = 1.Next let us consider important subsets of dominated measures and equiv-

alent measures in the Banach space S(X ) which are most frequently usedsubsets in statistics and ML.

Given a measure µ0 ∈M(X ), we let

S(X , µ0) := µ ∈ S(X ) : µ is dominated by µ0.By the Radon-Nikodym theorem, we may canonically identify S(X , µ0)

with L1(X , µ0) by the correspondence

(5.4) ıcan : L1(X , µ0) −→ S(X , µ0), φ 7−→ φ µ0.

Observe that ıcan is an isomorphism of Banach spaces, since evidently

‖φ‖L1(X ,µ0) =

∫X|φ| dµ0 = ‖φ µ0‖TV .

Example 5.1. Let Xn := ω1, · · · , ωn be a finite set of n elementary events.Let δωi denote the Dirac measure concentrated at ωi. Then

S(Xn) = µ =

n∑i=1

xiδωi |xi ∈ R = Rn(x1, · · · , xn)

and

M(Xn) = n∑i=1

xiδωi |xi ∈ R≥0 = Rn≥0.

For µ ∈M(Xn) of the form

µ =

k∑i=1

ciδi, ci > 0

we have ||µ||TV =∑ci. Thus the space L1(Xn, µ) with the total variation

norm is isomorphic to Rk with the l1-norm. The space P(Xn) with theinduced total variation topology is homeomorphic to a (n− 1)-dimensionalsimplex (c1, · · · , cn) ∈ Rn+|

∑i ci = 1.

Exercise 5.2. ([JLS2017]) For any countable family of signed measuresµn ∈ S(X ) show that there exists a measure µ ∈ M(X ) dominating allmeasures µn.

MACHINE LEARNING 35

Remark 5.3. On (possibly infinite dimensional) Banach spaces we can doanalysis, since we can define the notion of differentiable mappings. Let Vand W be Banach spaces and U ⊂ V an open subset. Denote by Lin(V,W )the space of all continuous linear map from V to W . A map φ : U → W iscalled differentiable at x ∈ U , if there is a bounded linear operator dxφ ∈Lin(V,W ) such that

(5.5) limh→0

‖φ(x+ h)− φ(x)− dxφ(h)‖W‖h‖V

= 0.

In this case, dxφ is called the (total) differential of φ at x. Moreover, φ iscalled continuously differentiable or shortly a C1-map, if it is differentiableat every x ∈ U , and the map dφ : U → Lin(V,W ), x 7→ dxφ, is continuous.Furthermore, a differentiable map c : (−ε, ε)→W is called a curve in W.

A map φ from an open subset Θ of a Banach space V to a subset X ofa Banach space W is called differentiable, if the composition i φ : Θ →W is differentiable. The notion of differentiable mappings can be furthergeneralized to the setting of convenient analysis on locally convex vectorspaces [KM1997] and to the general setting of diffeology [IZ2013].

5.2. The Fisher metric on a statistical model. Given a statisticalmodel P ⊂ P(X ) we shall show that P is endowed with a nice geomet-ric structure induced from the Banach space (S(X ), ||, ||TV ). Under a mildcondition this implies the existence of the Fisher metric on P . Then we shallcompare the Fisher metric and the Kullback-Leibler divergence.

We study P by investigating the space of functions on P (which is a linearinfinite dimensional vector space) and by investigating its dual version: thespace of all smooth curves on P 12. The tangent fibration of P describes thefirst order approximation of the later space.

Definition 5.4. (cf [AJLS2017, Definition 3.2, p. 141]) (1) Let (V, ‖ · ‖) be

a Banach space, X i→ V an arbitrary subset, where i denotes the inclusion,

and x0 ∈ X . Then v ∈ V is called a tangent vector of X at x0, if there is aC1-map c : R→ X , i.e., the composition + i c : R→ V is a C1-map, suchthat c(0) = x0 and c(0) = v.

(2) The tangent (double) cone CxX at a point x ∈ X is defined as thesubset of the tangent space TxV = V that consists of of tangent vectors ofX at x. The tangent space TxX is the linear hull of the tangent cone CxX .

(3) The tangent cone fibration CX (resp. the tangent fibration TX ) is theunion ∪x∈XCxX (resp. ∪x∈XTxX ), which is a subset of V ×V and thereforeit is endowed with the induced topology from V × V .

Example 5.5. Let us consider a mixture family PX of probability measurespηµ0 on X that are dominated by µ0 ∈ P(X ), where the density functions

12by [KM1997, Lemma 3.11, p. 30] a map f : P1 → P2 is smooth iff its composition toany smooth curve c : (0, 1)→ P1 is a smooth curve in P2

36 HONG VAN LE ∗

pη are of the following form

(5.6) pη(x) := g1(x)η1 + g2(x)η1 + g3(x)(1− η1 − η2) for x ∈ X .

Here gi, for i = 1, 2, 3, are nonnegative functions on X such that Eµ0(gi) = 1and η = (η1, η2) ∈ Db ⊂ R2 is a parameter, which will be specified as follows.Let us divide the square D = [0, 1]× [0, 1] ⊂ R2 in smaller squares and colorthem in black and white like a chessboard. Let Db be the closure of thesubset of D colored in black. If η is an interior point of Db then CpηPX = R2.If η is a boundary point of Db then CpηPX = R. If η is a corner point of Db,then CpηPX consists of two intersecting lines.

Exercise 5.6. (cf. [AJLS2018, Theorem 2.1]) Let P be a statistical model.Show that any v ∈ CξP is dominated by ξ. Hence the logarithmic represen-tation of v

log v := dv/dξ

is an element of L1(X , ξ).

Next we want to put a Riemannian metric on P i.e., to put a positivequadratic form g on each tangent space TξP . By Exercise 5.6, the log-arithmic representation log(CξP ) of CξP is a subspace in L1(X , ξ). Thespace L1(X , ξ) does not have a natural metric but its subspace L2(X , ξ) isa Hilbert space.

Definition 5.7. (1) A statistical model P that satisfies

(5.7) log(CξP ) ⊂ L2(X , ξ)

for all ξ ∈ P is called almost 2-integrable.(2) Assume that P is an almost 2-integrable statistical model. For each

v, w ∈ CξP the Fisher metric on P is defined as follows

(5.8) g(v, w) := 〈log v, logw〉L2(X ,ξ) =

∫X

log v · logw dξ.

(3) An almost 2-integrable statistical model P is called 2-integrable, if thefunction v 7→ |v|g is continuous on CP .

Since TξP is a linear hull of CξP , the formula (5.8) extends uniquely toa positive quadratic form on TξP , which is also called the Fisher metric.

Example 5.8. Let P ⊂ P(X ) be a 2-integrable statistical model that isparameterized by a differentiable map p : Θ→ P, θ 7→ pθµ0, where Θ is anopen subset in Rn. It follows from (5.8) that the Fisher metric on P has thefollowing form

(5.9) g|p(θ)(dp(v), dp(w)) =

∫X

∂vpθpθ· ∂wpθpθ

pθdµ0,

for any v, w ∈ TθΘ.

MACHINE LEARNING 37

Remark 5.9. (1) The Fisher metric has been defined by Fisher in 1925 tocharacterize “information” of a statistical model. One of most notable ap-plications of the Fisher metric is the Cramer-Rao inequality which measuresour ability to have a good density estimator in terms of geometry of theunderlying statistical model, see Theorem 5.16 below.

(2) The Fisher metric gp(θ)(dp(v), dp(v)) of a parameterized statisticalmodel P of dominated measures in Example 5.8 can be obtained from theTaylor expansion of the Kullback-Leibler divergence I(p(θ),p(θ + εv)), as-suming that log pθ is continuously differentiable in all partial derivative in θup to order 3. Indeed we have

I(p(θ),p(θ + εv)) =

∫Xpθ(x) log

pθ(x)

pθ+εv(x)dµ0

(5.10) = −ε∫Xpθ(x)∂v log pθ(x)dµ0

(5.11) − ε2

∫Xpθ(x)(∂v)

2 log pθ(x)dµ0 +O(ε3).

Since logθ(x) is continuously differentiable in θ up to order 3, we can applydifferentiation under the integral sign, see e.g. [Jost2005, Theorem 16.11, p.213] to (5.10), which then must vanish, and integration by part to (5.11).Hence we obtain

I(p(θ),p(θ + εv)) = ε2gp(θ)(dp(v), dp(v)) +O(ε3)

what is required to prove.

5.3. The Fisher metric, MSE and Cramer-Rao inequality. Given astatistical model P ⊂ P(X ), we wish to measure the accuracy (also calledthe efficiency) of an estimator σ : X → P via MSE. For this purpose we needfurther formalization by “linearizing” σ via a composition with a mappingϕ : P → V , where V is a real Hilbert vector space with a scalar product〈·, ·〉 and the associated norm ‖ · ‖.

Remark 5.10. Our definition of an estimator agrees with the notion of anestimator in nonparametric statistics. In parametric statistic we consider astatistical model P ⊂ PX that is parameterized by a parameter space Θ andan estimator is a map from X to Θ.

Definition 5.11. A ϕ-estimator is a composition of an estimator σ : X → Pand a map ϕ : P → V .

Example 5.12. (1) Let us reconsider the example of nonparametric densityestimation where X = Rk and H ⊂ C∞(Rk). Let x0 ∈ Rk and Sn ∈ (Rk)n.Then we have considered estimators ρSn ∈ H and its associated ϕ-estimatorsρSn(x0) ∈ Rk, where ϕ : H → Rk, f 7→ f(x0), is the evaluation mapping.

38 HONG VAN LE ∗

(2) Assume that k : X × X → R is a symmetric and positive definitekernel function and V be the associated RKHS. Then we define the kernelmean embedding ϕ : P(X )→ V as follows [MFSS2017]

ϕ(ξ) :=

∫Xk(x, ·)dξ(x).

Denote by Map(P, V ) the space of all mappings ϕ : P → V . For l ∈ Vlet ϕl : P → R defined by

ϕl(ξ) := 〈l, ϕ(ξ)〉,

and we set

L2ϕ(P,X ) := σ : X → P |ϕl σ ∈ L2(X , ξ) for all ξ ∈ P and for all l ∈ V .

For σ ∈ L2ϕ(X , P ) we define the ϕ-mean value of σ, denoted by ϕσ : P →

V , as follows (cf. [AJLS2017, (5.54), p. 279])

ϕσ(ξ)(l) := Eξ(ϕl σ) for ξ ∈ P and l ∈ V.

The difference bϕσ := ϕσ − ϕ ∈ Map(P, V ) will be called the bias ofthe ϕ-estimator σϕ. An estimator σ will be called ϕ-unbiased, if ϕσ = ϕ,equivalently, bϕσ = 0.

For all ξ ∈ P we define a quadratic function MSEϕξ [σ] on V , which is

called the mean square error quadratic function at ξ, by setting for l, h ∈ V(cf. [AJLS2017, (5.56), p. 279])

(5.12) MSEϕξ [σ](l, h) := Eξ[(ϕl σ(x)− ϕl(ξ)) · (ϕh σ(x)− ϕh(ξ))].

Thus the function MSEϕξ [σ](l, l) on P is the expected risk of the quadratic

instantaneous loss function

Ll : X × P → R, Ll(x, ξ) = |ϕl σ(x)− ϕl ξ|2.

Similarly we define the variance quadratic function of the ϕ-estimatorϕ σ at ξ ∈ P is the quadratic form V ϕ

ξ [σ] on V such that for all l, h ∈ Vwe have (cf. [AJLS2017, (5.57), p.279])

V ϕξ [σ](l, h) = Eξ[ϕl σ(x)− Eξ(ϕl σ(x)) · ϕh σ(x)− Eξ(ϕh σ(x))].

Remark 5.13. For σ ∈ L2ϕ(X , P ) the mean square error MSEϕξ (σ) of the

ϕ-estimator ϕ σ is defined by

(5.13) MSEϕξ (σ) := Eξ(‖ϕ σ − ϕ(ξ)‖2).

The RHS of (5.13) is well-defined, since σ ∈ L2ϕ(X , P ) and therefore

〈ϕ σ(x), ϕ σ(x)〉 ∈ L1(X , ξ) and 〈ϕ σ(x), ϕ(ξ)〉 ∈ L2(X , ξ).

Similarly, we define that the variance of a ϕ-estimator ϕ σ at ξ as follows

V ϕξ (σ) := Eξ(‖ϕ σ − Eξ(ϕ σ)‖2).

MACHINE LEARNING 39

If V has a countable basis of orthonormal vectors v1, · · · , v∞, then wehave

(5.14) MSEϕξ (σ) =∞∑i=1

MSEϕξ [σ](vi, vi),

(5.15) V ϕξ (σ) =

∞∑i=1

V ϕξ [σ](vi, vi).

Exercise 5.14. ([AJLS2017]) Prove the following formula

(5.16) MSEϕξ [σ](l, k) = V ϕξ [σ](l, k) + 〈bϕσ(ξ), l〉 · 〈bϕσ(ξ), k〉

for all ξ ∈ P and all l, k ∈ V .

Now we assume that P is an almost 2-integrable statistical model. Forany ξ ∈ P let T g

ξ P be the completion of TξP w.r.t. the Fisher metric g.

Since T gξ P is a Hilbert space, the map

Lg : T gξ P → (T g

ξ P )′, Lg(v)(w) := 〈v, w〉g,

is an isomorphism. Then we define the inverse g−1 of the Fisher metric gon (T g

ξ P )′ as follows

(5.17) 〈Lgv, Lgw〉g−1 := 〈v, w〉gDefinition 5.15. (cf. [AJLS2017, Definition 5.18, p. 281]) Assume thatσ ∈ L2

ϕ(X , P ). We shall call σ a ϕ-regular estimator, if for all l ∈ V the

function ξ 7→ ‖ϕl σ‖L2(X ,ξ) is locally bounded, i.e., for all ξ0 ∈ P

limξ→ξ0

sup ‖ϕl σ‖L2(X ,ξ) <∞.

In [AJLS2017, JLS2017, Le2019] it has been proved that, if σ is a ϕ-regularestimator, then for all l ∈ V the function ϕlσ is differentiable, moreover, there

is differential dϕlσ(ξ) ∈ (T gξ P )′ for any ξ ∈ P such that

(5.18) ∂vϕlσ(ξ) = dϕlσ(v) for all v ∈ Tξ.

Here for a function f on P we define ∂vf(ξ) := f(c(t)) where c(t) ⊂ P is acurve with c(0) = ξ and c(0) = v.

Set∇gϕ

lσ(ξ) := L−1

g (dϕlσ).

Then (5.18) is equivalent to the following equality

∂vϕlσ(ξ) = 〈v,∇gϕ

lσ(ξ)〉g.

Theorem 5.16 (The Cramer-Rao inequality). (cf. [AJLS2017, JLS2017])Let P be a 2-integrable statistical model, V -a Hilbert vector space, ϕ a V -valued function on P and σ ∈ L2

ϕ(P,X ) a ϕ-regular estimator. Then for alll ∈ (Rn)∗ we have

V ϕξ [σ](l, l)− ‖dϕlσ‖2g−1(ξ) ≥ 0.

40 HONG VAN LE ∗

Example 5.17. Assume that σ is an unbiased estimator. Then the terms in-volving bσ := bϕσ vanishes. Since ϕσ = ϕ we have ||dϕl||g−1(ξ) = ||dϕlσ||2g−1(ξ).

If ϕ : P → Rn is a local coordinate mapping at ξ ∈ P then the Cramer-Raoinequality in Theorem 5.16 becomes the well-known Cramer-Rao inequalityfor an unbiased estimator

(5.19) Vξ := Vξ[σ] ≥ g−1(ξ).

5.4. Efficient estimators and MLE.

Definition 5.18. Assume that P ⊂ P(X ) is a 2-integrable statistical modeland ϕ : P → Rn is a feature map. An estimator σ ∈ L2

ϕ(P,X ) is called effi-

cient, if the Cramer-Rao inequality for σ becomes an equality, i.e., V ϕξ [σ] =

||dϕσ||2g−1(ξ) for any ξ ∈ P .

Theorem 5.19. ([AJLS2017, Corollary 5.6, p. 291]) Let P be a 2-integrablestatistical model parameterized by a differentiable map p : Θ → P(X ), θ 7→pθµ0, where Θ is an open subset of Rn, and ϕ : P → Rn a differentiablecoordinate mapping, i.e., p ϕ = Id. Assume that the function p(x, θ) :=pθ(x) has continuous partial derivatives up to order 3. If σ : X → P isan unbiased efficient estimator then σ is a maximum likelihood estimator(MLE), i.e.,

(5.20) ∂v log p(θ, x)|θ=ϕσ(x) = 0

for all x ∈ X and all v ∈ TθΘ.

5.5. Consistency of MLE. Assume that P1 := P ⊂ P(X ) is a 2-integrablestatistical model that contains an unknown probability measure µu govern-ing distribution of random instance xi ∈ X . Then Pn := µn|µ ∈ P ⊂P(X n) is a 2-integrable statistical model containing probability measure µnuthat governs the distribution of i.i.d. of random instances (x1, · · · , xn) ∈ X n.Denote by gn the Fisher metric on the statistical model Pn. The mapλn : P1 → Pn, µ 7→ µn, is a 1-1 map, it is the restriction of the differentiablemap, also denoted by λn

λn : S(X )→ S(X n), λn(µ) = µn.

It is easy to see that for any v ∈ TµP we have

(5.21) gn(dλn(v), dλn(v)) = n · g1(v, v).

This implies that the lower bound in the Cramer-Rao inequality (5.19) forunbiased estimators σ∗n := λ−1

n σn : (X )n → P converges to zero and thereis a hope that MLE is asymptotically accurate as n goes to infinity. Nowwe shall give a concept of a consistent sequence ϕ-estimators that formalizesthe notion of asymptotically accurate sequence of estimators σ∗k : X k → P1

and using it to examine MLE. 13.

13the notion of a consistent sequence of estimators that is asymptotically accurate hasbeen suggested by Fisher in [Fisher1925]

MACHINE LEARNING 41

Definition 5.20. (cf. [IH1981, p. 30]) Let P ⊂ P(X ) be a statisticalmodel, V - real Hilbert space and ϕ a V -valued function on P . A sequenceof ϕ-estimators σ∗k : (X )k → P → V is called a consistent sequence ofϕ-estimators for the value ϕ(µu) if for all δ > 0 we have

(5.22) limk→∞

µku(x ∈ X k : |ϕ σ∗k(x)− ϕ(µu)| > δ) = 0.

Remark 5.21. Under quite general conditions on the density functions ofa statistical model P ⊂ P(X ) of dominated measures, see e.g. [IH1981,Theorem 4.3, p. 36] the sequence of MLE’s is consistent.

5.6. Conclusion. In this lecture we derive a natural geometry on a statis-tical model P ⊂ P(X ) regarding it as a subset in the Banach space S(X )with the total variation norm. Since P is non-linear, we linearize estimatorσ∗k : X k → Pk = λk(P ) by composing it with a map ϕk := λ−1

k ϕ : Pk → Rn.Then we define the MSE and variance of ϕk-estimator ϕk σk, which can beestimated using the Cramer-Rao inequality. It turns out that the efficientunbiased estimator w.r.t. MSE is MLE. The notion of a ϕ-estimator allowsto define the notion of a consistent sequence of ϕ-estimators that formalizesthe notion of asymptotically accurate estimators. Since the concept of aϕ-estimator also encompasses the Parzen-Rosenblatt’s estimator, the notionof the accuracy σ∗k defined in (5.22) also agrees with the notion of accuracyof estimators defined in (3.15).

6. Consistency of a learning algorithm

In the last lecture we learned the important concept of a consistent se-quence of ϕ-estimators, which formalizes the notion of the asymptotic ac-curacy of a sequence of estimators, and applied this concept to MLE. Sincethe notion of a ϕ-estimator is a technical modification of the notion of anestimator, we shall not consider ϕ-estimators separately in our considera-tion of consistency of a learning algorithm. We replace L by Lϕ when weconsider ϕ-consistency of a learning algorithm.

In this lecture we extend the concept of a consistent sequence of estimatorsto the notion of consistency of a learning algorithm in a unified learningmodel of supervised learning and unsupervised learning. Then we shall applythis concept to examine the success of ERM algorithm in binary classificationproblems.

6.1. Consistent learning algorithm and its sample complexity. Re-call that in a unified learning model (Z,H, L, PZ) we are given a mea-surable space Z, a hypothesis /decision class H together with a mappingϕ : H → V , where V is a real Hilbert space, and instantaneous loss functionL : Z ×H× PZ → R+, where PZ ⊂ P(Z) is a statistical model. We definethe expected loss/risk function as follows

RL : H× PZ → R+, (h, µ) 7→ EµL(z, h, µ),

42 HONG VAN LE ∗

We also set for µ ∈ PRLµ : H → R, h 7→ RL(h, µ).

A learning algorithm is a map

A :⋃n

Zn → H, S 7→ hS

where S is distributed by some unknown µn ∈ Pn = λn(P ), cf. (2.1).The expected accuracy of a learning algorithm A with data size n is

defined as followsL(An) := EµnRLµ(A(Sn))

where µn is the unknown probability measure that governs the distributionof observable Sn ∈ X n.

Example 6.1. In a unified learning model (Z,H, L, PZ) there is always alearning algorithm using ERM, if H is finite, or H is compact and RLµSn is

a continuous function on H. The ERM algorithm Aerm for (Z,H, L, PZ) isdefined as follows

(6.1) Aerm(Sn) := argh∈Hmin RLSn(h),

where recall that RLSn := RLµSn . Observe that argh∈Hmin RLSn(h) may not

exist ifH is not compact. In this case we denote by Aerm(Sn) any hypothesis

that satisfies the inequality RLS(Aerm(Sn))−infh∈H RLSn

(h) = O(n−k), wherek ≥ 1 depends on the computational complexity of defining Aerm. In otherwords, Aerm is only asymptotically ERM. 14

In general we expect a close relationship between a discriminative modelH and a statistical model PZ .

Recall that RLµ,H = infh∈HRLµ(h), see (2.10).

Definition 6.2. A learning algorithm A in a model (Z,H, L, P ) is calledconsistent, if for any ε ∈ (0, 1) and for every probability measure µ ∈ P ,

(6.2) limn→∞

µnS ∈ Zn : |RLµ(A(S))−RLµ,H| ≥ ε = 0

(2) A learning algorithm A is called uniformly consistent, if (6.2) convergesto zero uniformly on P , i.e., for any (ε, δ) ∈ (0, 1)2 there exists a numbermA(ε, δ) such that for any µ ∈ P and any m ≥ mA(ε, δ) we have

(6.3) µmS ∈ Zm : |RLµ(A(S))−RLµ,H| ≤ ε ≥ 1− δ.

If(6.3) holds we say that A predicts with accuracy ε and confidence 1− δusing m samples.

We characterize the uniform consistency of a learning algorithm A via thenotion of the sample complexity function of A.

14Vapnik considered ERM algorithm also for the case that the function RLSnmay not

reach infimum on H, using slightly different language than ours. In [SSBD2014] the

authors assumed that a minimizer of RLSnalways exists.

MACHINE LEARNING 43

Definition 6.3. Let A be an algorithm on (Z,H, L, P ) and mA(ε, δ) theminimal number m0 ∈ R+ ∪ ∞ such that (6.3) holds for any µ ∈ P andany m ≥ m0. Then the function mA : (ε, δ) 7→ mA(ε, δ) is called the samplecomplexity function of algorithm A.

Clearly a learning algorithm A is uniformly consistent if and only if mA

takes value in R+. Furthermore, A is consistent if and only if the samplefunction of A on the sub-model (Z,H, L, µ) takes values in R+ for all µ ∈ P .

Example 6.4. Let (Z,H, L, P ) be a unified learning model. Denote byπn : Z∞ → Zn the map (z1, · · · , z∞) 7→ (z1, · · · , zn). A sequence S∞ ∈Z∞ of i.i. instances distributed by µ ∈ P is called overfitting, if there existε ∈ (0, 1) such that for all n we have

(6.4) |RLµ(Aerm(πn(S∞))−RLµ,H| ≥ ε.Thus Aerm is consistent, if and only if the set of all overfitting sequencesS∞ ∈ Z∞ has µ∞-zero measure, see Subsection A.5 for the definition ofµ∞, equivalently, if (6.2) holds, for any µ ∈ P . In Example 2.14 we showedthe existence of a measure µf on Z = X × Y such that any sequence S∞ ∈Z∞ distributed by µ∞f is overfitting. Hence the unified learning model in

Example 2.14) for any P containing µf is not consistent using Aerm.

The following simple Lemma states the equivalency between the uniformconsistency of Aerm and the convergence in probability of RLS(h) to RLµ(h)(by the weak law of large number) that is uniform on H.

Lemma 6.5. Assume that for any (ε, δ) ∈ (0, 1)2 there exists a functionmH(ε, δ) taking value in R+ such that for all m ≥ mH(ε, δ) for all µ ∈ Pand for all h ∈ H we have

(6.5) µmS ∈ Zm : |RLS(h)−RLµ(h)| < ε ≥ 1− δthen Aerm is uniformly consistent.

Proof. For the simplicity of the exposition we assume first that Aerm(Sn) =arg minh∈HR

LSn

(h). The argument for the general case can be easily adaptedfrom this simple case.

Given m ≥ mH(ε/2, δ/2), µ ∈ P and hε ∈ H such that RLµ(hε) ≤ RLµ,H+εwe have

µmS ∈ Zm : |RLµ(Aerm(S))−Rµ,H| ≤ 2ε ≥µmS ∈ Zm : RLµ(Aerm(S)) ≤ RLµ(hε) + ε ≥

µmS ∈ Zm : RLµ(Aerm(S)) ≤ RLS(hε) +ε

2& |RLS(hε)−RLµ(hε)| <

ε

2 ≥

µnS ∈ Zm : RLµ(Aerm(S)) ≤ RLS(hε) + ε − δ

2≥

µnS ∈ Zn : |RLµ(Aerm(S))− RLS(A(S))| ≤ ε

2 − δ

2≥ 1− δ

since RLS(A(S)) < RLS(hε). This completes the proof of Lemma 6.5.

44 HONG VAN LE ∗

Theorem 6.6. (cf. [SSBD2014, Corollary 4.6, p.57]) Let (Z,H, L,P(Z))be a unified learning model. If H is finite and L(Z ×H) ⊂ [0, c] 63 ∞ thenthe ERM algorithm is uniformly consistent.

Proof. By Lemma 6.5, it suffices to find for each (ε, δ) ∈ (0, 1) × (0, 1) anumber mH(ε, δ) such that for all m ≥ mH(ε, δ) and for all µ ∈ P(Z) wehave

(6.6) µm( ⋂h∈HS ∈ Zm : |RLS(h)−RLµ(h)| ≤ ε

)≥ 1− δ.

In order to prove (6.6) it suffices to establish the following inequality

(6.7)∑h∈H

µm(S ∈ Zm| |RLS(h)−RLµ(h)| > ε

)< δ.

Since #H < ∞, it suffices to find mH(ε, δ) such that when m ≥ mH(ε, δ)each summand in RHS of (6.7) is small enough. For this purpose we shallapply the well-known Hoeffding inequality, which specifies the rate of con-vergence in the week law of large numbers, see Subsection B.4.

To apply Hoeffding’s inequality to the proof of Theorem 6.6 we observethat for each h ∈ H

θhi (z) := L(h, z) ∈ [0, c]are i.i.d. R-valued random variables on Z. Furthermore we have for anyh ∈ H and S = (z1, · · · , zm)

RLS(h) =1

m

m∑i=1

θhi (zi),

RLµ(h) = θh.

Hence the Hoeffding inequality implies

(6.8) µmS ∈ Zm : |RLS(h)−RLµ(h)| > ε ≤ 2 exp(−2mε2c−2).

Now pluging

m ≥ mH(ε, δ) :=log(2#(H)/δ)

2ε2c−2

in (6.8) we obtain (6.7). This completes the proof of Theorem 6.6.

Definition 6.7. The function mH : (0, 1)2 → R defined by the requirementthat mH(ε, δ) is the least number for which (6.6) holds is called the samplecomplexity of a (unified) learning model (Z,H, L, P ).

Remark 6.8. (1) We have proved that the sample complexity of the algo-rithm Aerm is upper bounded by the sample complexity mH of (Z,H, L, P ).

(2) The definition of the sample complexity mH in our lecture is differentfrom the definition of the sample complexity mH in [SSBD2014, Definition

MACHINE LEARNING 45

3.1, p. 43], where the authors consider the concept of PAC-learnability 15 ofa hypothesis class H under a “realizability” assumption. Their mH dependson a learning algorithm f and it is not hard to see that it is equivalent tothe notion of the sample complexity mf of the learning algorithm f in ourlecture.

6.2. Uniformly consistent learning and VC-dimension. In this sub-section we shall examine sufficient and necessary conditions for the existenceof a uniformly consistent learning algorithm on a unified learning model withinfinite hypothesis class H ⊂ YX . First we prove a version of No-Free-Lunchtheorem which asserts that there is no uniformly consistent learning algo-rithm on a learning model with a very large hypothesis class and a verylarge statistical model. Denote by PX (X ×Y) the set of all probability mea-sures (Γf )∗(µX ) ∈ P(X × Y) where f : X → Y is a measurable map andµX ∈ P(X ).

Theorem 6.9. Let X be an infinite domain set, Y = 0, 1 and L(0−1) the0-1 loss function. Then there is no uniformly consistent learning algorithmon a unified learning model (X × Y,H, L(0−1),PX (X × Y)).

Proof. To prove Theorem 6.9 it suffices to show that mA(ε, δ) = ∞ for(ε, δ) = (1/8, 1/8) and any learning algorithm A. Assume the opposite, i.e.,mA(1/8, 1/8) = m < ∞. Then we shall find µ(m) ∈ PX (X × Y) such thatEquation (6.3) violates for m, µ(m) and (ε, δ) = (1/8, 1/8).

To describe the measure µ(m) we need some notations. For a subset

C[k] ⊂ X of k-elements let µC[k]X ∈ P(X ) be defined by

(6.9) µC[k]X (B) :=

#(B ∩ C[k])

kfor any B ⊂ X .

For a map f : X → Y we set

µC[k]f := (Γf )∗µ

C[k]X ∈ PX (X × Y).

Lemma 6.10. Assume that X , Y are finite sets and #X ≥ n + 1. Forf ∈ YX set µf := µXf ∈ PX (X × Y). Then for any learning algorithm

A : S 7→ AS, any f ∈ YX we have

(6.10)

∫YX

∫(X×Y)n

R(0−1)µf

(AS) d(µnf )(S)dµYX

YX (f) ≥ (1− 1

#Y)(1− n

#X)

Proof of Lemma 6.10. We set for S ∈ (X × Y)n

Pri : (X × Y)n → X , (x1, y1), · · · , (xn, yn) 7→ xi ∈ X ,

XS :=

n⋃i+1

Pri(S).

15the concept of PAC-learnability has been introduced by the computer scientist L.Valiant [Valiant1984], which is equivalent to the existence of a uniformly consistent learn-ing algorithm with account of computational complexity

46 HONG VAN LE ∗

Note that S is distributed by µnf means that S = (x1, f(x1)), · · · , (xn, f(xn)),

so S is essentially distributed by the uniform probability measure (µXX )n. Letus compute and estimate the double integral in the LHS of (6.10) using (2.9)and the Fubini theorem.

EµYXYX

(Eµnf

(R(0−1)µf

(AS)))

=1

#XEµYXYX

(E(µXX )n

(∑x∈X

(1− δAS(x)f(x) )

))≥ 1

#XEµYXYX

(E(µXX )n

(∑x 6∈XS

(1− δAS(x)f(x) )

))=

1

#XE(µXX )n

(∑x 6∈XS

EµYXYX

(1− δAS(x)f(x) )

)=

1

#XE(µXX )n

(#[X \ XS ] ·

(1− 1

#Y))

since #[X\XS ]≥#X−n≥ (1− 1

#Y)(1− n

#X).(6.11)

This completes the proof of Lemma 6.10.

Continuation of the proof of Theorem 6.9. It follows from Lemma 6.10

that there exists f ∈ YX such that, denoting µ := µC[2m]f , we have

(6.12)

∫(X×Y)m

R(0−1)µ (AS) dµm =

∫(C[2m]×Y)m

R(0−1)µ (AS) d(µm)(S) ≥ 1

4.

Since 0 ≤ R(0−1)µ ≤ 1 we obtain from (6.12)

µmS ∈ (X × Y)m|R(0−1)µ

(A(S)

)≥ 1

8 > 1

8.

This implies that (6.3) does not hold for (ε, δ) = (1/8, 1/8), for any m and

µ(m) = µC[m]f . This proves Theorem 6.9.

Remark 6.11. In the proof of Theorem 6.9 we showed that if there is asubset C ⊂ X of size 2m and the restriction of H to C is the full set offunctions in 0, 1C then (6.3) does not hold for any learning algorithm A,m ∈ N and (ε, δ) = (1/8, 1/8). This motivates the following

Definition 6.12. (1) A hypothesis class H ⊂ 0, 1X shatters a finite subsetC ⊂ X if #H|C = 2#C .

(2)The VC-dimension of a hypothesis class H ⊂ 0, 1X , denoted byV C dim(H), is the maximal size of a set C ⊂ X that can be shattered byH. If H can shatter sets of arbitrarily large size we say that H has infiniteVC-dimension.

Example 6.13. (1) A hypothesis class H shatters a subset of oen pointx0 ∈ X if and only if there are two functions f, g ∈ H such that f(x0) 6=g(x0).

MACHINE LEARNING 47

(2) Let H be the class of intervals in the real line, namely,

H = 1(a,b) : a < b ∈ R,

where 1(a,b) : R→ 0, 1 is the indicate function of the interval (a, b). Takethe set C = 1, 2. Then, H shatters C, since all the functions in the set

1, 2(0,1) can be obtained as the restriction of some function from H to C.Hence V C dim(H) ≥ 2. Now take an arbitrary set C = c1 < c2 < c3 andthe corresponding labeling (1, 0, 1). Clearly this labeling cannot be obtainedby an interval: Any interval h(a,b) that contains c1 and c3 (and hence labelsc1 and c3 with the value 1) must contain c2 (and hence it labels c2 with 0 ).Hence H does not shatter C. We therefore conclude that V C dim(H) = 2.Note that H has infinitely many elements.

Exercise 6.14 (VC-Threshold functions). Consider the hypothesis classF ⊂ −1, 1R of all threshold functions sign b : R→ R, where b ∈ R, definedby

sign b(x) := sign (x− b)Show that V C dim(F) = 1.

Exercise 6.15. Let us reconsider the toy example 2.2, where we want topredict skin diseases by examination of skin image. Recall that X = ∪5

i=1I1×I2 × I3 × Ai and Y = ±1. Recall that a function f : X → Y can beidentified with a subset f−1(1) ⊂ X . Let us consider the hypothesis classH ⊂ YX consisting of cubes [a1, b1]× [a2, b2]× [a3, b3]× Ai ⊂ X , i = 1, 5.Prove that V C dim(H) <∞.

In Remark 6.11 we observed that the finiteness of V C dimH is a necessarycondition for the existence of a uniformly consistent learning algorithm ona unified learning model (X ,H, L(0−1),PX (X × Y)). In the next section weshall show that the finiteness of V C dimH is also a sufficient condition forthe uniform consistency of Aerm on (X ,H, L0−1,PX (X × Y)).

6.3. Fundamental theorem of binary classification.

Theorem 6.16 (Fundamental theorem of binary classification). A learning

model (X ,H ⊂ 0, 1X , L(0−1),P(X × 0, 1)) has a uniformly consistentlearning algorithm, if and only if V C dim(H) <∞.

Outline of the proof. Note that the “only if” assertion of Theorem 6.16follows from Remark 6.11. Thus we need only to prove the “if” assertion. ByLemma 6.5 it suffices to show that if V C dim(H) = k <∞ then mH(ε, δ) <∞ for all (ε, δ) ∈ (0, 1)2. In other words we need to find a lower bound forthe LHS of (6.5) in terms of the VC-dimension, which is an upper bound ofthe RHS of (6.5), when ε ∈ (0, 1) and m is sufficiently large. This shall bedone in three steps.

48 HONG VAN LE ∗

In step 1, setting Z = X×Y, and omitting superscript of the risk functionR, we use the Markov inequality to obtain

(6.13) µmS ∈ Zm : |Rµ(h)− RS(h)| < a ≤ Eµm |Rµ(h)−RS(h)|a

for any a > 0 and any h ∈ H.

In step 2 we define the growth function ΓH(m) : N → N and use it toupper bound the RHS of (6.13). Namely we shall prove the following

(6.14) Eµm(suph∈H|Rµ(h)− RS(h)| ≤

4 +√

log(ΓH(2m)

)δ√

2m) ≥ 1− δ

for every µ ∈ P(X × Y) and any δ ∈ (0, 1). The proof of (6.14) is delicateand can be found in [SSBD2014, p. 76-77].

Definition 6.17 (Growth function). Let F ⊂ YX be a class of functionswith finite target space Y. The growth function ΓF assigned to F is thendefined for all n ∈ N as

ΓF (n) := maxΣ⊂X|#Σ=n

#F|Σ.

We also set Γ(0) = 1.

Example 6.18. Consider the set F : signb| b ∈ R of all threshold func-tions. Given a set of distinct points x1, · · · , xn = Σ ⊂ R, there are n+ 1functions in F|Σ corresponding to n + 1 possible ways of placing b relativeto the xi s. Hence, in this case ΓF (n) ≥ n+ 1.

Exercise 6.19. Show that ΓF (n) = n + 1 for the set F of all thresholdfunctions.

In step 3 we use the following Vapnik-Chervonenski-Lemma, also knownas Sauer’s Lemma, whose proof can be found in [SSBD2014, p. 74-75].

Lemma 6.20. Let H ⊂ 0, 1X be a hypothesis class with V C dim(H) =d <∞. Then, for all n ∈ N we have

ΓH(n) ≤d∑i=0

(n

i

).

In particular, if n > d+ 1 then ΓH(n) ≤ (en/d)d.

It follows from Lemma 6.20 that for any (ε, δ) ∈ (0, 1)2 there exists msuch that

4 +√

log(ΓH(2m)

)δ√

2m< ε

and therefore by (6.14) for any (ε, δ) ∈ (0, 1)2 the value mH(ε, δ) is finite.This completes the proof of Theorem 6.16.

MACHINE LEARNING 49

6.4. Conclusions. In this lecture we define the notion of the (uniform) con-sistency of a learning algorithm A on a unified learning model (Z,H, L, P )and characterize this notion via the sample complexity of A. We relatethe consistency of the ERM algorithm with the uniform convergence of thelaw of large numbers over the parameter space H and use it to prove theuniform consistency of ERM in the binary classification problem (X ,H ⊂0, 1X , L(0−1),P(X × 0, 1). We show that the finiteness of the VC-dimension of H is a necessary and sufficient condition for the existence ofa uniform consistent learning algorithm on a binary classification problem(X × 0, 1,H ⊂ 0, 1X , L(0−1),PX (X × 0, 1) (Theorem 6.16).

7. Generalization ability of a learning machine

In the last lecture we measured the consistency of a learning algorithm Ain a unified learning model (Z,H, L, P ) via the sample complexity functionmA : (0, 1)2 → R+ ∪∞. The sample complexity mA(ε, δ) is the less numberof samples that A requires in order to make a prediction with ε accuracyand (1− δ) confidence. In 1984 Valiant suggested a PAC-model of learning,which corresponds to the notion of uniform (w.r.t. P ) consistency of A,which has moreover to be efficiently computable, i.e. the function mA(ε, δ)must be polynomial in ε−1 and δ−1. Furthermore Valiant also requiresthat A is efficiently computable, which can be expressed in terms of thecomputational complexity of A, see [SSBD2014, Chapter 8] for discussionon running time of A. This requirement is reasonable. In particular theuniform consistency requirement is one important feature that distinguishesgeneralizability of a leaning machine from the consistency of an estimatorin classical mathematical statistics.

Thus, given a learning machine (Z,H, L, P,A), where A is a learningalgorithm, the generalization ability of (Z,H, L, P,A) is measured in termsof the sample complexity of A. In the previous lecture, Lemma 6.5, we gaveupper bounds for the sample complexity mAerm(ε, δ) in terms of the samplecomplexity mH(ε/2, δ/2) of the learning model. Then we showed that in abinary classification problem the sample complexity mH takes finite valuesif and only if the VC-dimension of H is finite (Theorem 6.16).

Today we shall discuss two further methods of upper bounds for the sam-ple complexities mH and mAerm of some important learning models. Thenwe discuss the problem of learning model selection.

7.1. Covering number and sample complexity. In the binary classifi-cation problem of supervised learning the VC-dimension is a combinatorialcharacterization of the hypothesis class H, which carries no topology, sincethe domain X and the target space Y are discrete. The expected zero-one loss function is therefore the preferred choice of a risk function. In[CS2001] Cucker-Smale estimated the sample complexity mH of a discrim-inative model for a regression problem with the MSE as the expected lossfunction.

50 HONG VAN LE ∗

Before stating Cucker-Smale’s results we introduce necessary notations.• X is a topological space and ρ denotes a Borel measure on X × Rn.• Let Cn(X ) be the Banach space of continuous bounded Rn-valued func-

tions on X with the norm 16

||f ||C0 = supx∈X||f(x)||.

• For f ∈ Cn(X ) let fY ∈ Cn(X × Rn) denote the function

fY (x, y) := f(x)− y.Then MSEρ(f) = Eρ(||fY ||2), cf. (5.13).• For a function g ∈ Cn(X × Rn) denote by Vρ(g) its variance, see (??),

Vρ(g) = Eρ(||g − Eρ(g)||2) = Eρ(||g||2)− ||Eρg||2.• For a compact hypothesis class H ⊂ Cn(X ) define the following quan-

tities

(7.1) MSEρ,H(f) := MSEρ(f)−minf∈H

MSEρ(f),

which is called the estimation error of f , or the sample error of f [CS2001],and

Vρ(H) := supf∈H

Vρ(fY ),

N (H, s) := minl ∈ N| there exists l disks in H with radius s covering H.• For S = (z1, · · · , zm) ∈ Zm := (X ×Y)m, where zi = (xi, yi), denote by

fS the minimizer of the empirical function MSES : H → R such that

MSES(f) :=1

m

m∑i=1

||f(xi)− yi||2 = MSEµS (f).

The existence of fS follows from the compactness of H and the continuityof the functional MSES on H.

Theorem 7.1. ([CS2001, Theorem C]) Let H be a compact subset of C(X ) :=C1(X ). Assume that for all f ∈ H we have |f(x)− y| ≤M ρ-almost every-where, where ρ is a probability measure on X × R. Then for all ε > 0

(7.2) ρmS ∈ Zm|MSEρ,H(fS) ≤ ε ≥ 1−N (H, ε

16M)2e− mε2

8(4Vρ(H)+ 13M

2ε) .

Theorem 7.1 implies that for any n <∞ the ERM algorithm is consistenton the unified learning model (X ,H ⊂ Cn(X ), L2,PB(X × Rn)), where L2

is the quadratic loss function, and PB(X × Rn) denotes the space of Borelmeasures on the topological space X × Rn, if H is compact. To increasethe accuracy of the estimate in (7.2) we need to decrease the radius of thecovering balls and therefore increase the number of the covering ball.

16In [CS2001, p.8] the authors used the L∞-norm, but they considered only the sub-space of continuous functions

MACHINE LEARNING 51

The proof of Theorem 7.1 is based on Lemma 6.5. To prove that thesample complexity of Aerm is bounded it suffices to prove that the samplecomplexity mH of the learning model (X ,H ⊂ C(X ), L2,PB(X × Rn)) isbounded.

Outline of the proof of the Cucker-Smale theorem. The strategy of theproof of Theorem 7.1 is similar to that of Fundamental Theorem of binaryclassification (Theorem 6.16).

In the first step we prove the following Lemma, which gives a lower boundon the rate of the convergence in probability of empirical risk MSES(f) tothe expected risk MSEρ(f) for a given f ∈ H.

Lemma 7.2. ([CS2001, Theorem A, p.8]) Let M > 0 and f ∈ C0(X ) suchthat |f(x)− y| ≤M ρ-a.e.. Then for all ε > 0 we have

ρmS ∈ Zm : |MSEρ(f)−MSES(f)| ≤ ε ≥ 1− 2e(− mε2

2(σ2+ 13M

2ε))

where σ2 = Vρ(f2Y ).

Lemma 7.2 is a version of the inequality (6.8), for which we used theHoeffding inequality. The Hoeffding inequality does not involve the varianceand Cucker-Smale used the Bernstein inequality instead of the Hoeffdinginequality, see Appendix B.

In the second step, we reduce the problem of estimating upper bound forthe sample complexity mH to the problem of estimating upper bound forthe sample complexities mDj , where Dj | j ∈ [1, l] is a cover of H, andusing the covering number. Namely we have the following easy inequality

ρmS ∈ Zm| supf∈H|MSEρ(f)−MSES(f)| ≥ ε ≤

l∑j=1

ρmS ∈ Zm| supf∈Dj

|MSEρ(f)−MSES(f)| ≥ ε.(7.3)

In the last third step we proof the following

Lemma 7.3. ([CS2001, Proposition 3, p. 12]) Let f1, f2 ∈ C(X ). If |fj(x)−y| ≤ M on a set U ⊂ Z of full measure for j = 1, 2 then for any S ∈ Umwe have

|MSES(f1)−MSES(f2)|| ≤ 4M ||f1 − f2||C0 .

Lemma 7.3 implies that for all S ∈ Um

supf∈Dj

|MSEρ(f)−MSES(f)| ≥ 2ε =⇒ |MSEρ(fj)−MSES(fj)| ≥ ε.

Combining the last relation with (7.3), we derive the following desired upperestimate for the sample complexity mH.

52 HONG VAN LE ∗

Proposition 7.4. Assume that for all f ∈ H we have |f(x)−y| ≤M ρ-a.e..Then for all ε > 0 we have

ρmz ∈ Zm : supf∈H|MSEρ(f)−MSEz(f)| ≤ ε ≥ 1−N (H, ε

8M)2e

(− mε2

4(2σ2+ 13M

2ε)

where σ2 = supf∈H Vρ(f2Y ).

This completes the proof of Theorem 7.1.

Exercise 7.5. Let L2 denote the instantaneous quadratic loss function in(2.16). Derive from Theorem 7.1 an upper bound for the sample complexitymH(ε, δ) of the learning model (X ,H ⊂ Cn(X ), L2,PB(X ×Rn)), where H iscompact and PB(X ×Rn) is the space of Borel measures on the topologicalspace X × Rn.

Remark 7.6. If the hypothesis class H in Theorem 7.1 is a convex subset inH then Cucker-Smale got an improved estimation of the sample complexitymAerm [CS2001, Theorem C∗].

7.2. Rademacher complexities and sample complexity. Rademachercomplexities of a learning model (Z,H, L, P ) are refined versions of thesample complexity mH by adding a parameter S ∈ Zn for the empiricalRademacher complexity, and by adding a parameter µ ∈ P for the expectedversion of the (expected) Rademacher complexity. Rademacher complexitiesare designed for data-depending estimation of upper bounds of the samplecomplexity mAerm of the ERM algorithm.

For a learning model (Z,H, L, P ) we set

(7.4) GLH := gh : Z → R, gh(z) = L(z, h)|h ∈ H.

Definition 7.7 (Rademacher complexity). Let S ∈ Zn. The empiricalRademacher complexity of a family G ⊂ RZ w.r.t. a sample S is defined asfollows

RS(G) := E(µ

Z2Z2

)n[supg∈G

1

n

n∑i=1

σig(zi)],

where σi ∈ Z2| i ∈ [1, n] and µZ2Z2

is the counting measure on Z2, see (6.9).If S is distributed according to a probability measure µn on Zn, where

µ ∈ P , then the Rademacher n-complexity of the family G w.r.t. µ is definedby

Rn,µ(G) := Eµn [RS(G)].

Let (Z,H, L, P ) be a learning model and µ ∈ P . The Rademacher n-complexity Rn(Z,H, L, µ) (resp. the Rademacher empirical n-complexityRS(Z,H, L)) is defined to be the complexity Rn,µ(GLH) (resp. the empir-

ical complexity RS(GLH)), where GLH is the family associated to the model(Z,H, L, P ) by (7.4).

MACHINE LEARNING 53

Example 7.8. Let us consider a learning model (X×Z2,H ⊂ (Z2)X , L(0−1), µ).For a sample S = (x1, y1), · · · , (xm, ym) ∈ (X × Z2)m denote by Pr(S)the sample (x1, · · · , xm) ∈ Xm. We shall show that

(7.5) RS(G(0−1)H ) =

1

2RPr(S)(H).

Using the identity

L(0−1)(x, y, h) = 1− δh(x)y =

1

2(1− yih(xi))

we compute

RS(G(0−1)H ) = E

(µZ2Z2

)m[suph∈H

1

m

m∑i=1

σiδh(xi)yi ]

= E(µ

Z2Z2

)m[suph∈H

1

m

m∑i=1

σi1− yih(xi)

2]

E(µ

Z2Z2

)mσi=0

=1

2E

(µZ2Z2

)m[suph∈H

1

m

m∑i=1

−σiyih(xi)]

=1

2E

(µZ2Z2

)m[suph∈H

1

m

m∑i=1

σih(xi)] =1

2RPr(S)(H)

which is required to prove.

We have the following relation between the empirical Rademacher com-plexity and the Rademacher complexity, using the McDiarmid concentrationinequality, see (B.7) and [MRT2012, (3.14), p.36]

(7.6) µnS ∈ Zn|Rn,µ(GLH) ≤ RS(GLH) +

√ln(2/δ)

2m ≥ 1− δ/2.

Theorem 7.9. (see e.g. [SSBD2014, Theorems 26.3, 26.5, p. 377- 378])Assume that (Z,H, L, µ) is a learning model with |L(z, h)| < c for all z ∈ Zand all h ∈ H. Then for any δ > 0 and any h ∈ H we have

(7.7) µnS ∈ Zn|RLµ(h)−RLS(h) ≤ Rn,µ(GLH) + c

√2 ln(2/δ)

n ≥ 1− δ,

(7.8) µnS ∈ Zn|RLµ(h)−RLS(h) ≤ RS(GLH) + 4c

√2 ln(4/δ)

n ≥ 1− δ.

(7.9)

µnS ∈ Zn|RLµ(Aerm(S))−RLµ(h) ≤ 2RrS(GLH) + 5c

√2 ln(8/δ)

δ ≥ 1− δ.

(7.10) Eµn(RLµ(Aerm(S))−RLµ,H) ≤ 2Rrn,µ(GLH).

54 HONG VAN LE ∗

It follows from (7.10), using the Markov inequality, the following boundfor the sample complexity mAerm in terms of Rademacher complexity

(7.11) µnS ∈ Zn|RLµ(Aerm)−RLµ,H ≤2RLn,µ(GLH)

δ ≥ 1− δ.

Remark 7.10. (1) The first two assertions of Theorem 7.9 give an upperbound of a “half” of the sample complexity mH of a unified learning model(Z,H, L, µ) by the (empirical) Rademacher complexity Rn(G) of the associ-ated family G. The last assertion of Theorem 7.9 is derived from the secondassertion and the Hoeffding inequality.

(2) For the binary classification problem (X×0, 1,H ⊂ 0, 1X , L(0−1),P(X×0, 1) there exists a close relationship between the Rademacher complexityand the growth function ΓH(m), see [MRT2012, Lemma 3.1, Theorem 3.2,p. 37] for detailed discussion.

7.3. Model selection. The choice of a right prior information in machinelearning is often interpreted as the choice of a right class H in a learningmodel (Z,H, L, P ) which is also called a model selection. A right choice ofH should make balance between the approximation error and the estimationerror of H defined in the error decomposition of H.

7.3.1. Error decomposition. We assume that the maximum domain of theexpected loss function RLµ is a subspace HL,µ ⊃ H, given L and a probabilitymeasure µ ∈ P .

We define the Bayes risk of the learning problem RLµ on the maximaldomain HL,µ as follows

RLb,µ := infh∈HL,µ

RLµ(h)

Recall that RLµ,H := infh∈HRLµ(h) quantify the optimal performance of a

learner in H. Then we decompose the difference between the expected riskof a predictor h ∈ H and the Bayes risk as follows:

(7.12) RLµ(h)−RLb,µ = (RLµ(h)−RLµ,H) + (RLµ,H −RLb,µ).

The first term in the RHS of (7.12) is called the estimation error of h, cf.(7.1), and the second term is called the approximation error. If h = Aerm(S)

is a minimizer of the empirical risk RLS , then the estimation error of Aerm(S)is also called the sample error [CS2001, p. 9].

The approximation error quantifies how well the hypothesis class H issuited for the problem under consideration. The estimation error measureshow well the hypothesis h performs relative to best hypotheses in H. Typi-cally, the approximation error will decrease when enlargingH but the sampleerror will increase as demonstrated in No-Free-Lunch Theorem 6.9, becauseP should be enlarged as H will be enlarged.

MACHINE LEARNING 55

Example 7.11. (cf. [Vapnik2000, p. 19], [CS2001, p. 4, 5]) Let uscompute the error decomposition of a discriminative model (X × R,H ⊂RX , L2,PB(X × R) for regression problem. Let π : X × R → X denotethe natural projection. Then the measure ρ, the push-forward measureπ∗(ρ) ∈ P(X ) and the conditional probability measure ρ(y|x) on each fiberπ−1(x) = R are related as follows∫

X×Rϕ(x, y)dρ =

∫X

(

∫Rϕ(x, y)dρ(y|x))dπ∗(ρ)

for ϕ ∈ L1(ρ), similar to the Fubini theorem, see Subsection ??.Let us compute the Bayes risk RLb,µ for L = L2. The maximal subspace

HL,µ where the expected loss RLµ is well defined is the space L2(X , π∗(ρ)).We claim that the regression function of ρ, see Exercise 2.10,

rρ(x) := Eρ(i2(Y )|X = x) =

∫Rydρ(y|x)

minimizes the MSEπ∗(ρ) defined on the space L2(X , π∗(ρ)). Indeed, for any

f ∈ L2(X , π∗(ρ)) we have

(7.13) MSEπ∗(ρ)(f) =

∫X

(f(x)− rρ(x))2dπ∗(ρ) +MSEπ∗(ρ)(rρ).

The equation (7.13) implies that the Bayes risk RLb,π∗(ρ) is MSEπ∗(ρ)(rρ). It

follows that the approximation error of a hypothesis class H is equal

MSEπ∗(ρ)(fmin) =

∫X

(fmin − rρ(x))2dπ∗(ρ) +MSEπ∗(ρ)(rρ),

where fmin is a minimizer of MSE in H. Since MSEπ∗(ρ)(rρ) is a constant,if H is compact, fmin exists and satisfies the condition

(7.14) fmin = argming∈H

d(g, rρ),

where d is the L2-distance on L2(X , π∗(ρ)).

7.3.2. Validation and cross-validation. An important empirical approach inmodel selection is validation and its refinement - (k-fold) cross-validation.Validation is used for model selection as follows. We first train differentalgorithms (or the same algorithm with different parameters) on the giventraining set S. Let H := h1, · · · , hr be the set of all output predictors ofthe different algorithms. Now, to choose a single predictor fromH we samplea fresh validation set S′ ∈ Zm and choose the predictor hi that minimizesthe error over the validation set.

The basic idea of cross-validation is to partition the training set S =(z1, · · · , zn) ∈ Zn into two sets S = S1 ∪ S2 where S1 ∈ Zk and S2 ∈ Zn−k.The set S1 is used for training each of the candidate models, and the secondset S2 is used for deciding which of them yields the best results.

56 HONG VAN LE ∗

The n-cross validation is a refinement of cross-validation by partition ofthe training set into n-subsets and use one of them for testing the and repeatthe procedure (n− 1)-time for other testing subsets.

7.4. Undecidability of learnability. Kurt Godel proved in 1940 that thenegation of the continuum hypothesis, i.e., the existence of a set with in-termediate cardinality, could not be proved in standard set theory. Thesecond half of the independence of the continuum hypothesis i.e., unprov-ability of the nonexistence of an intermediate-sized set was proved in 1963by Paul Cohen. Thus Godel and Cohen showed that not every thing isprovable. Recently, using similar ideas, S. Ben-David, P. Hrubes, S. Moran,A. Shpilka and A. Yehudayoff showed that learnability can be undecidable[BHMSY2019]. Specifically, they consider a learning problem called esti-mating the maximum formulated as follows. Given a family F of subsets ofsome domain X, find a set in F whose measure wrt an unknown probabilitydistribution P is close to maximal. This should be done based on a finitesample generatd i.i.d. from P .They show that the standard axioms of math-ematics cannot be used to prove that we can solve this problem nor theycan be used to prove that we cannot solve this problem. In other words,they show that the learnability statement is independent of the ZFC axioms(ZermeloFraenkel set theory with the axiom of choice).

7.5. Conclusion. In this lecture we use several complexities of a learningmodel (Z,H, L, P ) for obtaining upper bounds of the sample complexity ofthe ERM algorithm Aerm. Among them Rademacher complexities are themost sophisticated ones that measure the capacity of a hypothesis class on aspecific sample, which can be used to bound the difference between empiricaland expected error, and thus the excess generalization error of empirical riskminimization. To find an ideal hypothesis class H for a learning problemwe have to take into account the error decomposition of a learning modeland the resulting bias-variance trade-off and use empirical cross validationmethods. Finally the learnability of a learning problem can be undecidable.s

8. Support vector machines

In this lecture we shall consider a class of simple supervised learningmachines for binary classification problems and apply results in the previ-ous lectures on consistent learning algorithms. Our learning machines are(V ×Z2,Hlin(V ), L,P(V ×Z2), A) where V is a real Hilbert space, Hlin(V )consists of linear classifiers on V , defined below, L is the (0−1) loss function(resp. a regularized (0 − 1) loss function) and A is a hard SVM algorithm(resp. a soft SVM algorithm), which we shall learn in today lecture. Theoriginal SVM algorithm is the hard SVM algorithm, which was invented byVapnik and Chervonenkis in 1963. The current standard incarnation (softmargin) was proposed by Cortes and Vapnik in 1993 and published in 1995.

MACHINE LEARNING 57

8.1. Linear classifier and hard SVM. For (w, b) ∈ V × R we set

(8.1) f(w,b)(x) := 〈w, x〉+ b.

Definition 8.1. A linear classifier is a function sign f(w,b) : V → Z2, x 7→sign f(w,b)(x) ∈ −1, 1 = Z2.

We identify each linear classifier with the half spaceH+(w,b) := sign f−1

(w,b)(1) =

f−1(w,b)(R≥0) ⊂ V and set H(w,b) := f−1

(w,b)(0) ⊂ V . Note that each hyperplane

H(w,b) ⊂ V defines H+(w,b) up to a reflection of V around H(w,b) and therefore

defines the affine function f(w,b) up to a multiplicative factor λ ∈ R∗. Let

HA(V ) := H(w,b) ⊂ V | (w, b) ∈ V × R

be the set of of all hyperplanes in the affine space V . Then Hlin(V ) is adouble cover of HA(V ) with the natural projection π : Hlin(V ) → HA(V )defined above.

Definition 8.2. A training sample S = (x1, y1), · · · , (xm, ym) ∈ (V ×±1)m is called separable, if there is a half space H+

(w,b) ⊂ V that cor-

rectly classifies S, i.e. for all i ∈ [1,m] we have xi ∈ H+(w,b) iff yi = 1. In

other words, the linear classifier sign f(w,b) is a minimizer of the empirical

risk function R0−1S : Hlin(V )→ R associated to the 0-1 loss function L(0−1).

Remark 8.3. (1) A half space H+(w,b) correctly classifies S if and only if the

empirical risk function R(0−1)S (f(w,b)) = 0 and sign (w,b) is a linear classifier

associated to H+(w,b).

(2) Write S = S+ ∪ S− where

S± := (x, y) ∈ S| y = ±1.

Let Pr : (V × ±1)m → V m denote the canonical projection. Then Sis separable if and only if there exists a hyper-plane H(w,b) that separates[Pr(S+)] and [Pr(S−)], where recall that [(x1, · · · , xm)] = ∪mi=1xi ⊂ V . Inthis case we say that H(w,b) correctly separates S.

(3) If a training sample S is separable then the separating hyperplaneis not unique, and hence there are many minimizers of the empirical risk

function R(0−1)S . Thus, given S, we need to find a strategy for selecting one

of these ERM’s, or equivalently for selecting a separating hyperplane H(w,b),

since the associated half-space H+(w,b) is defined by H(w,b) and any training

value (xi, yi). The standard approach in the SVM framework is to chooseH(w,b) that maximizes the distance to the closest points xi ∈ [Pr(S)]. Thisapproach is called the hard SVM rule. To formulate the hard SVM rule weneed a formula for the distance of a point to a hyperplane H(w,b).

Lemma 8.4 (Distance to a hyperplane). Let V be a real Hilbert space andH(w,b) := z ∈ V | 〈z, w〉 + b = 0. The distance of a point x ∈ V to H(w,b)

58 HONG VAN LE ∗

is given by

(8.2) ρ(x,H(w,b)) := infz∈H(w,b)

||x− z|| = |〈x,w〉+ b|||w||

.

Proof. Since H(w,b) = H(w,b)/λ for all λ > 0, it suffices to prove (8.2) for thecase ||w|| = 1 and hence we can assume that w = e1. Now formula (8.2)follows immediately, noting that H(e1,b) = H(e1,0) − be1.

Let H(w,b) separate S = (x1, y1), · · · , (xm, ym) correctly. Then we have

yi = sign(〈xi, w〉+ b),

=⇒ |〈xi, w〉+ b| = yi(〈x,w〉+ b).

Hence, by Lemma 8.4, the distance between H(w,b) and S is

(8.3) ρ(S,H(w,b)) := miniρ(xi, H(w,b)) =

mini yi(〈xi, w〉+ b)

||w||.

The distance ρ(S,H(w,b)) is also called the margin of a hyperplane H(w,b)

w.r.t. S. The hyperplanes, that are parallel to the separating hyperplaneand passing through the closest points on the negative or positive sides arecalled marginal.

Set

HS := H(w,b) ∈ π(Hlin) = HA(V )|H(w,b) separates S correctly ,

(8.4) A∗hs(S) := arg maxH(w,b)′∈HS

ρ(S,H(w,b)′),

Ahs : ∪m(V × Z2)m → Hlin,Ahs(S) ∈ π−1(A∗hs(S)) and Ahs(S) ∩ (x,−1)|x ∈ V = ∅.

Definition 8.5. A hard SVM is a learning machine (V ×Z2,Hlin(V ), L(0−1),P(V × Z2), Ahs).

The domain of the optimization problem in (8.4) is HS , which is not easyto determine. So we replace this problem by another optimization problemover a larger convex domain as follows.

Lemma 8.6. For S = (x1, y1), · · · , (xm, ym) we have

(8.5) Ahs(S) = H+(w,b) where (w, b) = arg max

(w,b):||w||≤1miniyi(〈w, xi〉+ b).

Proof. If H(w,b) separates S then ρ(S,H(w,b)) = mini yi(〈w, xi〉 + b). Sincethe constraint ||w|| ≤ 1 does not effect on H(w,b), which is invariant under apositive rescaling, (8.3) implies that

(8.6) max(w,b):||w||≤1

miniyi(〈w, xi〉+ b) ≥ max

H(w,b)′∈HSρ(S,H(w,b)′).

Next we observe that if H(w,b) 6∈ HS then

miniyi(〈w, xi〉+ b) < 0.

MACHINE LEARNING 59

Combining this with (8.6) we obtain

max(w,b):||w||≤1

miniyi(〈w, xi〉+ b) = max

(w,b):H(w,b)∈HSminiyi(〈w, xi〉+ b).

This completes the proof of Lemma 8.6.

A solution Ahs(S) of the equation (8.5) maximizes the enumerator of thefar RHS of (8.3) under the constraint ||w|| ≤ 1. In the Proposition below weshall show that Ahs(S) can be found as a solution to the dual optimizationproblem of minimizing the dominator of the RHS (8.3) under the constraintthat the enumerator of the far RHS of (8.3) has to be fixed.

Proposition 8.7. A solution to the following optimization problem, whichis called Hard-SVM,

(8.7) (w0, b0) = argminw,b||w||2 : yi(〈w, xi〉+ b) ≥ 1 for all i

produces a solution (w, b) := (w0/||w0||, b0/||w0||) of the optimization prob-lem (8.5).

Proof. Let (w0, b0) be a solution of (8.7). We shall show that (w0/||w0||, b0/||w0||)is a solution of (8.5). It suffices to show that the margin of the hyperplaneH(w0,b0) is greater than or equal to the margin of the hyperplane associatedto a (and hence any) solution of (8.5).

Let (w∗, b∗) be a solution of Equation (8.5). Set

γ∗ := miniyi(〈w∗, xi〉+ b∗)

which is the margin of the hyperplane H(w∗,b∗) by (8.3). Therefore for all iwe have

yi(〈w∗, xi〉+ b∗) ≥ γ∗

or equivalently

y∗i (〈w∗

γ∗, xi〉+

b∗

γ∗) ≥ 1.

Hence the pair (w∗

γ∗ ,b∗

γ∗ ) satisfies the condition of the quadratic optimization

problem in (8.7). It follows that

||w0|| ≤ ||w∗

γ∗|| = 1

γ∗.

Hence for all i we have

yi(w0

||w0||+

b0||w0||

) =yi(〈w0, xi〉+ b0)

||w0||≥ 1

||w0||≥ γ∗.

This implies that the margin of H(w0,b0) satisfies the required condition. Thiscompletes the proof of Proposition 8.7.

Remark 8.8. (1) The optimization problem of (8.4) is a specific instanceof quadratic programming (QP), a family of problems extensively studied inoptimization. A variety of commercial and open-source solvers are available

60 HONG VAN LE ∗

for solving convex QP problems. It is well-known that there is a uniquesolution of (8.4).

(2) In practice, when we have a sample set S of large size, then S is notseparable, thus the application of hard SVM is limited.

Exercise 8.9. (1) Show that the vector w0 of the solution (w0, b0) in (8.7)of the SVM problem is a linear combination of the training set vectorsx1, · · · , xm.

(2) Show that xi lies on the marginal hyperplanes 〈w0, x〉+ b0 = ±1.

A vector xi appears in the linear expansion of the weight vector w0 inExercise 8.9 is called a support vector.

8.2. Soft SVM. Now we consider the case when the sample set S is not sep-arable. There are at least two possibilities to overcome this difficulty. Thefirst one is to find a nonlinear embedding of patterns into a high-dimensionalspace. To realize this approach we use a kernel trick that embeds the pat-terns in an infinite dimensional Hilbert space space, which we shall learn inthe next lecture. The second way is to seek a predictor sign f(w,b) such that

H(w,b) = f−1(w,b)(0) still has maximal margin in some sense. More precisely,

we shall relax the hard SVM rule (8.7) by replacing the constraint

(8.8) yi(〈w, xi〉+ b) ≥ 1

by the relaxed constraint

(8.9) yi(〈w, xi〉) + b ≥ 1− ξiwhere ξi ≥ 0 are called the slack variables. The slack variables are commonlyused in optimization to define relaxed versions of some constraints. In ourcase a slack variable ξi measures the distance by which vector xi violatesthe original inequality in the LHS of (8.8).

The relaxed hard SVM rule is called the soft SVM rule.

Definition 8.10. The soft SVM algorithm Ass : (Rd × Z2)m → Hlin(Rd)with slack variables ξ ∈ Rm≥0 is defined as follows

Ass(S) = sign f(w0,b0)(S)

where (w0, b0) satisfies the following equation with ξ = (ξ1, · · · , ξm)

(w0, b0, ξ) = argminw,b,ξ

(λ||w||2 +1

m||ξ||l1)(8.10)

s. t. ∀i, yi(〈w, xi〉+ b) ≥ 1− ξi and ξi ≥ 0.(8.11)

In what follows we shall show that the soft SVM algorithm Ass is a so-lution of a regularized loss minimization rule, which is a refinement of theERM rule.

Digression Regularized Loss Minimization (RLM) is a learning algorithmon a learning model (Z,H, L, P ) in which we jointly minimize the empirical

MACHINE LEARNING 61

risk RLS and a regularization function. Formally, a regularization functionis a mapping R : H → R and the regularized loss minimization rule is amap Arlm : Zn → H such that Arlm(S) is a minimizer of the empirical

regularized loss function RLS := RLS +R : H → R.As the ERM algorithm works under certain condition, the RLM algorithm

also works under certain conditions, see e.g. [SSBD2014, Chapter 13] fordetailed discussion.

The loss function for the soft SVM learning machine is the hinge lossfunction Lhinge : Hlin(V )× (V × ±1)→ R defined as follows

(8.12) Lhinge(h(w,b), (x, y)) := max0, 1− y(〈w, x〉+ b).Hence the empirical hinge risk function is defined as follows for S =

(x1, y1) · · · , (xm, ym)

RhingeS (h(w,b)) =1

m

m∑i=1

max0, 1− yi(〈w, xi〉+ b.

Lemma 8.11. The Equation (8.10) with constraint (8.11) for Ass is equiv-alent to the following regularized risk minimization problem, which does notdepend on the slack variables ξ:

(8.13) Ass(S) = arg minf(w,b)

(λ||w||2 +RhingeS (f(w,b))

)∈ Hlin.

Proof. Let us fix (w0, b0) and minimize the RHS of (8.10) under the con-straint (8.11). It is straightforward to see that ξi = Lhinge

((w, b), (xi, yi)

).

Using this and comparing (8.10) with (8.13), we complete the proof ofLemma 8.11.

From Lemma 8.11 we obtain immediately the following

Corollary 8.12 (Definition). A soft SVM is a learning machine (V ×Z2,Hlin,Lhinge,P(V × Z2), Arlm).

Remark 8.13. The hinge loss function Lhinge enjoys several good propertiesthat justify the preference of Lhinger as a loss function over the zero-one lossfunction L(0−1), see [SSBD2014, Subsection 12.3, p. 167] for discussion.

8.3. Sample complexities of SVM.

Exercise 8.14. Prove that V C dimHlin(V ) = dimV + 1.

Hint. To show that any d + 1 points Sd+1 in Rd can be shattered byHlin(V ), it suffices to show that there is a haft space separating any subsetin Sd+1. Next we show that there is a configuration Sd+2 of d+ 2 points inRd that cannot be shattered by Hlin(V ), if Sd+2 consisting of (d+1) verticesof a convex set containing an interior point.

62 HONG VAN LE ∗

From the Fundamental Theorem of binary classification 6.16 and Exercise(8.14) we obtain immediately the following

Proposition 8.15. The binary classification problem (V×Z2,Hlin, L(0−1),P(V×Z2)) ha a uniformly consistent learning algorithm if and only if V is finitedimensional.

In what follow we shall show the uniform consistent of hard SVM andsoft SVM replacing the statistical model P(Rd × Z2) by a smaller classP(γ,ρ)(V × Z2) to be defined below.

Definition 8.16. ([SSBD2014, Definition 15.3]) Let µ ∈ P(V × Z2). Wesay that µ is separable with a (γ, ρ)-margin if there exists (w∗, b∗) ∈ V × Rsuch that ‖w‖ = 1 and such that

µ(x, y) ∈ V × Z2| y(〈w∗, x〉+ b∗) ≥ γ and ||x|| ≤ ρ = 1.

Similarly, we say that µ is separable with a (γ, ρ)-margin using a homoge-neous half-space if the preceding holds with a half-space defined by a vector(w∗, 0).

Definition 8.16 means that the set of labeled pairs (x, y) ∈ V × Z2 thatsatisfy the condition

y(〈w∗, x〉+ b∗) ≥ γ and ||x|| ≤ ρ

has a full µ-measure, where µ ∈ P(V × Z2) is a separable measure with(γ, ρ)-margin.

Remark 8.17. (1) We regard an affine function f(w,b) : V → R as a linearfunction fw′ : V ′ → R where V ′ = 〈e1〉⊗R ⊕ V , i.e., we incorporate thebias term b of f(w,b) in (8.1) into the term w as an extra coordinate. Moreprecisely we set

w′ := be1 + v and x′ := e1 + x.

Then

f(w,b)(x) = fw′(x′).

Note that the natural projection of the zero set f−1w′ (0) ⊂ V ′ to V is the zero

set H(w,b) of f(w,b).(2) By Remark 8.17 (1), we can always assume that a separable measure

with (γ, ρ)-margin is a one that uses a homogeneous half-space by enlargingthe instance space V .

Denote by P(γ,ρ)(V ×Z2) the subset of P(V ×Z2) that consists of separablemeasures with a (γ, ρ)-margin using a homogeneous half-space. Using theRademacher complexity, see [SSBD2014, Theorem 26.13, p. 384], we havethe following estimate of the sample complexity of the learning machine(V × Z2,Hlin, L(0−1), P(γ,ρ)(V × Z2), Ahs).

MACHINE LEARNING 63

Theorem 8.18. ([SSBD2014, Theorem 15.4, p. 206]) Let µ ∈ P(γ,ρ)(V ×Z2). Then we have

µmS ∈ (V × Z2)m|R(0−1)µ (Ahs(S)) ≤

√4(ρ/γ)2

m+

√2 log(2/δ)

m ≥ 1− δ.

Denote by Pρ(V ×Z2) the set of of probability measures on V ×Z2 whosesupport lies in B(0, ρ)×Z2 where B(0, ρ) is the ball of radius ρ centered atthe origin of V . Now we shall examine the sample complexity of the softSVM learning machine (V ×Z2,Hlin, Lhinge, Pρ(V ×Z2), Ass). The followingtheorem is a a consequence of Lemma 8.11 and a general result on the samplecomplexity of RLM under certain conditions, see [SSBD2014, Corollary 13.8,p. 179].

Theorem 8.19. ([SSBD2014, Corollary 15.7, p. 208]) Let µ ∈ Pρ(V ×Z2).Then for every r > 0 we have

Eµm(R(0−1)µ

(Ass(S)

))≤ Eµm

(Rhingeµ

(Ass(S)

))≤ min

w∈B(0,r)Rhingeµ (hw) +

√8ρ2r2

m.(8.14)

Exercise 8.20. Using the Markov inequality, derive from Theorem 8.19 anupper bound for the sample complexity of the soft SVM.

Theorem 8.19 and Exercise 8.20 imply that we can control the samplecomplexity of a soft SVM algorithm as a function of the norm of the under-lying Hilbert space V , independently of the dimension of V . This becomeshighly significant when we learn classifiers h : V → Z2 via embeddings intohigh dimensional feature spaces.

8.4. Conclusion. In this section we consider two classes of learning ma-chines for binary classification. The first class consists of learning models(V ×Z2,Hlin, L(0−1),P(V ×Z2)) with finite VC-dimension iff and only if V isfinite dimensional. If µ ∈ P(V ×Z2) is separable with (γ, ρ)-margin then wecan upper bound the sample complexity of the hard SVM algorithm Ahs for(V ×Z2,Hlin, L(0−1), µ) using the ration ρ/γ and Rademacher’s complexity.The second class consists of learning machines (V ×Z2,Hlin, Lhinge,Pρ(V ×Z2), Ass). The soft SVM algorithm Ass is a solution of a regularized ERMand therefore we can apply here general results on sample complexity ofregularized ERM algorithms.

9. Kernel based SVMs

In the previous lecture we considered the hypothesis class Hlin of linearclassifiers. A linear classifier sign f(w,b) correctly classifies a training sam-ple S ⊂ (V × ±1)m iff the zero set H(w,b) of f(w,b) separates the subsets[Pr(S−)] and [Pr(S+)]. By Radon’s theorem, any set of distinct (d + 2)

64 HONG VAN LE ∗

points in Rd can be partitioned into two subsets that cannot be separatedby a hyperplane in Rd. Thus it is reasonable to enlarge the hypothesis classHlin by adding polynomial classifiers, or to enlarge the dimension d of theEuclidean space Rd. However, the computational complexity of SVM withlearning by polynomial embedding (x, f(x))|x ∈ Rd may be computation-ally expensive. In this lecture we shall learn the kernel based SVMs, whichare implementations of the idea to enlarge the space Rd containing samplepoints x1, · · · , xm. The term “kernels” is used in this context to describeinner products in the feature, enlarged space. Namely we are interested inclassifiers of the form

sign h : X → Z2, h(x) := 〈h, ψ(x)〉,where h is an element in a Hilbert space W , ψ : X →W is a “feature map”and the kernel function Kϕ : X × X → R is defined

(9.1) Kψ(x, y) := 〈ψ(x), ψ(y)〉.We shall see that to solve the hard SVM optimization problem (8.7) forh ∈ W it suffices to know K. This kernel trick requires less computationalcomplexity than the one for learning ψ : X →W .

9.1. Kernel trick. It is known that a solution of the hard SVM equationcan be expressed as a linear combination of support vectors (Exercise 8.9). Ifthe number of the support vectors is less than the dimension of the instancespace, then this expressibility as a linear combination of support vectorssimplifies the search for a solution of the hard SVM equation. Below we shallshow that this expressibility is a consequence of the Representer Theoremconcerning solutions of a special optimization problem. The optimizationproblem we are interested in is of the following form:

(9.2) w0 = arg minw∈W

(f(〈w,ψ(x1)〉, · · · , 〈w,ψ(xm)〉

)+R(‖w‖)

)where xi ∈ X , w and ψ(xi) are elements of a Hilbert space W , f : Rm → Ris an arbitrary function and R : R+ → R is a monotonically non-decreasingfunction. The map ψ : X → W is often called the feature map, and W iscalled the feature space.

The following examples show that the optimization problem for the hard(resp. soft) SVM algorithm is an instance of the optimization problem (9.2).

Example 9.1. Let S = (x1, y1), · · · , (xm, ym) ∈ (V × Z2)m.(1) Plugging in Equation (9.2)

R(a) := a2,

f(a1, · · · , am) :=

0 if yi(ai) ≥ 1 for all i∞ otherwise

we obtain Equation (8.7) of a hard SVM for homogeneous vectors (w,0),replacing ai by 〈w, xi〉. The general case of non-homogeneous solutions(w, b) is reduced to the homogeneous case, using Remark 8.17.

MACHINE LEARNING 65

(2) Plugging in Equation (9.2)

R(a) := λa2,

f(a1, · · · , am) :=1

m

m∑i=1

max0, 1− yiai

we obtain Equation (8.13) of a soft SVM for a homogeneous solutionAss(S), identifyingAss(S) with its parameter (w, 0), S with (x1, y1) · · · , (xm, ym)and replacing ai with 〈w, xi〉.

Theorem 9.2 (Representer Theorem). Let ψ : X →W be a feature mappingfrom an instance space X to a Hilbert space W and w0 a solution of (9.2).Then the projection of w0 to the subspace 〈ψ(x1), · · · , ψ(xm)〉⊗R in W isalso a solution of (9.2).

Proof. Assume that w0 is a solution of (9.2). Then we can write

w0 =m∑i=1

αiψ(xi) + u

where 〈u, ψ(xi)〉 = 0 for all i. Set w0 := w0 − u. Then

(9.3) ‖w0‖ ≤ ‖w0‖

and since 〈w0, ψ(xi)〉 = 〈w0, ψ(xi)〉 we have

(9.4) f(〈w0, ψ(x1)〉, · · · , 〈w0, ψ(xm)〉

)= f

(〈w0, ψ(x1)〉, · · · , 〈w0, ψ(xm)〉

).

From (9.3), (9.4) and taking into account the monotonicity of R, we concludethat w0 is also a solution of (9.2). This completes the proof of Theorem9.2.

The Representer Theorem suggests that we could restrict ourselves toa search for a solution of Equation (9.2) in a finite dimensional subspaceW1 ⊂ W generated by vectors ψ(xi)

mi=1. In what follows we shall describe

a method to solve the minimization problem of (9.2) on W1, which is calledthe kernel trick.

Let• K : X × X → R, K(x, x′) := 〈ψ(x), ψ(x′)〉 be a kernel function,• G = (Gij) := K(xi, xj) - a Gram matrix,• w0 =

∑mi=1 αjψ(xj) ∈W1 - a solution of Equation (9.2).

Then α = (α1, · · · , αm) is a solution of the following minimization prob-lem

(9.5) arg minα∈Rm

f( m∑j=1

αjGj1, · · · ,m∑j=1

αjGjm)

+R(√√√√ m∑

i,j=1

αiαjGji

).

66 HONG VAN LE ∗

Recall that the solution w0 =∑m

i=1 αjψ(xj) of the hard (resp. soft) SVMoptimization problem, where (α1, · · · , αm) is a solution of (9.5), produces a“nonlinear” classifier w0 : X → Z2 associated to as follows

w0(x) := signw0(x)

where

(9.6) w0(x) := 〈w0, ψ(x)〉 =m∑i=1

αj〈ψ(xj), ψ(x)〉 =m∑j=1

αjK(xj , x).

To compute (9.6) we need to know only the kernel function K and not themapping ψ, nor the inner product 〈, 〉 on the Hilbert space W .

This motivates the following question.

Problem 9.3. Find a sufficient and necessary condition for a kernel func-tion, also called a kernel, K : X × X → R such that K can be written asK(x, x′) = 〈ψ(x), ψ(x′)〉 for a feature mapping ψ : X → W , where W is areal Hilbert space.

Definition 9.4. If K satisfies the condition in Problem 9.3 we shall saythat K is generated by a (feature) mapping ψ. The target Hilbert space isalso called a feature space.

9.2. PSD kernels and reproducing kernel Hilbert spaces.

9.2.1. Positive semi-definite kernel.

Definition 9.5. Let X be an arbitrary set. A map K : X × X → R iscalled positive semi-definite kernel (PSD kernel) iff for all x1, · · · , xm theGram matrix Gij = K(xi, xj) is positive semi-definite.

Theorem 9.6. A kernel K : X × X → R is induced by a feature map to aHilbert space if and only if it is positive semi-definite.

Proof. 1) First let us prove the “only if” assertion of Theorem 9.6. Assumethat K(x, x′) = 〈ψ(x), ψ(x′)〉 for a mapping ψ : X → W , where W is aHilbert space. Given m points x1, · · · , xm ∈ X we consider the subspaceWm ⊂W generated by ψ(x1), · · · , ψ(xm). Using the positive definite of theinner product on Wm, we conclude that the Gram matrix Gij = K(xi, xj)is positive semi-definite. This proves the “only if” part of Theorem 9.6

2) Now let us prove the “if” part. Assume that K : X ×X → R is positivesemi-definite. For each x ∈ X let Kx ∈ RX be the function defined by

Kx(y) := K(x, y).

Denote by

W := f ∈ RX | f =

N(f)∑i=1

aiKxi , ai ∈ R and N(f) <∞.

MACHINE LEARNING 67

Then W is equipped with the following inner-product

〈∑i

αiKxi ,∑j

βjKyj 〉 :=∑i,j

αiβjK(xi, yj).

The PSD property of K implies that the inner product is positive semi-definite, i.e.

〈∑i

αiKxi ,∑j

αiKxi〉 ≥ 0.

Since the inner product is positive semi-definite, the Cauchy-Schwarz in-equality implies for f ∈W and x ∈ X(9.7) 〈f,Kx〉2 ≤ 〈f, f〉〈Kx,Kx〉.Since for all x, y we have Ky(x) = K(y, x) = 〈Ky,Kx〉 , it follows that forall f ∈W we have

(9.8) f(x) = 〈f,Kx〉.Using (9.8), we obtain from (9.7) for all x ∈ X

|f(x)|2 ≤ 〈f, f〉K(x, x).

This proves that the inner product on W is positive definite. Hence W isa pre-Hilbert space. Let H be the completion of W . The map x 7→ Kx isthe desired mapping from X to H. This completes the proof of Theorem9.6.

Example 9.7. (1) (Polynomial kernels). Assume that P is a polynomial inone variable with non-negative coefficients. Then the polynomial kernel ofthe form K : Rd×Rd → R, (x, y) 7→ P (〈x, y)〉 is a PSD kernel. This followsfrom the observations that if Ki : X ×X → R, i = 1, 2, are PDS kernel then(K1+K2)(x, y) := K1(x, y)+K2(x, y) is a PSD kernel, and (K1 ·K2)(x, y) :=K1(x, y) ·K2(x, y) is a PSD kernel. In particular, K(x, y) := (1 + 〈x, y〉)2 isa PSD kernel.

(2) (Exponential kernel). For any γ > 0 the kernel K(x, y) := exp(γ ·

〈x, y〉)

is a PSD kernel, since it is the limit of a polynomials in 〈x, y〉 withnon-negative coefficients.

Exercise 9.8. (1) Show that the Gaussian kernel K(x, y) := exp(−γ

2 ||x −y||2) is a PSD kernel.

(2) Let X = B(0, 1) - the open ball of radius 1 centered at the origin0 ∈ Rd. Show that K(x, y) := (1−〈x, y〉)−p is a PSD kernel for any p ∈ N+.

9.2.2. Reproducing kernel Hilbert space. Given a PSD kernel K : X × X →R, there exist many feature maps ϕ : X → W such that K is generatedby a feature map ϕ : X → W . Indeed, if K(x, x) = 〈ϕ(x), ϕ(x)〉 thenK(x, x) = 〈e ϕ(x), e ϕ(x)〉 for any isometric embedding e : W → W ′.However, there is a canonical choice for the feature space W , a so-calledreproducing kernel Hilbert space.

68 HONG VAN LE ∗

Definition 9.9 (Reproducing kernel Hilbert space). Let X be an instanceset and H ⊂ RX a real Hilbert space of functions on X with the uniquevector space structure such that for x ∈ X the evaluation map

evx : H → R , evx(f) := f(x)

is a linear map. 17 Then H is called a reproducing kernel Hilbert space(RKHS) on X if for all x ∈ X the linear map evx is bounded i.e.,

supf∈B(0,1)⊂H

evx(f) <∞.

Remark 9.10. Let H be a RKHS on X and x ∈ X . Since evx is bounded,by the Riesz representation theorem, there is a function kx ∈ H so thatf(x) = 〈f, kx〉 for all f ∈ H. Then the kernel

K(x, y) := 〈kx, ky〉is a PSD kernel. K is called the reproducing kernel of H.

Thus every RKHS H on X produces a PSD kernel K : X × X → R.Conversely, Theorem 9.11 below asserts that every PSD kernel reproducesa RKHS H.

Theorem 9.11. Let K : X × X → R be a PSD kernel. There there existsa unique RKHS H(K) such that K is the reproducing kernel of H(K).

Proof. Given a PSD kernel K : X × X → R, for any x ∈ X let us define afunction Kx on X by

Kx(y) := K(x, y) for all y ∈ X .Let us denote by H0(K) the pre-Hilbert space whose elements are finite lin-ear combination of functions Kx and the inner product on H0(K) is definedon the generating elements Kx as follows

(9.9) ∀x, y ∈ X we have 〈Kx,Ky〉 := K(x, y).

From (9.9) we conclude that the evaluation map evx : H0(K)→ R, evx(Ky) =K(y, x) is a linear bounded map for all x, y, since

||evx|| = max||Ky ||=1

evx(Ky) = max||Ky ||=1

K(y, x) ≤√K(x, x).

Hence the evaluation map evx extends to a bounded linear map on thecompletion H(K) of H0(K). It follows that H(K) is a RKHS.

To show the uniqueness of a RKHS H such that K is the reproducingkernel of H, we assume that there exists another RKHS H′ such that for allx, y ∈ X there exist kx, ky ∈ H′ with the following properties

K(x, y) = 〈kx, ky〉 and f(x) = 〈f, kx〉 for all f ∈ H.We define a map g : H(K)→ H′ by setting g(Kx) = kx. It is not hard to seethat g is an isometric embedding. To show that g extends to an isometry

17In other words, the vector structure on H is induced from the vector structure on Rvia the evaluation map.

MACHINE LEARNING 69

it suffices to show that the set kx is dense in H′. Assume the opposite, i.e.there exists f ∈ H′ such that 〈f, kx〉 = 0 for all x. But this implies thatf(x) = 0 for all x and hence f = 0. This completes the proof of Theorem9.11.

9.3. Kernel based SVMs and their generalization ability.

9.3.1. Kernel based SVMs. Let K : X × X → R be a PSD kernel. Denoteby H(K) the RHKS of functions on X that produces K. Each functionh ∈ H(K) defines a binary classifier

signh : X → Z2.

Denote by Klin the set of all binary classifiers signh where h ∈ H(K). Usingthe Representer Theorem 9.2 and Example 9.1 (1), we replace the algorithmAhs of a hard SVM by a kernel based algorithm.

Definition 9.12. A kernel based hard SVM is a learning machine(X×Z2,Klin, L

(0−1),P(X×Z2), Ahk), where for S = (x1, y1), · · · , (xm, ym)∈ (X × Z2)m and x ∈ X we have

Ahk(S)(x) = signm∑i=1

αjK(xj , x) ∈ Z2.

Here α := (α1, · · · , αm) is the solution of the following optimization problem(9.10)

α = argmin(f( m∑j=1

αjK(xj , x1), · · · ,m∑j=1

αjK(xj , xm))+R(

√√√√ m∑i,j=1

αiαjK(xi, xj))),

where R and f are defined in Example 9.1(1).

Similarly, using the Representer Theorem 9.2 and Example 9.1(2)s, wereplace the algorithm Ass of a soft SVM by a kernel based algorithm.

Definition 9.13. A kernel based soft SVM is a learning machine(X×Z2,Klin, L

hinge,P(X×Z2), Ask), where for S = (x1, y1), · · · , (xm, ym)∈ (X × Z2)m and x ∈ X we have

Ask(S)(x) = signm∑i=1

αjK(xj , x) ∈ Z2.

Here α := (α1, · · · , αm) is the solution of the following optimization problem(9.11)

α = argmin(f( m∑j=1

αjK(xj , x1), · · · ,m∑j=1

αjK(xj , xm))+R(

√√√√ m∑i,j=1

αiαjK(xi, xj))),

where R and f are defined in Example 9.1(2).

70 HONG VAN LE ∗

9.3.2. Generalization ability of kernel based SVMs.• The advantage of working with kernels rather than directly optimizing

in the feature space is that in some situations the dimension of the fea-ture space is extremely large while implementing the kernel function is verysimple. Moreover, in many cases, the computational time complexity ofsolving (9.10) is a polynomial on the variable of the size of xi, i ∈ [1,m], see[SSBD2014, p. 221-223].• The upper bound for the sample complexity of a s hard SVM in The-

orem 8.18 is also valid for the sample complexity of the kernel based hardSVM [SSBD2014, Theorem 26.3, p. 384] after translating the condition ofseparability with (γ, ρ)-margin of a measure µ ∈ P(X × Z2) in terms of thekernel function.• The upper bound for the sample complexity of soft SVM in Theorem

8.19 is also valid for the sample complexity of the kernel based soft SVM,after translating the support condition a measure µ ∈ P(X × Z2) in termsof the kernel function.

9.4. Conclusion. In this section we learn the kernel trick, which simplifiesthe algorithm of solving hard SVM and soft SVM optimization problem,using embedding of patterns into a Hilbert space. The kernel trick is basedon the theory of RKHS and has many applications, e.g., for defining a featuremap ϕ : P(X ) → V , where V is a RHKS, see e.g. [MFSS2017]. The maindifficulty of the kernel method is that we still have no general method ofselecting a suitable kernel for a concrete problem. Another open problem isto improve the upper bound for sample complexity of SVM algorithm, i.e.,to find new conditions on µ ∈ P such that the sample complexity of Ahk,Ask which is computed w.r.t. µ is bounded.

10. Neural networks

In the last lecture we examined kernel based SVMs which are generaliza-tions of linear classifiers sign f(w,b). Today we shall study further general-izations of linear classifiers which are artificial neural networks, shortenedas neural networks. Neural networks are inspired by the structure of neuralnetworks in the brain. The idea behind neural networks is that many neu-rons can be joined together by communication links to carry out complexcomputations.

A neural network is a model of computation. Therefore, a neural networkalso denotes a hypothesis class of a learning model consisting of functionsrealizable by a neural network.

In today lecture we shall investigate several types of neural networks, theirexpressive power, i.e., the class of functions that can be realized as elementsin a hypothesis class of a neural network. In the next lecture we shall discussthe current learning algorithm for learning machines whose hypothesis classis a neural network.

MACHINE LEARNING 71

10.1. Neural networks as a model of computation. A neural networkhas a graphical representation for multivariate functions of multi-variableshV,E,σ,w : Rn → Rm (in the case of a feedforward neural network), or a se-quence of multivariate functions of multi-variables hiV,E,σ,w : Rn → Rm| i ∈N (in the case of a recurrent neural network).• The quadruple (V,E, σ, w) consists of+ the network graph (V,E), also called the underlying graph of the net-

work, where V is the set of nodes n, also called neurons, and E is the set ofdirected edges connecting nodes V .

+ a family of functions σn : R → R, also called the activation functionof neuron n. Usually σn = σ is independent of n. Most common activationfunctions are:- the sign function, σ(x) = sign (x),- the threshold function, σ(x) = 1R+(x),- the sigmoid function σ(x) := 1

1+e−x , which is a smooth approximation tothe threshold function;

+ w : E → R - the weight function of the network.• The networks architecture of a neural network is the triple G = (V,E, σ).

• A weight w : E → R endows each neuron n with a computing instruction oftype “ input-output”. The input I(n) of a neuron n is equal to the weightedsum of the outputs of all the neurons connected to it: I(n) =

∑w(n′n)O(n′),

where n′n ∈ E is a directed edge and O(n′) is the output of the neuron n′ inthe network.• The output O(n) of a neuron n is obtained from the input I(n) as follows:

O(n) = σ(I(n)).

72 HONG VAN LE ∗

• The i-th input nodes give the output xi. If the input space is Rn thenwe have n + 1 input-nodes, one of them is the “constant” neuron, whoseoutput is 1.• There is a neuron in the hidden layer that has no incoming edges. This

neuron will output the constant σ(0).• A feedforward neural network (FNN) has underlying acyclic directed

graph. Each FNN (E, V,w, σ) represents a multivariate multivariable func-tion hV,E,σ,w : Rn → Rm which is obtained by composing the computinginstruction of each neuron on directed paths from input neurons to outputneurons. For each architecture (V,E, σ) of a FNN we denote by

HV,E,σ = hV,E,σ,w : w ∈ RE

the underlying hypothesis class of functions from the input space to theoutput space of the network.• A recurrent neural network (RNN) has underlying directed graph with

a cycle. By unrolling cycles in a RNN in discrete time n ∈ N, a RNNdefines a map r : N+ :→ FNN such that [r(n)] ⊂ [r(n + 1)], where[r(n)] is the underlying graph of r(n), see [GBC2016, §10.2, p. 368] and[Graves2012, §3.2, p. 22]. Thus a RNN can be regarded as a sequenceof multivariate multivariable functions which serves as in a discriminativemodel for supervised sequence labelling.

MACHINE LEARNING 73

Digression The goal of supervised sequence labelling is to assign sequencesof labels, drawn from a label space, to sequences of input data. Denotingthe set of labels by Y, and the set of input data by X then Y∞ is the spaceof all sequences of labels and X∞ is the space of all sequences of input data.For example, one might wish to transcribe a sequence of acoustic featureswith spoken words (speech recognition), or a sequence of video frames withhand gestures (gesture recognition). If the sequences are assumed to beindependent and identically distributed, we recover the basic framework ofpattern classification. In practice, this assumption may not be the case.

Example 10.1. A multilayer perceptron (MLP) is a special FNN that hasvertices arranged in a disjoint union of layers V = ∪nl=0Vi such that everyedge in E connects nodes in neighboring layers Vi, Vi+1. The depth of theMLP is m. V0 is called the input layer, Vn is called the output layer, theother layer is called hidden.

(1) A MLP

φ(x) =

0 if ψ(x) ≤ 1/2,1 otherwise

(10.1)

for

ψ(x) = c0 +

d∑i=1

cixi,

ci ∈ R, is a perceptron (linear classifier) with the activation function sign (ψ(x)−1/2). This neural network has no hidden layer and ci are weights of the neu-ral network.

(2) In a feedforward neural network with one hidden layer one takes φ(x)of the form (10.1) and

ψ(x) = c0 +

k∑i=1

ciσ(ψi(x))

where ci are the weights of the network and σ is a activation function.

Remark 10.2. Neural networks are abstraction of biological neural net-works, the connection weights w represent the strength of the synapses be-tween the neurons and the activation function σ is usually an abstractionrepresenting the rate of action potential firing in the cell. In its simplestform, this function is binary, that is, either the neuron is firing or not. Wecan consider activation function as a filter of relevant information.

In the remaining part of today lecture we consider only FNNs. In partic-ular under NN’s we mean FNNs.

10.2. The expressive power of neural networks. In this section wewant to address the following

Question 10.3. What type of functions can be implemented using neuralnetworks.

74 HONG VAN LE ∗

First we consider representations of Boolian functions by neural networks.

Proposition 10.4 (Representation of Boolean functions). ( [SSBD2014,Claim 20.1, p. 271]) Every Boolean function f : Zd2 → Z2 can be representedexactly by a feedforward neural network HV,E, sign with a single hidden layer

containing at most 2d + 1 neurons and with the activation function σ(x) =sign (x).

Proof. Let (V,E) be a two-layer FNN with #V0 = d + 1, #V1 = 2d + 1,#V2 = 1 and E consist of all possible edges between adjacent layers. Asbefore Z2 = ±1. Now let f : Zd2 → Z2. Let ui ∈ f−1(1) ⊂ Zd2 andk := #f−1(1). Set

gi(x) := sign(〈x, ui〉 − d+ 1).

Then gi| i ∈ [1, k] are linear classifiers and therefore can be implementedby the neurons in V1. Now set

f(x) := sign(k∑i=1

gi(x) + k − 1)

which is also a linear classifier. This completes the proof of Proposition10.4

In general case we have the following Universal Approximation Theorem,see e.g. [Haykin2008, p. 167].

Theorem 10.5. Let ϕ be a nonconstant, bounded and monotone increas-ing continuous function. For any m ∈ N, ε > 0 and any function F ∈C0([0, 1]m) there exists an integer m1 ∈ N and constants ai, bj , wij wherei ∈ [1,m1], j ∈ [1,m] such that

f(x1, · · · , xm) :=

m1∑i=1

αiϕ(m∑j=1

wijxj + bi)

for all (x1, · · · , xm) ∈ [0, 1]m we have

|F (x1, · · · , xm)− f(x1, · · · , xm)| < ε.

Remark 10.6. The universal approximation theorem may be viewed as anatural extension of the Weierstrass theorem stating that any continuousfunction over a closed interval on the real axis can be expressed in thatinterval as an absolutely and uniformly convergent series of polynomials.Many researchers regard the Universal Approximation Theorem as a naturalgeneralization of Kolomogorov’s theorem [Kolmogorov1957] and Lorentz’stheorem [Lorentz1976] which states that every continuous function on [0, 1]d

can be written as

f(x) =

2d+1∑i=1

Fi(

d∑j=1

Gijxj)

MACHINE LEARNING 75

where Gij and Fi are continuous functions. In 1989 Cybenko demonstratedrigorously for the first time that a single hidden layer is sufficient to uni-formly approximate any continuous function with support in a unit hyper-cube [Cybenko1989].

Example 10.7 (Neural networks for regression problems). In neural net-works with hypothesis class HV,E,σ for regression problems one often choosesthe activation function σ to be the sigmoid function σ(a) := (1 + e−a)−1,and the loss function L to be L2, i.e., L(x, y, hw) := 1

2 ||hw(x) − y||2 forhw ∈ HV,E,σ and x ∈ X = Rn, y ∈ Y = Rm.

Example 10.8 (Neural networks for generative models in supervised learn-ing). In generative models of supervised learning we need to estimate theconditional distribution p(t|x), where t ∈ R is the target variable in aregression problem. In many regression problems p is chosen as follows[Bishop2006, (5.12), p. 232]

(10.2) p(t|x) = N (t|y(x,w), β−1) =β√2π

exp−β2

(t− y(x,w)),

where β is unknown parameter and y(x,w) is the expected value of t. Thusthe learning model is of the form (X ,H, L, P ) where H := y(t, x, w) isparameterized by a parameter w, β, and a statistical model P is a subset ofP(X ), since the joint distribution µ(x, y) is completely defined by the mar-ginal distribution µX and the conditional distribution µ(y|x), see Remark2.8.

Now assume that X = (x1, · · · , xn) are i.i.d. by µ ∈ P along with labels(t1, · · · , tn). Then (10.2) implies

(10.3) − log p(t|X,w, β) = −n2

log β +n

2log(2π) +

β

2

n∑i=1

|tn − y(xn, w)|2.

As in the density estimation problem we want to minimize the LHS of (10.2).Leaving β = const we minimize first the β -independent component of theloss function

(10.4) LS(w) =1

2

n∑i−1

|y(xn, w)2 − tn|2.

Once we know a solution wML of the equation minimizing LS(w), the valueof β can be found by the following formula

(10.5)1

βML=

1

n

n∑i=1

|y(xi, wML)− ti|2.

10.3. Sample complexities of neural networks. A learning model (X ×Y,HV,E,σ, L, P ) is called a neural network if HV,E,σ is a neural network and

HV,E,σ ⊂ YX .

76 HONG VAN LE ∗

10.3.1. Neural networks for binary classification problem. In neural net-works with hypothesis class HV,E,σ for binary classification problems oneoften choose the activation function σ to be the sign function, and the lossfunction L to be L(0−1).

Proposition 10.9. ([SSBD2014, Theorem 20.6, p. 274]) Let HV,E,sign be aMLP. The VC-dimension of HV,E,sign is O(|E| log |E|).

Outline of the proof Assume that H := HV,E,sign consists of exactly oneperceptron. In this case Proposition 10.9 is valid since by Exercise 8.14we have V C dimH = |Ein| = O(|E| log |E|), where Ein denotes the set ofdirected edges coming into the perceptron.

Next, we shall reduce the proof for the general case of a neural networkto the case of a single perceptron. Let m := V C dim(H). Using

(10.6) ΓH(m) = 2m,

to prove Proposition 10.9, it suffices to show that

(10.7) ΓHV,E,σ(m) ≤ (em)|E|,

since log2(em) < 4 log(E) by (10.7) and ((10.6)).Let V0, · · · , VT be the layers of (E, V ). For t ∈ [1, T ] denote by Ht the

neural network HWt,Et,sign where Wt consists of inputs neurons in Vt−1 andoutput neurons in Vt and Et consists of edges of H that connect Vt−1 withVt. Now we decompose

(10.8) H = HT · · · H1.

Lemma 10.10 (Exercises). (1) ([SSBD2014, Exercise 4, p. 282]) Let F1 ⊂ZX and F2 ⊂ YZ . Set H := F2 F1. Then ΓH(n) ≤ ΓF2(n)ΓF1(n).

(2) ([SSBD2014, Exercise 3, p. 282]) Let Fi be a set of function from Xto Yi for i = 1, 2. Then ΓF1×F2(n) ≤ ΓF1(n)ΓF2(n).

By Lemma 10.10 (1) we have

ΓH(m) ≤ ΠTt=1ΓHt(m).

Next we observe that

(10.9) Ht = Ht,1 × · · · ×Ht,|Vt|.

Each neuron ni on Vt has dt,i heading edges presenting the number of theinputs for the linear classifier ni. Hence V C dimHt,i = dt,i < m − 1. ByLemma 10.10 and by Vapnik-Chervonenski-Sauer-Lemma we have

ΓH(m) ≤ ΠTt=1Π

|Vt|i=1(

em

dt,i)dt,i < (em)|E|,

which completes the proof of Proposition 10.9.

It follows from Proposition 10.9 that the sample complexity of the ERMalgorithm for (V ×Z2,HV,E,sign, L(0−1),P(V ×Z2)) is finite. But the runningtime for ERM algorithm in a neural networkHV,E,sign is non-polynomial and

MACHINE LEARNING 77

therefore it is impractical to use it [SSBD2014, Theorem 20.7, p. 276]. Thesolution is to use the stochastic gradient descend, which we shall learn inthe next lecture.

10.4. Conclusion. In this lecture we considered learning machines whosehypothesis class consisted of functions or sequence of functions that canbe graphical represented by neural networks. Neural networks have goodexpressive power and finite VC-dimension in binary classification problemsbut the ERM algorithm in these networks has very high computational com-plexity and therefore they are unpractical.

11. Training neural networks

Training a neural network is a popular name for running a learning al-gorithm in a neural network learning model. We consider in this lectureonly the case where the input space and the output space of a network areEuclidean spaces Rn and Rm respectively. Our learning model is of the form(Rn×Rm,HV,E,σ, L, P ) and the learning algorithm is stochastic gradient de-scend (SGD), which aims to find a minimizer of the expected risk functionRLµ : HV,E,σ → R. Since HV,E,σ is parameterized by the weight functions

w ∈ RE ∼= R|E|, we regard RLµ as a function of variable w on RN , whereN = |E|. We begin with classical (deterministic) gradient and subgradientdescend of a function on RN and then analyze the SGD, assuming the lossfunction L is convex. Under this assumption, we get an upper bound forthe sample complexity of SGD. Finally we discuss SGD in general FNNs.

11.1. Gradient and subgradient descend. For a differentiable functionf on a RN denote by ∇gf the gradient of f w.r.t. a Riemannian metric gon RN , i.e., for any x ∈ RN and any V ∈ RN we have

(11.1) df(V ) = 〈∇gf,X〉.

If g is fixed, for instance g is the standard Euclidean metric on RN , we justwrite ∇f instead of ∇gf .

The negative gradient flow of f on RN is a dynamic system on RN definedby the following ODE with initial value w0 ∈ RN

(11.2) w(0) = w0 ∈ RN and w(t) = −∇f(w(t)).

If w(t) is a solution of (11.2) then f(w(t)) < f(w(t′)) for any t′ > t unless∇f(w(t)) = 0, i.e., w(t) is a critical point of f .

If f is not differentiable we modify the notion of the gradient of f asfollows.

Definition 11.1. Let f : S → R be a function on an open convex setS ⊂ RN . A vector v ∈ RN is called a subgradient of f at w ∈ S if

(11.3) ∀u ∈ S, f(u) ≥ f(w) + 〈u− w, v〉.

78 HONG VAN LE ∗

The set of subgradients of f at w is also called the differential set and denotedby ∂f(w).

Exercise 11.2. (1) Show that if f is differentiable at w then ∂f(w) containsa single element ∇f(w).

(2) Find a subgradient of the generalized hinge loss function fa,b,c(w) =

maxa, 1− b〈w, c〉 where a, b ∈ R and w, c ∈ RN and 〈·, ·〉 a scalar product.

Remark 11.3. It is known that a subgradient of a function f on a con-vex open domain S exists at every point w ∈ S iff f is convex, see e.g.[SSBD2014, Lemma 14.3].

• Gradient descend algorithm discretizes the solution of the gradient flowequation (11.2). We begin with an arbitrary initial point x0 ∈ RN . We set

(11.4) wn+1 = wn − γn∇f(wn),

where γn ∈ R+ is a constant, called a “learning rate” in machine learning, tobe optimized. This algorithm can be slightly modified. For example, afterT iterations we set the output point wT to be

(11.5) wT :=1

T

T∑i=1

wi,

or

(11.6) wT := arg mini∈[1,T ]

f(wi).

If a function f on RN has a critical point which is not the minimizerof f , then the gradient flow (11.2) and its discrete version (11.4) may notconverge to the required minimizer of f . If f is convex, then f has only aunique critical point w0 which is also the minimizer of f . In fact we havethe following stronger assertion.

(11.7) f(w)− f(u) ≤ 〈w − u,∇f(w)〉 for any w, u ∈ RN .

It also follows from (11.7) that there exists a unique minimizer of f , andhence the gradient flow (11.2) works. Its discrete version (11.4) also works,as stated in the following.

Proposition 11.4. ([SSBD2014, Corollary 14.2, p. 188]) Let f be a convexρ-Lipschitz function on RN ,18 and let w∗ ∈ arg minw∈B(0,r)⊂RN f(w). If

we run the GD algorithm (11.4) on f for T steps with γt = η = rρ√T

for

t ∈ [1, T ], then the output wT defined by (11.5) satisfies

f(wT )− f(w∗) ≤ rρ√T.

18i.e., |f(w)− f(u)| ≤ ρ |w − u|

MACHINE LEARNING 79

Under the conditions in Proposition 11.4, for every ε > 0, to achievef(wT ) − f(w∗) ≤ ε, it suffices to run the GD algorithm for a number ofiterations that satisfies

T ≥ r2ρ2

ε2.

Lemma 11.5. ([SSBD2014, Lemma 14.1, p. 187]) Let w∗, v1, · · · , vT ∈ RN .Any algorithm with an initialization w1 = 0 and

(11.8) wt+1 = wt − ηvtsatisfies

(11.9)

T∑i=1

〈wt − w∗, vt〉 ≤||w∗||2

2η+η

2

T∑t=1

||vt||2.

In particular, for every r, ρ > 0, if for all t ∈ [1, T ] we have ||vt|| ≤ ρ and if

we set η = (r/ρ)T−1/2 then if ||w∗|| ≤ r we have

(11.10)1

T

T∑t=1

〈wt − w∗, vt〉 ≤rρ√T.

To apply Lemma 11.5 to Proposition 11.4 we set vt := ∇f(wt) and notethat ||∇f(wt)|| ≤ ρ since f is ρ-convex, moreover

f(wT )− f(w∗) = f(1

T

T∑t=1

wt)− f(w∗)

since f is convex≤ 1

T

T∑t=1

f(wt)− f(w∗) =1

T

T∑i=1

(f(wt)− f(w∗))

by(11.7)

≤ 1

T

T∑i=1

〈wt − w∗,∇f(wt)〉.

• Subgradient descend algorithm. Comparing (11.3) with (11.7), takinginto account the technical Lemma 11.5, we conclude that the gradient de-scend algorithm can be applied the case of non-differentiable function f thathas subgradient at every point.

11.2. Stochastic gradient descend (SGD). Let (Z,H, L, P ) be a learn-ing model. Given a sample S := (z1, · · · , zn) ∈ Zn consisting of observableszi that are i.i.d. by µ ∈ P , a SGD searches for an approximate minimizerhS ∈ H of the function RLµ : H → R using the following formula of “differ-entiation under integration”, assuming L is differentiable.

(11.11) ∇RLµ(h) =

∫Z∇hL(h, z) dµ(z).

80 HONG VAN LE ∗

Here ∇hL(h, z) is the gradient of a function L of variable h and parameter z.Thus ∇RLµ(h) can be computed in two steps. First we compute ∇hL(h, zi)for zi ∈ S. Then we approximate the RHS of (11.11) by the empiricalgradient 1

n

∑zi∈S ∇hL(h, zi) which is equal to the gradient of the empirical

risk function.

∇RLS(h) =1

n

∑zi∈S∇hL(h, zi).

Next we apply the algorithm for gradient flow described above to ∇RLS(h).The weak law of large numbers ensures the convergence in probability of∇RLS(h) to RHS of (11.11), and heuristically the convergence of the empiricalgradient descend algorithm to the gradient descend of the expected riskfunction RLµ .

There are several versions of SGD with minor modifications.For simplicity and applications in NN we assume Z = Rn × Rm, H :=

HE,V,σ is parameterized by w ∈ RN and L is differentiable in w. A versionof SGD works as follows.1) Choose a parameter η > 0 and T > 0.2) Assume that S = (z1, z2, · · · , zn) ∈ Zn. Take arbitrary z ∈ S.3) Set w1 = 0 ∈ RN .4) wt+1 := wt − η∇wL(wt, z).5) Set the output wT (z) := 1

n

∑nt=1wt.

Proposition 11.6. ([SSBD2014, Corollary 14.12, p. 197]) Assume L is aconvex function in variable w and µ ∈ P governs the probability distributionof i.i.d. zi ∈ S ∈ (Rn × Rm). Assume that r, ρ ∈ R+ are given with thefollowing properties.

1) w∗ ∈ arg minw∈B(0,r)RLµ

(w).

2) The SGD is run for T iterations with η =√

r2

ρ2T.

3) For all t ∈ [1, T ] we have Eµ(||∇wL(wt, z)||

)≤ ρ (e.g., ||∇wL(wt, z)|| ≤

ρ for all z).

4) Assume that T ≥ r2ρ2

ε2.

Then

(11.12) Eµ(RLµ(wT (z)

))≤ RLµ

(h(w∗)

)+ ε.

Exercise 11.7. Find an upper bound for the sample complexity of the SGDin Proposition 11.6.

Example 11.8. Let us consider layered FNN with H = HE,V,σ where V =V0 ∪ V1 ∪ · · · ∪ VT . For the loss function

L(x, y, w) :=1

2||hw(x)− y||2

MACHINE LEARNING 81

and a vector v ∈ RN on Rn we compute the gradient of L w.r.t. the Eu-clidean metric on RN , regarding x, y as parameters:

〈∇L(x, y, w), v〉 = 〈hw(x)− y,∇vhw(x)〉.To compute ∇vhw(x) = dh(v) we decompose hw = hT · · · h1 as in (10.8)and using the chain rule

d(hT · · · h1)(v) = dhT · · · dh1(v).

To compute dhi we use the decomposition (10.9)

d(ht,1 × · · · × ht,|Vt|) = dht,1 × · · · × dht,|Vt|.

Finally for ht,j = σ(∑ajxj) we have

dht,j = dσ (∑

ajdxj).

The algorithm for computing the gradient ∇L w.r.t. w efficiently is calledbackpropogation. 19

Remark 11.9. (1) In a general FNN the loss function L is not convex there-fore we cannot apply Proposition 11.6. Training FNN is therefore subjectto experimental tuning.

(2) Training a RNN is reduced to training of sequence of FNN given asequence of labelled data, see [Haykin2008, §15.6, p. 806] for more details.

11.3. Online gradient descend and online learnability. For trainingneural networks one also use Online Gradient Descend (OGD), which worksas an alternative method of SGD [SSBD2014, p. 300]. Let L : RN ×Z → Rbe a loss function. A version of OGD works almost like SGD1) We choose a parameter η > 0 and T > 0.2) A sample S = (z1, · · · , · · · zT ) ∈ ZT is given.3) Set w1 = 0.4) For t∈ [1, T ] set vt := ∇wf(wt, zt).5) Set wt := wt − ηvt.

Despite on their similarity, at the moment there is no sample complex-ity analysis of OGD. Instead, ML community develops a concept of onlinelearnability for understanding OGD.

11.3.1. Setting of online-learning. Let (X×Y,H, L, P ) be a supervised learn-ing model. The general on-line learning setting involves T rounds. At thet-th round, 1 ≤ t ≤ T , the algorithm A receives an instance xt ∈ X andmakes a prediction A(xt) ∈ Y. It then receives the true label yt ∈ Y andcomputes a loss L(A(yt), yt), where L : Y ×Y → R+ is a loss function. Thegoal of A is to find a predictor A(xt) that minimizes the cumulative loss,which is an analogue of the notion of empirical risk in our unified learningmodel RA(T ) :=

∑Ti=1 L(A(xt), yt) over T rounds [MRT2012, p. 148].

19According to [Bishop2006, p. 241] the term “backpropogation” is used in the neuralcomputing literature to mean a variety of different things.

82 HONG VAN LE ∗

In the case of 0-1 loss function L(0−1) the value RA(T ) is called the numberof mistakes that A makes after T rounds.

Definition 11.10 (Mistake Bounds, Online Learnability). ([SSBD2014,Definition 21.1, p. 288]) Given any sequence S = (x1, h

∗(y1)), · · · , (xT , h∗(yT )),where T is any integer and h∗ ∈ H, let MA(S) be the number of mistakes Amakes on the sequence S. We denote by MA(H) the supremum of MA(S)over all sequences of the above form. A bound of the form MA(H) ≤ B <∞is called a mistake bound. We say that a hypothesis class H is online learn-able if there exists an algorithm A for which MA(H) ≤ B <∞.

Remark 11.11. 1) Similarly we also have the notion of a successful onlinelearner in regression problems [SSBD2014, p. 300] and within this conceptonline gradient descent is a successful online learner whenever the loss func-tion is convex and Lipschitz.

2) In the online learning setting the notion of certainty and therefore thenotion of probability measure are absent. In particular we do not have thenotion of expected risk. So there is an open question if we can justify onlinelearning setting, using statistical learning theory.

11.4. Conclusion. In this section we study stochastic gradient descend asa learning algorithm which works if the loss function is convex. To applystochastic gradient flow as a learning algorithm in FNN where the loss func-tion is not convex one needs experimentally modify the algorithm so it doesnot stay in a critical point which is not the minimizer of the empirical riskfunction. One also trains NN with online gradient descends for which weneed a new concept of online learnability which has not yet interpreted usingprobability framework.

12. Bayesian machine learning

Under “Bayesian learning” one means application of Bayesian statisticsto statistical learning theory. Bayesian inference is an approach to statisticsin which all forms of uncertainty are expressed in terms of probability. Ul-timately in Bayesian statistics we regard all unknown quantities as randomvariables and we consider a joint probability distribution for all of them,which contains the most complete information about the correlation be-tween the unknown quantities. We denote by Θ the set of parameters thatgovern the distribution of data x ∈ X we like to estimate. The crucial ob-servation is that the joint distribution µΘ×X is defined by the conditionalprobability P (A|x) := µΘ|X (A|x) and the marginal probability measure µΘ

on Θ by Remark 2.8 (1).

12.1. Bayesian concept of learning. A Bayesian approach to a problemstarts with the formulation of a Bayesian statistical model (Θ, µΘ,p,X ) thatwe hope is adequate to describe the situation of interest. Here µΘ ∈ P(Θ)and p : Θ → P(X ) is a measurable map, see Subsection A.4. We thenformulated a prior distribution µΘ over the unknown parameters θ ∈ Θ of

MACHINE LEARNING 83

the model, which is meant to capture our beliefs about the situation be-fore seeing the data x ∈ X . After observing some data, we apply Bayes’Theorem A.11, to obtain a posterior distribution for these unknowns, whichtakes account of both the prior and the data. From this posterior distribu-tion we can compute predictive distributions P (zn+1| z1, · · · , zn) for futureobservations using (12.1).

To predict the value of an unknown quantity zn+1, given a sample (z1, · · · , zn),a prior distribution µΘ, one uses the following formula

(12.1) P (zn+1| z1, · · · , zn) =

∫P (zn+1|θ)P (θ|z1, · · · , zn)dµΘ

which is a consequence of disintegration formula (A.8).The conditional distribution P (zn+1|θ) := p(θ)(zn+1) is called the sam-

pling distribution of data zn+1, the conditional probability P (θ|z1, · · · , zn)is called posterior distribution of θ after observing (z1, · · · , zn). The learn-ing algorithm A gives a rule to compute P (θ|z1, · · · , zn) is called posteriordistribution of θ.

Definition 12.1. A Bayesian machine learning is a quintuple (Θ, µΘ,X , , A),where (Θ, µΘ,p,X ) is a Bayesian statistical model and A is an algorithmfor computing the posterior distribution P (θ|x1, · · · , xn).

Example 12.2. Let us consider a simple Bayesian learning machine (Θ =∪∞i=1,X = ∪∞j=1xj), called a discrete naive Bayesian learning machine. Innaive Bayesian machine learning we make assumption that for all observableθk ∈ ΣΘ the conditional distribution P (θk|xi) are mutually independent, i.e.,

(12.2) P (xi | xi+1, . . . , xn, θk) = p(xi | θk).

Then by (A.14) we have the following algorithm for computing the posteriordistributions

(12.3) P (θk|x1, · · · , xk) = P (x) · P (θk)Πni=1P (xi|θk),

where x = (x1, · · · , xk).

Example 12.3. When dealing with continuous data, a typical assumption isthat the continuous values associated with each class are distributed accord-ing to a normal (or Gaussian) distribution. The Gaussian naive Bayesianmodel assume that the Bayesian statistical model (Θ, µΘ,p,X ) is Gaussian,i.e.,

P (x = v|θ) =1√

2πσ(θ)2exp(−(v − θ)2

2σ(θ)2).

Remark 12.4 (MAP estimator). In many case it is difficult to compute thepredictive distribution P (xn+1|x1, · · · , xn), using (12.1). A popular solutionis the Maximum A Posterior estimator. We take the value θ that maximizethe posterior probability P (θ|x1, · · · , xn) and plug it to compute P (xn+1|θ).

84 HONG VAN LE ∗

12.2. Applications of Bayesian machine learning.

Example 12.5 (Bayesian neural networks). ([Neal1996, §1.1.2, p. 5]) InBayesian neural network the aim of a learner is to find a conditional prob-ability P (y|xn+1, (x1, y1), · · · , (xn, yn)), where y is a label, xn+1 is a newinput and (xi, yi)| i = 1, n is training data. Let θ be a parameter of theneural network. Then we have(12.4)

P (y|xn+1, (x1, y1), · · · , (xn, yn)) =

∫P (y|xn+1, θ)P (θ| (x1, y1), · · · , (xn, yn)) dθ.

The conditional sampling probability P (y|xn+1, θ) is assumed to known.Hence we can compute the LHS of (12.4), which is called predictive distri-bution of y.

Another application of Bayesian methods is model selection. First weenumerate all reasonable models of the data and assigning a prior belief µito each of these models Mi. Then, upon observing the data x you evaluatehow probable the data was under each of these models to compute P (x|µi).To compare two models Mi with Mj , we need to compute their relativeprobability given the data: µiP (x|Mi)/µjP (x|Mj).

There are more applications of Bayesian machine learning, for example indecision theory using posterior distributions, see e.g. [Robert2007].

12.3. Consistency. There are many view points on Bayesian consistency,see e.g. [GV2017, Chapters 6, 7], where the authors devoted a sizable partof their book to the convergence of posterior distributions to the true valueof the parameters as the amount of data increases indefinitely. On the otherhand, according to [Robert2007, p. 48], in Bayesian decision theory one didnot consider asymptotic theory. Firstly, the Bayesian point of view is intrin-sically conditional. When conditioning on the observation S ∈ X n, there isno reason to wonder what might happen if n goes to infinity since n is fixedby the sample size. Theorizing on future values of the observations thusleads to a frequentist analysis, opposite to the imperatives of the Bayesianperspective. Secondly, even though it does not integrate asymptotic require-ments, Bayesian procedures perform well in a vast majority of cases underasymptotic criteria. In a general context, Ibragimov and Hasminskii showthat Bayes estimators are consistent [IH1981, chapter 1].

12.4. Conclusion. In our lecture we considered main ideas and some appli-cations of Bayesian methods in machine learning. Bayesian machine learningis an emerging promising trend in machine learning that is well suitable forsolving complex problems on one hand and consistent with most basic tech-niques of non-Bayesian machine learning. There are several problems inimplementing Bayesian approach, for instance to translating our subjectiveprior beliefs into a mathematically formulated model and prior. There mayalso computational difficulties with the Bayesian approach.

MACHINE LEARNING 85

Appendix A. Some basic notions in probability theory

The true logic of this world is the calculus of probabilities.James Clerk Maxwell

Basis objects in probability theory (and mathematical statistics) are mea-surable spaces (X ,ΣX ), where ΣX is a σ-algebra of subsets of a space X .We shall often write X if we dont need to specify ΣX . A signed measure µon X is a countably additive function µ : ΣX → R∪−∞∪∞. A signedmeasure µ is called a (nonnegative) measure20, if µ(A) ≥ 0 for all A ∈ ΣX .A measure µ is called a probability measure if µ(X ) = 1. We shall denoteby S(X ) the set of all signed measures on X , by M(X ) the set of all finitemeasures µ on X , i.e., µ(X ) < ∞, and by P(X ) the set of all probabilitymeasures on X , i.e., µ(X ) = 1.

For this Appendix I use [Bogachev2007] as my main reference on (signed)measure theory and [Schervish1997] for theoretical statistics. I recommendalso [JP2003] for a clear and short exposition of probability theory, [Kallenberg2002]for a modern treatment of probability theory and [AJLS2017, JLLT2019] forgeometric and categorical approaches in statistics.

A.1. Dominating measures and the Radon-Nikodym theorem. Letµ ∈M(X ) and ν ∈ S(X ).(i) The measure ν is called absolutely continuous with respect to µ (or dom-inated by µ) if |ν|(A) = 0 for every set A ∈ ΣX with |µ|(A) = 0. In this casewe write ν µ.(ii) The measure ν is called singular with respect to µ, if there exists a setΩ ∈ ΣX such that

|µ|(Ω) = 0 and |ν|(X \ Ω) = 0.

In this case we write: ν ⊥ µ and we denote by Idν the identity measurablemapping (X ,ΣX )→ (X , ν−1(ΣY)).

Theorem A.1. (cf. [Bogachev2007, Theorem 3.2.2, vol 1, p. 178]) Let µand ν be two finite measures on a measurable space (X ,Σ). The measureν is dominated by the measure µ precisely when there exists a function f ∈L1(X , µ) such that for any A ∈ ΣX we have

(A.1) ν(A) =

∫Af dµ.

We denote ν by fµ for µ, ν, f satisfying the equation (A.1). The functionf is called the (Radon-Nikodym) density (or the Radon-Nikodym derivative)of ν w.r.t. µ. We denote f by dν/dµ.

Remark A.2. The Radon-Nykodym derivative dν/dµ should be understoodas the equivalence class of functions f ∈ L1(X , µ) that satisfy the relation(A.1) for any A ∈ ΣX . It is important in many cases to find a function on

20we shall omit in our lecture notes the adjective “nonnegative”

86 HONG VAN LE ∗

X which represents the equivalence class dν/dµ ∈ L1(X , µ), see [JLLT2019,Theorem A.1].

A.2. Conditional expectation and regular conditional measure. Thenotion of conditioning is constantly used as a basic tool to describe and an-alyze systems involving randomness. Heuristic concept of conditional proba-bility existed long before the fundamental work by Kolmogorov [Kolmogoroff1933],where the concept of conditional probability has been rigorously defined viathe concept of conditional expectation.

A.2.1. Conditional expectation. In this section we define the notion of con-ditional expectation using the Radon-Nykodym theorem. We note thatany sub-σ-algebra B ⊂ ΣX can be written as B = Id−1(B) where Id :(X ,ΣX ) → (X ,B) is the identity mapping, and hence a measurable map-ping. In what follows, w.l.o.g. we shall assume that B := σ(η) whereη : (X ,ΣX )→ (Y,ΣY) is a measurable mapping.

(X ,ΣX )η //

Idη

(Y,ΣY)

(X , η−1(ΣY))

η77

For a measurable mapping η : (X ,ΣX )→ (Y,ΣY) we denote by η∗ : S(X )→S(Y) the push-forward mapping

(A.2) η∗(µ)(B) := µ(η−1(B))

for any B ∈ ΣY . Clearly η∗(M(X )) ⊂ M(Y) and µ∗(P(X )) ⊂ P(Y).Denote by Idη the identity measurable mapping (X ,ΣX )→ (X , η−1(ΣY)).

Definition A.3. Let µ ∈ P(X ), η : X → Y a measurable mapping and

f ∈ L1(X , µ). The conditional expectation Eσ(η)µ f is defined as follows

(A.3) Eσ(η)µ f :=

d(Idη)∗(fµ)

d(Idη)∗(µ)∈ L1(X , η−1(ΣY), (Idη)∗µ).

Remark A.4. (1) The conditional expectation Eσ(η)µ f can be expressed via

a function in L1(Y,ΣY , η∗µ) as follows. Set

ηµ∗ f :=dη∗(fµ)

dη∗(µ)∈ L1(Y,ΣY , η∗µ).

Assume that g is a function on Y that represents the equivalence class ηµ∗ f .

Then g(η(x)) is a representative of Eσ(η)µ f , since for any A ∈ η−1(ΣX ) we

have

fµ(A) =

∫η−1(B)

fdµ =

∫Bηµ∗ fdη∗(µ) =

∫Bgdη∗(µ).

(2) In the probabilistic literature for B = σ(η) one uses the notation

E(f |B) := EBµf,

MACHINE LEARNING 87

since in this case µ is assume the known probability measure and thereforewe don’t need to specify it in the expression E(f |B). We also use the notationEµ(f |B) later.

(3) There are many approaches to conditional expectations. The presentapproach using the Radon-Nykodym theorem is more or less the same as inin [Halmos1950]. In [JP2003, Definition 23.5, p. 200] the author lets theorthogonal projection of f ∈ L2(X , µ) to the subspace L2(X , σ(η)(Idη)∗µ)to be the definition of conditional expectation of f and then extend thisdefinition to the whole L1(X , µ).

A.2.2. Conditional measure and conditional probability. Let µ ∈ M(X ).The measure µ, conditioning w.r.t. a sub-σ-algebra B ⊂ ΣX , is definedas follows

(A.4) µ(A|B) := Eµ(1A|B) ∈ L1(X ,B, µ)

for A ∈ ΣX . In probabilistic literature one omits µ in (A.4) and writesinstead

P (A|B) := µ(A|B).

If B = η−1(Σ′) where η : (X ,ΣX )→ (Y,ΣY) is a measurable map, one usesthe notation

P (A|η) := µ(A|η) := µ(A|B).

For any A ∈ Σ and B ∈ B = η−1(ΣY) formulas (A.3) and (A.1) imply(A.5)

µ(A∩B) =

∫B

1Adµ =

∫B

d(Idη)∗1Aµ

d(Idη)∗µdµ =

∫BEσ(η)µ (1A)dµ =

∫Bµ(A|B)dµ.

If a measurable function ζA : Y → R represents ηµ∗ (1A) ∈ L1(Y, η∗µ),then one sets

(A.6) µB(A|y) := µB(A|η(x) = y) := ζA(y).

The RHS of (A.6) is called the measure of A under conditioning η = y.Clearly the measure of A under conditioning η = y, as a function of y ∈ Y,is well-defined up to a set of zero η∗µ-measure.

We also rewrite formula (A.6) as follows. For E ∈ ΣY

(A.7) µ(A ∩ η−1(E)) =

∫Eµy(A) dη∗(µ)(y)

where µy(A) is a function in L1(Y,ΣY , η∗µ).Note that it is not always the case that for η∗(µ)-almost all y ∈ Y the set

function µy(A) is countably addictive (any two representatives of the equiv-alence class of µy(A) ∈ L1(Y, η∗(µ)) coincide outside a set of zero η∗(µ)-measure), see Example 10.4.9 in [Bogachev2007, p. 367, v.2]. Neverthelessthis becomes possible under some additional conditions on set-theoretic ortopological character, and in this case we say that µy(A) is a regular condi-tional measure.

88 HONG VAN LE ∗

A.2.3. Regular conditional measure and Markov kernel.

Definition A.5. [Bogachev2007, Definition 10.4.1, p. 357]) Suppose we aregiven a sub-σ-algebra B ⊂ ΣX . A function

µB(·, ·) : ΣX ×X → R

is called a regular conditional measure on ΣX w.r.t. B if(1) for every x ∈ X the function A 7→ µB(A, x) is a measure on ΣX ,(2) for every A ∈ ΣX the function x 7→ µB(A, x) is B-measurable and µ-integrable,(3) For all A ∈ ΣX , B ∈ B the following formula holds, cf. (A.5)

(A.8) µ(A ∩B) =

∫BµB(A, x)dµ(x).

Remark A.6. (1) Assume that µB(A, x) is a regular conditional measure.Formulas (A.8) and (A.5) imply that µB(A, x) : X → R is a representativeof the conditional measure µ(A|B) ∈ L1(X ,B, µ). Thus one also uses thenotation µB(A|x), defined in (A.6), instead of µB(A, x).

(2) The equality (A.8) can be written in the following integral form: forevery bounded ΣX -measurable function f and every B ∈ B one has

(A.9)

∫Bf(x)µ(dx) =

∫B

∫Xf(y)dµB(y, x)dµ(x) =

∫BEBµfdµ.

The RHS of (A.9) can be written as Eµ(1BEBµf). The equality (A.9) is alsoutilized to define the notion of conditional expectation, see e.g. [Bogachev2007,(10.1.2), p. 339, vol. 2]. It implies that for any bounded B-measurable func-tion g on X we have

(A.10)

∫Xgfdµ =

∫XgEBµfdµ.

If f ∈ L2(X , µ) then (A.10) implies the following simple formula for EBµ(f)

(A.11) EBµ(f) = ΠL2(X ,B,µ)(f)

where ΠL2(X ,B,µ) : L2(X ,ΣX , µ)→ L2(X ,B, µ) is the orthogonal projection.See [JP2003, p. 200] for defining conditional expectation using (A.11).

Regular conditional measures in Definition A.5 are examples of transitionmeasures for which we shall have a generalized version of Fubini theorem(Theorem A.8)

Definition A.7. ([Bogachev2007, Definition 10.7.1, vol. 2, p. 384]) Let(X1,Σ1) and (X2,Σ2) be a pair of measurable spaces. A transition measurefor this pair is a function P (·|·) : X1×Σ2 → R with the following properties:

(i) for every fixed x ∈ X1 the function B 7→ P (x|B) is a measure on Σ2;(ii) for every fixed B ∈ Σ2 the function x 7→ P (x|B) is measurable w.r.t.

X1.

MACHINE LEARNING 89

In the case where transition measures are probabilities in the second ar-gument, they are called transition probabilities. In probabilistic literaturetransition probability is also called Markov kernel, or (probability) kernel[Kallenberg2002, p. 20].

Theorem A.8. ([Bogachev2007, Theorem 10.7.2, p. 384, vol. 2]) Let P (·|·)be a transition probability for spaces (X1,Σ1) and (X2,Σ2) and let ν be aprobability measure on Σ1. Then there exists a unique probability measureµ on (X1 ×X2,Σ1 ⊗ Σ2) with

(A.12) µ(B1 ×B2) =

∫B1

P (x|B2)dν(x) for all B1 ∈ Σ1, B2 ∈ Σ2.

In addition, given any f ∈ L1(µ) for ν- a.e. x1 ∈ X1 the function x2 7→f(x1, x2) on X2 is measurable w.r.t. the completed σ-algebra (Σ2)P (x1|·) andP (x1|·)-integrable, the function

x1 7→∫X2

f(x1, x2)dP (x1|x2)

is measurable w.r.t. (Σ1)ν , and ν-integrable, and one has

(A.13)

∫X1×X2

f(x1, x2)dµ(x1, x2) =

∫X1

∫X2

f(x1, x2)dP (x1|x2)dν(x1).

Corollary A.9. If a parametrization (Θ,ΣΘ) → P(X ), θ 7→ pθ, defines atransition measure then pθ can be regarded as a regular conditional proba-bility measure µ(·|θ) for µ defined by (A.12).

Corollary A.10. Assume that Θ is a topological space and ΣΘ is a Borel σ-algebra. If the parametrization mapping Θ → P(X ), θ 7→ pθ, is continuousw.r.t. the strong topology, then pθ can be regarded as a regular conditionalprobability measure.

Proof. Since the parametrization is continuous, for any A ∈ ΣX the functionθ 7→ pθ(A) is continuous and bounded, and hence measurable. Hence theparametrization Θ → P(X ) defines a transition probability measure andapplying Theorem A.8 we obtain Corollary A.10.

A.3. Joint distribution, regular conditional probability and Bayes’theorem. Till now we define conditional probability measure µ(A|B) on ameasurable space (X ,Σ, µ) where B is a sub σ-algebra of Σ. We can alsodefine conditional probability measure µX×Y(A|B) where A is a subset ofthe σ-algebra ΣX of a measurable space X and B is a sub-σ algebra of theσ-algebra ΣY of a measurable space Y, if a joint probability measure µX×Yon (X × Y,ΣX ⊗ ΣY) is given.

Let ΠX and ΠY denote the projection (X × Y,ΣX ⊗ ΣY) → (X ,ΣX )and (X × Y,ΣX ⊗ ΣY) → (Y,ΣY) respectively. Clearly ΠX and ΠY aremeasurable mappings. The marginal probability measure µX ∈ P(X ) isdefined as

µX := (ΠX )∗(µX×Y).

90 HONG VAN LE ∗

The conditional probability measure µX|Y(·|y) is defined as follows for A ∈ΣX(A.14)

µX|Y(A|y) := Eσ(ΠY )µX×Y (A×Y|y) =

d(ΠY)∗(1A×YµX×Y)

d(ΠY)∗µX×Y(y) ∈ L1(Y, (ΠY)∗µX×Y)

where the equality should understood as equivalence class of functions inL1(Y, (ΠY)∗µX×Y).

The conditional probability measure µX|Y(·|y) is called regular, if for allA ∈ ΣX there is a function on Y, denoted by µX|Y(A|y), that represents the

equivalence class µX|Y(A|y) in L1(Y, (ΠY)∗µX×Y) such that the functionP (·|·) : Y · ΣX → R, (y,A) 7→ µX|Y(A|y) is a transition probability, i.e. aMarkov kernel. A convenient way to express this condition is to use thelanguage of probabilistic morphism, see Subsection A.4.

• The Bayes theorem stated below assumes the existence of regular con-ditional measure µX|Θ(A|θ), 21 where µ is a joint distribution of randomelements x ∈ X and θ ∈ Θ. Furthermore we also assume the condition thatthere exists a measure ν ∈ P(X ) such that ν dominates µθ := µ(·|θ) for allθ ∈ Θ.

Theorem A.11. [Bayes’ theorem]([Schervish1997, Theorem 1.31, p. 16])Suppose that X has a parametric family Pθ| θ ∈ Θ such that Pθ ν forsome ν ∈ P(X ) for all θ ∈ Θ. Let fX|Θ(x|θ) denotes the conditional densityof Pθ w.r.t. ν. Let µΘ be the prior distribution of Θ and let µΘ|X (·|x) theconditional distribution of Θ given x. Then µΘ|X µΘ ν- a.s. and theRadon-Nykodim derivative is

dµΘ|X

dµΘ(θ|x) =

fX|Θ(x|θ)∫Θ fX|Θ(x|t) dµΘ(t)

for those x s.t. the dominator is neither 0 or infinite. The prior predictiveprobability of the set of x values s.t. the dominator is 0 or infinite is 0,hence the posterior can be defined arbitrary for such x values.

For a version of Bayes’ theorem without the dominance condition for afamily µθ| θ ∈ Θ see [JLLT2019, Theorem 3.6].

A.4. Probabilistic morphism and regular conditional probability.In 1962 Lawere proposed a categorical approach to probability theory, wheremorphisms are Markov kernels, and most importantly, he supplied the spaceP(X ) with a natural σ-algebra Σw, making the notion of Markov kernels andhence many constructions in probability theory and mathematical statisticsfunctorial.

Given a measurable space X , let Fs(X ) denote the linear space of simplefunctions on X . There is a natural homomorphism I : Fs(X ) → S∗(X ) :=

21Schervish considered only parametric family of conditional distributions[Schervish1997, p.13]

MACHINE LEARNING 91

Hom(S(X ),R), f 7→ If , defined by integration: If (µ) :=∫X fdµ for f ∈

Fs(X ) and µ ∈ S(X ). Following Lawvere [Lawvere1962], we define Σw to bethe smallest σ-algebra on S(X ) such that If is measurable for all f ∈ Fs(X ).We also denote by Σw the restriction of Σw toM(X ),M∗(X ) :=M(X )\0,and P(X ).• For a topological space X we shall consider the natural Borel σ-algebra

B(X ). Let Cb(X ) be the space of bounded continuous functions on a topo-logical space X . We denote by τv the smallest topology on S(X ) such thatfor any f ∈ Cb(X ) the map If : (S(X ), τv) → R is continuous. We alsodenote by τv the restriction of τv toM(X ) and P(X ). If X is separable andmetrizable then the Borel σ-algebra on P(X ) generated by τv coincides withΣw.

Definition A.12. ([JLLT2019, Definition 2.4]) A probabilistic morphism22

(or an arrow) from a measurable space X to a measurable space Y is ameasurable mapping from X to (P(Y),Σw).

Example A.13. Recall that the conditional probability measure µX|Y(·|y)is defined in (A.14). Then µX|Y(·|y) is regular, iff there exists a measurable

mapping P : Y → (P(X ),Σw) such that for all A ∈ ΣX the function P (·)(A)represents the equivalence class of µX|Y(A|·) in L1(Y, (ΠY)∗µX×Y).

We shall denote by T : X → (P(Y),Σw) the measurable mapping defin-ing/generating a probabilistic mapping T : X ; Y. Similarly, for a measur-able mapping p : X → P(Y) we shall denote by p : X ; Y the generatedprobabilistic morphism. Note that a probabilistic morphism is denoted bya curved arrow and a measurable mapping by a straight arrow.

Example A.14. ([JLLT2019, Example 2.6]) Let δx denote the Dirac mea-sure concentrated at x. It is known that the map δ : X → (P(X ),Σw), x 7→δ(x) := δx, is measurable [Giry1982]. If X is a topological space, then themap δ : X → (P(X ), τv) is continuous, since the composition If δ : X → Ris continuous for any f ∈ Cb(X ). Hence, if κ : X → Y is a measurablemapping between measurable spaces (resp. a continuous mapping between

separable metrizable spaces), then the map κ : X δκ→ P(Y) is a measurablemapping (resp. a continuous mapping). We regard κ as a probabilistic mor-phism defined by δ κ : X → P(Y). In particular, the identity mappingId : X → X of a measurable space X is a probabilistic morphism generatedby δ : X → P(X ). Graphically speaking, any straight arrow (a measurablemapping) κ : X → Y between measurable spaces can be seen as a curvedarrow (a probabilistic morphism).

Given a probabilistic morphism T : X ; Y, we define a linear mapS∗(T ) : S(X ) → S(Y), called Markov morphism, as follows [Chentsov1972,

22we thank J.P. Vigneaux for suggesting us to use “probabilistic morphism” instead of“probabilistic mapping”

92 HONG VAN LE ∗

Lemma 5.9, p. 72]

(A.15) S∗(T )(µ)(B) :=

∫XT (x)(B)dµ(x)

for any µ ∈ S(X ) and B ∈ ΣY .

Proposition A.15. ([JLLT2019]) Assume that T : X ; Y is a probabilisticmorphism.

(1) Then T induces a linear bounded map S∗(T ) : S(X ) → S(Y) w.r.t.the total variation norm || · ||TV . The restriction M∗(T ) of S∗(T ) to M(X )(resp. P∗(T ) of S∗(T ) to P(X )) maps M(X ) to M(Y) (resp. P(X ) toP(Y)).

(2) Probabilistic morphisms are morphisms in the category of measurablespaces, i.e., for any probabilistic morphism T1 : X ; Y and T2 : Y ; Z wehave

(A.16) M∗(T2 T1) = M∗(T2) M∗(T1), P∗(T2 T1) = P∗(T2) P∗(T1).

(3) M∗ and P∗ are faithful functors.(4) If ν µ ∈M∗(X ) then M∗(T )(ν)M∗(T )(µ).

A.5. The Kolmogorov theorem. In theory of consistency of estimatorsone would like to consider infinite sequences of instances (z1, · · · , z∞) andput an probability measure on such sequences. To do so ones use the Kol-mogorov theorem that provides the existence of such a probability measurewith useful compatibility properties.

Theorem A.16. ([Bogachev2007, Theorem 7.7.1, p. 95, vol 1]) Supposethat for every finite set Λ ⊂ T , we are given a probability measure µΛ on(ΩΛ,BΛ) such that the following consistency condition is fulfilled: if Λ1 ⊂ Λ2

then the image of the measure µΛ2 under the natural projection ΩΛ2 → ΩΛ1

coincides with µΛ1. Suppose that for every t ∈ T the measure µt on Btpossesses an approximating compact class Kt ⊂ Bt. Then there exists aprobability measure µ on the measurable space (Ω := Πt∈TΩt, B := ⊗t∈TBt)such that the image of µ under the natural projection from Ω to ΩΛ is µΛ

for each finite set Λ ⊂ T .

Appendix B. Law of large numbers andconcentration-of-measure inequalities

In probability theory, the concentration of measure is a property of alarge number of variables, such as in laws of large numbers. Concentration-of-measure inequalities provide bounds on the probability that a measurablemapping f : X → R deviates from its mean value Eµ(f), or other typicalvalues, by a given amount. Thus concentration-of-measure inequality quan-tifies the rate of convergence in law of large numbers.

MACHINE LEARNING 93

B.1. Laws of large numbers.

Proposition B.1 (Strong law of larger number). ([Kallenberg2002, The-orem 4.23, p. 73]) Let ξ, ξ1, ξ2, · · · , ξ∞ ∈ (R∞, µ∞) where µ ∈ P(R), and

p ∈ (0, 2). Then n−1/p∑

k≤n ξk converges a.s. iff Eµ|ξi|p < ∞ and eitherp ≤ 1 or Eµξi = 0. In that case the limit equals Eµξ for p = 1 and isotherwise 0.

If we replace the strong condition Eµ|ξi|p <∞ in the law of large numbersby weaker conditions, then we have to replace the convergence a.e. (or a.s.,or with probability 1) in the law of large numbers by the convergence inprobability in the weak law of large numbers.

Proposition B.2 (Weak law of large numbers). [Kallenberg2002, Theorem5.16, p. 95] Let ξ, ξ1, ξ2, · · · , ξ∞ ∈ (R∞, µ∞) where µ ∈ P(R), and p ∈ (0, 2),

c ∈ R. Then ξn := n−1/p∑

k≤n ξk converges in probability to c, i.e., for allε > 0

limn→∞

µ(ξn ∈ R| |ξn − c| > ε) = 0,

iff the following conditions holds as r →∞, depending on the value of p:p < 1: rpµ(ξ ∈ R| |ξ| > r)→ 0 and c = 0;p = 1: rµ(ξ ∈ R| |ξ| > r)→ 0 and Eµ(ξ| |ξ| ≤ r)→ c;p =>: rpµ(ξ ∈ R| |ξ| > r)→ 0 and Eµ(ξ) = c = 0.

B.2. Markov’s inequality. For any f ∈ L1(X , µ) such that f(x) ≥ 0 forall x ∈ X we have and any t > 0 we have

(B.1) µx ∈ X | f(x) > t ≤ Eµft.

B.3. Chebyshev’s inequality. For any f ∈ L1(X , µ) andt > 0 we have

(B.2) µ(x : |f(x)− Eµ(f)| ≥ t) ≤ |f(x)− Eµ(x)|2

t2.

Remark B.3. (1) Using the following identity

(B.3) Eµ(f) = Efµ(1X ) = Ef∗(f ·µ)(1R) =

∫ ∞0

f∗(µ)(t,+∞) dt

and noting that the LHS of (B.1) is equal f∗(µ)(t,+∞), which is a monotonefunction in t, we obtain the Markov inequality (B.1) immediately.

(2) For any monotone function ϕ : (R≥0, dt) → (R≥0, dt) applying (B.1)we have

(B.4) f∗(µ)(t,+∞) = ϕ∗ f∗(µ)(ϕ(t),+∞) ≤ Eµ(ϕ f)

ϕ(t).

The Chebyshev equality (B.2) follows from (B.4), replacing f in (B.4) by|g − Eµ(g)| for g ∈ L1(X , µ), and set ϕ(t) := t2.

94 HONG VAN LE ∗

B.4. Hoeffding’s inequality. ([Hoeffding1963]) Let θ = (θ1, · · · , θn) be asequence of i.i.d. R-valued random variables on Z and µ ∈ P(Z). Assumethat Eµ(θi(z)) = θ for all i and µz ∈ Z| [ai ≤ θi(z) ≤ b] = 1. Then forany ε > 0 we have

(B.5) µmz ∈ Zm :∣∣ 1

m

m∑i=1

θi(zi)− θ∣∣ > ε ≤ 2 exp

( −2mε2

(b− a)2

),

where z = (z1, · · · , zm).

B.5. Bernstein’s inequality. Let θ be a R-valued random variable on aprobability space (Z, µ) with the mean Eµ(θ) = θ and variance σ2 = Vµ(θ).If |ξ − Eµ(ξ)| ≤M then for all ε > 0 we have

(B.6) µmz ∈ Zm :∣∣ 1

m

m∑i=1

θi(zi)− θ∣∣ > ε ≤ 2 exp

( −mε2

2(σ2 + 13Mε

)),

where z = (z1, · · · , zm).

B.6. McDiarmid’s inequality. (or Bounded Differences or Hoeffding/AzumaInequality). Let X1, · · · , Xm ∈ X are i.i.d. by a probability measure µ. As-sume that f : Xm → R satisfies the following property form some c > 0.For all i ∈ [1,m] and for all x1, · · · , xm, x′i ∈ X we have

|f(x1, · · · , xm)− f(x1, · · · , xi−1, x′i, xi+1, · · · , xm)| ≤ c.

Then we have for all δ ∈ (0, 1)

(B.7) µmS ∈ Xm| |f(S)− Eµmf(S)| ≤ c√

ln(2/δ)

m ≥ 1− δ.

References

[Amari2016] S. Amari, Information Geometry and its applications, Springer, 2016.[AJLS2015] N. Ay, J. Jost, H. V. Le, and L. Schwachhofer, Information geometry

and sufficient statistics, Probability Theory and related Fields, 162 (2015), 327-364, arXiv:1207.6736.

[AJLS2017] N. Ay, J. Jost, H. V. Le and L. Schwachhofer, Information Geometry,Springer, 2017.

[AJLS2018] N. Ay, J. Jost, H. V. Le and L. Schwachhofer, Parametrized measuremodels, Bernoulli vol. 24 Nr 3 (2018), 1692-1725, arXiv:1510.07305.

[BHMSY2019] S. Ben-David, P. Hrubes, S. Moran and A. Yehudayoff, Learnabilitycan be undecidable, Nature Machine intelligence, 1(2019), 44-48.

[Billingsley1999] P. Billingsley, Convergence of Probability measures, 2nd edition, JohnWiley ans Sohns, 1999.

[Bishop2006] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.[Bogachev2007] V. I. Bogachev, Measure theory I, II, Springer, 2007.[Bogachev2010] V. I. Bogachev, Differentiable Measures and the Malliavin Calculus,

AMS, 2010.[Bogachev2018] V. I. Bogachev, Weak convergence of measures, AMS, 2018.[Borovkov1998] A. A. Borovkov, Mathematical statistics, Gordon and Breach Science

Publishers, 1998.

MACHINE LEARNING 95

[CP1997] J. T. Chang and D. Pollard, Conditioning as disintegration, Statistica Neer-landica. 51(1997), 287-317.

[CP1997] J. T. Chang and D. Pollard, Conditioning as disintegration. Statistica Neer-landica 51(1997), 287-317.

[Chentsov1972] N. N. Chentsov, Statistical Decision Rules and Optimal Inference,Translation of mathematical monographs, AMS, Providence, Rhode Island, 1982,translation from Russian original, Nauka, Moscow, 1972.

[CS2001] F. Cucker and S. Smale, On mathematical foundations of learning, Bulletinof AMS, 39(2001), 1-49.

[Cybenko1989] G. Cybenko, Approximation by superpositions of a sigmoidal function,Mathematics of Control, Signals, and Systems, vol. 2(1989), pp. 303-314.

[DGL1997] L. Devroye, L. Gyorfi and G. Lugosi,[Fisher1925] R. A. Fisher, Theory of statistical estimation, Proceedings of the Cambridge

Philosophical Society, 22(1925), 700-725.[GBC2016] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT, 2016.[Ghahramani2013] Z. Ghahramani, Bayesian nonparametrics and the probabilistic ap-

proach to modelling. Philosophical Transactions of the Royal Society A 371 (2013),20110553.

[Ghahramani2015] Z. Ghahramani, Probabilistic machine learning and artificial intelli-gence, Nature, 521(2015), 452-459.

[Giry1982] M. Giry, A categorical approach to probability theory, In: B. Banaschewski,editor, Categorical Aspects of Topology and Analysis, Lecture Notes in Mathe-matics 915, 68- 85, Springer, 1982.

[GS2003] J.K. Ghosh and R.V. Ramamoorthi, Bayesian Nonparametrics, Springer,2003.

[Graves2012] A. Graves, Supervised Sequence Labelling with Recurrent Neural Net-works, Springer, 2012.

[GV2017] S. Ghosal and A. van der Vaart, Fundamentals of Nonparametric BayesianInference, Cambridge University Press, 2017.

[JLS2017] J. Jost, H. V. Le and L. Schwachhofer, The Cramer-Rao inequality onsingular statistical models, arXiv:1703.09403.

[Halmos1950] P.R. Halmos, Measure theory, Van Nostrand 1950.[Haykin2008] S. Haykin, Neural Networks and Learning Machines, 2008.[HTF2008] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical

Learning: Data Mining, Inference, and Prediction, Springer 2008.[Hoeffding1963] W. Hoeffding, Probability inequalities for sums of bounded random

variables, J. Am. Stat. Assoc., 58(301):13-30, 1963.[IH1981] I. A. Ibragimov and R. Z. Has’minskii, Statistical Estimation: Asymptotic

Theory, Springer, 1981.[IZ2013] P. Igleasias-Zemmour, Diffeology, AMS, 2013.[Janssen2003] A. Janssen, A nonparametric Cramer-Rao inequality, Statistics & Proba-

bility letters, 64(2003), 347-358.[Jaynes2003] E. T. Jaynes, Probability Theory The Logic of Sciences, Cambridge Uni-

versity Press, 2003.[Jost2005] J. Jost, Postmodern Analysis, Springer, 2005.[JLLT2019] J. Jost, H. V. Le, D. H. Luu and T. D. Tran, Probabilistic mappings

and Bayesian nonparametrics, arXiv:1905.11448.[JP2003] J. Jacod and P. Protter, Probability Essentials, Springer, 2. edition, 2004.[KM1997] A. Kriegl and P. Michor, The convenient setting of Global Analysis, AMS,

1997.[Kallenberg2002] O. Kallenberg, Foundations of modern Probability, Springer, 2002.[Kallenberg2017] O. Kallenberg, Random measures, Springer, 2017.

96 HONG VAN LE ∗

[KM1997] A. Kriegl and P. W. Michor, The Convenient Setting of Global Analysis,AMS, 1997.

[Kolmogoroff1933] A. Kolmogoroff, Grundbegriife der Wahrscheinlichkeitensrechnung.Springer, Berlin, 1933. English transl: Foundations of the Theory of Probability,Chelsea, New York, 1950.

[Kolmogorov1957] A. Kolmogorov, On the representation of continuous functions ofmany variables by superposition of continuous functions of one variable and ad-ditions, Dokl. Acad. Nauk USSR, 114(1957), 953-956.

[Kullback1959] S. Kullback, Information theory and statistics, John Wiley and Sons,1959.

[Lawvere1962] W. F. Lawvere, The category of probabilistic mappings (1962). Unpub-lished, Available at https://ncatlab.org/nlab/files/lawvereprobability1962.pdf.

[Le2017] H.V. Le, The uniqueness of the Fisher metric as information metric, AISM, 69(2017), 879-896, arXiv:math/1306.1465.

[Le2019] H. V. Le, Diffeological statistical models, the Fisher metric and prob-abilistic mappings, Mathematics 2020, 8, 167; doi:10.3390/math8020167,arXiv:1912.02090.

[Ledoux2001] M. Ledoux, The concentration of measure phenomenon, AMS, 2001.[Lorentz1976] G. Lorentz, The thirteen problem of Hilbert, In Proceedings of Symposia

in Pure Mathematics, 28(1976), 419-430, Providence, RI.[Lugosi2009] G. Lugosi, Concentration of measure inequalities - lecture notes, 2009, avail-

able at http://www.econ.upf.edu/∼lugosi/anu.pdf.[LC1998] E. L. Lehmann and G. Casella, Theory of Point Estimation, Springer, 1998.[MFSS2017] K. Muandet, K. Fukumizu, B. Sriperumbudur and B. Scholkopf, Ker-

nel Mean Embedding of Distributions: A Review and Beyonds, Foundations andTrends in Machine Learning: Vol. 10: No. 1-2, pp 1-141 (2017), arXiv:1605.09522.

[MRT2012] M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of MachineLearning, MIT Press, 2012.

[Murphy2012] K. P. Murphy, Machine Learning: A Probabilistic Perspective, MITPress, 2012.

[Neal1996] R. M. Neal, Bayesian Learning for Neural Networks, Springer, 1996.[RN2010] S. J. Russell and P. Norvig, Artificial Intelligence A Modern Approach,

Prentice Hall, 2010.[Robert2007] C. P. Robert, The Bayesian Choice, From Decision-Theoretic Foundations

to Computational Implementation, Springer, 2007.[SB2018] R. S. Sutton and A. G. Barto, Reinforcement Learning, second edition,

MIT, 2018.[Shioya2016] T. Shioya, Metric Measure Geometry Gromov’s Theory of Convergence and

Concentration of Metrics and Measures, EMS, 2016.[SSBD2014] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning:

From Theory to Algorithms, Cambridge University Press, 2014.[Schervish1997] M. J. Schervish, Theory of Statistics, Springer, 2d corrected printing,

1997.[Sugiyama2015] M. Sugiyama, Statistical Reinforcement learning, CRC Press, 2015.[Sugiyama2016] M. Sugiyama, Introduction to Statistical Machine Learning, Elsevier,

2016.[Tsybakov2009] A. B. Tsybakov, Introduction to Nonparametric Estimation, Springer,

2009.[Valiant1984] L. Valiant, A theory of the learnable, Communications of the ACM, 27,

1984.[Vapnik1998] V. Vapnik, Statistical learning theory, John Willey and Sons, 1998.[Vapnik2000] V. Vapnik, The nature of statistical learning theory, Springer, 2000.

MACHINE LEARNING 97

[Vapnik2006] V. Vapnik, Estimation of Dependences Based on Empirical Data, Springer,2006.

[Wald1950] A. Wald, Statistical decision functions, Wiley, New York; Chapman & Hall,London, 1950.

Documents

MATHEMATICAL FOUNDATIONS OF MACHINE LEARNING · theory. That is why statistical learning theory is important part of machine learning theory. 1.2. A brief history of machine learning