Introduction to Machine Learning - Alex Smola › ... › slides ›...

Introduction to Machine Learning CMU-10701

8. Stochastic Convergence

Barnabás Póczos

Motivation

What have we seen so far?

Several algorithms that seem to work fine on training datasets: • Linear regression • Naïve Bayes classifier • Perceptron classifier • Support Vector Machines for regression and classification

How good are these algorithms on unknown test sets? How many training samples do we need to achieve small error? What is the smallest possible error we can achieve?

) Learning Theory To answer these questions, we will need a few powerful tools

Basic Estimation Theory

Rolling a Dice, Estimation of parameters θ1,θ2,…,θ6

120 60

Does the MLE estimation converge to the right value? How fast does it converge?

Rolling a Dice Calculating the Empirical Average

Does the empirical average converge to the true mean? How fast does it converge?

5 sample traces

How fast do they converge?

Rolling a Dice, Calculating the Empirical Average

Key Questions

I want to know the coin parameter θ2[0,1] within ε = 0.1 error, with probability at least 1-δ = 0.95. How many flips do I need?

• Do empirical averages converge? • Does the MLE converge in the dice rolling problem? • What do we mean on convergence? • What is the rate of convergence?

Applications: • drug testing (Does this drug modifies the average blood pressure?) • user interface design (We will see later)

Outline Theory: • Stochastic Convergences:

– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm

• Limit theorems: – Law of large numbers – Central limit theorem

• Tail bounds: – Markov, Chebyshev

Stochastic convergence definitions and properties

Convergence of vectors

Convergence in Distribution = Convergence Weakly = Convergence in Law

Notation:

Let {Z, Z1, Z2, …} be a sequence of random variables.

Definition:

This is the “weakest” convergence.

Only the distribution functions converge! (NOT the values of the random variables)

Continuity is important!

Proof:

The limit random variable is constant 0:

Example:

In this example the limit Z is discrete, not random (constant 0), although Zn is a continuous random variable.

This image cannot currently be displayed.

Properties

Scheffe's theorem: convergence of the probability density functions ) convergence in distribution

Example: (Central Limit Theorem)

Zn and Z can still be independent even if their distributions are the same!

Convergence in Probability

Notation: Definition:

This indeed measures how far the values of Zn(ω) and Z(ω) are from each other.

Almost Surely Convergence

Notation: Definition:

Convergence in p-th mean, Lp norm

Definition:

Notation:

Properties:

Counter Examples

Further Readings on Stochastic convergence

•http://en.wikipedia.org/wiki/Convergence_of_random_variables

•Patrick Billingsley: Probability and Measure

•Patrick Billingsley: Convergence of Probability Measures

Finite sample tail bounds

Useful tools!

Gauss Markov inequality

Decompose the expectation

If X is any nonnegative random variable and a > 0, then

Proof:

Corollary: Chebyshev's inequality 22

Chebyshev inequality If X is any nonnegative random variable and a > 0, then

Proof:

Here Var(X) is the variance of X, defined as:

Generalizations of Chebyshev's inequality

Chebyshev:

Asymmetric two-sided case (X is asymmetric distribution)

Symmetric two-sided case (X is symmetric distribution)

This is equivalent to this:

There are lots of other generalizations, for example multivariate X. 24

Higher moments?

Chebyshev:

Markov:

Higher moments: where n ≥ 1

Other functions instead of polynomials? Exp function:

Proof: (Markov ineq.)

Law of Large Numbers

Do empirical averages converge?

27 Answer: Yes, they do. (Law of large numbers)

Chebyshev’s inequality is good enough to study the question: Do the empirical averages converge to the true mean?

Law of Large Numbers

Strong Law of Large Numbers:

Weak Law of Large Numbers:

Weak Law of Large Numbers Proof I:

Assume finite variance. (Not very important)

Therefore,

As n approaches infinity, this expression approaches 1.

What we have learned today

Theory: • Stochastic Convergences:

– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm

• Limit theorems:

– Law of large numbers – Central limit theorem

• Tail bounds: – Markov, Chebyshev

Thanks for your attention

Introduction to Machine Learning - Alex Smola › ... › slides ›...

Documents

Smola - Introduction to Machine Learning

Perceptron and Kernels - Alex Smola

Approximate Ray-Tracing on the GPU with Distance Impostors László Szirmay-Kalos Barnabás Aszódi István Lazányi Mátyás Premecz TU Budapest, Hungary

Online Group-Structured Dictionary Learningbapoczos/articles/szabo11cvpr.pdfOnline Group-Structured Dictionary Learning Zoltán Szabó1 Barnabás Póczos2 András Lorincz˝ 1 1School

Reinforcement Learning - Carnegie Mellon School …mgormley/courses/10601-s17/...Introduction to Machine Learning Reinforcement Learning Barnabás Póczos TexPoint fonts used in EMF

Entropy and Dependence Estimation Barnabás Póczosbapoczos/other_presentations/ELTE_02_07_2009.pdf · Entropy and Dependence Estimation Barnabás Póczos Department of Computing

My Visual Resume - Filip Smola

Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016

Advanced Introduction to Machine Learning, CMU-10715 · 2014-09-17 · Advanced Introduction to Machine Learning, CMU-10715 Perceptron, Multilayer Perceptron Barnabás Póczos, Sept

Introduction to Machine Learning CMU-10701bapoczos/slides/ICA.pdf20. Independent Component Analysis Barnabás Póczos . 2 Contents ICA model ICA applications ICA generalizations ICA

Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon University at MLconf SF - 11/13/15

John Deere SCRLC Member Spotlight Elizabeth Carroll Bob Smola Austin, TX 18 May 2010

Vapnik Chervonenkis Theory10701/slides/19_VC_Theory.pdf · 2017. 11. 13. · Vapnik–Chervonenkis Theory Barnabás Póczos . Empirical Risk and True Risk 2. Empirical Risk 3 Let

Introduction to Machine Learning 10701aarti/Class/10701_Spring14/slides/ICA.pdfIndependent Component Analysis Barnabás Póczos & Aarti Singh . 2 Independent Component Analysis . 3

Dr. Ferenc Horváth Dr. Barnabás Wodala

Yuyang Wang, Alex Smola, Danielle C. Maddix, Jan Gasthaus ...11-14...Yuyang Wang, Alex Smola, Danielle C. Maddix, Jan Gasthaus, Dean Foster, Tim Januschowski. Time Series Prediction

A Tutorial on Support Vector Regression - Alexander J. Smola

Geraldine Smola - Fulkerson - SGeraldine Smola Place and Date of Birth Rural Rosetown, Saskatchewan, Canada ~ May 20, 1914 Place and Date of Death Stanley, North Dakota ~ November

Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University