Download pdf - Day 1: Introduction to Machine Learning

Machine LearningSeth Flaxman, Ph.D.

Upcoming Seminar: January 13-15, 2022, Remote Seminar

Day 1:Introduction toMachine Learning

Seth Flaxman www.sethrf.com

What is machine learning?

https://www.youtube.com/watch?v=f_uwKZIAeM0

2

https://www.youtube.com/watch?v=f_uwKZIAeM0

What is machine learning?

For our purposes, follow Tom Mitchell:

A computer program is said to learn from experienceE with respect to some class of tasks T and perfor-mance measure P if its performance at tasks in T, asmeasured by P, improves with experience E.

3

What is statistical machine learning?

Both provide methods for learning from data.

Computer science takes an algorithmic perspective: propose analgorithm for data, study the algorithm formally.

Statistics takes an inferential perspective: propose a model fordata, study the model formally.

Statistical machine learning (and computational statistics) is theintersection: algorithmic perspective on statistical methods,statistical perspective on algorithms.

4

Supervised vs. unsupervised learning: terminology

▶ Supervised learning, also known as: regression, classification,pattern recognition, recovery, sensing, . . .

▶ Unsupervised learning, also known as: clustering, data mining,dimensionality reduction, . . .

▶ Inputs, also known as: independent variables, predictors,covariates, patterns, x , X , . . .

▶ Outputs, also known as: dependent variables, responses,labels, y , Y , . . .

yfx

5

Supervised learning

6

Supervised learning, most basic setup

yfx

Given training inputs x ∈ X and outputs y ∈ Y

(xi , yi ), i = 1, . . . , n (1)

Learn a function (algorithm, black box, decision rule, classifier,probability distribution)

f : X → Y (2)

i.e. on the training inputs, we would like our function f toapproximately recover the training outputs:

f (xi ) ≈ yi (3)

7

Unsupervised learning, clustering and dim. reduction

Given training inputs x ∈ X , learn:

▶ Clustering: a function f giving cluster assignments 1, . . . ,K

f (x) ∈ {1, . . . ,K} (4)

such that Ck = {xi |f (xi ) = k} is homogeneous for each k .

▶ Dimensionality reduction: if X ∈ Rp, for large p, learn alatent representation Z ∈ Rd , d ≪ p, such that Z explainsmost of the variance in X .

8

9

Supervised learning: k-nearest neighbors

10

Unsupervised learning: k-means clustering

Supervised learning: further considerations

▶ Loss function: standard choice in regression are squared error(L2) loss:

L(x , y , f ) := (y − f (x))2 (5)

▶ Standard choice in classification is misclassification rate (1 -accuracy):

L(x , y , f ) := 1− I(y = f (x)) (6)

Loss is bad, you want to avoid loss, so smaller loss is better!(Some losses are always positive, others can be positive ornegative.)

11

Supervised learning: further considerations

Quiz: what value of k for k-nearest neighbors gives trainingloss = 0? Does this make sense?

▶ Risk: expected loss

R(f ) := EX ,Y [L(x , y , f )] (7)

▶ Empirical risk: average over data, e.g. “ordinary leastsquares”:

R̂(f ) :=n∑

i=1

(yi − f (xi ))2 (8)

12

An algorithmic vs. statistical perspective

▶ K-nearest neighbors and k-means clustering are algorithms forhandling data

▶ Algorithmic questions: what is their time complexity in termsof p and n? storage complexity?

▶ Statistical perspective: can the performance of eitheralgorithm be analyzed with reference to an underlyingprobabilistic model?

▶ Statistical questions: what kind of performance do we expecton unseen data (generalization)? How does performance varywith n and p? How robust is the model to outliers?

13

5 10 15 20

0.0

0.2

0.4

p

frac

tion

clos

e

14

The curse of dimensionality (Bellman 1961)As p increases, all points are about equally distant from one another:

Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α+ βx (9)

Finding f in this case means finding values for α and β.

▶ Algorithmic perspective: assuming squared error loss, find αand β to minimize the empirical risk:

R̂(f ) :=n∑

i=1

(yi − f (xi ))2 (10)

Closed form solutions exist for α̂ and β̂ which minimize R̂(f ).Exercise: find them! Hint: you will need to solve ∇βR̂(f ) = 0

and ∇αR̂(f ) = 0.

15

Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α+ βx (11)

Finding f in this case means finding values for α and β.

▶ Statistical perspective: assume that errors are iid N (0, σ2), orequivalently:

p(y |x) = N (f (x), σ2) (12)

▶ Use maximum likelihood to estimate α̂ and β̂.

▶ The statistical and algorithmic perspectives coincide!

16

Linear regression as a statistical machine learning method▶ Closed-form optima aren’t always available: need to use some

sort of optimization method (e.g. gradient descent) to learnthe parameters of a model.

▶ Many machine learning papers back in the day containedpages of math deriving gradients

▶ More common these days to rely on autodifferentiationmethods (see the deep learning revolution)

▶ Distinction between parameters (usually fit with optimization)and hyperparameters (usually learned by crossvalidation)

17

A quick tour of classic supervised learning methods

▶ k-nearest neighbors [Friedman, Tibshirani, Hastie 2009]

▶ Linear regression

▶ Naive Bayes [Mitchell 1997]

▶ Logistic regression

▶ Linear Discriminant Analysis

▶ Support Vector Machines (SVMs) [Scholkopf and Smola 2002]

▶ Gaussian process regression and classification [Rasmussen andWilliams 2006]

▶ Neural networks [Goodfellow, Bengio, Courville 2016]

▶ Random forests [Breiman 2001]

▶ Probabilistic Graphical Models [Murphy 2012]

18

A quick tour of classic unsupervised learning methods

▶ k-means clustering [Friedman, Tibshirani, Hastie 2009]

▶ Spectral clustering [von Luxburg 2007]

▶ Principal Components Analysis

▶ Latent Dirichlet Allocation [Blei, Ng, Jordan 2003]

▶ Gaussian Mixture Models

▶ Neural networks, especially VAEs and GANs [Goodfellow,Bengio, Courville 2016]

19