20
Machine Learning Seth Flaxman, Ph.D. Upcoming Seminar: January 13-15, 2022, Remote Seminar

Day 1: Introduction to Machine Learning

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Day 1: Introduction to Machine Learning

Machine LearningSeth Flaxman, Ph.D.

Upcoming Seminar: January 13-15, 2022, Remote Seminar

Page 2: Day 1: Introduction to Machine Learning

Day 1:Introduction toMachine Learning

Seth Flaxman www.sethrf.com

Page 3: Day 1: Introduction to Machine Learning

What is machine learning?

https://www.youtube.com/watch?v=f_uwKZIAeM0

2

Page 4: Day 1: Introduction to Machine Learning

What is machine learning?

For our purposes, follow Tom Mitchell:

A computer program is said to learn from experienceE with respect to some class of tasks T and perfor-mance measure P if its performance at tasks in T, asmeasured by P, improves with experience E.

3

Page 5: Day 1: Introduction to Machine Learning

What is statistical machine learning?

Both provide methods for learning from data.

Computer science takes an algorithmic perspective: propose analgorithm for data, study the algorithm formally.

Statistics takes an inferential perspective: propose a model fordata, study the model formally.

Statistical machine learning (and computational statistics) is theintersection: algorithmic perspective on statistical methods,statistical perspective on algorithms.

4

Page 6: Day 1: Introduction to Machine Learning

Supervised vs. unsupervised learning: terminology

▶ Supervised learning, also known as: regression, classification,pattern recognition, recovery, sensing, . . .

▶ Unsupervised learning, also known as: clustering, data mining,dimensionality reduction, . . .

▶ Inputs, also known as: independent variables, predictors,covariates, patterns, x , X , . . .

▶ Outputs, also known as: dependent variables, responses,labels, y , Y , . . .

yfx

5

Page 7: Day 1: Introduction to Machine Learning

Supervised learning

6

Page 8: Day 1: Introduction to Machine Learning

Supervised learning, most basic setup

yfx

Given training inputs x ∈ X and outputs y ∈ Y

(xi , yi ), i = 1, . . . , n (1)

Learn a function (algorithm, black box, decision rule, classifier,probability distribution)

f : X → Y (2)

i.e. on the training inputs, we would like our function f toapproximately recover the training outputs:

f (xi ) ≈ yi (3)

7

Page 9: Day 1: Introduction to Machine Learning

Unsupervised learning, clustering and dim. reduction

Given training inputs x ∈ X , learn:

▶ Clustering: a function f giving cluster assignments 1, . . . ,K

f (x) ∈ {1, . . . ,K} (4)

such that Ck = {xi |f (xi ) = k} is homogeneous for each k .

▶ Dimensionality reduction: if X ∈ Rp, for large p, learn alatent representation Z ∈ Rd , d ≪ p, such that Z explainsmost of the variance in X .

8

Page 10: Day 1: Introduction to Machine Learning

9

Supervised learning: k-nearest neighbors

Page 11: Day 1: Introduction to Machine Learning

10

Unsupervised learning: k-means clustering

Page 12: Day 1: Introduction to Machine Learning

Supervised learning: further considerations

▶ Loss function: standard choice in regression are squared error(L2) loss:

L(x , y , f ) := (y − f (x))2 (5)

▶ Standard choice in classification is misclassification rate (1 -accuracy):

L(x , y , f ) := 1− I(y = f (x)) (6)

Loss is bad, you want to avoid loss, so smaller loss is better!(Some losses are always positive, others can be positive ornegative.)

11

Page 13: Day 1: Introduction to Machine Learning

Supervised learning: further considerations

Quiz: what value of k for k-nearest neighbors gives trainingloss = 0? Does this make sense?

▶ Risk: expected loss

R(f ) := EX ,Y [L(x , y , f )] (7)

▶ Empirical risk: average over data, e.g. “ordinary leastsquares”:

R̂(f ) :=n∑

i=1

(yi − f (xi ))2 (8)

12

Page 14: Day 1: Introduction to Machine Learning

An algorithmic vs. statistical perspective

▶ K-nearest neighbors and k-means clustering are algorithms forhandling data

▶ Algorithmic questions: what is their time complexity in termsof p and n? storage complexity?

▶ Statistical perspective: can the performance of eitheralgorithm be analyzed with reference to an underlyingprobabilistic model?

▶ Statistical questions: what kind of performance do we expecton unseen data (generalization)? How does performance varywith n and p? How robust is the model to outliers?

13

Page 15: Day 1: Introduction to Machine Learning

5 10 15 20

0.0

0.2

0.4

p

frac

tion

clos

e

14

The curse of dimensionality (Bellman 1961)As p increases, all points are about equally distant from one another:

Page 16: Day 1: Introduction to Machine Learning

Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α+ βx (9)

Finding f in this case means finding values for α and β.

▶ Algorithmic perspective: assuming squared error loss, find αand β to minimize the empirical risk:

R̂(f ) :=n∑

i=1

(yi − f (xi ))2 (10)

Closed form solutions exist for α̂ and β̂ which minimize R̂(f ).Exercise: find them! Hint: you will need to solve ∇βR̂(f ) = 0

and ∇αR̂(f ) = 0.

15

Page 17: Day 1: Introduction to Machine Learning

Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α+ βx (11)

Finding f in this case means finding values for α and β.

▶ Statistical perspective: assume that errors are iid N (0, σ2), orequivalently:

p(y |x) = N (f (x), σ2) (12)

▶ Use maximum likelihood to estimate α̂ and β̂.

▶ The statistical and algorithmic perspectives coincide!

16

Page 18: Day 1: Introduction to Machine Learning

Linear regression as a statistical machine learning method▶ Closed-form optima aren’t always available: need to use some

sort of optimization method (e.g. gradient descent) to learnthe parameters of a model.

▶ Many machine learning papers back in the day containedpages of math deriving gradients

▶ More common these days to rely on autodifferentiationmethods (see the deep learning revolution)

▶ Distinction between parameters (usually fit with optimization)and hyperparameters (usually learned by crossvalidation)

17

Page 19: Day 1: Introduction to Machine Learning

A quick tour of classic supervised learning methods

▶ k-nearest neighbors [Friedman, Tibshirani, Hastie 2009]

▶ Linear regression

▶ Naive Bayes [Mitchell 1997]

▶ Logistic regression

▶ Linear Discriminant Analysis

▶ Support Vector Machines (SVMs) [Scholkopf and Smola 2002]

▶ Gaussian process regression and classification [Rasmussen andWilliams 2006]

▶ Neural networks [Goodfellow, Bengio, Courville 2016]

▶ Random forests [Breiman 2001]

▶ Probabilistic Graphical Models [Murphy 2012]

18

Page 20: Day 1: Introduction to Machine Learning

A quick tour of classic unsupervised learning methods

▶ k-means clustering [Friedman, Tibshirani, Hastie 2009]

▶ Spectral clustering [von Luxburg 2007]

▶ Principal Components Analysis

▶ Latent Dirichlet Allocation [Blei, Ng, Jordan 2003]

▶ Gaussian Mixture Models

▶ Neural networks, especially VAEs and GANs [Goodfellow,Bengio, Courville 2016]

19