Machine LearningSeth Flaxman, Ph.D.
Upcoming Seminar: January 13-15, 2022, Remote Seminar
Day 1:Introduction toMachine Learning
Seth Flaxman www.sethrf.com
What is machine learning?
https://www.youtube.com/watch?v=f_uwKZIAeM0
2
What is machine learning?
For our purposes, follow Tom Mitchell:
A computer program is said to learn from experienceE with respect to some class of tasks T and perfor-mance measure P if its performance at tasks in T, asmeasured by P, improves with experience E.
3
What is statistical machine learning?
Both provide methods for learning from data.
Computer science takes an algorithmic perspective: propose analgorithm for data, study the algorithm formally.
Statistics takes an inferential perspective: propose a model fordata, study the model formally.
Statistical machine learning (and computational statistics) is theintersection: algorithmic perspective on statistical methods,statistical perspective on algorithms.
4
Supervised vs. unsupervised learning: terminology
▶ Supervised learning, also known as: regression, classification,pattern recognition, recovery, sensing, . . .
▶ Unsupervised learning, also known as: clustering, data mining,dimensionality reduction, . . .
▶ Inputs, also known as: independent variables, predictors,covariates, patterns, x , X , . . .
▶ Outputs, also known as: dependent variables, responses,labels, y , Y , . . .
yfx
5
Supervised learning
6
Supervised learning, most basic setup
yfx
Given training inputs x ∈ X and outputs y ∈ Y
(xi , yi ), i = 1, . . . , n (1)
Learn a function (algorithm, black box, decision rule, classifier,probability distribution)
f : X → Y (2)
i.e. on the training inputs, we would like our function f toapproximately recover the training outputs:
f (xi ) ≈ yi (3)
7
Unsupervised learning, clustering and dim. reduction
Given training inputs x ∈ X , learn:
▶ Clustering: a function f giving cluster assignments 1, . . . ,K
f (x) ∈ {1, . . . ,K} (4)
such that Ck = {xi |f (xi ) = k} is homogeneous for each k .
▶ Dimensionality reduction: if X ∈ Rp, for large p, learn alatent representation Z ∈ Rd , d ≪ p, such that Z explainsmost of the variance in X .
8
9
Supervised learning: k-nearest neighbors
10
Unsupervised learning: k-means clustering
Supervised learning: further considerations
▶ Loss function: standard choice in regression are squared error(L2) loss:
L(x , y , f ) := (y − f (x))2 (5)
▶ Standard choice in classification is misclassification rate (1 -accuracy):
L(x , y , f ) := 1− I(y = f (x)) (6)
Loss is bad, you want to avoid loss, so smaller loss is better!(Some losses are always positive, others can be positive ornegative.)
11
Supervised learning: further considerations
Quiz: what value of k for k-nearest neighbors gives trainingloss = 0? Does this make sense?
▶ Risk: expected loss
R(f ) := EX ,Y [L(x , y , f )] (7)
▶ Empirical risk: average over data, e.g. “ordinary leastsquares”:
R̂(f ) :=n∑
i=1
(yi − f (xi ))2 (8)
12
An algorithmic vs. statistical perspective
▶ K-nearest neighbors and k-means clustering are algorithms forhandling data
▶ Algorithmic questions: what is their time complexity in termsof p and n? storage complexity?
▶ Statistical perspective: can the performance of eitheralgorithm be analyzed with reference to an underlyingprobabilistic model?
▶ Statistical questions: what kind of performance do we expecton unseen data (generalization)? How does performance varywith n and p? How robust is the model to outliers?
13
5 10 15 20
0.0
0.2
0.4
p
frac
tion
clos
e
14
The curse of dimensionality (Bellman 1961)As p increases, all points are about equally distant from one another:
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α+ βx (9)
Finding f in this case means finding values for α and β.
▶ Algorithmic perspective: assuming squared error loss, find αand β to minimize the empirical risk:
R̂(f ) :=n∑
i=1
(yi − f (xi ))2 (10)
Closed form solutions exist for α̂ and β̂ which minimize R̂(f ).Exercise: find them! Hint: you will need to solve ∇βR̂(f ) = 0
and ∇αR̂(f ) = 0.
15
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α+ βx (11)
Finding f in this case means finding values for α and β.
▶ Statistical perspective: assume that errors are iid N (0, σ2), orequivalently:
p(y |x) = N (f (x), σ2) (12)
▶ Use maximum likelihood to estimate α̂ and β̂.
▶ The statistical and algorithmic perspectives coincide!
16
Linear regression as a statistical machine learning method▶ Closed-form optima aren’t always available: need to use some
sort of optimization method (e.g. gradient descent) to learnthe parameters of a model.
▶ Many machine learning papers back in the day containedpages of math deriving gradients
▶ More common these days to rely on autodifferentiationmethods (see the deep learning revolution)
▶ Distinction between parameters (usually fit with optimization)and hyperparameters (usually learned by crossvalidation)
17
A quick tour of classic supervised learning methods
▶ k-nearest neighbors [Friedman, Tibshirani, Hastie 2009]
▶ Linear regression
▶ Naive Bayes [Mitchell 1997]
▶ Logistic regression
▶ Linear Discriminant Analysis
▶ Support Vector Machines (SVMs) [Scholkopf and Smola 2002]
▶ Gaussian process regression and classification [Rasmussen andWilliams 2006]
▶ Neural networks [Goodfellow, Bengio, Courville 2016]
▶ Random forests [Breiman 2001]
▶ Probabilistic Graphical Models [Murphy 2012]
18
A quick tour of classic unsupervised learning methods
▶ k-means clustering [Friedman, Tibshirani, Hastie 2009]
▶ Spectral clustering [von Luxburg 2007]
▶ Principal Components Analysis
▶ Latent Dirichlet Allocation [Blei, Ng, Jordan 2003]
▶ Gaussian Mixture Models
▶ Neural networks, especially VAEs and GANs [Goodfellow,Bengio, Courville 2016]
19