Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
VBM683
Machine Learning
Pinar Duygulu
Slides are adapted from
Dhruv Batra,
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Aykut Erdem
Python
• Python is a high-level, dynamically typed
multiparadigm programming language. Python code
is often said to be almost like pseudocode, since it
allows you to express very powerful ideas in very few
lines of code while being very readable.
• http://cs231n.github.io/python-numpy-tutorial/
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
NumPy
http://www.numpy.org/• NumPy is the fundamental package for scientific
computing with Python. It contains among other
things:
• a powerful N-dimensional array object
• sophisticated (broadcasting) functions
• tools for integrating C/C++ and Fortran code
• useful linear algebra, Fourier transform, and random
number capabilities
• Besides its obvious scientific uses, NumPy can also
be used as an efficient multi-dimensional container of
generic data. Arbitrary data-types can be defined.
This allows NumPy to seamlessly and speedily
integrate with a wide variety of databases.Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Plan for today
• Introduction to Supervised/Inductive Learning
• Your first classifier: k-Nearest Neighbour
(C) Dhruv Batra
Types of Learning
• Supervised learning
– Training data includes desired outputs
• Unsupervised learning
– Training data does not include desired outputs
• Weakly or Semi-supervised learning
– Training data includes a few desired outputs
• Reinforcement learning
– Rewards from sequence of actions
(C) Dhruv Batra
Supervised / Inductive Learning
• Given
– examples of a function (x, f(x))
• Predict function f(x) for new examples x
– Discrete f(x): Classification
– Continuous f(x): Regression
– f(x) = Probability(x): Probability estimation
(C) Dhruv Batra
Slide Credit: Pedro Domingos, Tom Mitchel, Tom Dietterich
Supervised Learning
• Input: x (images, text, emails…)
• Output: y (spam or non-spam…)
• (Unknown) Target Function– f: X Y (the “true” mapping / reality)
• Data – (x1,y1), (x2,y2), …, (xN,yN)
• Model / Hypothesis Class– g: X Y
– y = g(x) = sign(wTx)
• Learning = Search in hypothesis space– Find best g in model class.
(C) Dhruv Batra
(C) Dhruv Batra Slide Credit: Yaser Abu-Mostapha
Basic Steps of Supervised Learning• Set up a supervised learning problem
• Data collection – Start with training data for which we know the correct outcome provided by a teacher or
oracle.
• Representation – Choose how to represent the data.
• Modeling– Choose a hypothesis class: H = {g: X Y}
• Learning/Estimation– Find best hypothesis you can in the chosen class.
• Model Selection– Try different models. Picks the best one. (More on this later)
• If happy stop– Else refine one or more of the above
(C) Dhruv Batra
Learning is hard!
• No assumptions = No learning
(C) Dhruv Batra
Klingon vs Mlingon Classification
• Training Data
– Klingon: klix, kour, koop
– Mlingon: moo, maa, mou
• Testing Data: kap
• Which language?
• Why?
(C) Dhruv Batra
Concept Learning
• Definition: Acquire an operational definition of a
general category of objects given positive and
negative training examples. • Also called binary
classification, binary supervised learning
Slide credit: Thorsten Joachims
Slide credit: Thorsten Joachims
(C) Dhruv Batra
Loss/Error Functions
• How do we measure performance?
• Regression:
– L2 error
• Classification:
– #misclassifications
– Weighted misclassification via a cost matrix
– For 2-class classification:
• True Positive, False Positive, True Negative, False Negative
– For k-class classification:
• Confusion Matrix
(C) Dhruv Batra
Training vs Testing
• What do we want?
– Good performance (low loss) on training data?
– No, Good performance on unseen test data!
• Training Data:
– { (x1,y1), (x2,y2), …, (xN,yN) }
– Given to us for learning f
• Testing Data
– { x1, x2, …, xM }
– Used to see if we have learnt anything
(C) Dhruv Batra
Procedural View
• Training Stage:
– Raw Data x (Feature Extraction)
– Training Data { (x,y) } f (Learning)
• Testing Stage
– Raw Data x (Feature Extraction)
– Test Data x f(x) (Apply function, Evaluate error)
(C) Dhruv Batra
Statistical Estimation View
• Probabilities to rescue:
– x and y are random variables
– D = (x1,y1), (x2,y2), …, (xN,yN) ~ P(X,Y)
• IID: Independent Identically Distributed
– Both training & testing data sampled IID from P(X,Y)
– Learn on training set
– Have some hope of generalizing to test set
(C) Dhruv Batra
Concepts
• Capacity
– Measure how large hypothesis class H is.
– Are all functions allowed?
• Overfitting
– f works well on training data
– Works poorly on test data
• Generalization
– The ability to achieve low error on new test data
(C) Dhruv Batra
Guarantees
• 20 years of research in Learning Theory
oversimplified:
• If you have:
– Enough training data D
– and H is not too complex
– then probably we can generalize to unseen test data
(C) Dhruv Batra
29Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Your First Classifier:
Nearest Neighbours
(C) Dhruv Batra Image Credit: Wikipedia
Synonyms
• Nearest Neighbours
• k-Nearest Neighbours
• Member of following families:
– Instance-based Learning
– Memory-based Learning
– Exemplar methods
– Non-parametric methods
(C) Dhruv Batra
Nearest Neighbor is an example of….
Instance-based learning
Has been around since
about 1910.
To make a prediction,
search database for
similar datapoints, and fit
with the local points.
Assumption: Nearby
points behavior similarly
wrt y
x1 y1
x2 y2
x3 y3
.
.
xn yn
(C) Dhruv Batra
Instance/Memory-based Learning
Four things make a memory based learner:
• A distance metric
• How many nearby neighbors to look at?
• A weighting function (optional)
• How to fit with the local points?
Slide Credit: Carlos Guestrin(C) Dhruv Batra
1-Nearest Neighbour
Four things make a memory based learner:
• A distance metric
– Euclidean (and others)
• How many nearby neighbors to look at?
– 1
• A weighting function (optional)
– unused
• How to fit with the local points?
– Just predict the same output as the nearest neighbour.
Slide Credit: Carlos Guestrin(C) Dhruv Batra
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
k-Nearest Neighbour
Four things make a memory based learner:
• A distance metric
– Euclidean (and others)
• How many nearby neighbors to look at?
– k
• A weighting function (optional)
– unused
• How to fit with the local points?
– Just predict the average output among the nearest
neighbours.
(C) Dhruv Batra Slide Credit: Carlos Guestrin
1 vs k Nearest Neighbour
(C) Dhruv Batra Image Credit: Ying Wu
1 vs k Nearest Neighbour
(C) Dhruv Batra Image Credit: Ying Wu
(C) Dhruv Batra 50
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Nearest Neighbour
• Demo 1
– http://cgm.cs.mcgill.ca/~soss/cs644/projects/perrier/Nearest.
html
• Demo 2
– http://www.cs.technion.ac.il/~rani/LocBoost/
(C) Dhruv Batra
• Gender Classification from body proportions
– Igor Janjic & Daniel Friedman, Juniors
(C) Dhruv Batra
Scene Completion [Hayes & Efros, SIGGRAPH07]
(C) Dhruv Batra
Hays and Efros, SIGGRAPH 2007
… 200 totalHays and Efros, SIGGRAPH 2007
Context Matching
Hays and Efros, SIGGRAPH 2007
Graph cut + Poisson blending Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Hays and Efros, SIGGRAPH 2007
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
Fei-Fei Li & Andrej Karpathy & Justin Johnson L
(C) Dhruv Batra 77
(C) Dhruv Batra 78
1-NN for Regression
(C) Dhruv Batra
x
y
Here, this is the closest datapoint
Figure Credit: Carlos Guestrin
1-NN for Regression
• Often bumpy (overfits)
(C) Dhruv Batra Figure Credit: Andrew Moore
9-NN for Regression
• Often bumpy (overfits)
(C) Dhruv Batra Figure Credit: Andrew Moore
Multivariate distance metrics
Suppose the input vectors x1, x2, …xN are two dimensional:
x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ).
One can draw the nearest-neighbor regions in input space.
Dist(xi,xj) =(xi1 – xj1)2+(3xi2 – 3xj2)
2
The relative scalings in the distance metric affect region shapes
Dist(xi,xj) = (xi1 – xj1)2 + (xi2 – xj2)
2
Slide Credit: Carlos Guestrin
Euclidean distance metric
where
Or equivalently,
A
Slide Credit: Carlos Guestrin
Scaled Euclidian (L2) Mahalanobis
(non-diagonal A)
Notable distance metrics
(and their level sets)
Slide Credit: Carlos Guestrin
Minkowski distance
(C) Dhruv Batra Image Credit: By Waldir (Based on File:MinkowskiCircles.svg)
[CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
L1 norm (absolute)
Linf (max) norm
Scaled Euclidian (L2)
Mahalanobis
(non-diagonal A)
Notable distance metrics
(and their level sets)
Slide Credit: Carlos Guestrin
Kernel Regression/Classification
Four things make a memory based learner:
• A distance metric– Euclidean (and others)
• How many nearby neighbors to look at?– All of them
• A weighting function (optional)– wi = exp(-d(xi, query)2 / σ2)
– Nearby points to the query are weighted strongly, far points weakly. The σ parameter is the Kernel Width. Very important.
• How to fit with the local points?– Predict the weighted average of the outputs
predict = Σwiyi / Σwi
(C) Dhruv Batra Slide Credit: Carlos Guestrin
Weighting/Kernel functions
wi = exp(-d(xi, query)2 / σ2)
(Our examples use Gaussian)
(C) Dhruv Batra Slide Credit: Carlos Guestrin
Effect of Kernel Width
(C) Dhruv Batra
• What happens as σinf?
• What happens as σ0?
Image Credit: Ben Taskar
Problems with Instance-Based Learning
• Expensive
– No Learning: most real work done during testing
– For every test sample, must search through all dataset –
very slow!
– Must use tricks like approximate nearest neighbour search
• Doesn’t work well when large number of irrelevant
features
– Distances overwhelmed by noisy features
• Curse of Dimensionality
– Distances become meaningless in high dimensions
– (See proof in next lecture)
(C) Dhruv Batra
• k-NN– Simplest learning algorithm
– With sufficient data, very hard to beat “strawman” approach
– Picking k?
• Kernel regression/classification– Set k to n (number of data points) and chose kernel width
– Smoother than k-NN
• Problems with k-NN– Curse of dimensionality
– Irrelevant features often killers for instance-basedapproaches
– Slow NN search: Must remember (very large) dataset for prediction
(C) Dhruv Batra
What you need to know
Parametric vs Non-Parametric Models
• Does the capacity (size of hypothesis class) grow
with size of training data?
– Yes = Non-Parametric Models
– No = Parametric Models
• Example
– http://www.theparticle.com/applets/ml/nearest_neighbor/
(C) Dhruv Batra