VBM683 Machine Learning - Hacettepepinar/courses/VBM683/lectures/nearest... · VBM683 Machine...

Preview:

Citation preview

VBM683

Machine Learning

Pinar Duygulu

Slides are adapted from

Dhruv Batra,

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Aykut Erdem

Python

• Python is a high-level, dynamically typed

multiparadigm programming language. Python code

is often said to be almost like pseudocode, since it

allows you to express very powerful ideas in very few

lines of code while being very readable.

• http://cs231n.github.io/python-numpy-tutorial/

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

NumPy

http://www.numpy.org/• NumPy is the fundamental package for scientific

computing with Python. It contains among other

things:

• a powerful N-dimensional array object

• sophisticated (broadcasting) functions

• tools for integrating C/C++ and Fortran code

• useful linear algebra, Fourier transform, and random

number capabilities

• Besides its obvious scientific uses, NumPy can also

be used as an efficient multi-dimensional container of

generic data. Arbitrary data-types can be defined.

This allows NumPy to seamlessly and speedily

integrate with a wide variety of databases.Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Plan for today

• Introduction to Supervised/Inductive Learning

• Your first classifier: k-Nearest Neighbour

(C) Dhruv Batra

Types of Learning

• Supervised learning

– Training data includes desired outputs

• Unsupervised learning

– Training data does not include desired outputs

• Weakly or Semi-supervised learning

– Training data includes a few desired outputs

• Reinforcement learning

– Rewards from sequence of actions

(C) Dhruv Batra

Supervised / Inductive Learning

• Given

– examples of a function (x, f(x))

• Predict function f(x) for new examples x

– Discrete f(x): Classification

– Continuous f(x): Regression

– f(x) = Probability(x): Probability estimation

(C) Dhruv Batra

Slide Credit: Pedro Domingos, Tom Mitchel, Tom Dietterich

Supervised Learning

• Input: x (images, text, emails…)

• Output: y (spam or non-spam…)

• (Unknown) Target Function– f: X Y (the “true” mapping / reality)

• Data – (x1,y1), (x2,y2), …, (xN,yN)

• Model / Hypothesis Class– g: X Y

– y = g(x) = sign(wTx)

• Learning = Search in hypothesis space– Find best g in model class.

(C) Dhruv Batra

(C) Dhruv Batra Slide Credit: Yaser Abu-Mostapha

Basic Steps of Supervised Learning• Set up a supervised learning problem

• Data collection – Start with training data for which we know the correct outcome provided by a teacher or

oracle.

• Representation – Choose how to represent the data.

• Modeling– Choose a hypothesis class: H = {g: X Y}

• Learning/Estimation– Find best hypothesis you can in the chosen class.

• Model Selection– Try different models. Picks the best one. (More on this later)

• If happy stop– Else refine one or more of the above

(C) Dhruv Batra

Learning is hard!

• No assumptions = No learning

(C) Dhruv Batra

Klingon vs Mlingon Classification

• Training Data

– Klingon: klix, kour, koop

– Mlingon: moo, maa, mou

• Testing Data: kap

• Which language?

• Why?

(C) Dhruv Batra

Concept Learning

• Definition: Acquire an operational definition of a

general category of objects given positive and

negative training examples. • Also called binary

classification, binary supervised learning

Slide credit: Thorsten Joachims

Slide credit: Thorsten Joachims

(C) Dhruv Batra

Loss/Error Functions

• How do we measure performance?

• Regression:

– L2 error

• Classification:

– #misclassifications

– Weighted misclassification via a cost matrix

– For 2-class classification:

• True Positive, False Positive, True Negative, False Negative

– For k-class classification:

• Confusion Matrix

(C) Dhruv Batra

Training vs Testing

• What do we want?

– Good performance (low loss) on training data?

– No, Good performance on unseen test data!

• Training Data:

– { (x1,y1), (x2,y2), …, (xN,yN) }

– Given to us for learning f

• Testing Data

– { x1, x2, …, xM }

– Used to see if we have learnt anything

(C) Dhruv Batra

Procedural View

• Training Stage:

– Raw Data x (Feature Extraction)

– Training Data { (x,y) } f (Learning)

• Testing Stage

– Raw Data x (Feature Extraction)

– Test Data x f(x) (Apply function, Evaluate error)

(C) Dhruv Batra

Statistical Estimation View

• Probabilities to rescue:

– x and y are random variables

– D = (x1,y1), (x2,y2), …, (xN,yN) ~ P(X,Y)

• IID: Independent Identically Distributed

– Both training & testing data sampled IID from P(X,Y)

– Learn on training set

– Have some hope of generalizing to test set

(C) Dhruv Batra

Concepts

• Capacity

– Measure how large hypothesis class H is.

– Are all functions allowed?

• Overfitting

– f works well on training data

– Works poorly on test data

• Generalization

– The ability to achieve low error on new test data

(C) Dhruv Batra

Guarantees

• 20 years of research in Learning Theory

oversimplified:

• If you have:

– Enough training data D

– and H is not too complex

– then probably we can generalize to unseen test data

(C) Dhruv Batra

29Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Your First Classifier:

Nearest Neighbours

(C) Dhruv Batra Image Credit: Wikipedia

Synonyms

• Nearest Neighbours

• k-Nearest Neighbours

• Member of following families:

– Instance-based Learning

– Memory-based Learning

– Exemplar methods

– Non-parametric methods

(C) Dhruv Batra

Nearest Neighbor is an example of….

Instance-based learning

Has been around since

about 1910.

To make a prediction,

search database for

similar datapoints, and fit

with the local points.

Assumption: Nearby

points behavior similarly

wrt y

x1 y1

x2 y2

x3 y3

.

.

xn yn

(C) Dhruv Batra

Instance/Memory-based Learning

Four things make a memory based learner:

• A distance metric

• How many nearby neighbors to look at?

• A weighting function (optional)

• How to fit with the local points?

Slide Credit: Carlos Guestrin(C) Dhruv Batra

1-Nearest Neighbour

Four things make a memory based learner:

• A distance metric

– Euclidean (and others)

• How many nearby neighbors to look at?

– 1

• A weighting function (optional)

– unused

• How to fit with the local points?

– Just predict the same output as the nearest neighbour.

Slide Credit: Carlos Guestrin(C) Dhruv Batra

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

k-Nearest Neighbour

Four things make a memory based learner:

• A distance metric

– Euclidean (and others)

• How many nearby neighbors to look at?

– k

• A weighting function (optional)

– unused

• How to fit with the local points?

– Just predict the average output among the nearest

neighbours.

(C) Dhruv Batra Slide Credit: Carlos Guestrin

1 vs k Nearest Neighbour

(C) Dhruv Batra Image Credit: Ying Wu

1 vs k Nearest Neighbour

(C) Dhruv Batra Image Credit: Ying Wu

(C) Dhruv Batra 50

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Nearest Neighbour

• Demo 1

– http://cgm.cs.mcgill.ca/~soss/cs644/projects/perrier/Nearest.

html

• Demo 2

– http://www.cs.technion.ac.il/~rani/LocBoost/

(C) Dhruv Batra

• Gender Classification from body proportions

– Igor Janjic & Daniel Friedman, Juniors

(C) Dhruv Batra

Scene Completion [Hayes & Efros, SIGGRAPH07]

(C) Dhruv Batra

Hays and Efros, SIGGRAPH 2007

… 200 totalHays and Efros, SIGGRAPH 2007

Context Matching

Hays and Efros, SIGGRAPH 2007

Graph cut + Poisson blending Hays and Efros, SIGGRAPH 2007

Hays and Efros, SIGGRAPH 2007

Hays and Efros, SIGGRAPH 2007

Hays and Efros, SIGGRAPH 2007

Hays and Efros, SIGGRAPH 2007

Hays and Efros, SIGGRAPH 2007

Hays and Efros, SIGGRAPH 2007

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

(C) Dhruv Batra 77

(C) Dhruv Batra 78

1-NN for Regression

(C) Dhruv Batra

x

y

Here, this is the closest datapoint

Figure Credit: Carlos Guestrin

1-NN for Regression

• Often bumpy (overfits)

(C) Dhruv Batra Figure Credit: Andrew Moore

9-NN for Regression

• Often bumpy (overfits)

(C) Dhruv Batra Figure Credit: Andrew Moore

Multivariate distance metrics

Suppose the input vectors x1, x2, …xN are two dimensional:

x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ).

One can draw the nearest-neighbor regions in input space.

Dist(xi,xj) =(xi1 – xj1)2+(3xi2 – 3xj2)

2

The relative scalings in the distance metric affect region shapes

Dist(xi,xj) = (xi1 – xj1)2 + (xi2 – xj2)

2

Slide Credit: Carlos Guestrin

Euclidean distance metric

where

Or equivalently,

A

Slide Credit: Carlos Guestrin

Scaled Euclidian (L2) Mahalanobis

(non-diagonal A)

Notable distance metrics

(and their level sets)

Slide Credit: Carlos Guestrin

Minkowski distance

(C) Dhruv Batra Image Credit: By Waldir (Based on File:MinkowskiCircles.svg)

[CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

L1 norm (absolute)

Linf (max) norm

Scaled Euclidian (L2)

Mahalanobis

(non-diagonal A)

Notable distance metrics

(and their level sets)

Slide Credit: Carlos Guestrin

Kernel Regression/Classification

Four things make a memory based learner:

• A distance metric– Euclidean (and others)

• How many nearby neighbors to look at?– All of them

• A weighting function (optional)– wi = exp(-d(xi, query)2 / σ2)

– Nearby points to the query are weighted strongly, far points weakly. The σ parameter is the Kernel Width. Very important.

• How to fit with the local points?– Predict the weighted average of the outputs

predict = Σwiyi / Σwi

(C) Dhruv Batra Slide Credit: Carlos Guestrin

Weighting/Kernel functions

wi = exp(-d(xi, query)2 / σ2)

(Our examples use Gaussian)

(C) Dhruv Batra Slide Credit: Carlos Guestrin

Effect of Kernel Width

(C) Dhruv Batra

• What happens as σinf?

• What happens as σ0?

Image Credit: Ben Taskar

Problems with Instance-Based Learning

• Expensive

– No Learning: most real work done during testing

– For every test sample, must search through all dataset –

very slow!

– Must use tricks like approximate nearest neighbour search

• Doesn’t work well when large number of irrelevant

features

– Distances overwhelmed by noisy features

• Curse of Dimensionality

– Distances become meaningless in high dimensions

– (See proof in next lecture)

(C) Dhruv Batra

• k-NN– Simplest learning algorithm

– With sufficient data, very hard to beat “strawman” approach

– Picking k?

• Kernel regression/classification– Set k to n (number of data points) and chose kernel width

– Smoother than k-NN

• Problems with k-NN– Curse of dimensionality

– Irrelevant features often killers for instance-basedapproaches

– Slow NN search: Must remember (very large) dataset for prediction

(C) Dhruv Batra

What you need to know

Parametric vs Non-Parametric Models

• Does the capacity (size of hypothesis class) grow

with size of training data?

– Yes = Non-Parametric Models

– No = Parametric Models

• Example

– http://www.theparticle.com/applets/ml/nearest_neighbor/

(C) Dhruv Batra

Recommended