VBM683 Machine Learning - Hacettepepinar/courses/VBM683/lectures/nearest... · VBM683 Machine...

VBM683

Machine Learning

Pinar Duygulu

Slides are adapted from

Dhruv Batra,

Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Aykut Erdem

Python

• Python is a high-level, dynamically typed

multiparadigm programming language. Python code

is often said to be almost like pseudocode, since it

allows you to express very powerful ideas in very few

lines of code while being very readable.

• http://cs231n.github.io/python-numpy-tutorial/

http://www.numpy.org/• NumPy is the fundamental package for scientific

computing with Python. It contains among other

things:

• a powerful N-dimensional array object

• sophisticated (broadcasting) functions

• tools for integrating C/C++ and Fortran code

• useful linear algebra, Fourier transform, and random

number capabilities

• Besides its obvious scientific uses, NumPy can also

be used as an efficient multi-dimensional container of

generic data. Arbitrary data-types can be defined.

This allows NumPy to seamlessly and speedily

integrate with a wide variety of databases.Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Plan for today

• Introduction to Supervised/Inductive Learning

• Your first classifier: k-Nearest Neighbour

(C) Dhruv Batra

Types of Learning

• Supervised learning

– Training data includes desired outputs

• Unsupervised learning

– Training data does not include desired outputs

• Weakly or Semi-supervised learning

– Training data includes a few desired outputs

• Reinforcement learning

– Rewards from sequence of actions

(C) Dhruv Batra

Supervised / Inductive Learning

• Given

– examples of a function (x, f(x))

• Predict function f(x) for new examples x

– Discrete f(x): Classification

– Continuous f(x): Regression

– f(x) = Probability(x): Probability estimation

(C) Dhruv Batra

Slide Credit: Pedro Domingos, Tom Mitchel, Tom Dietterich

Supervised Learning

• Input: x (images, text, emails…)

• Output: y (spam or non-spam…)

• (Unknown) Target Function– f: X Y (the “true” mapping / reality)

• Data – (x1,y1), (x2,y2), …, (xN,yN)

• Model / Hypothesis Class– g: X Y

– y = g(x) = sign(wTx)

• Learning = Search in hypothesis space– Find best g in model class.

(C) Dhruv Batra

(C) Dhruv Batra Slide Credit: Yaser Abu-Mostapha

Basic Steps of Supervised Learning• Set up a supervised learning problem

• Data collection – Start with training data for which we know the correct outcome provided by a teacher or

oracle.

• Representation – Choose how to represent the data.

• Modeling– Choose a hypothesis class: H = {g: X Y}

• Learning/Estimation– Find best hypothesis you can in the chosen class.

• Model Selection– Try different models. Picks the best one. (More on this later)

• If happy stop– Else refine one or more of the above

(C) Dhruv Batra

Learning is hard!

• No assumptions = No learning

(C) Dhruv Batra

Klingon vs Mlingon Classification

• Training Data

– Klingon: klix, kour, koop

– Mlingon: moo, maa, mou

• Testing Data: kap

• Which language?

• Why?

(C) Dhruv Batra

Concept Learning

• Definition: Acquire an operational definition of a

general category of objects given positive and

negative training examples. • Also called binary

classification, binary supervised learning

Slide credit: Thorsten Joachims

(C) Dhruv Batra

Loss/Error Functions

• How do we measure performance?

• Regression:

– L2 error

• Classification:

– #misclassifications

– Weighted misclassification via a cost matrix

– For 2-class classification:

• True Positive, False Positive, True Negative, False Negative

– For k-class classification:

• Confusion Matrix

(C) Dhruv Batra

Training vs Testing

• What do we want?

– Good performance (low loss) on training data?

– No, Good performance on unseen test data!

• Training Data:

– { (x1,y1), (x2,y2), …, (xN,yN) }

– Given to us for learning f

• Testing Data

– { x1, x2, …, xM }

– Used to see if we have learnt anything

(C) Dhruv Batra

Procedural View

• Training Stage:

– Raw Data x (Feature Extraction)

– Training Data { (x,y) } f (Learning)

• Testing Stage

– Raw Data x (Feature Extraction)

– Test Data x f(x) (Apply function, Evaluate error)

(C) Dhruv Batra

Statistical Estimation View

• Probabilities to rescue:

– x and y are random variables

– D = (x1,y1), (x2,y2), …, (xN,yN) ~ P(X,Y)

• IID: Independent Identically Distributed

– Both training & testing data sampled IID from P(X,Y)

– Learn on training set

– Have some hope of generalizing to test set

(C) Dhruv Batra

Concepts

• Capacity

– Measure how large hypothesis class H is.

– Are all functions allowed?

• Overfitting

– f works well on training data

– Works poorly on test data

• Generalization

– The ability to achieve low error on new test data

(C) Dhruv Batra

Guarantees

• 20 years of research in Learning Theory

oversimplified:

• If you have:

– Enough training data D

– and H is not too complex

– then probably we can generalize to unseen test data

(C) Dhruv Batra

29Fei-Fei Li & Andrej Karpathy & Justin Johnson L

Your First Classifier:

Nearest Neighbours

(C) Dhruv Batra Image Credit: Wikipedia

Synonyms

• Nearest Neighbours

• k-Nearest Neighbours

• Member of following families:

– Instance-based Learning

– Memory-based Learning

– Exemplar methods

– Non-parametric methods

(C) Dhruv Batra

Nearest Neighbor is an example of….

Instance-based learning

Has been around since

about 1910.

To make a prediction,

search database for

similar datapoints, and fit

with the local points.

Assumption: Nearby

points behavior similarly

(C) Dhruv Batra

Instance/Memory-based Learning

Four things make a memory based learner:

• A distance metric

• How many nearby neighbors to look at?

• A weighting function (optional)

• How to fit with the local points?

Slide Credit: Carlos Guestrin(C) Dhruv Batra

1-Nearest Neighbour

– Euclidean (and others)

– unused

– Just predict the same output as the nearest neighbour.

Slide Credit: Carlos Guestrin(C) Dhruv Batra

k-Nearest Neighbour

– Euclidean (and others)

– unused

– Just predict the average output among the nearest

neighbours.

(C) Dhruv Batra Slide Credit: Carlos Guestrin

1 vs k Nearest Neighbour

(C) Dhruv Batra Image Credit: Ying Wu

1 vs k Nearest Neighbour

(C) Dhruv Batra Image Credit: Ying Wu

(C) Dhruv Batra 50

Nearest Neighbour

• Demo 1

– http://cgm.cs.mcgill.ca/~soss/cs644/projects/perrier/Nearest.

• Demo 2

– http://www.cs.technion.ac.il/~rani/LocBoost/

(C) Dhruv Batra

• Gender Classification from body proportions

– Igor Janjic & Daniel Friedman, Juniors

(C) Dhruv Batra

Scene Completion [Hayes & Efros, SIGGRAPH07]

(C) Dhruv Batra

Hays and Efros, SIGGRAPH 2007

… 200 totalHays and Efros, SIGGRAPH 2007

Context Matching

Graph cut + Poisson blending Hays and Efros, SIGGRAPH 2007

(C) Dhruv Batra 77

(C) Dhruv Batra 78

1-NN for Regression

(C) Dhruv Batra

Here, this is the closest datapoint

Figure Credit: Carlos Guestrin

1-NN for Regression

• Often bumpy (overfits)

(C) Dhruv Batra Figure Credit: Andrew Moore

9-NN for Regression

• Often bumpy (overfits)

(C) Dhruv Batra Figure Credit: Andrew Moore

Multivariate distance metrics

Suppose the input vectors x1, x2, …xN are two dimensional:

x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ).

One can draw the nearest-neighbor regions in input space.

Dist(xi,xj) =(xi1 – xj1)2+(3xi2 – 3xj2)

The relative scalings in the distance metric affect region shapes

Dist(xi,xj) = (xi1 – xj1)2 + (xi2 – xj2)

Slide Credit: Carlos Guestrin

Euclidean distance metric

Or equivalently,

Scaled Euclidian (L2) Mahalanobis

(non-diagonal A)

Notable distance metrics

(and their level sets)

Minkowski distance

(C) Dhruv Batra Image Credit: By Waldir (Based on File:MinkowskiCircles.svg)

[CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

L1 norm (absolute)

Linf (max) norm

Scaled Euclidian (L2)

Mahalanobis

(non-diagonal A)

Notable distance metrics

(and their level sets)

Kernel Regression/Classification

• A distance metric– Euclidean (and others)

• How many nearby neighbors to look at?– All of them

• A weighting function (optional)– wi = exp(-d(xi, query)2 / σ2)

– Nearby points to the query are weighted strongly, far points weakly. The σ parameter is the Kernel Width. Very important.

• How to fit with the local points?– Predict the weighted average of the outputs

predict = Σwiyi / Σwi

Weighting/Kernel functions

wi = exp(-d(xi, query)2 / σ2)

(Our examples use Gaussian)

Effect of Kernel Width

(C) Dhruv Batra

• What happens as σinf?

• What happens as σ0?

Image Credit: Ben Taskar

Problems with Instance-Based Learning

• Expensive

– No Learning: most real work done during testing

– For every test sample, must search through all dataset –

very slow!

– Must use tricks like approximate nearest neighbour search

• Doesn’t work well when large number of irrelevant

features

– Distances overwhelmed by noisy features

• Curse of Dimensionality

– Distances become meaningless in high dimensions

– (See proof in next lecture)

(C) Dhruv Batra

• k-NN– Simplest learning algorithm

– With sufficient data, very hard to beat “strawman” approach

– Picking k?

• Kernel regression/classification– Set k to n (number of data points) and chose kernel width

– Smoother than k-NN

• Problems with k-NN– Curse of dimensionality

– Irrelevant features often killers for instance-basedapproaches

– Slow NN search: Must remember (very large) dataset for prediction

(C) Dhruv Batra

What you need to know

Parametric vs Non-Parametric Models

• Does the capacity (size of hypothesis class) grow

with size of training data?

– Yes = Non-Parametric Models

– No = Parametric Models

• Example

– http://www.theparticle.com/applets/ml/nearest_neighbor/

(C) Dhruv Batra

VBM683 Machine Learning - Hacettepepinar/courses/VBM683/lectures/nearest... · VBM683 Machine...

Documents

Data Types - Bilkent University Computer Engineering ...duygulu/Courses/CS315/Notes/Chapter6.pdf · • The data types that are not ... • It takes 24 bits to store a 6 digit number

Introduction - Hacettepepinar/courses/VBM686/... · 2018. 10. 9. · Source: Rob Fergus and Antonio Torralba. Context Pinar Duygulu, ENLG 2015 81 Slide credit: Fei-fei Li. Challenges:

Clustering - Hacettepepinar/courses/VBM687/...obj. representation and the similarity measure used External criteria for clustering quality Quality measured by its ability to discover

Chapter 8 First Order Logic - Bilkent Universityduygulu/Courses/CS461/Notes/FirstOrderLogic.pdfChapter 8 First Order Logic CS 461 – Artificial Intelligence Pinar Duygulu Bilkent

VBM683 Machine Learning - Hacettepepinar/courses/VBM683/lectures/regression.pdf · EPI 809/Spring 2008 54 Correlation Models • 1. Answer ‘How Strong Is the Linear Relationship

Chapter 4 Informed search and Exploration - Bilkent …duygulu/Courses/CS461/Notes/... · Chapter 4 Informed search and Exploration CS 461 – Artificial Intelligence Pinar Duygulu

Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology Hakan Burak Duygulu 2003701387 CmpE 511 15.12.2005

Object Recognition as Machine Translation: Learning a Lexicon …duygulu/Talks/ECCV2002.pdf · Learning a Lexicon for a Fixed image Vocabulary Pinar Duygulu, Kobus Barnard, Nando

&&Computer&Science&Department,&& Hace7epe&University ...pinar/talks/OttomanWorkshop-2015.… · Duygulu, Pinar; Arifoglu, Damla; Kalpakli, Mehmet; "Cross-document word matching for

EECS 442 – Computer vision Object Recognition · –Random sampling (Vidal-Naquet & Ullman, 2002) –Segmentation based patches (Barnard, Duygulu, Forsyth, de Freitas, Blei, Jordan,

Working(with(Images:( - seas.upenn.educis519/fall2017/lectures/15_ImageFeatures.pdf · • Random sampling (Vidal-Naquet & Ullman, 2002) • Segmentation based patches (Barnard, Duygulu,

Yacc a tool for Syntactic Analysis - Bilkent Universityduygulu/Courses/CS315/Notes/Yacc.pdf · Yacc a tool for Syntactic Analysis CS 315 – Programming Languages Pinar Duygulu Bilkent

GROWTH AND MORPHOLOGICAL CHARACTERIZATION OF …etd.lib.metu.edu.tr/upload/12615488/index.pdf · I intensely acknowledge my colleagues Okan Yılmaz, Nilüfer Duygulu and Deneb Menda

Middle East Technical University, Ankara, Turkey arXiv ... · nermin.samet@metu.edu.tr (Nermin Samet), emre@ceng.metu.edu.tr (Emre Akbas), pinar@cs.hacettepe.edu.tr (Pinar Duygulu)

Web-scale Image Annotation · For evaluation datasets, a popular set for image annotation is the Corel 5k data set used by Duygulu et al. [5], where 4,500 images are marked for training

The New Face of Central Bank Policies and Its Reflections on Developing Countries Dr. Aylin Abuk Duygulu Dokuz Eylul University, Department of Economics

VBM683 Machine Learning - Hacettepe Üniversitesi › ~pinar › courses › VBM... · VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra, Aykut Erdem Barnabas

VBM683 Machine Learning - Hacettepe Üniversitesipinar/courses/VBM683/lectures/introduction.pdf · A Brief History of AI (C) Dhruv Batra • ^We propose that a 2 month, 10 man study

VBM683 Machine Learning - Hacettepe Üniversitesi · 2017-10-12 · – limiting frequency of a repeating non-deterministic event • Bayesian View – P(A) is your “belief” about

Tutorial on Image Processing - Bilkent Universityretina.cs.bilkent.edu.tr/papers/ImageProcessingTutorial.pdf · Pinar Duygulu June 2005 2 Reference Material Computer Vision – A