Data Classification
Rong Jin
Classification ProblemsâĒ Given input:
âĒ Predict the output (class label)
âĒ Binary classification:
âĒ Multi-class classification:
âĒ Learn a classification function:
âĒ Regression:
Examples of Classification ProblemText categorization:
Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, âĶ
Topic:Politics
Sport
Examples of Classification ProblemText categorization:
Input features :Word frequency
{(campaigning, 1), (democrats, 2), (basketball, 0), âĶ}
Class label: âPoliticsâ:
âSportâ:
Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, âĶ
Topic:Politics
Sport
Examples of Classification Problem
Image Classification:
Input features XColor histogram{(red, 1004), (red, 23000), âĶ}
Class label yY = +1: âbird imageâ Y = -1: ânon-bird imageâ
Which images have birds, which one does not?
Examples of Classification Problem
Image Classification:
Input features Color histogram{(red, 1004), (blue, 23000), âĶ}
Class label âbird imageâ: ânon-bird imageâ:
Which images are birds, which are not?
Supervised Learning
âĒ Training examples:
âĒ Identical independent distribution (i.i.d) assumptionâĒ A critical assumption for machine learning
theory
Regression for Classification
âĒ It is easy to turn binary classification into a regression problemâĒ Ignore the binary nature of class label y
âĒ How to convert multiclass classification into a regression problem?
âĒ Pros: computational efficiency âĒ Cons: ignore the discrete nature of class
label
K Nearest Neighbour (kNN) Classifier
(k=1)
K Nearest Neighbour (kNN) Classifier
K = 1
K Nearest Neighbour (kNN) Classifier
(k=1)(k=4)
How many neighbors should we count ?
K Nearest Neighbour (kNN) Classifier
âĒ K acts as a smother
Cross Validation
âĒ Divide training examples into two setsâĒ A training set (80%) and a validation set (20%)
âĒ Predict the class labels for validation set by using the examples in training set
âĒ Choose the number of neighbors k that maximizes the classification accuracy
Leave-One-Out Method
Leave-One-Out Method
Leave-One-Out Method
(k=1)
Leave-One-Out Method
(k=1) err(1) = 1
Leave-One-Out Method
err(1) = 1
Leave-One-Out Method
err(1) = 3
err(2) = 2
err(3) = 6k = 2
K-Nearest-Neighbours for Classification (1)
Given a data set with Nk data points from class Ck and , we have
and correspondingly
Since , Bayesâ theorem gives
K-Nearest-Neighbours for Classification (2)
K = 1K = 3
Probabilistic Interpretation of KNNâĒ Estimate conditional probability Pr(y|x)âĒ Count of data points in class y in the
neighborhood of xâĒ Bias and variance tradeoffâĒ A small neighborhood large variance
unreliable estimation âĒ A large neighborhood large bias inaccurate
estimation
Weighted kNNâĒ Weight the contribution of each close neighbor
based on their distancesâĒ Weight function
âĒ Prediction
Nonparametric Methods
âĒ Parametric distribution models are restricted to specific forms, which may not always be suitable; for example, consider modelling a multimodal distribution with a single, unimodal model.
âĒ Nonparametric approaches make few assumptions about the overall shape of the distribution being modelled.
Nonparametric Methods (2)
Histogram methods partition the data space into distinct bins with widths ÂĒi and count the number of observations, ni, in each bin.
âĒ Often, the same width is used for all bins, ÂĒi = ÂĒ.
âĒ ÂĒ acts as a smoothing parameter.
âĒ In a D-dimensional space, using M bins in each dimen-sion will require MD bins!
Nonparametric Methods
Assume observations drawn from a density p(x) and consider a small region R containing x such that
The probability that K out of N observations lie inside R is Bin(KjN,P ) and if N is large
If the volume of R, V, is sufficiently small, p(x) is approximately constant over R and
Thus
V small, yet K>0, therefore N large?
Nonparametric Methods
Kernel Density Estimation: fix V, estimate K from the data. Let R be a hypercube centred on x and define the kernel function (Parzen window)
It follows that
and hence
Nonparametric Methods
To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian
Any kernel such that
will work.h acts as a smoother.
Nonparametric Methods (6)
Nearest Neighbour Density Estimation: fix K, estimate V from the data. Consider a hypersphere centred on x and let it grow to a volume, V ?, that includes K of the given N data points. Then
K acts as a smoother.
Nonparametric Methods
âĒ Nonparametric models (not histograms) requires storing and computing with the entire data set.
âĒ Parametric models, once fitted, are much more efficient in terms of storage and computation.
Estimate in the Weight Function
Estimate in the Weight Function
âĒ Leave one cross validationâĒ Divide training data D into two setsâĒ Validation set âĒ Training set
âĒ Compute leave one out prediction
Estimate in the Weight Function
Estimate in the Weight Function
âĒ In general, for any training example, we have âĒ Validation set âĒ Training set
âĒ Compute leave one out prediction
Estimate in the Weight Function
Estimate in the Weight Function
Challenges in Optimization
âĒ Convex functions
âĒ Single-mode functions (quasi-convex)
âĒ Multi-mode functions (DC)
Difficulty in
optimization
ML = Statistics + Optimization
âĒ ModelingâĒ is the parameter(s) to be decided
âĒ Search for the best parameter âĒ Maximum likelihood estimationâĒ Construct a log-likelihood functionâĒ Search for the optimal solution
When to Consider Nearest Neighbor ?
âĒ Lots of training dataâĒ Less than 20 attributes per exampleâĒ Advantages:âĒ Training is very fastâĒ Learn complex target functionsâĒ Donât lose information
âĒ Disadvantages:âĒ Slow at query timeâĒ Easily fooled by irrelevant attributes
KD Tree for NN Search
Each node containsChildren informationThe tightest box that bounds all the data points within the node.
NN Search by KD Tree
NN Search by KD Tree
NN Search by KD Tree
NN Search by KD Tree
NN Search by KD Tree
NN Search by KD Tree
NN Search by KD Tree
Curse of Dimensionality
âĒ Imagine instances described by 20 attributes, but only 2 are relevant to target function
âĒ Curse of dimensionality: nearest neighbor is easily mislead when high dimensional X
âĒ Consider N data points uniformly distributed in a p-dimensional unit ball centered at origin. Consider the nn estimate at the original. The mean distance from the origin to the closest data point is:
Curse of Dimensionality
âĒ Imagine instances described by 20 attributes, but only 2 are relevant to target function
âĒ Curse of dimensionality: nearest neighbor is easily mislead when high dimensional X
âĒ Consider N data points uniformly distributed in a p-dimensional unit ball centered at origin. Consider the nn estimate at the original. The mean distance from the origin to the closest data point is: