Download pptx - Data Classification

Transcript
Page 1: Data Classification

Data Classification

Rong Jin

Page 2: Data Classification

Classification Problemsâ€Ē Given input:

â€Ē Predict the output (class label)

â€Ē Binary classification:

â€Ē Multi-class classification:

â€Ē Learn a classification function:

â€Ē Regression:

Page 3: Data Classification

Examples of Classification ProblemText categorization:

Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, â€Ķ

Topic:Politics

Sport

Page 4: Data Classification

Examples of Classification ProblemText categorization:

Input features :Word frequency

{(campaigning, 1), (democrats, 2), (basketball, 0), â€Ķ}

Class label: ‘Politics’:

‘Sport’:

Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, â€Ķ

Topic:Politics

Sport

Page 5: Data Classification

Examples of Classification Problem

Image Classification:

Input features XColor histogram{(red, 1004), (red, 23000), â€Ķ}

Class label yY = +1: ‘bird image’ Y = -1: ‘non-bird image’

Which images have birds, which one does not?

Page 6: Data Classification

Examples of Classification Problem

Image Classification:

Input features Color histogram{(red, 1004), (blue, 23000), â€Ķ}

Class label ‘bird image’: ‘non-bird image’:

Which images are birds, which are not?

Page 7: Data Classification

Supervised Learning

â€Ē Training examples:

â€Ē Identical independent distribution (i.i.d) assumptionâ€Ē A critical assumption for machine learning

theory

Page 8: Data Classification

Regression for Classification

â€Ē It is easy to turn binary classification into a regression problemâ€Ē Ignore the binary nature of class label y

â€Ē How to convert multiclass classification into a regression problem?

â€Ē Pros: computational efficiency â€Ē Cons: ignore the discrete nature of class

label

Page 9: Data Classification

K Nearest Neighbour (kNN) Classifier

(k=1)

Page 10: Data Classification

K Nearest Neighbour (kNN) Classifier

K = 1

Page 11: Data Classification

K Nearest Neighbour (kNN) Classifier

(k=1)(k=4)

How many neighbors should we count ?

Page 12: Data Classification

K Nearest Neighbour (kNN) Classifier

â€Ē K acts as a smother

Page 13: Data Classification

Cross Validation

â€Ē Divide training examples into two setsâ€Ē A training set (80%) and a validation set (20%)

â€Ē Predict the class labels for validation set by using the examples in training set

â€Ē Choose the number of neighbors k that maximizes the classification accuracy

Page 14: Data Classification

Leave-One-Out Method

Page 15: Data Classification

Leave-One-Out Method

Page 16: Data Classification

Leave-One-Out Method

(k=1)

Page 17: Data Classification

Leave-One-Out Method

(k=1) err(1) = 1

Page 18: Data Classification

Leave-One-Out Method

err(1) = 1

Page 19: Data Classification

Leave-One-Out Method

err(1) = 3

err(2) = 2

err(3) = 6k = 2

Page 20: Data Classification

K-Nearest-Neighbours for Classification (1)

Given a data set with Nk data points from class Ck and , we have

and correspondingly

Since , Bayes’ theorem gives

Page 21: Data Classification

K-Nearest-Neighbours for Classification (2)

K = 1K = 3

Page 22: Data Classification

Probabilistic Interpretation of KNNâ€Ē Estimate conditional probability Pr(y|x)â€Ē Count of data points in class y in the

neighborhood of xâ€Ē Bias and variance tradeoffâ€Ē A small neighborhood large variance

unreliable estimation â€Ē A large neighborhood large bias inaccurate

estimation

Page 23: Data Classification

Weighted kNNâ€Ē Weight the contribution of each close neighbor

based on their distancesâ€Ē Weight function

â€Ē Prediction

Page 24: Data Classification

Nonparametric Methods

â€Ē Parametric distribution models are restricted to specific forms, which may not always be suitable; for example, consider modelling a multimodal distribution with a single, unimodal model.

â€Ē Nonparametric approaches make few assumptions about the overall shape of the distribution being modelled.

Page 25: Data Classification

Nonparametric Methods (2)

Histogram methods partition the data space into distinct bins with widths ÂĒi and count the number of observations, ni, in each bin.

â€Ē Often, the same width is used for all bins, ÂĒi = ÂĒ.

â€Ē ÂĒ acts as a smoothing parameter.

â€Ē In a D-dimensional space, using M bins in each dimen-sion will require MD bins!

Page 26: Data Classification

Nonparametric Methods

Assume observations drawn from a density p(x) and consider a small region R containing x such that

The probability that K out of N observations lie inside R is Bin(KjN,P ) and if N is large

If the volume of R, V, is sufficiently small, p(x) is approximately constant over R and

Thus

V small, yet K>0, therefore N large?

Page 27: Data Classification

Nonparametric Methods

Kernel Density Estimation: fix V, estimate K from the data. Let R be a hypercube centred on x and define the kernel function (Parzen window)

It follows that

and hence

Page 28: Data Classification

Nonparametric Methods

To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian

Any kernel such that

will work.h acts as a smoother.

Page 29: Data Classification

Nonparametric Methods (6)

Nearest Neighbour Density Estimation: fix K, estimate V from the data. Consider a hypersphere centred on x and let it grow to a volume, V ?, that includes K of the given N data points. Then

K acts as a smoother.

Page 30: Data Classification

Nonparametric Methods

â€Ē Nonparametric models (not histograms) requires storing and computing with the entire data set.

â€Ē Parametric models, once fitted, are much more efficient in terms of storage and computation.

Page 31: Data Classification

Estimate in the Weight Function

Page 32: Data Classification

Estimate in the Weight Function

â€Ē Leave one cross validationâ€Ē Divide training data D into two setsâ€Ē Validation set â€Ē Training set

â€Ē Compute leave one out prediction

Page 33: Data Classification

Estimate in the Weight Function

Page 34: Data Classification

Estimate in the Weight Function

â€Ē In general, for any training example, we have â€Ē Validation set â€Ē Training set

â€Ē Compute leave one out prediction

Page 35: Data Classification

Estimate in the Weight Function

Page 36: Data Classification

Estimate in the Weight Function

Page 37: Data Classification

Challenges in Optimization

â€Ē Convex functions

â€Ē Single-mode functions (quasi-convex)

â€Ē Multi-mode functions (DC)

Difficulty in

optimization

Page 38: Data Classification

ML = Statistics + Optimization

â€Ē Modelingâ€Ē is the parameter(s) to be decided

â€Ē Search for the best parameter â€Ē Maximum likelihood estimationâ€Ē Construct a log-likelihood functionâ€Ē Search for the optimal solution

Page 39: Data Classification

When to Consider Nearest Neighbor ?

â€Ē Lots of training dataâ€Ē Less than 20 attributes per exampleâ€Ē Advantages:â€Ē Training is very fastâ€Ē Learn complex target functionsâ€Ē Don’t lose information

â€Ē Disadvantages:â€Ē Slow at query timeâ€Ē Easily fooled by irrelevant attributes

Page 40: Data Classification

KD Tree for NN Search

Each node containsChildren informationThe tightest box that bounds all the data points within the node.

Page 41: Data Classification

NN Search by KD Tree

Page 42: Data Classification

NN Search by KD Tree

Page 43: Data Classification

NN Search by KD Tree

Page 44: Data Classification

NN Search by KD Tree

Page 45: Data Classification

NN Search by KD Tree

Page 46: Data Classification

NN Search by KD Tree

Page 47: Data Classification

NN Search by KD Tree

Page 48: Data Classification

Curse of Dimensionality

â€Ē Imagine instances described by 20 attributes, but only 2 are relevant to target function

â€Ē Curse of dimensionality: nearest neighbor is easily mislead when high dimensional X

â€Ē Consider N data points uniformly distributed in a p-dimensional unit ball centered at origin. Consider the nn estimate at the original. The mean distance from the origin to the closest data point is:

Page 49: Data Classification

Curse of Dimensionality

â€Ē Imagine instances described by 20 attributes, but only 2 are relevant to target function

â€Ē Curse of dimensionality: nearest neighbor is easily mislead when high dimensional X

â€Ē Consider N data points uniformly distributed in a p-dimensional unit ball centered at origin. Consider the nn estimate at the original. The mean distance from the origin to the closest data point is:


Recommended