44
Machine Learning :: Introduction Part II Konstantin Tretyakov ([email protected]) MTAT.03.183 Data Mining November 5, 2009

Machine Learning :: Introduction Part II · In the previous episode 2 Machine learning :: Introduction :: Part II 12.11.2009 Approaches to data analysis The general principle is the

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Machine Learning :: Introduction

Part II

Konstantin Tretyakov ([email protected])

MTAT.03.183 – Data Mining

November 5, 2009

In the previous episode

Machine learning :: Introduction :: Part II 12.11.20092

Approaches to data analysis

The general principle is the same, though:

1. Define a set of patterns of interest

2. Define a measure of goodness for the patterns

3. Find the best pattern in the data

In the previous episode

Machine learning :: Introduction :: Part II 12.11.20093

Supervised Learning: Manual

In the previous episode

Machine learning :: Introduction :: Part II 12.11.20094

Regression Classification

Supervised Learning: Automatic

In the previous episode

Machine learning :: Introduction :: Part II 12.11.20095

Supervised Learning

Regression Classification

Training set

Test set

Holdout

Cross-validation

Accuracy,

Precision, Recall

TP, FP, TN, FN

Validation

In the previous episode

Machine learning :: Introduction :: Part II 12.11.20096

Supervised Learning

Regression Classification

Decision trees: ID3, C4.5

Today

Machine learning :: Introduction :: Part II 12.11.20097

Supervised Learning

Regression Classification

Naïve BayesLinear regression KNN

Supervised learning

Four examples of approaches

Ad-hoc

Decision tree induction

Probabilistic modeling

Naïve Bayes classifier

Objective function optimization

Linear least squares regression

Instance-based methods

K-nearest neighbors

12.11.2009Machine learning :: Introduction :: Part II8

Naïve Bayes Classifier

12.11.2009Machine learning :: Introduction :: Part II9

The Tennis Dataset

12.11.2009Machine learning :: Introduction :: Part II10

Shall we play tennis today?

12.11.2009Machine learning :: Introduction :: Part II11

Shall we play tennis today?

12.11.2009Machine learning :: Introduction :: Part II12

P(Yes) = 9/14 = 0.64

P(No) = 5/14 = 0.36

Yes

Probabilistic model:

It’s windy today. Tennis, anyone?

12.11.2009Machine learning :: Introduction :: Part II13

It’s windy today. Tennis, anyone?

Probabilistic model:

12.11.2009Machine learning :: Introduction :: Part II14

P(Weak) = 8/14

P(Strong) = 6/14

P(Yes | Weak) = 6/8

P(No | Weak) = 2/8

P(Yes | Strong) = 3/6

P(No | Strong) = 3/6

More attributes

Probabilistic model:

12.11.2009Machine learning :: Introduction :: Part II15

P(High,Weak) = 4/14

P(Yes | High,Weak) = 2/4

P(No | High,Weak) = 2/4

P(High,Strong) = 3/14

P(Yes | High,Strong) = 1/3

P(No | High,Strong) = 2/3

The Bayes Classifier

In general:

1. Estimate from data:

P(Class | X1,X2,X3,…)

2. For a given instance (X1,X2,X3 …)

predict class whose conditional

probability is greater:

P(C1 | X1,X2,X3,…) > P(C2 | X1,X2,X3,…)

predict C1

12.11.2009Machine learning :: Introduction :: Part II16

The Bayes Classifier

In general:

1. P(C1 | X) > P(C2 | X)

predict C1

12.11.2009Machine learning :: Introduction :: Part II17

The Bayes Classifier

In general:

1. P(C1 | X) > P(C2 | X)

predict C1

2. P(X | C1)P(C1)/P(X) > P(X | C2)P(C2)/P(X)

predict C1

12.11.2009Machine learning :: Introduction :: Part II18

The Bayes Classifier

In general:

1. P(C1 | X) > P(C2 | X)

predict C1

2. P(X | C1)P(C1)/P(X) > P(X | C2)P(C2)/P(X)

predict C1

3. P(X | C1)P(C1) > P(X | C2)P(C2)

predict C1

12.11.2009Machine learning :: Introduction :: Part II19

The Bayes Classifier

In general:

1. P(C1 | X) > P(C2 | X)

predict C1

2. P(X | C1)P(C1)/P(X) > P(X | C2)P(C2)/P(X)

predict C1

3. P(X | C1)P(C1) > P(X | C2)P(C2)

predict C1

4. P(X | C1)/P(X | C2) > P(C2)/P(C1)

predict C1

12.11.2009Machine learning :: Introduction :: Part II20

The Bayes Classifier

3. P(X | C1)P(C1) > P(X | C2)P(C2) predict C1

12.11.2009Machine learning :: Introduction :: Part II21

The Bayes Classifier

If the true underlying distribution is

known, the Bayes classifier is optimal

(i.e. it achieves minimal error probability).

12.11.2009Machine learning :: Introduction :: Part II22

Problem

We need exponential amount of data

12.11.2009Machine learning :: Introduction :: Part II23

P(High,Weak) = 4/14

P(Yes | High,Weak) = 2/4

P(No | High,Weak) = 2/4

P(High,Strong) = 3/14

P(Yes | High,Strong) = 1/3

P(No | High,Strong) = 2/3

The Naïve Bayes Classifier

To scale beyond 2-3 attributes, use a trick:

Assume that attributes of each class are

independent:

P(X1,X2,X3 | Class) =

= P(X1|Class)P(X2|Class)P(X3|Class)…

12.11.2009Machine learning :: Introduction :: Part II24

The Naïve Bayes Classifier

P(X | C1)/P(X | C2) > P(C2)/P(C1)

predict C1

becomes

12.11.2009Machine learning :: Introduction :: Part II25

Naïve Bayes Classifier

1. Training:

For each value v of each attribute i, compute

2. Classification:

For a given instance (x1, x2, …) compute

If classify as C1 , else C2

12.11.2009Machine learning :: Introduction :: Part II26

Naïve Bayes Classifier

1. Training:

For each value v of each attribute i, compute

2. Classification:

For a given instance (x1, x2, …) compute

If classify as C1 , else C2

12.11.2009Machine learning :: Introduction :: Part II27

Naïve Bayes Classifier

Extendable to continuous attributes via kernel density

estimation.

The goods:

Easy to implement, efficient

Won’t overfit, intepretable

Works better than you would expect (e.g. spam filtering)

The bads

“Naïve”, linear

Usually won’t work well for too many classes

12.11.2009Machine learning :: Introduction :: Part II28

Supervised learning

Four examples of approaches

Ad-hoc

Decision tree induction

Probabilistic modeling

Naïve Bayes classifier

Objective function optimization

Linear least squares regression

Instance-based methods

K-nearest neighbors

12.11.2009Machine learning :: Introduction :: Part II29

Linear regression

Consider data points of the type

Let us search for a function of the form

that would have the least possible sum of error

squares:

12.11.2009Machine learning :: Introduction :: Part II30

Linear regression

Set of functions f(x) = wx for various w.

12.11.2009Machine learning :: Introduction :: Part II31

Linear regression

The function having the least error

12.11.2009Machine learning :: Introduction :: Part II32

Linear regression

We need to solve:

where

Setting the derivative to 0:

we get

12.11.2009Machine learning :: Introduction :: Part II33

Linear regression

The idea naturally generalizes to more

complex cases (e.g. multivariate regression)

The goods

Simple, easy

Intepretable, popular, extendable to many variants

Won’t overfit

The bads

Linear, i.e. works for a limited set of datasets.

Nearly never perfect.

12.11.2009Machine learning :: Introduction :: Part II34

Supervised learning

Four examples of approaches

Ad-hoc

Decision tree induction

Probabilistic modeling

Naïve Bayes classifier

Objective function optimization

Linear least squares regression

Instance-based methods

K-nearest neighbors

12.11.2009Machine learning :: Introduction :: Part II35

K-Nearest Neighbors

12.11.2009Machine learning :: Introduction :: Part II36

K-Nearest Neighbors

One-nearest neighbor

12.11.2009Machine learning :: Introduction :: Part II37

K-Nearest Neighbors

One-nearest neighbor

12.11.2009Machine learning :: Introduction :: Part II38

K-Nearest Neighbors

Training:

Store and index all training instances

Classification

Given an instance to classify, find k instances from the

training sample nearest to it.

Predict the majority class among these k instances

12.11.2009Machine learning :: Introduction :: Part II39

K-Nearest Neighbors

The goods

Trivial concept

Easy to implement

Asymptotically optimal

Allows various kinds of data

The bads

Difficult to implement efficiently

Not interpretable (no model)

On smaller datasets looses to other methods

12.11.2009Machine learning :: Introduction :: Part II40

Objective optimization

Regression models

Kernel methods, SVM, RBF

Neural networks

Instance-based

K-NN, LOWESS

Kernel densities

SVM, RBF

Probabilistic models

Naïve Bayes

Graphical models

Regression models

Density estimation

Supervised learning

Ad-hoc

Decision trees, forests

Rule induction, ILP

Fuzzy reasoning

12.11.2009Machine learning :: Introduction :: Part II41

Objective optimization

Regression models,

Kernel methods, SVM, RBF

Neural networks

Instance-based

K-NN, LOWESS

Kernel densities

SVM, RBF

Probabilistic models

Naïve Bayes

Graphical models

Regression models

Density estimation

Supervised learning

Ad-hoc

Decision trees, forests

Rule induction, ILP

Fuzzy reasoning

12.11.2009Machine learning :: Introduction :: Part II42

Ensemble-learners

Arcing, Boosting,

Bagging, Dagging,

Voting, Stacking

Coming up next

“Machine learning”

Terminology, foundations, general framework.

Supervised machine learning

Basic ideas, algorithms & toy examples.

Statistical challenges

Learning theory, consistency, bias-variance, overfitting, ...

State of the art techniques

SVM, kernel methods, graphical models, latent variable

models, boosting, bagging, LASSO, on-line learning, deep

learning, reinforcement learning, …

12.11.2009Machine learning :: Introduction :: Part II43

Questions?

12.11.2009Machine learning :: Introduction :: Part II44