48
DECISION TREES

DECISION TREES. Decision trees One possible representation for hypotheses

Embed Size (px)

Citation preview

Page 1: DECISION TREES. Decision trees  One possible representation for hypotheses

DECISION TREES

Page 2: DECISION TREES. Decision trees  One possible representation for hypotheses

Decision trees

One possible representation for hypotheses

Page 3: DECISION TREES. Decision trees  One possible representation for hypotheses

Choosing an attribute

Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"

Which is a better choice? Patrons

Page 4: DECISION TREES. Decision trees  One possible representation for hypotheses

Using information theory

Implement Choose-Attribute in the DTL algorithm based on information content – measured by Entropy

Entropy is the measure of uncertainty of a random variable More uncertainty leads to higher entropy More knowledge leads to lower entropy

Page 5: DECISION TREES. Decision trees  One possible representation for hypotheses

Entropy

For a training set containing p positive examples and n negative examples:

np

n

np

n

np

p

np

p

np

n

np

pI

22 loglog),(

Page 6: DECISION TREES. Decision trees  One possible representation for hypotheses

Entropy Examples

Fair coin flip:

Biased coin flip:

Page 7: DECISION TREES. Decision trees  One possible representation for hypotheses

Information Gain

Measures Reduction in Entropy achieved because of the split.

Choose the split that achieves most reduction (maximizes Information Gain)

Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

k

i

isplit iEntropy

n

npEntropyGAIN

1

)()(

Parent node is split into partitions. is number of records in partition .

Page 8: DECISION TREES. Decision trees  One possible representation for hypotheses

Information Gain Example

Consider the attributes Patrons and Type:

Patrons has the highest Information Gain of all attributes and so is chosen by the DTL algorithm as the root

bits 0)]4

2,

4

2(

12

4)

4

2,

4

2(

12

4)

2

1,

2

1(

12

2)

2

1,

2

1(

12

2[1)(

bits 0541.)]6

4,

6

2(

12

6)0,1(

12

4)1,0(

12

2[1)(

IIIITypeGain

IIIPatronsGain

Page 9: DECISION TREES. Decision trees  One possible representation for hypotheses

Learned Restaurant Tree

Decision tree learned from the 12 examples:

Substantially simpler than the full tree Raining and Reservation were not necessary to

classify all the data.

Page 10: DECISION TREES. Decision trees  One possible representation for hypotheses

Stopping Criteria

Stop expanding a node when all the records belong to the same class

Stop expanding a node when all the records have similar attribute values

Page 11: DECISION TREES. Decision trees  One possible representation for hypotheses

Overfitting

Overfitting results in decision trees that are more complex than necessary

Training error does not provide a good estimate of how well the tree will perform on previously unseen records (need a test set)

Page 12: DECISION TREES. Decision trees  One possible representation for hypotheses

How to Address Overfitting 1… Pruning

Grow decision tree to its entirety Trim the nodes of the decision tree in a

bottom-up fashion If generalization error is reduced after

trimming, replace sub-tree by a leaf node ( test, see page 706)

Class label of leaf node is determined from majority class of instances in the sub-tree

Page 13: DECISION TREES. Decision trees  One possible representation for hypotheses

How to Address Overfitting 2…

Early Stopping Rule Stop the algorithm before it becomes a fully-

grown tree Stopping conditions:

Stop if number of instances is less than some user-specified threshold

Stop if class distribution of instances are independent of the available features (e.g., using test)

Stop if expanding the current node does not improve impurity measures (e.g., information gain).

Page 14: DECISION TREES. Decision trees  One possible representation for hypotheses

How to Address Overfitting… Is the early stopping rule strictly better

than pruning (i.e., generating the full tree and then cutting it)?

Page 15: DECISION TREES. Decision trees  One possible representation for hypotheses

Remaining Challenges…

Continuous values: Need to be split into discrete categories. Sort all values, then consider split points

between two examples in sorted order that have different classifications.

Missing values: Affect how an example is classified, information

gain calculations, test set error rate. Pretend that the example has all possible values

for the missing attribute, weight by its frequency among all the examples in the current node.

Page 16: DECISION TREES. Decision trees  One possible representation for hypotheses

Summary

Advantages of decision trees: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification

techniques for many simple data sets

Learning performance = prediction accuracy measured on test set

Page 17: DECISION TREES. Decision trees  One possible representation for hypotheses

K-NEAREST NEIGHBORS

Page 18: DECISION TREES. Decision trees  One possible representation for hypotheses

K-Nearest Neighbors

What value do we assign to the green sample?

Page 19: DECISION TREES. Decision trees  One possible representation for hypotheses

K-Nearest Neighbors

1-NN: For a given query point ,

assign the class of the nearest neighbour.

K-NN Compute the nearest

neighbours and assign the class by majority vote.

k = 1

k = 3

Page 20: DECISION TREES. Decision trees  One possible representation for hypotheses

Decision Regions for 1-NN

Page 21: DECISION TREES. Decision trees  One possible representation for hypotheses

Effect of

𝑘=1 𝑘=5

Page 22: DECISION TREES. Decision trees  One possible representation for hypotheses

K-Nearest Neighbors

Euclidian Distance:

Weighted Euclidian Distance:

Where is the dimensionality of the data.

Page 23: DECISION TREES. Decision trees  One possible representation for hypotheses

Weighting the Distance to Remove Irrelevant Features

+

+

+

+

+

++ +

o

o

o o

o

o

oo

o

o

o

oo

o

o

o

o

o?

Page 24: DECISION TREES. Decision trees  One possible representation for hypotheses

Weighting the Distance to Remove Irrelevant Features

+

+

+

+

+

+ + +o

oo o

o

o

oo

oo

ooo

oo

o

o

o?

Page 25: DECISION TREES. Decision trees  One possible representation for hypotheses

Weighting the Distance to Remove Irrelevant Features

+ ++ ++ + + +oo o o oo ooooo oo o oo oo?

Page 26: DECISION TREES. Decision trees  One possible representation for hypotheses

Nearest Neighbors Search

Let be a set of training points Given a query point , find the nearest

neighbor of in .

Naïve approach Compute the distance from the query

point to every other point in the database, keeping track of the "best so far".

Running time is O(n).

Data Structure approach Construct a data structure which makes

this search more efficient

qp

Page 27: DECISION TREES. Decision trees  One possible representation for hypotheses

Quadtree

Is a tree data structure in which each internal node has up to four children.

Every node in the Quadtree corresponds to a square.

If a node has children, then their corresponding squares are the four quadrants of the square of .

The leaves of a Quadtree form a Quadtree Subdivision of the square of the root.

The children of a node are labelled NE, NW, SW, and SE to indicate to which quadrant they correspond.

Page 28: DECISION TREES. Decision trees  One possible representation for hypotheses

Quadtree Construction

Input: point set P

while Some cell C contains more than 1 point do

Split cell C

end

j k f g l d a b

c ei h

X

400

1000

h

b

i

a

cd e

g f

kj

Y

l

X 25, Y 300

X 50, Y 200

X 75, Y 100

Page 29: DECISION TREES. Decision trees  One possible representation for hypotheses

Nearest Neighbor Search

Page 30: DECISION TREES. Decision trees  One possible representation for hypotheses

Quadtree -Query

X

Y

X1,Y1 P≥X1P≥Y1

P<X1P<Y1

P≥X1P<Y1

P<X1P≥Y1

X1,Y1

Page 31: DECISION TREES. Decision trees  One possible representation for hypotheses

Quadtree- Query

X

Y

In many cases works

X1,Y1P<X1P<Y1 P<X1

P≥Y1

X1,Y1

P≥X1P≥Y1

P≥X1P<Y1

Page 32: DECISION TREES. Decision trees  One possible representation for hypotheses

Quadtree– Pitfall 1

X

Y

In some cases doesn’t: there could be points in adjacent buckets that are closer

X1,Y1P≥X1P≥Y1

P<X1

P<X1P<Y1 P≥X1

P<Y1P<X1P≥Y1

X1,Y1

Page 33: DECISION TREES. Decision trees  One possible representation for hypotheses

Quadtree – Pitfall 2

X

Y

Could result in Query time Exponential in dimensions

Page 34: DECISION TREES. Decision trees  One possible representation for hypotheses

Simple data structure. Versatile, easy to implement. Often space and time inefficient.

Quadtree

Page 35: DECISION TREES. Decision trees  One possible representation for hypotheses

kd-trees (k-dimensional trees) Main ideas:

one-dimensional splits instead of splitting in the middle, choose

the split “carefully” (many variations) nearest neighbor queries same as for quad-

trees

Page 36: DECISION TREES. Decision trees  One possible representation for hypotheses

2-dimensional kd-trees

Algorithm Choose x or y coordinate (alternate between them). Choose the median of the coordinate

this defines a horizontal or vertical line. Recurse on both sides until there is only one point

left, which is stored as a leaf.

We get a binary tree Size O(n). Construction time O(nlogn). Depth O(logn).

Page 37: DECISION TREES. Decision trees  One possible representation for hypotheses

Nearest Neighbor with KD Trees

We traverse the tree looking for the nearest neighbor of the query point.

Page 38: DECISION TREES. Decision trees  One possible representation for hypotheses

Examine nearby points first: Explore the branch of the tree that is closest to the query point first.

Nearest Neighbor with KD Trees

Page 39: DECISION TREES. Decision trees  One possible representation for hypotheses

Examine nearby points first: Explore the branch of the tree that is closest to the query point first.

Nearest Neighbor with KD Trees

Page 40: DECISION TREES. Decision trees  One possible representation for hypotheses

When we reach a leaf node: compute the distance to each point in the node.

Nearest Neighbor with KD Trees

Page 41: DECISION TREES. Decision trees  One possible representation for hypotheses

When we reach a leaf node: compute the distance to each point in the node.

Nearest Neighbor with KD Trees

Page 42: DECISION TREES. Decision trees  One possible representation for hypotheses

Then we can backtrack and try the other branch at each node visited.

Nearest Neighbor with KD Trees

Page 43: DECISION TREES. Decision trees  One possible representation for hypotheses

Each time a new closest node is found, we can update the distance bounds.

Nearest Neighbor with KD Trees

Page 44: DECISION TREES. Decision trees  One possible representation for hypotheses

Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

Nearest Neighbor with KD Trees

Page 45: DECISION TREES. Decision trees  One possible representation for hypotheses

Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

Nearest Neighbor with KD Trees

Page 46: DECISION TREES. Decision trees  One possible representation for hypotheses

Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

Nearest Neighbor with KD Trees

Page 47: DECISION TREES. Decision trees  One possible representation for hypotheses

Summary of K-Nearest Neighbor

Stores all training data in memory – large space requirement

Can improve query time by representing the data within a k-d tree

K-d trees are only efficient when there are many more examples than dimensions, preferably at least examples for dimensions

Page 48: DECISION TREES. Decision trees  One possible representation for hypotheses