DECISION TREES. Decision trees One possible representation for hypotheses

DECISION TREES

Decision trees

One possible representation for hypotheses

Choosing an attribute

Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"

Which is a better choice? Patrons

Using information theory

Implement Choose-Attribute in the DTL algorithm based on information content – measured by Entropy

Entropy is the measure of uncertainty of a random variable More uncertainty leads to higher entropy More knowledge leads to lower entropy

Entropy

For a training set containing p positive examples and n negative examples:

np

n

np

n

np

p

np

p

np

n

np

pI

22 loglog),(

Entropy Examples

Fair coin flip:

Biased coin flip:

Information Gain

Measures Reduction in Entropy achieved because of the split.

Choose the split that achieves most reduction (maximizes Information Gain)

Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

k

i

isplit iEntropy

n

npEntropyGAIN

1

)()(

Parent node is split into partitions. is number of records in partition .

Information Gain Example

Consider the attributes Patrons and Type:

Patrons has the highest Information Gain of all attributes and so is chosen by the DTL algorithm as the root

bits 0)]4

2,

4

2(

12

4)

4

2,

4

2(

12

4)

2

1,

2

1(

12

2)

2

1,

2

1(

12

2[1)(

bits 0541.)]6

4,

6

2(

12

6)0,1(

12

4)1,0(

12

2[1)(

IIIITypeGain

IIIPatronsGain

Learned Restaurant Tree

Decision tree learned from the 12 examples:

Substantially simpler than the full tree Raining and Reservation were not necessary to

classify all the data.

Stopping Criteria

Stop expanding a node when all the records belong to the same class

Stop expanding a node when all the records have similar attribute values

Overfitting

Overfitting results in decision trees that are more complex than necessary

Training error does not provide a good estimate of how well the tree will perform on previously unseen records (need a test set)

How to Address Overfitting 1… Pruning

Grow decision tree to its entirety Trim the nodes of the decision tree in a

bottom-up fashion If generalization error is reduced after

trimming, replace sub-tree by a leaf node ( test, see page 706)

Class label of leaf node is determined from majority class of instances in the sub-tree

How to Address Overfitting 2…

Early Stopping Rule Stop the algorithm before it becomes a fully-

grown tree Stopping conditions:

Stop if number of instances is less than some user-specified threshold

Stop if class distribution of instances are independent of the available features (e.g., using test)

Stop if expanding the current node does not improve impurity measures (e.g., information gain).

How to Address Overfitting… Is the early stopping rule strictly better

than pruning (i.e., generating the full tree and then cutting it)?

Remaining Challenges…

Continuous values: Need to be split into discrete categories. Sort all values, then consider split points

between two examples in sorted order that have different classifications.

Missing values: Affect how an example is classified, information

gain calculations, test set error rate. Pretend that the example has all possible values

for the missing attribute, weight by its frequency among all the examples in the current node.

Summary

Advantages of decision trees: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification

techniques for many simple data sets

Learning performance = prediction accuracy measured on test set

K-NEAREST NEIGHBORS

K-Nearest Neighbors

What value do we assign to the green sample?

K-Nearest Neighbors

1-NN: For a given query point ,

assign the class of the nearest neighbour.

K-NN Compute the nearest

neighbours and assign the class by majority vote.

k = 1

k = 3

Decision Regions for 1-NN

Effect of

𝑘=1 𝑘=5

K-Nearest Neighbors

Euclidian Distance:

Weighted Euclidian Distance:

Where is the dimensionality of the data.

Weighting the Distance to Remove Irrelevant Features

+

+

+

+

+

++ +

o

o

o o

o

o

oo

o

o

o

oo

o

o

o

o

o?


+

+

+

+

+

+ + +o

oo o

o

o

oo

oo

ooo

oo

o

o

o?


+ ++ ++ + + +oo o o oo ooooo oo o oo oo?

Nearest Neighbors Search

Let be a set of training points Given a query point , find the nearest

neighbor of in .

Naïve approach Compute the distance from the query

point to every other point in the database, keeping track of the "best so far".

Running time is O(n).

Data Structure approach Construct a data structure which makes

this search more efficient

qp

Quadtree

Is a tree data structure in which each internal node has up to four children.

Every node in the Quadtree corresponds to a square.

If a node has children, then their corresponding squares are the four quadrants of the square of .

The leaves of a Quadtree form a Quadtree Subdivision of the square of the root.

The children of a node are labelled NE, NW, SW, and SE to indicate to which quadrant they correspond.

Quadtree Construction

Input: point set P

while Some cell C contains more than 1 point do

Split cell C

end

j k f g l d a b

c ei h

X

400

1000

h

b

i

a

cd e

g f

kj

Y

l

X 25, Y 300

X 50, Y 200

X 75, Y 100

Nearest Neighbor Search

Quadtree -Query

X

Y

X1,Y1 P≥X1P≥Y1

P<X1P<Y1

P≥X1P<Y1

P<X1P≥Y1

X1,Y1

Quadtree- Query

X

Y

In many cases works

X1,Y1P<X1P<Y1 P<X1

P≥Y1

X1,Y1

P≥X1P≥Y1

P≥X1P<Y1

Quadtree– Pitfall 1

X

Y

In some cases doesn’t: there could be points in adjacent buckets that are closer

X1,Y1P≥X1P≥Y1

P<X1

P<X1P<Y1 P≥X1

P<Y1P<X1P≥Y1

X1,Y1

Quadtree – Pitfall 2

X

Y

Could result in Query time Exponential in dimensions

Simple data structure. Versatile, easy to implement. Often space and time inefficient.

Quadtree

kd-trees (k-dimensional trees) Main ideas:

one-dimensional splits instead of splitting in the middle, choose

the split “carefully” (many variations) nearest neighbor queries same as for quad-

trees

2-dimensional kd-trees

Algorithm Choose x or y coordinate (alternate between them). Choose the median of the coordinate

this defines a horizontal or vertical line. Recurse on both sides until there is only one point

left, which is stored as a leaf.

We get a binary tree Size O(n). Construction time O(nlogn). Depth O(logn).

Nearest Neighbor with KD Trees

We traverse the tree looking for the nearest neighbor of the query point.

Examine nearby points first: Explore the branch of the tree that is closest to the query point first.


Examine nearby points first: Explore the branch of the tree that is closest to the query point first.


When we reach a leaf node: compute the distance to each point in the node.


When we reach a leaf node: compute the distance to each point in the node.


Then we can backtrack and try the other branch at each node visited.


Each time a new closest node is found, we can update the distance bounds.


Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.






Summary of K-Nearest Neighbor

Stores all training data in memory – large space requirement

Can improve query time by representing the data within a k-d tree

K-d trees are only efficient when there are many more examples than dimensions, preferably at least examples for dimensions

Documents

DECISION TREES. Decision trees One possible representation for hypotheses