Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island

Data Mining with Decision Trees

Lutz HamelDept. of Computer Science and Statistics

University of Rhode Island

What is Data Mining?

Data mining is the application of machine learning techniques to large databases in order to extract knowledge.

(KDD Knowledge Discovery in Databases)

No longer strictly true, data mining now encompasses other computational techniques outside the classic machine learning domain.

What is Machine Learning?

Programs that get better with experience given a task and some performance measure. Learning to classify news articles Learning to recognize spoken words Learning to play board games Learning to classify customers

What is Knowledge?

Structural descriptions of data (transparent) If-then-else rules Decision trees

Models of data (non-transparent) Neural Networks Clustering (Self-Organizing Maps, k-Means) Naïve-Bayes Classifiers

Why Data Mining?

Oversimplifying somewhat:Queries allow you to retrieve existing knowledge

from a database.

Data mining induces new knowledge in a database.

Why Data Mining? (Cont.)

Example: Give me a description of customers who spent more than $100 in my store.


The Query: The only thing a query can do is give you a

list of every single customer who spent more than $100.

Probably not very informative with the exception that you will most likely see a lot of customer records.


Data Mining Techniques: Data mining techniques allow you to

generate structural descriptions of the data in question, i.e., induce new knowledge.

In the case of rules this might look something like:

IF age < 35 AND car = MINIVAN

THEN spent > $100


In principle, you could generate the same kind of knowledge you gain with data mining techniques using only queries: look at the data set of customers who spent more that

$100 and propose a hypothesis test this hypothesis against your data using a query if the query returns a non-null result set then you have

found a description of a subset of your customers Time consuming, undirected search.

Decision Trees

Decision trees are concept learning algorithms Once a concept is acquired the algorithm can classify objects

according to this concept. Concept Learning:

acquiring the definition of a general category given a sample of positive and negative examples of the category,

can be formulated as a problem of searching through a predefined space of potential concepts for the concept that best fits the training examples.

Best known algorithms: ID3, C4.5, CART

Example

Systolic Blood Pressure White Blood Count Diagnosis110 13000 MI90 12000 MI85 18000 MI

120 8000 MI130 18000 MI180 5000 Angina200 7500 Angina165 6000 Angina190 6500 Angina120 9000 Angina

MI Myocardial Infarction

(Source: Neural Networks and Artificial Intelligence for Biomedical Engineering, IEEE Press, 2000)

Below is a table of patients who have entered the emergency room complaining about chest pain two types of diagnoses: Angina and Myocardial Infarction.

Question: can we generalize beyond this data?

Example (Contd)

C4.5 induces the following decision tree for the data:

0

5000

10000

15000

20000

0 50 100 150 200 250

Systolic Blood Pressure

Whi

te B

lood

Cou

nt

MI Angina

Decision Surface


> 130<= 130

AnginaMyocardial Infarction

Definition of Concept Learning

Notes: This is called supervised learning because of the necessity of labeled data

provided by the trainer. Once we have determined c we can use it to make predictions on unseen

elements of the data universe.

Given: A data universe X A sample set S, where S Ì X Some target concept c: X ® {true, false} Labeled training examples D, where

D = { < s, c(s) > | sÎS } Using D determine:

A function c such that c(x) @ c(x) for all xÎX.

The Inductive Learning Hypothesis

Any function found to approximate the target concept well over a sufficiently large set of training examples will also approximate the target concept well over other unobserved examples.

In other words, we are able to generalize beyond what we have seen.

Recasting our Example as a Concept Learning Problem

The data universe X are ordered pairs of the form Systolic Blood Pressure White Blood Count

The sample set SÌX is the table of value pairs we are given

Target concept: Diagnosis: X ® {Angina, MI} Training examples is the table where

D = {< s, Diagnosis(s) > | s Î S } Find a function Diagnosis that best describes D.

Recasting our Example as a Concept Induction Problem

A definition of the learned function Diagnosis:

Diagnosis (Systolic Blood Pressure, White Blood Count) = IF Systolic Blood Pressure > 130 THEN Diagnosis = Angina ELSE IF Systolic Blood Pressure <= 130 THEN Diagnosis = MI.

Decision Tree Representation

We can represent the learned function as a tree: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification


> 130<= 130

AnginaMyocardial Infarction

Entropy S is a sample of training

examples p+ is the proportion of

positive examples in S p- is the proportion of

negative examples in S Entropy measures the

impurity (randomness) of S

Entropy(S) º - p+ log2 p+ - p- log2 p-

p+

Top-down Induction of Decision Trees

Recursive Algorithm Main loop:

Let attribute A be the attribute that minimizes the entropy at the current node

For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples classified satisfactorily, then

STOP, else iterate over new leaf nodes.

Information Gain

Gain(S, A) = expected reduction in entropy due to sorting on A.

})(|{ where

)(||

||)(),(

)(

vsASsS

SEntropyS

SSEntropyASGain

v

vAValuesv

v

In other words, Gain(S, A) is the information providedabout the target concept, given the value of some attribute A.

Training, Evaluation and Prediction

We know how to induce classification rules on the data, but:

How do we measure performance? How do we use our rules to do prediction?

Training & EvaluationThe simplest method of measuring performance is the hold-

out method: Given labeled data D, we divide the data into two sets:

A hold-out (test) set Dh of size h,

A training set Dt = D Dh.

The error of the induced function ct is given as follows:

where d(p, q) = 1 if p ¹ q and 0 otherwise.

>Î<

=hDscs

th scsch

error)(,

))(),('(1 d

Training & Evaluation

However, since we trained and evaluated the learner on a finite set of data we want to know what the confidence interval is.

We can compute the 95% confidence interval of errorh as follows, Assume that the hold-out set Dh has h ³ 30 members.

Assume that each d in Dh has been selected independently and according to the probability distribution over the domain.Then:

h

errorerrorerror hh

h

)1(96.1

-±

Prediction

As we have said earlier, the induced function c@ c, that is, the induced function is an estimate of the target concept.

Therefore, we can use c to estimate (predict) the label for any unseen instance xÎX with an appropriate accuracy.

Summary

Data Mining is the application of machine learning algorithms to large databases in order to induce new knowledge.

Machine Learning can be considered to be a directed search over the space of all possible descriptions of the training data for the best description of the data set that also generalizes well to unseen instances.

Decision trees are concept learning algorithms that learn classification functions.

Documents

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island