23
CSE5230 - Data Mining, 2002 Lecture 7.1 Data Mining - CSE5230 Decision Trees CSE5230/DMS/2002/7

CSE5230 - Data Mining, 2002 Lecture 7.1

  • Upload
    tommy96

  • View
    372

  • Download
    5

Embed Size (px)

Citation preview

Page 1: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.1

Data Mining - CSE5230

Decision Trees

CSE5230/DMS/2002/7

Page 2: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.2

Lecture Outline

Why use Decision Trees? What is a Decision Tree? Examples Use as a data mining technique Popular Models

CART CHAID ID3 & C4.5

Page 3: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.3

Why use Decision Trees? - 1 Whereas neural networks compute a

mathematical function of their inputs to generate their outputs, decision trees use logical rules

Iris setosa

Petal-length

Petal-width

Iris versicolor

Iris versicolor Iris virginica

Sepal-length

Iris virginicaPetal-length

> 2.6 2.6

> 1.65 1.65

5 > 5

6.05 > 6.05

IFPetal-length > 2.6 ANDPetal-width 1.65 ANDPetal-length > 5 ANDSepal-length > 6.05

THENthe flower is Iris virginica

NB. This is not the only rule for this species. What is the other?

Figure adapted from [SGI2001]

Page 4: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.4

Why use Decision Trees? - 2

For some applications accuracy of classification or prediction is sufficient, e.g.: Direct mail firm needing to find a model for identifying

customers who will respond to mail Predicting the stock market using past data

In other applications it is better (sometimes essential) that the decision be explained, e.g.: Rejection of a credit application Medical diagnosis

Humans generally require explanations for most decisions

Page 5: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.5

Why use Decision Trees? - 3

Example: When a bank rejects a credit card application, it is better to explain to the customer

that it was due to the fact that: He/she is not a permanent resident of Australia AND

He/she has been residing in Australia for < 6 months ANDHe/she does not have a permanent job.

This is better than saying: “We are very sorry, but our neural network thinks

that you are not a credit-worthy customer.” (In which case the customer might become angry and move to another bank)

Page 6: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.6

What is a Decision Tree? Built from root node (top) to leaf

nodes (bottom) A record first enters the root node A test is applied to determine to

which child node it should go next A variety of algorithms for choosing

the initial test exists. The aim is to discriminate best between the target classes

The process is repeated until a record arrives at a leaf node

The path from the root to a leaf node provides an expression of a rule

Iris setosa

Petal-length

Petal-width

Iris versicolor

Iris versicolor Iris virginica

Sepal-length

Iris virginicaPetal-length

> 2.6 2.6

> 1.65 1.65

5 > 5

6.05 > 6.05

root node

leaf nodes

test

child node

path

Page 7: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.7

Building a Decision Tree - 1

Algorithms for building decision trees (DTs) begin by trying to find the test which does the “best job” of splitting the data into the desired classes

The desired classes have to be identified at the start

Example: we need to describe the profiles of customers of a telephone company who “churn” (do not renew their contracts). The DT building algorithm examines the customer database to find the best splitting criterion:

The DT algorithm may discover out that the“Phone technology” variable is best for separating churners from non-churners

Phone technology Age of customer Time has been a customer Gender

Page 8: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.8

Building a Decision Tree - 2

The process is repeated to discover the best splitting criterion for the records assigned to each node

Once built, the effectiveness of a decision tree can be measured by applying it to a collection of previously unseen records and observing the percentage of correctly classified records

Time has been a customer

Phone technology

Churners

oldnew

2.3 > 2.3

Page 9: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.9

Example - 1

Requirement: Classifycustomers who churn,i.e. do not renewtheir phonecontracts.(adapted from [BeS1997])

new

Phone Technology

50 Churners50 Non-churners

old

20 Churners 0 Non-churners

Time has been a Customer

30 Churners50 Non-churners

5 Churners40 Non-churners

> 2.3 years<= 2.3 years

Age

25 Churners10 Non-churners

5 Churners10 Non-churners

20 Churners 0 Non-churners

<= 35 > 35

Page 10: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.10

Example - 2

The number of records in a given parent node equals the sum of the records contained in the child nodes

Quite easy to understand how the model is being built (unlike NNs)

Easy use the model say for a targeted marketing campaign aimed at

customers likely to churn

Provides intuitive ideas about the customer base e.g: “Customers who have been with the company for a

couple of years and have new phones are pretty loyal”

Page 11: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.11

Use as a data mining technique - 1

Exploration Analyzing the predictors and splitting criteria selected

by the algorithm may provide interesting insights which can be acted upon

e.g. if the following rule was identified:

IFtime a customer < 1.1 years ANDsales channel = telesales

THEN chance of churn is 65%

It might be worthwhile conducting a study on the way the telesales operators are making their calls

Page 12: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.12

Use as a data mining technique - 2

Exploration (continued) Gleaning information from rules that fail e.g. from the phone example we obtained the rule:

IFPhone technology = old ANDTime has been a customer 2.3 years ANDAge > 35

THEN there are only 15 customers (15% of total)

Can this rule be useful?

» Perhaps we can attempt to build up this small market segment. If this is possible then we have the edge over competitors since we have a head start in this knowledge

» We can remove these customers from our direct marketing campaign since there are so few of them

Page 13: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.13

Use as a data mining technique - 3

Exploration (continued) Again from the phone company example we noticed

that:

» There was no combination of rules to reliably discriminate between churners and non-churners for the small market segment mentioned on the previous slide (5 churners, 10 non-churners).

Do we consider this as an occasion where it was not possible to achieve our objective?

From this failure we have learnt that age is not all that important for this category churners (unlike those under 35).

Perhaps we were asking the wrong questions all along - this warrants further analysis

Page 14: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.14

Use as a data mining technique - 4

Data Pre-processing Decision trees are very robust at handling different

predictor types (number/categorical), and run quickly. Therefore the can be good for a first pass over the data in a data mining operation

This will create a subset of the possibly useful predictors which can then be fed into another model, say a neural network

Prediction Once the decision tree is built it can be then be used as

a prediction tool, by using it on a new set of data

Page 15: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.15

Popular Decision Tree Models: CART

CART: Classification And Regression Trees, developed in 1984 by a team of researchers (Leo Breiman et al.) from Stanford University Used in the DM software Darwin - from Thinking Machines

Corporation (recently bought by Oracle)

Often uses an entropy measure to determine the split point (Shannon’s Information theory).

measure of disorder (MOD) =

where p is is the probability of that prediction value occurring in a particular node of the tree. Other measures used include Gini and twoing.

CART produces a binary tree

)(log2 pp

Page 16: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.16

CART - 2

Consider the “Churn” problem from slide 7.9 At the first node there are 100 customers to split, 50 who churn

and 50 who don’t churnThe MOD of this node is:

MOD = -0.5*log2(0.5) + -0.5*log2(0.5) = 1.00 The algorithm will try each predictor For each predictor the algorithm will calculate the MOD of the

split produced by several values to identify the optimum splitting on “Phone technology” produces two nodes, one with 50

churners and 30 non-churners, the other with 20 churners and 0 non-churners. The first of these has:

MOD = -5/8*log2(5/8) + -3/8log2(3/8) = 0.95and the second has a MOD of 0.

CART will select the predictor producing nodes with the lowest MOD as the split point

Page 17: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.17

Node splittingAn ideally good split

Name Churned? Name Churned?

Jim Yes Bob No

Sally Yes Betty No

Steve Yes Sue No

Joe Yes Alex No

An ideally bad split

Name Churned? Name Churned?

Jim Yes Bob No

Sally Yes Betty No

Steve No Sue Yes

Joe No Alex Yes

Page 18: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.18

Popular Decision Tree Models: CHAID

CHAID: Chi-squared Automatic Interaction Detector, developed by J. A. Hartigan in 1975. Widely used since it is distributed as part of the popular

statistical packages SAS and SPSS

Differs from CART in the way it identifies the split points. Instead of the information measure, it uses chi-squared test to identify the split points (a statistical measure used for identifying independent variables)

All predictors must be categorical or put into categorical form by binning

The accuracy of the two methods CHAID and CART have been found to be similar

Page 19: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.19

Popular Decision Tree Models:ID3 & C4.5

ID3: Iterative Dichtomiser, developed by the Australian researcher Ross Quinlan in 1979 Used in the data mining software Clementine of Integral

Solutions Ltd. (taken over by SPSS)

ID3 picks predictors and their splitting values on the basis of the information gain provided Gain is the difference between the amount of

information that is needed to make a correct prediction both before and after the split has been made

If the amount of information required is much lower after the split is made, then the split is said to have decreased the disorder of the original data

Page 20: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.20

ID3 & C4.5 - 2

B y u s i n g t h e e n t r o p y

l e f t r i g h t l e f t e n t r o p y r i g h t e n t r o p y s t a r t e n t r o p y

A + + + + - + - - - - - 4 / 5 l o g ( 4 / 5 ) + - 4 / 5 l o g ( 4 / 5 ) + - 5 / 1 0 l o g ( 5 / 1 0 ) +- 1 / 5 l o g ( 1 / 5 ) = . 7 2 - 1 / 5 l o g ( 1 / 5 ) = . 7 2 - 5 / 1 0 ( l o g ( 5 / 1 0 )

= 1

B + + + + + - - 5 / 9 l o g ( 5 / 9 ) + - 1 / 1 l o g ( 1 / 1 )- - - - - 4 / 9 l o g ( 4 / 9 ) = . 9 9 = 0

A B

+ + + + - + - - - - + + + + + - - - - -

)log( pp

Page 21: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.21

ID3 & C4.5 - 3

Split A will be selected C4.5 introduces a number of extensions to ID3:

Handles unknown field values in training set Tree pruning method Automated rule generation

Weighted Entropy Gain

A (5/10)*0.72+(5/10)*0.72 0.28= 0.72

B (9/10)*0.99+(1/10)*0 0.11= 0.89

Page 22: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.22

Strengths and Weaknesses

Strengths of decision trees Able to generate understandable rules Classify with very little computation Handle both continuous and categorical data Provides a clear indication of which variables are most

important for prediction or classification

Weaknesses Not appropriate for estimation or prediction tasks

(income, interest rates, etc.) Problematic with time series data (much pre-processing

required), can be computationally expensive

Page 23: CSE5230 - Data Mining, 2002 Lecture 7.1

CSE5230 - Data Mining, 2002 Lecture 7.23

References

[SGI2001] Silicon Graphics Inc. MLC++ Utilities Manual, 2001http://www.sgi.com/tech/mlc/utils.html

[BeL1997] J. A. Berry and G. Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley & Sons Inc.,1997

[BeS1997] A. Berson and S. J. Smith, Data Warehousing, Data Mining and OLAP, McGraw Hill, 1997