COMP 578 Discovering Classification Rules

COMP 578Discovering Classification Rules

Keith C.C. ChanDepartment of Computing

The Hong Kong Polytechnic University

2

An Example Classification Problem Patient RecordsSymptoms & TreatmentRecovered

Not RecoveredA?

B?

3

Classification in Relational DB

Patient Symptom TreatmentRecoveredMike Headache Type A YesMary Fever Type A NoBill Cough Type B2 NoJim Fever Type C1 YesDave Cough Type C1 YesAnne Headache Type B2 Yes

Class LabelWill John, having a headacheand treated with Type C1,recover?

4

Discovering of Classification Rules

TrainingData

NAME Symptom Treat. Recover?Mike Headache Type A YesMary Fever Type A NoBill Cough Type B2 NoJim Fever Type C1 YesDave Cough Type C1 YesAnne Headache Type B2 Yes

MiningClassification

Rules

IF Symptom = HeadacheAND Treatment = C1THEN Recover = Yes

ClassificationRules

Based on the classification rule discovered, John will recover!!!

5

The Classification ProblemGiven:– A database consisting of n records.– Each record characterized by m attributes.– Each record pre-classified into p different

classes.

Find:– A set of classification rules (that constitutes a

classification model) that characterizes the different classes

– so that records not originally in the database can be accurately classified.

– I.e “predicting” class labels.

6

Typical Applications

Credit approval.– Classes can be High Risk, Low Risk?

Target marketing.– What are the classes?

Medical diagnosis– Classes can be customers with different

diseases.

Treatment effectiveness analysis.– Classes can be patience with different

degrees of recovery.

7

Techniques for Discoveirng of Classification Rules

The k-Nearest Neighbor Algorithm.The Linear Discriminant Function.The Bayesian Approach.The Decision Tree approach.The Neural Network approach.The Genetic Algorithm approach.

8

Example Using The k-NN Algorithm

Salary Age Insurance15K 28 Buy31K 39 Buy41K 53 Buy10K 45 Buy14K 55 Buy25K 27 Not Buy42K 32 Not Buy18K 38 Not Buy33K 44 Not Buy

John earns 24K per month and is 42 years old.Will he buy insurance?

9

The k-Nearest Neighbor AlgorithmAll data records correspond to points in the n-Dimensional space.Nearest neighbor defined in terms of Euclidean distance.k-NN returns the most common class label among k training examples nearest to xq.

.

_+

_ xq

+

_ _+

_

_

+

.

10

The k-NN Algorithm (2)k-NN can be for continuous-valued labels.– Calculate the mean values of the k nearest

neighborsDistance-weighted nearest neighbor algorithm– Weight the contribution of each of the k neighbors

according to their distance to the query point xq

Advantage:– Robust to noisy data by averaging k-nearest

neighborsDisadvantage:– Distance between neighbors could be dominated by

irrelevant attributes.

wd xq xi

12( , )

11

Linear Discriminant Function

How shouldwe determinethe coefficients,i.e. the wi’s?

12

Linear Discriminant Function (2)3 lines separating

3 classes

13

An Example Using TheNaïve Bayesian Approach

Luk Tang Pong Cheng B/SBuy Sell Buy Buy BBuy Sell Buy Sell BHold Sell Buy Buy SSell Buy Buy Buy SSell Hold Sell Buy SSell Hold Sell Sell BHold Hold Sell Sell SBuy Buy Buy Buy BBuy Hold Sell Buy SSell Buy Sell Buy SBuy Buy Sell Sell SHold Buy Buy Sell SHold Sell Sell Buy SSell Buy Buy Sell B

14

The Example ContinuedOn one particular day, if– Luk recommends Sell– Tang recommends Sell– Pong recommends Buy, and– Cheng recommends Buy.

If P(Buy | L=Sell, T=Sell, P=Buy, Cheng=Buy)> P(Sell | L=Sell, T=Sell, P=Buy, Cheng=Buy)Then BUY– Else Sell

How do we compute the probabilities?

15

The Bayesian ApproachGiven a record characterized by n attributes:– X=<x1,…,xn>.

Calculate the probability for it to belong to a class Ci.– P(Ci|X) = prob. that record X=<x1,…,xk> is

of class Ci.– I.e. P(Ci|X) = P(Ci|x1,…,xk).– X is classified into Ci if P(Ci|X) is the

greatest amongst all.

16

Estimating A-Posteriori ProbabilitiesHow do we compute P(C|X).

Bayes theorem:

P(C|X) = P(X|C)·P(C) / P(X)

P(X) is constant for all classes.

P(C) = relative freq of class C samples

C such that P(C|X) is maximum =

C such that P(X|C)·P(C) is maximum

Problem: computing P(X|C) is not feasible!

17

The Naïve Bayesian ApproachNaïve assumption: – All attributes are mutually conditionally

independent

P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

If i-th attribute is categorical:– P(xi|C) is estimated as the relative freq of samples

having value xi as i-th attribute in class C

If i-th attribute is continuous:– P(xi|C) is estimated thru a Gaussian density function

Computationally easy in both cases

18

An Example Using TheNaïve Bayesian Approach

Luk Tang Pong Cheng B/SBuy Sell Buy Buy BBuy Sell Buy Sell BHold Sell Buy Buy SSell Buy Buy Buy SSell Hold Sell Buy SSell Hold Sell Sell BHold Hold Sell Sell SBuy Buy Buy Buy BBuy Hold Sell Buy SSell Buy Sell Buy SBuy Buy Sell Sell SHold Buy Buy Sell SHold Sell Sell Buy SSell Buy Buy Sell B

19

The Example ContinuedOn one particular day, X=<Sell,Sell,Buy,Buy>– P(X|Sell)·P(Sell)=

P(Sell|Sell)·P(Sell|Sell)·P(Buy|Sell)·P(Buy|Sell)·P(Sell) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

– P(X|Buy)·P(Buy) = P(Sell|Buy)·P(Sell|Buy)·P(Buy|Buy)·P(Buy|Buy)·P(Buy) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

You should Buy.

20

Advantages of The Bayesian ApproachProbabilistic.– Calculate explicit probabilities.

Incremental.– Additional example can incrementally

increase/decrease a class probability.

Probabilistic classification.– Classify into multiple classes weighted by their

probabilities.

Standard.– Though computationally intractable, the

approach can provide a standard of optimal decision making.

21

The independence hypothesis…

… makes computation possible.

… yields optimal classifiers when satisfied.

… but is seldom satisfied in practice, as attributes (variables) are often correlated.

Attempts to overcome this limitation:– Bayesian networks, that combine Bayesian

reasoning with causal relationships between attributes

– Decision trees, that reason on one attribute at the time, considering most important attributes first

22

Bayesian Belief Networks (I)FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table for the variable LungCancer

23

Bayesian Belief Networks (II)Bayesian belief network allows a subset of the

variables conditionally independent

A graphical model of causal relationships

Several cases of learning Bayesian belief

networks

– Given both network structure and all the variables:

easy

– Given network structure but only some variables

– When the network structure is not known in advance

24

The Decision Tree Approachage income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

25

The Decision Tree Approach (2)

What is A Decision tree?– A flow-chart-like tree structure– Internal node denotes a test on an

attribute– Branch represents an outcome of the

test– Leaf nodes represent class labels or

class distribution

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

26

Constructing A Decision TreeDecision tree generation has 2 phases– At start, all the records are at the root– Partition examples recursively based on

selected attributesDecision tree can be used to classify a record not originally in the example database.– Test the attribute values of the sample against

the decision tree.

27

Tree Construction AlgorithmBasic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-

conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are

discretized in advance)– Examples are partitioned recursively based on selected

attributes– Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)

Conditions for stopping partitioning– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning

– majority voting is employed for classifying the leaf– There are no samples left

28

A Decision Tree Example

RecordHS

IndexTrading

Vol. DJIABuy/Sell

1 Drop Large Drop Buy

2 Rise Large Rise Sell

3 Rise Medium Drop Buy

4 Drop Small Drop Sell

5 Rise Small Drop Sell

6 Rise Large Drop Buy

7 Rise Small Rise Sell

8 Drop Large Rise Sell

29

A Decision Tree Example (2)

Each record is described in terms of three attributes:– Hang Seng Index with values {rise,

drop}– Trading volume with values {small,

medium, large}– Dow Jones Industrial Average (DJIA)

with values {rise, drop}– Records contain Buy (B) or Sell (S) to

indicate the correct decision.– B or S can be considered a class label.

30


If we select Trading Volume to form the root of the decision tree:

Trading Volume

Small LargeMedium

{4, 5, 7} {3} {1, 2, 6, 8}

31

A Decision Tree Example (4)The sub-collections corresponding to “Small” and “Medium” contain records of only a single classFurther partitioning unnecessary.Select the DJIA attribute to test for the “Large” branch.Now all sub-collections contain records of one decision (class).We can replace each sub-collection by the decision/class name to obtain the decision tree.

32


Trading Volume

overcast

DJIA

RiseDrop

Small Large

Sell Buy

Buy

Medium

Sell

33

A Decision Tree Example (6)A record can be classified by:– Start at the root of the decision tree.– Find value of attribute being tested in the given

record.– Taking the branch appropriate to that value.– Continue in the same fashion until a leaf is reached.– Two records having identical attribute values may

belong to different classes.– The leaves corresponding to an empty set of

examples should be kept to a minimum.Classifying a particular record may involve evaluating only a small number of the attributes depending on the length of the path.– We never need to consider the HSI.

34

Simple Decision Trees

The selection of each attribute in turn for different levels of the tree tend to lead to complex tree.A simple tree is easier to understand.Select attribute so as to make final tree as simple as possible.

35

The ID3 AlgorithmUses an information-theoretic approach for this.A decision tree considered an information source that, given a record, generates a message.The message is the classification of that record (say, Buy (B) or Sell (S)).ID3 selects attributes by assuming that tree complexity is related to amount of information conveyed by this message.

36

Information Theoretic Test Selection

Each attribute of a record contributes a certain amount of information to its classification.E.g., if our goal is to determine the credit risk of a customer, the discovery that it has many late-payment records may contribute a certain amount of information to that goal.ID3 measures the information gained by making each attribute the root of the current sub-tree.It then picks the attribute that provides the greatest information gain.

37

Information Gain Information theory proposed by Shannon in 1948.Provides a useful theoretic basis for measuring the information content of a message.A message considered an instance in a universe of possible messages.The information content of a message is dependent on:– Number of possible messages (size of the universe).– Frequency each possible message occurs.

38

Information Gain (2)– The number of possible messages determines

amount of information (e.g. gambling).• Roulette has many outcomes.• A message concerning its outcome is of more value.

– The probability of each message determines the amount of information (e.g. a rigged coin).

• If one already know enough about the coin to wager correctly ¾ of the time, a message telling me the outcome of a given toss is worth less to me than it would be for an honest coin.

– Such intuition formalized in Information Theory.• Define the amount of information in a message as a

function of the probability of occurrence of each possible message.

39

Information Gain (3)– Given a universe of messages:

• M={m1, m2, …, mn}• And suppose each message, mi has probability p(mi) of

being received.• The amount of information I(mi) contained in the

message is defined as:– I(mi)= log2 p(mi)

• The uncertainty of a message set, U(M) is just the sum of the information in the several possible messages weighted by their probabilities:

– U(M) = i p(mi) log p(mi), i=1 to n.• That is, we compute the average information of the

possible messages that could be sent.• If all messages in a set are equiprobable, then

uncertainty is at a maximum.

40

DT Construction Using ID3If the probability of these messages is pB and pS respectively, the expected information content of the message is:

With a known set C of records we can approximate these probabilities by relative frequencies.That is pB becomes the proportion of records in C with class B.

SSBB pppp 22 loglog

41

DT Construction Using ID3 (2)

Let U(C) denote this calculation of the expected information content of a message from a decision tree, i.e.,

And we define U({ })=0.Now consider as before the possible choice of as the attribute to test next.The partial decision tree is:

SSBB ppppCU 22 loglog)(

42


The values of attribute are mutually exclusive, so the new expected information content will be:

Aj

aj1 ajjajmi

cmicjc1... ...

j

jijii CUaAACE )()Pr(),(

43


Again we can replace the probabilities by relative frequencies. The suggested choice of attribute to test next is that which gains the most information.That is select for which is maximal.For example: consider the choice of the first attribute to test, i.e., HISThe collection of records contains 3 Buy signals (B) and 5 Sell signals (S), so:

bits 954.08

5log

8

5

8

3log

8

3)( 22

CU

44


Testing the first attribute gives the results shown below.

Hang Seng Index

Rise Drop

{2, 3, 5, 6, 7} {1, 4, 8}

45


The informaiton still needed for a rule for the “rise” branch is:

And for the “drop” branch

The expected information content is:

bits 971.05

3log

5

3

5

2log

5

222

bits 918.03

2log

3

2

3

1log

3

122

bits 951.0918.08

3971.0

8

5),( HSICE

46


The information gained by testing this attribute is 0.954 - 0.951 = 0.003 bits which is negligible.The tree arising from testing the second attribute was given previously.The branches for small (with 3 records) and medium (1 record) require no further information.The branch for large contained 2 Buy and 2 Sell records and so requires 1 bit.

bits 5.018

40

8

10

8

3),( VolumeCE

47


The information gained by testing Trading Volume is 0.954 - 0.5 = 0.454 bits.In a similar way the information gained by testing DJIA comes to 0.347 bits. The principle of maximizing expected information gain would lead ID3 to select Trading Volume as the attribute to form the root of the decision tree.

48

How to use a tree?Directly – test the attribute value of unknown sample

against the tree.– A path is traced from root to a leaf which

holds the label

Indirectly – decision tree is converted to classification

rules– one rule is created for each path from the

root to a leaf– IF-THEN is easier for humans to understand

49

Extracting Classification Rules from TreesRepresent the knowledge in the form of IF-THEN rulesOne rule is created for each path from the root to a leafEach attribute-value pair along a path forms a conjunctionThe leaf node holds the class predictionRules are easier for humans to understandExampleIF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “excellent” THEN

buys_computer = “yes”IF age = “<=30” AND credit_rating = “fair” THEN buys_computer =

“no”

50

Avoid Overfitting in ClassificationThe generated tree may overfit the training data – Too many branches, some may reflect anomalies

due to noise or outliers– Result is in poor accuracy for unseen samples

Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not

split a node if this would result in the goodness measure falling below a threshold• Difficult to choose an appropriate threshold

– Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to

decide which is the “best pruned tree”

51

Improving the C4.5/ID3 Algorithm

Allow for continuous-valued attributes– Dynamically define new discrete-valued

attributes that partition the continuous attribute value into a discrete set of intervals

Handle missing attribute values– Assign the most common value of the attribute– Assign probability to each of the possible values

Attribute construction– Create new attributes based on existing ones

that are sparsely represented– This reduces fragmentation, repetition, and

replication

52

Classifying Large Datasets

Advantages of the decision-tree approach– Computational efficient compared to other

classification methods.– Convertible into simple and easy to

understand classification rules.– Relatively good quality rules (comparable

classification accuracy).

53

Presentation of Classification Results

54

Neural Networks

k-

f

weighted sum

Inputvector x

output y

Activationfunction

weightvector w

w0

w1

wn

x0

x1

xn

A Neuron

55

Neural Networks Advantages– prediction accuracy is generally high– robust, works when training examples contain

errors– output may be discrete, real-valued, or a vector of

several discrete or real-valued attributes– fast evaluation of the learned target function

Criticism– long training time– difficult to understand the learned function

(weights)– not easy to incorporate domain knowledge

Genetic Algorithm (I)GA: based on an analogy to biological evolution.

– A diverse population of competing hypotheses is maintained.

– At each iteration, the most fit members are selected to produce new offspring that replace the least fit ones.

– Hypotheses are encoded by strings that are combined by crossover operations, and subject to random mutation.

Learning is viewed as a special case of optimization.

– Finding optimal hypothesis according to the predefined fitness function.

57

Genetic Algorithm (II)

IF (level = doctor) and (GPA = 3.6) THEN result=approval

level GPA result001 111 10

00111110 10011110 10001101 00101101

58

Fuzzy Set Approaches

Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph)Attribute values are converted to fuzzy values– e.g., income is mapped into the discrete categories

{low, medium, high} with fuzzy values calculated

For a given new sample, more than one fuzzy value may applyEach applicable rule contributes a vote for membership in the categoriesTypically, the truth values for each predicted category are summed

59

Evaluating Classification Rules

Constructing a classification model.– In form of mathematical equations?– Neural networks.– Classification rules.– Requires training set of pre-classified records.

Evaluation of classification model.– Estimate quality by testing classification model.– Quality = accuracy of classification.– Requires a testing set of records (known class

labels).– Accuracy is percentage of correctly classified test

set.

60

Construction of Classification Model

TrainingData

NAME Undergrad U Degree GradeMike U of A B.Sc. HiMary U of C B.A. LoBill U of B B.Eng LoJim U of B B.A. HiDave U of A B.Sc. HiAnne U of A B.Sc. Hi

ClassificationAlgorithms

IF Undergrad U = ‘U of A’OR Degree = B.Sc.THEN Grade = ‘Hi’

Classifier(Model)

61

Evaluation of Classification Model

Classifier

TestingData Unseen Data

(Jeff, U of A, B.Sc.)

Hi Grade?NAME Undergrad U Degree GradeTom U of A B.Sc. HiMelisa U of C B.A. LoPete U of B B.Eng LoJoe U of A B.A. Hi

62

Classification Accuracy: Estimating Error Rates

Partition: Training-and-testing

– use two independent data sets, e.g., training set (2/3), test set(1/3)

– used for data set with large number of samples

Cross-validation

– divide the data set into k subsamples

– use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation

– for data set with moderate size

Bootstrapping (leave-one-out)

– for small size data

63

Issues Regarding classification: Data Preparation

Data cleaning– Preprocess data in order to reduce noise and

handle missing values

Relevance analysis (feature selection)– Remove the irrelevant or redundant attributes

Data transformation– Generalize and/or normalize data

64

Issues regarding classification (2): Evaluating Classification Methods

Predictive accuracySpeed and scalability– time to construct the model– time to use the model

Robustness– handling noise and missing values

Scalability– efficiency in disk-resident databases

Interpretability: – understanding and insight provded by the model

Goodness of rules– decision tree size– compactness of classification rules

Documents

COMP 578 Discovering Classification Rules