Upload
baylee
View
40
Download
0
Embed Size (px)
DESCRIPTION
COMP 578 Discovering Classification Rules. Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University. An Example Classification Problem. Patient Records Symptoms & Treatment. Recovered. Not Recovered. A?. B?. Classification in Relational DB. - PowerPoint PPT Presentation
Citation preview
COMP 578Discovering Classification Rules
Keith C.C. ChanDepartment of Computing
The Hong Kong Polytechnic University
2
An Example Classification Problem Patient RecordsSymptoms & TreatmentRecovered
Not RecoveredA?
B?
3
Classification in Relational DB
Patient Symptom TreatmentRecoveredMike Headache Type A YesMary Fever Type A NoBill Cough Type B2 NoJim Fever Type C1 YesDave Cough Type C1 YesAnne Headache Type B2 Yes
Class LabelWill John, having a headacheand treated with Type C1,recover?
4
Discovering of Classification Rules
TrainingData
NAME Symptom Treat. Recover?Mike Headache Type A YesMary Fever Type A NoBill Cough Type B2 NoJim Fever Type C1 YesDave Cough Type C1 YesAnne Headache Type B2 Yes
MiningClassification
Rules
IF Symptom = HeadacheAND Treatment = C1THEN Recover = Yes
ClassificationRules
Based on the classification rule discovered, John will recover!!!
5
The Classification ProblemGiven:– A database consisting of n records.– Each record characterized by m attributes.– Each record pre-classified into p different
classes.
Find:– A set of classification rules (that constitutes a
classification model) that characterizes the different classes
– so that records not originally in the database can be accurately classified.
– I.e “predicting” class labels.
6
Typical Applications
Credit approval.– Classes can be High Risk, Low Risk?
Target marketing.– What are the classes?
Medical diagnosis– Classes can be customers with different
diseases.
Treatment effectiveness analysis.– Classes can be patience with different
degrees of recovery.
7
Techniques for Discoveirng of Classification Rules
The k-Nearest Neighbor Algorithm.The Linear Discriminant Function.The Bayesian Approach.The Decision Tree approach.The Neural Network approach.The Genetic Algorithm approach.
8
Example Using The k-NN Algorithm
Salary Age Insurance15K 28 Buy31K 39 Buy41K 53 Buy10K 45 Buy14K 55 Buy25K 27 Not Buy42K 32 Not Buy18K 38 Not Buy33K 44 Not Buy
John earns 24K per month and is 42 years old.Will he buy insurance?
9
The k-Nearest Neighbor AlgorithmAll data records correspond to points in the n-Dimensional space.Nearest neighbor defined in terms of Euclidean distance.k-NN returns the most common class label among k training examples nearest to xq.
.
_+
_ xq
+
_ _+
_
_
+
.
10
The k-NN Algorithm (2)k-NN can be for continuous-valued labels.– Calculate the mean values of the k nearest
neighborsDistance-weighted nearest neighbor algorithm– Weight the contribution of each of the k neighbors
according to their distance to the query point xq
Advantage:– Robust to noisy data by averaging k-nearest
neighborsDisadvantage:– Distance between neighbors could be dominated by
irrelevant attributes.
wd xq xi
12( , )
11
Linear Discriminant Function
How shouldwe determinethe coefficients,i.e. the wi’s?
12
Linear Discriminant Function (2)3 lines separating
3 classes
13
An Example Using TheNaïve Bayesian Approach
Luk Tang Pong Cheng B/SBuy Sell Buy Buy BBuy Sell Buy Sell BHold Sell Buy Buy SSell Buy Buy Buy SSell Hold Sell Buy SSell Hold Sell Sell BHold Hold Sell Sell SBuy Buy Buy Buy BBuy Hold Sell Buy SSell Buy Sell Buy SBuy Buy Sell Sell SHold Buy Buy Sell SHold Sell Sell Buy SSell Buy Buy Sell B
14
The Example ContinuedOn one particular day, if– Luk recommends Sell– Tang recommends Sell– Pong recommends Buy, and– Cheng recommends Buy.
If P(Buy | L=Sell, T=Sell, P=Buy, Cheng=Buy)> P(Sell | L=Sell, T=Sell, P=Buy, Cheng=Buy)Then BUY– Else Sell
How do we compute the probabilities?
15
The Bayesian ApproachGiven a record characterized by n attributes:– X=<x1,…,xn>.
Calculate the probability for it to belong to a class Ci.– P(Ci|X) = prob. that record X=<x1,…,xk> is
of class Ci.– I.e. P(Ci|X) = P(Ci|x1,…,xk).– X is classified into Ci if P(Ci|X) is the
greatest amongst all.
16
Estimating A-Posteriori ProbabilitiesHow do we compute P(C|X).
Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
P(X) is constant for all classes.
P(C) = relative freq of class C samples
C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
Problem: computing P(X|C) is not feasible!
17
The Naïve Bayesian ApproachNaïve assumption: – All attributes are mutually conditionally
independent
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
If i-th attribute is categorical:– P(xi|C) is estimated as the relative freq of samples
having value xi as i-th attribute in class C
If i-th attribute is continuous:– P(xi|C) is estimated thru a Gaussian density function
Computationally easy in both cases
18
An Example Using TheNaïve Bayesian Approach
Luk Tang Pong Cheng B/SBuy Sell Buy Buy BBuy Sell Buy Sell BHold Sell Buy Buy SSell Buy Buy Buy SSell Hold Sell Buy SSell Hold Sell Sell BHold Hold Sell Sell SBuy Buy Buy Buy BBuy Hold Sell Buy SSell Buy Sell Buy SBuy Buy Sell Sell SHold Buy Buy Sell SHold Sell Sell Buy SSell Buy Buy Sell B
19
The Example ContinuedOn one particular day, X=<Sell,Sell,Buy,Buy>– P(X|Sell)·P(Sell)=
P(Sell|Sell)·P(Sell|Sell)·P(Buy|Sell)·P(Buy|Sell)·P(Sell) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
– P(X|Buy)·P(Buy) = P(Sell|Buy)·P(Sell|Buy)·P(Buy|Buy)·P(Buy|Buy)·P(Buy) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
You should Buy.
20
Advantages of The Bayesian ApproachProbabilistic.– Calculate explicit probabilities.
Incremental.– Additional example can incrementally
increase/decrease a class probability.
Probabilistic classification.– Classify into multiple classes weighted by their
probabilities.
Standard.– Though computationally intractable, the
approach can provide a standard of optimal decision making.
21
The independence hypothesis…
… makes computation possible.
… yields optimal classifiers when satisfied.
… but is seldom satisfied in practice, as attributes (variables) are often correlated.
Attempts to overcome this limitation:– Bayesian networks, that combine Bayesian
reasoning with causal relationships between attributes
– Decision trees, that reason on one attribute at the time, considering most important attributes first
22
Bayesian Belief Networks (I)FamilyHistory
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table for the variable LungCancer
23
Bayesian Belief Networks (II)Bayesian belief network allows a subset of the
variables conditionally independent
A graphical model of causal relationships
Several cases of learning Bayesian belief
networks
– Given both network structure and all the variables:
easy
– Given network structure but only some variables
– When the network structure is not known in advance
24
The Decision Tree Approachage income student credit_rating buys_computer
<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
25
The Decision Tree Approach (2)
What is A Decision tree?– A flow-chart-like tree structure– Internal node denotes a test on an
attribute– Branch represents an outcome of the
test– Leaf nodes represent class labels or
class distribution
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
26
Constructing A Decision TreeDecision tree generation has 2 phases– At start, all the records are at the root– Partition examples recursively based on
selected attributesDecision tree can be used to classify a record not originally in the example database.– Test the attribute values of the sample against
the decision tree.
27
Tree Construction AlgorithmBasic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-
conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are
discretized in advance)– Examples are partitioned recursively based on selected
attributes– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning
– majority voting is employed for classifying the leaf– There are no samples left
28
A Decision Tree Example
RecordHS
IndexTrading
Vol. DJIABuy/Sell
1 Drop Large Drop Buy
2 Rise Large Rise Sell
3 Rise Medium Drop Buy
4 Drop Small Drop Sell
5 Rise Small Drop Sell
6 Rise Large Drop Buy
7 Rise Small Rise Sell
8 Drop Large Rise Sell
29
A Decision Tree Example (2)
Each record is described in terms of three attributes:– Hang Seng Index with values {rise,
drop}– Trading volume with values {small,
medium, large}– Dow Jones Industrial Average (DJIA)
with values {rise, drop}– Records contain Buy (B) or Sell (S) to
indicate the correct decision.– B or S can be considered a class label.
30
A Decision Tree Example (3)
If we select Trading Volume to form the root of the decision tree:
Trading Volume
Small LargeMedium
{4, 5, 7} {3} {1, 2, 6, 8}
31
A Decision Tree Example (4)The sub-collections corresponding to “Small” and “Medium” contain records of only a single classFurther partitioning unnecessary.Select the DJIA attribute to test for the “Large” branch.Now all sub-collections contain records of one decision (class).We can replace each sub-collection by the decision/class name to obtain the decision tree.
32
A Decision Tree Example (5)
Trading Volume
overcast
DJIA
RiseDrop
Small Large
Sell Buy
Buy
Medium
Sell
33
A Decision Tree Example (6)A record can be classified by:– Start at the root of the decision tree.– Find value of attribute being tested in the given
record.– Taking the branch appropriate to that value.– Continue in the same fashion until a leaf is reached.– Two records having identical attribute values may
belong to different classes.– The leaves corresponding to an empty set of
examples should be kept to a minimum.Classifying a particular record may involve evaluating only a small number of the attributes depending on the length of the path.– We never need to consider the HSI.
34
Simple Decision Trees
The selection of each attribute in turn for different levels of the tree tend to lead to complex tree.A simple tree is easier to understand.Select attribute so as to make final tree as simple as possible.
35
The ID3 AlgorithmUses an information-theoretic approach for this.A decision tree considered an information source that, given a record, generates a message.The message is the classification of that record (say, Buy (B) or Sell (S)).ID3 selects attributes by assuming that tree complexity is related to amount of information conveyed by this message.
36
Information Theoretic Test Selection
Each attribute of a record contributes a certain amount of information to its classification.E.g., if our goal is to determine the credit risk of a customer, the discovery that it has many late-payment records may contribute a certain amount of information to that goal.ID3 measures the information gained by making each attribute the root of the current sub-tree.It then picks the attribute that provides the greatest information gain.
37
Information Gain Information theory proposed by Shannon in 1948.Provides a useful theoretic basis for measuring the information content of a message.A message considered an instance in a universe of possible messages.The information content of a message is dependent on:– Number of possible messages (size of the universe).– Frequency each possible message occurs.
38
Information Gain (2)– The number of possible messages determines
amount of information (e.g. gambling).• Roulette has many outcomes.• A message concerning its outcome is of more value.
– The probability of each message determines the amount of information (e.g. a rigged coin).
• If one already know enough about the coin to wager correctly ¾ of the time, a message telling me the outcome of a given toss is worth less to me than it would be for an honest coin.
– Such intuition formalized in Information Theory.• Define the amount of information in a message as a
function of the probability of occurrence of each possible message.
39
Information Gain (3)– Given a universe of messages:
• M={m1, m2, …, mn}• And suppose each message, mi has probability p(mi) of
being received.• The amount of information I(mi) contained in the
message is defined as:– I(mi)= log2 p(mi)
• The uncertainty of a message set, U(M) is just the sum of the information in the several possible messages weighted by their probabilities:
– U(M) = i p(mi) log p(mi), i=1 to n.• That is, we compute the average information of the
possible messages that could be sent.• If all messages in a set are equiprobable, then
uncertainty is at a maximum.
40
DT Construction Using ID3If the probability of these messages is pB and pS respectively, the expected information content of the message is:
With a known set C of records we can approximate these probabilities by relative frequencies.That is pB becomes the proportion of records in C with class B.
SSBB pppp 22 loglog
41
DT Construction Using ID3 (2)
Let U(C) denote this calculation of the expected information content of a message from a decision tree, i.e.,
And we define U({ })=0.Now consider as before the possible choice of as the attribute to test next.The partial decision tree is:
SSBB ppppCU 22 loglog)(
42
DT Construction Using ID3 (3)
The values of attribute are mutually exclusive, so the new expected information content will be:
Aj
aj1 ajjajmi
cmicjc1... ...
j
jijii CUaAACE )()Pr(),(
43
DT Construction Using ID3 (4)
Again we can replace the probabilities by relative frequencies. The suggested choice of attribute to test next is that which gains the most information.That is select for which is maximal.For example: consider the choice of the first attribute to test, i.e., HISThe collection of records contains 3 Buy signals (B) and 5 Sell signals (S), so:
bits 954.08
5log
8
5
8
3log
8
3)( 22
CU
44
DT Construction Using ID3 (5)
Testing the first attribute gives the results shown below.
Hang Seng Index
Rise Drop
{2, 3, 5, 6, 7} {1, 4, 8}
45
DT Construction Using ID3 (6)
The informaiton still needed for a rule for the “rise” branch is:
And for the “drop” branch
The expected information content is:
bits 971.05
3log
5
3
5
2log
5
222
bits 918.03
2log
3
2
3
1log
3
122
bits 951.0918.08
3971.0
8
5),( HSICE
46
DT Construction Using ID3 (7)
The information gained by testing this attribute is 0.954 - 0.951 = 0.003 bits which is negligible.The tree arising from testing the second attribute was given previously.The branches for small (with 3 records) and medium (1 record) require no further information.The branch for large contained 2 Buy and 2 Sell records and so requires 1 bit.
bits 5.018
40
8
10
8
3),( VolumeCE
47
DT Construction Using ID3 (8)
The information gained by testing Trading Volume is 0.954 - 0.5 = 0.454 bits.In a similar way the information gained by testing DJIA comes to 0.347 bits. The principle of maximizing expected information gain would lead ID3 to select Trading Volume as the attribute to form the root of the decision tree.
48
How to use a tree?Directly – test the attribute value of unknown sample
against the tree.– A path is traced from root to a leaf which
holds the label
Indirectly – decision tree is converted to classification
rules– one rule is created for each path from the
root to a leaf– IF-THEN is easier for humans to understand
49
Extracting Classification Rules from TreesRepresent the knowledge in the form of IF-THEN rulesOne rule is created for each path from the root to a leafEach attribute-value pair along a path forms a conjunctionThe leaf node holds the class predictionRules are easier for humans to understandExampleIF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”IF age = “<=30” AND credit_rating = “fair” THEN buys_computer =
“no”
50
Avoid Overfitting in ClassificationThe generated tree may overfit the training data – Too many branches, some may reflect anomalies
due to noise or outliers– Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not
split a node if this would result in the goodness measure falling below a threshold• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to
decide which is the “best pruned tree”
51
Improving the C4.5/ID3 Algorithm
Allow for continuous-valued attributes– Dynamically define new discrete-valued
attributes that partition the continuous attribute value into a discrete set of intervals
Handle missing attribute values– Assign the most common value of the attribute– Assign probability to each of the possible values
Attribute construction– Create new attributes based on existing ones
that are sparsely represented– This reduces fragmentation, repetition, and
replication
52
Classifying Large Datasets
Advantages of the decision-tree approach– Computational efficient compared to other
classification methods.– Convertible into simple and easy to
understand classification rules.– Relatively good quality rules (comparable
classification accuracy).
53
Presentation of Classification Results
54
Neural Networks
k-
f
weighted sum
Inputvector x
output y
Activationfunction
weightvector w
w0
w1
wn
x0
x1
xn
A Neuron
55
Neural Networks Advantages– prediction accuracy is generally high– robust, works when training examples contain
errors– output may be discrete, real-valued, or a vector of
several discrete or real-valued attributes– fast evaluation of the learned target function
Criticism– long training time– difficult to understand the learned function
(weights)– not easy to incorporate domain knowledge
Genetic Algorithm (I)GA: based on an analogy to biological evolution.
– A diverse population of competing hypotheses is maintained.
– At each iteration, the most fit members are selected to produce new offspring that replace the least fit ones.
– Hypotheses are encoded by strings that are combined by crossover operations, and subject to random mutation.
Learning is viewed as a special case of optimization.
– Finding optimal hypothesis according to the predefined fitness function.
57
Genetic Algorithm (II)
IF (level = doctor) and (GPA = 3.6) THEN result=approval
level GPA result001 111 10
00111110 10011110 10001101 00101101
58
Fuzzy Set Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph)Attribute values are converted to fuzzy values– e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated
For a given new sample, more than one fuzzy value may applyEach applicable rule contributes a vote for membership in the categoriesTypically, the truth values for each predicted category are summed
59
Evaluating Classification Rules
Constructing a classification model.– In form of mathematical equations?– Neural networks.– Classification rules.– Requires training set of pre-classified records.
Evaluation of classification model.– Estimate quality by testing classification model.– Quality = accuracy of classification.– Requires a testing set of records (known class
labels).– Accuracy is percentage of correctly classified test
set.
60
Construction of Classification Model
TrainingData
NAME Undergrad U Degree GradeMike U of A B.Sc. HiMary U of C B.A. LoBill U of B B.Eng LoJim U of B B.A. HiDave U of A B.Sc. HiAnne U of A B.Sc. Hi
ClassificationAlgorithms
IF Undergrad U = ‘U of A’OR Degree = B.Sc.THEN Grade = ‘Hi’
Classifier(Model)
61
Evaluation of Classification Model
Classifier
TestingData Unseen Data
(Jeff, U of A, B.Sc.)
Hi Grade?NAME Undergrad U Degree GradeTom U of A B.Sc. HiMelisa U of C B.A. LoPete U of B B.Eng LoJoe U of A B.A. Hi
62
Classification Accuracy: Estimating Error Rates
Partition: Training-and-testing
– use two independent data sets, e.g., training set (2/3), test set(1/3)
– used for data set with large number of samples
Cross-validation
– divide the data set into k subsamples
– use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation
– for data set with moderate size
Bootstrapping (leave-one-out)
– for small size data
63
Issues Regarding classification: Data Preparation
Data cleaning– Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)– Remove the irrelevant or redundant attributes
Data transformation– Generalize and/or normalize data
64
Issues regarding classification (2): Evaluating Classification Methods
Predictive accuracySpeed and scalability– time to construct the model– time to use the model
Robustness– handling noise and missing values
Scalability– efficiency in disk-resident databases
Interpretability: – understanding and insight provded by the model
Goodness of rules– decision tree size– compactness of classification rules