Upload
sushil-kulkarni
View
5.689
Download
2
Embed Size (px)
DESCRIPTION
Classification algorithms like Decision tree, ID3, Information Theory,Entropy,CART, Naive Baysian classsification are discussed with examples.
Citation preview
SUSHIL KULKARNISUSHIL KULKARNI
SUSHIL KULKARNISUSHIL KULKARNI
CLASSIFICATIONCLASSIFICATIONIN DATA MININGIN DATA MINING
SUSHIL KULKARNISUSHIL KULKARNI
ClassificationClassification What is classification?
Model Construction
ID3
Information Theory
Naïve Baysian Classifier
SUSHIL KULKARNISUSHIL KULKARNI
CLASSIFICATION CLASSIFICATION PROBLEMPROBLEM
SUSHIL KULKARNISUSHIL KULKARNI
CLASSIFICATION PROBLEMCLASSIFICATION PROBLEM
ӂGiven a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f: DC where each ti is assigned to one class.
ӂ Problem is to create classes to classify data with the help of given set of data called training set.
SUSHIL KULKARNISUSHIL KULKARNI
CLASSIFICATION EXAMPLESCLASSIFICATION EXAMPLES
ӂ Teachers classify students’ grades as A,
B, C, D, or F.
ӂ Identify mushrooms as poisonous or
edible.
ӂ Identify individuals with credit risks.
SUSHIL KULKARNISUSHIL KULKARNI
Why Classification? A motivating Why Classification? A motivating applicationapplication
ӂ Credit approval
o A bank wants to classify its customers based on whether they are expected to pay back their approved loans
o The history of past customers is used to train the classifier
o The classifier provides rules, which identify potentially reliable future customers
SUSHIL KULKARNISUSHIL KULKARNI
Why Classification? A motivating Why Classification? A motivating applicationapplication
ӂ Credit approvalo Classification rule:
If age = “31...40” and income = high then credit_rating = excellent
o Future customers
Suhas : age = 35, income = high excellent
credit rating
Heena : age = 20, income = medium fair
credit rating
SUSHIL KULKARNISUSHIL KULKARNI
Classification — A Two-Step Classification — A Two-Step ProcessProcess
Model construction: describing a set of predetermined classes: Excellent and Fair using training set.
Model is represented using classification rules.
SUSHIL KULKARNISUSHIL KULKARNI
Supervised LearningSupervised Learning
Supervised learning (classification)
o Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
o New data is classified based on the training set
SUSHIL KULKARNISUSHIL KULKARNI
Classification Process (1): Classification Process (1): Model ConstructionModel Construction
TrainingData
NAME RANK YEARS TEACH Henna Assistant Prof 3 noLeena Assistant Prof 7 yesMeena Professor 2 yesDinesh Associate Prof 7 yesDinu Assistant Prof 6 noAmar Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN teach = ‘yes’
Classifier(Model)
SUSHIL KULKARNISUSHIL KULKARNI
Classification Process (2): Use Classification Process (2): Use the Model in Predictionthe Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDSwati Assistant Prof 2 noMalika Associate Prof 7 noTina Professor 5 yesJune Assistant Prof 7 yes
Unseen Data
(Dina, Professor, 4)
Teach?
SUSHIL KULKARNISUSHIL KULKARNI
Model Construction: Example Model Construction: Example Sr. Gender Age BP Drug
1 M 20 Normal A
2 F 73 Normal B
3 M 37 High A
4 M 33 Low B
5 F 48 High A
6 M 29 Normal A
7 F 52 Normal B
8 M 42 Low B
9 M 61 Normal B
10 F 30 Normal A
11 F 26 Low B
12 M 54 High A
SUSHIL KULKARNISUSHIL KULKARNI
Model Construction: ExampleModel Construction: Example
Blood Pressure ?
Age ? Drug B
High Low
Drug A Drug B
Drug A
Normal
40 > 40
Directed Tree
SUSHIL KULKARNISUSHIL KULKARNI
Model Construction: ExampleModel Construction: Example
Tree summarizes the following:o If BP=High prescribe Drug A
o If BP=Low prescribe Drug B
o If BP=Normal and age 40 prescribe Drug A else prescribe
Drug B
Two classes ‘Drug A’ and ‘Drug B’ are created.
SUSHIL KULKARNISUSHIL KULKARNI
Model Construction: ExampleModel Construction: ExampleThe tree is constructed with training data and
there is no training error.
All rules that we made are 100% correct
according to training data.
In practical field data, it is unlikely that we get
rules with 100% accuracy and with high support.
SUSHIL KULKARNISUSHIL KULKARNI
Model Construction: ExampleModel Construction: ExampleAccuracy and Support :
o Accuracy is 100% correct for given rules.
o If BP=High prescribe Drug A ( Support = 3/12)
o If BP=Low prescribe Drug B ( Support = 3/12)
o If BP=Normal and age 40 prescribe Drug A else
prescribe Drug B ( Support = 3/12)
SUSHIL KULKARNISUSHIL KULKARNI
Error and SupportError and SupportLet t = No. of data points, r = no. of data points in a
class or node, max = maximum no. of data points in
a class or node, min = minimum no. of data points in
a class or nodeo Accuracy = max / r
o Error = min / r
o Support = max / t
Accuracy and Error are calculated for classes and
support for the class is calculated with respect to
the total number of data points in a given set.
SUSHIL KULKARNISUSHIL KULKARNI
Rules with different Rules with different accuracy & supportaccuracy & support
180 data points
115 A5 B
58 A2 B
X < 60 X > 60
E = 5/120A= 115/120S= 115/180
E = 2/60A= 58/60S= 58/180
Node P Node Q
SUSHIL KULKARNISUSHIL KULKARNI
Criteria to grow the treeCriteria to grow the tree
If the attribute is a categorical then the
tree is called as classification tree.
[ Eg. Drug Prescribe]
If the attribute is continuous then the tree
is called as regression tree.
[ Eg. Income]
SUSHIL KULKARNISUSHIL KULKARNI
CLASSIFICATION CLASSIFICATION TREES FOR TREES FOR
CATEGORICAL CATEGORICAL ATTRIBUTESATTRIBUTES
SUSHIL KULKARNISUSHIL KULKARNI
INDUCTION DECISION TREE [ ID3]INDUCTION DECISION TREE [ ID3]
Decision tree generation consists of two phaseso Tree construction
• At start, all the training examples are at the root• Partition examples recursively based on
selected attributeso Tree pruning
• Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sampleo Test the attribute values of the sample against the
decision tree
SUSHIL KULKARNISUSHIL KULKARNI
Training DatasetTraining Dataset
No age income student credit_rating buys_computer1 <=30 high no fair no2 <=30 high no excellent no3 31…40 high no fair yes4 >40 medium no fair yes5 >40 low yes fair yes6 >40 low yes excellent no7 31…40 low yes excellent yes8 <=30 medium no fair no9 <=30 low yes fair yes
10 >40 medium yes fair yes11 <=30 medium yes excellent yes12 31…40 medium no excellent yes
This follows an example from Quinlan’s ID3
SUSHIL KULKARNISUSHIL KULKARNI
Output: ID 3 for “Output: ID 3 for “buys_computer”buys_computer”
age?
student? credit rating?
no yes fairexcellent
<=30>40
no noyes yes
yes
31..40
‘no’ and ‘yes’ are two classes created
SUSHIL KULKARNISUSHIL KULKARNI
ANOTHER EXAMPLE:ANOTHER EXAMPLE:MARKSMARKS
ӂ If x >= 90 then grade =A.ӂIf 80<=x<90 then grade =B.ӂIf 70<=x<80 then grade =C.ӂIf 60<=x<70 then grade =D.ӂIf x<50 then grade =F
SUSHIL KULKARNI
>=90<90
x
>=80<80
x
>=70<70
x B
A
>=60<50
x C
F D
SUSHIL KULKARNISUSHIL KULKARNI
ALGORITHM FOR ID 3ALGORITHM FOR ID 3
Basic algorithm (a greedy algorithm)o Tree is constructed in a top-down recursive divide-
and-conquer mannero At start, all the training examples are at the rooto Attributes are categorical o Samples are partitioned recursively based on
selected attributeso Test attributes are selected on the basis of a
heuristic or statistical measure (e.g., information gain)
SUSHIL KULKARNISUSHIL KULKARNI
ALGORITHM FOR ID 3ALGORITHM FOR ID 3
Conditions for stopping partitioningo All samples for a given node belong to the
same classo There are no remaining attributes for
further partitioning – majority voting is employed for classifying the leaf
o There are no samples left
SUSHIL KULKARNISUSHIL KULKARNI
ID 3 : ADVANTAGESID 3 : ADVANTAGES
Easy to understand.
Easy to generate rules
SUSHIL KULKARNISUSHIL KULKARNI
ID 3 :ID 3 :DISADVANTAGESDISADVANTAGES
May suffer from over fitting.
Does not easily handle nonnumeric data.
Can be quite large – pruning is necessary.
SUSHIL KULKARNISUSHIL KULKARNI
INFORMATION INFORMATION THEORYTHEORY
SUSHIL KULKARNISUSHIL KULKARNI
INFORMATION THEORYINFORMATION THEORY
SUSHIL KULKARNISUSHIL KULKARNI
When all the marbles in the bowl are mixed up, little information is given.
When the marbles in the bowl are distributed in different classes , more information is given.
INFORMATION THEORYINFORMATION THEORY
SUSHIL KULKARNISUSHIL KULKARNI
Entropy gives an idea of how to split an attribute from a tree.
‘yes’ or ‘no’ in our example.
ENTROPYENTROPY
SUSHIL KULKARNISUSHIL KULKARNI
BUILDING THE BUILDING THE TREETREE
SUSHIL KULKARNISUSHIL KULKARNI
Information Gain ID3Information Gain ID3
Select the attribute with the highest information
gain
Assume there are two classes, P and N
o Let the set S contain p elements of class P and n
elements of class N
o The amount of information, needed to decide if an
arbitrary object in S belongs to P or N is defined as
npn
npn
npp
npp
npI
22 loglog),(
SUSHIL KULKARNISUSHIL KULKARNI
Information Gain in Decision Information Gain in Decision Tree InductionTree Induction
Assume that using attribute A, a set S will be partitioned into sets {S1, S2 , …, Sv}
o If Si contains pi elements of P and ni elements of N,
the entropy, or the expected information needed to classify objects in all sub trees Si is
The encoding information that would be gained by branching on A
1),()(
iii
ii npInpnp
AE
)(),()( AEnpIAGain
SUSHIL KULKARNISUSHIL KULKARNI
Training DatasetTraining Dataset
No age income student credit_rating buys_computer1 <=30 high no fair no2 <=30 high no excellent no3 31…40 high no fair yes4 >40 medium no fair yes5 >40 low yes fair yes6 >40 low yes excellent no7 31…40 low yes excellent yes8 <=30 medium no fair no9 <=30 low yes fair yes
10 >40 medium yes fair yes11 <=30 medium yes excellent yes12 31…40 medium no excellent yes
This follows an example from Quinlan’s ID3
SUSHIL KULKARNISUSHIL KULKARNI
Attribute Selection by Attribute Selection by Information Gain ComputationInformation Gain Computation
o Class P: buys_computer = “yes”
o Class N: buys_computer = “no”
o I(p, n) = I(9, 5) =0.940
o Compute the entropy for age:
Hence
Similarly
age pi ni I(pi, ni)<=30 2 3 0.97131..40 4 0 0>40 3 2 0.971
69.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIageE
048.0)_(
151.0)(
029.0)(
ratingcreditGain
studentGain
incomeGain
250.0
)(),()(
ageEnpIageGain
AGE IS MAX GAINAGE IS MAX GAIN
SUSHIL KULKARNISUSHIL KULKARNI
Splitting the samples using Splitting the samples using ageage
income student credit_rating buys_computerhigh no fair nohigh no excellent nomedium no fair nolow yes fair yesmedium yes excellent yes
income student credit_rating buys_computerhigh no fair yeslow yes excellent yesmedium no excellent yeshigh yes fair yes
income student credit_rating buys_computermedium no fair yeslow yes fair yeslow yes excellent nomedium yes fair yesmedium no excellent no
age?<=30
31...40
>40
labeled yes
SUSHIL KULKARNISUSHIL KULKARNI
Output: ID 3 for “Output: ID 3 for “buys_computer”buys_computer”
age?
student? credit rating?
no yes fairexcellent
<=30>40
no noyes yes
yes
31..40
SUSHIL KULKARNISUSHIL KULKARNI
CARTCART
SUSHIL KULKARNISUSHIL KULKARNI
CART [ CLASSIFICATION AND CART [ CLASSIFICATION AND REGRESSION TREE] REGRESSION TREE]
Algorithm is similar to ID 3 but used GINI index called impurity measure to select variables.
If target variable is normal and has more than two categories , the option of merging of target categories into two super categories may be considered. The process is called Twoing.
SUSHIL KULKARNISUSHIL KULKARNI
GiniGini Index (IBM Index (IBM Intelligent Miner)Intelligent Miner)If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in
T.
n
jp jTgini
1
21)(
SUSHIL KULKARNISUSHIL KULKARNI
Extracting Classification Rules Extracting Classification Rules from Treesfrom Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
SUSHIL KULKARNISUSHIL KULKARNI
Extracting Classification Rules Extracting Classification Rules from Treesfrom Trees
The leaf node holds the class prediction
Rules are easy for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
SUSHIL KULKARNISUSHIL KULKARNI
BAYESIAN BAYESIAN CLASSIFICATIONCLASSIFICATION
SUSHIL KULKARNISUSHIL KULKARNI
Classification and Classification and regressionregression
What is classification? What is regression?
Issues regarding classification and
regression
Classification by decision tree induction
Bayesian Classification
Other Classification Methods
regression
SUSHIL KULKARNISUSHIL KULKARNI
What is Bayesian What is Bayesian Classification?Classification?
Bayesian classifiers are statistical classifiers
For each new sample they provide a probability that the sample belongs to a class (for all classes)
SUSHIL KULKARNISUSHIL KULKARNI
What is Bayesian What is Bayesian Classification?Classification?
Example:
o sample John (age=27, income=high,
student=no, credit_rating=fair)
o P(John, buys_computer=yes) = 20%
o P(John, buys_computer=no) = 80%
SUSHIL KULKARNISUSHIL KULKARNI
Naive Bayesian Classifier Naive Bayesian Classifier ExampleExample
Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N
play tennis?
SUSHIL KULKARNISUSHIL KULKARNI
Naive Bayesian Classifier Naive Bayesian Classifier ExampleExample
Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Nrain cool normal true Nsunny mild high false Nrain mild high true N
Outlook Temperature Humidity Windy Classovercast hot high false Prain mild high false Prain cool normal false Povercast cool normal true Psunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false P
9
5
SUSHIL KULKARNISUSHIL KULKARNI
Naive Bayesian Classifier Naive Bayesian Classifier ExampleExample
Given the training set, we compute the probabilities:
We also have the probabilities P = 9/14 and N = 5/14
Outlook P N Humidity P Nsunny 2/9 3/5 high 3/9 4/5overcast 4/9 0 normal 6/9 1/5rain 3/9 2/5Tempreature Windyhot 2/9 2/5 true 3/9 3/5mild 4/9 2/5 false 6/9 2/5cool 3/9 1/5
SUSHIL KULKARNISUSHIL KULKARNI
Naive Bayesian ClassifierNaive Bayesian Classifier
We use notation P(A) as the probability of an event A and P(A/B) denotes the probability of A conditional on another event B. H is the hypothesis and E is the evidence and is the combination of attribute values then
)E(p
)H(p).H/E(p)E/H(p
Example: Let H be ‘yes’ and e is the combination of attribute values for new day: Outlook=sunny, temp.= cool, humidity= high, windy= true. Call these for pieces as E1 , E2 ’ E 3 and E 4 and are independent then
)E(p
)H(p).H/4E(p)H/3E(p).H/2E(p).H/1E(p)E/H(p
SUSHIL KULKARNISUSHIL KULKARNI
Naive Bayesian ClassifierNaive Bayesian Classifier
Denominator can be eliminated as the final normalizing step when we make the probabilities of different pieces, the sum is 1. Thus
)H(p).H/4E(p)H/3E(p).H/2E(p).H/1E(p)E/H(p
SUSHIL KULKARNISUSHIL KULKARNI
Naive Bayesian Classifier Naive Bayesian Classifier ExampleExample
To classify a new day E: outlook = sunny, temperature = cool
humidity = high, windy = false
Prob(P|E) = Prob(P) * Prob(sunny|P) * Prob(cool|P)
* Prob(high|P) * Prob(false|P) = 9/14*2/9*3/9*3/9*6/9 = 0.01
Prob(N|X) = Prob(N) * Prob(sunny|N) * Prob(cool|N) * Prob(high|N) * Prob(false|N) = 5/14*3/5*1/5*4/5*2/5 = 0.013
SUSHIL KULKARNISUSHIL KULKARNI
Naive Bayesian Classifier Naive Bayesian Classifier ExampleExample
Probability of ‘Playing’
Probability of ‘ Not Playing’
Therefore E takes class label N
%57013.001.0
013.0
%43013.001.0
01.0
SUSHIL KULKARNISUSHIL KULKARNI
Second example X = <rain, hot, high, false>
P(X|p) · P(p) = P(rain|p) * P(hot|p) * P(high|p)
* P(false|p) * P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n) · P(n) = P(rain|n) · P(hot|n) · P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
Sample X is classified in class N (don’t play)
Naive Bayesian Classifier Naive Bayesian Classifier ExampleExample
SUSHIL KULKARNISUSHIL KULKARNI
Naive Bayesian Classifier Naive Bayesian Classifier ExampleExample
Probability of ‘Playing’
Probability of ‘ Not Playing’
Therefore X takes class label N
%630182860.0010582.0
0182860.0
%370182860.0010582.0
010582.0
SUSHIL KULKARNISUSHIL KULKARNI
REGRESSIONREGRESSION
SUSHIL KULKARNISUSHIL KULKARNI
What Is regression?What Is regression?regression is similar to classification
o First, construct a model
o Second, use model to predict unknown value
Major method for regression is regression
• Linear and multiple regression
• Non-linear regression
regression is different from classification
o Classification refers to predict categorical class label
o regression models continuous-valued functions
SUSHIL KULKARNISUSHIL KULKARNI
Predictive modeling: Predict data values or construct generalized linear models based on the database data.
One can only predict value ranges or category distributions
Determine the major factors which influence the regressiono Data relevance analysis: uncertainty
measurement, entropy analysis, expert judgement, etc.
Predictive Modeling in Predictive Modeling in DatabasesDatabases
SUSHIL KULKARNISUSHIL KULKARNI
Regress Analysis and Log-Regress Analysis and Log-Linear Models in RegressionLinear Models in Regression Linear regression: Y = + X
o Two parameters , and specify the line and are to be estimated by using the data at hand.
o using the least squares criterion to the known values of (x1,y1),(x2,y2),...,(xs,yS):
s
i i
s
i ii
xx
yyxx
1
2
1
)(
))(( xya
SUSHIL KULKARNISUSHIL KULKARNI
Regress Analysis and Log-Regress Analysis and Log-Linear Models in RegressionLinear Models in Regression Multiple regression: Y = b0 + b1 X1 + b2 X2.o Many nonlinear functions can be transformed into
the above. o E.g.,Y=b 0 + b1 X+ b2X 2+ b3X 3, X1=X, X2=X 2, X3=X 3
Log-linear models:o The multi-way table of joint probabilities is
approximated by a product of lower-order tables.o Probability: p(a, b, c, d) = ab acad bcd
SUSHIL KULKARNISUSHIL KULKARNI
T H A N K S !T H A N K S !