Upload
karthik-kompelli
View
218
Download
0
Embed Size (px)
Citation preview
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 1/23
March 6, 2014 Data Mining: Concepts and Techniques 1
:
Concepts and
Techniques — Slides for Tet!oo" —
— Chapter # —
$$$%&ntu$orld%co'
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 2/23
March 6, 2014 Data Mining: Concepts and Techniques 2
Chapter #% Classification and
(rediction
)hat is classification* )hat is prediction*
+ssues regarding classification and prediction
Classification ! decision tree induction
-aesian Classification Classification ! !ac"propagation
Classification !ased on concepts fro' association rule
'ining
.ther Classification Methods
(rediction
Classification accurac
Su''ar
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 3/23
March 6, 2014 Data Mining: Concepts and Techniques /
Classification: predicts categorical class la!els classifies data constructs a 'odel !ased on the
training set and the alues class la!els in a
classifing attri!ute and uses it in classifing ne$ data (rediction:
'odels continuous3alued functions, i%e%, predictsun"no$n or 'issing alues
Tpical pplications credit approal target 'ar"eting 'edical diagnosis
treat'ent effectieness analsis
Classification s% (rediction
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 4/23
March 6, 2014 Data Mining: Concepts and Techniques 4
.ur ea'ple
)e used clustering !ased on te'poral patterns
of isits and spending to co'e up $ith follo$ing
la!els 5oal3!igSpender
5oal3'oderateSpender
Se'i5oal3!igSpender Se'i5oal3'oderateSpender
.ther
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 5/23
March 6, 2014 Data Mining: Concepts and Techniques
.ur classification
)e $ill tr to predict those la!les fro' $hat
the !ought
7se spending in / i'portant categories topredict8classif the la!el
)e use the $ord classif instead of predict,
!ecause prediction tpicall is for continuous
attri!utes in our !oo", an$a
Classification is prediction of categories or
la!els
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 6/23
March 6, 2014 Data Mining: Concepts and Techniques 6
Classification— T$o3Step (rocess
Model construction: descri!ing a set of predeter'ined classes
9ach tuple8sa'ple is assu'ed to !elong to a predefined class,
as deter'ined ! the class la!el attri!ute
The set of tuples used for 'odel construction: training set
The 'odel is represented as classification rules, decision trees,or 'athe'atical for'ulae
Model usage: for classifing future or un"no$n o!&ects
9sti'ate accurac of the 'odel
The "no$n la!el of test sa'ple is co'pared $ith the
classified result fro' the 'odel ccurac rate is the percentage of test set sa'ples that are
correctl classified ! the 'odel
Test set is independent of training set, other$ise oer3fitting
$ill occur
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 7/23March 6, 2014 Data Mining: Concepts and Techniques #
Classification (rocess 1: Model
Construction
Training
Data
NAME RANK YEARS TENURED
Mi"e ssistant (rof / no
Mar, ssistant (rof # ,es-ill (rofessor 2 ,es
:i' ssociate (rof # ,es
Dae ssistant (rof 6 no
nne ssociate (rof / no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
T!" ten#re$ = ‘yes’
Classifier
%&o$el'
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 8/23March 6, 2014 Data Mining: Concepts and Techniques ;
Classification (rocess 2: 7se the
Model in (rediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
To' ssistant (rof 2 no
Merlisa ssociate (rof # no
<eorge (rofessor ,es
:oseph ssistant (rof # ,es
(nseen Data
%)eff* +rofessor* ,'
Ten#re$-
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 9/23March 6, 2014 Data Mining: Concepts and Techniques =
Superised s% 7nsuperised
5earning
Superised learning classification
Superision: The training data o!serations,
'easure'ents, etc% are acco'panied ! la!els
indicating the class of the o!serations
>e$ data is classified !ased on the training set
7nsuperised learning clustering
The class la!els of training data is un"no$n <ien a set of 'easure'ents, o!serations, etc% $ith
the ai' of esta!lishing the eistence of classes or
clusters in the data
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 10/23March 6, 2014 Data Mining: Concepts and Techniques 10
+ssues regarding classification and
prediction 1: Data (reparation
Data cleaning
(reprocess data in order to reduce noise and handle
'issing alues
?eleance analsis feature selection
?e'oe the irreleant or redundant attri!utes
Data transfor'ation
<enerali@e and8or nor'ali@e data
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 11/23March 6, 2014 Data Mining: Concepts and Techniques 11
+ssues regarding classification and prediction
2: 9aluating Classification Methods
(redictie accurac Speed and scala!ilit
ti'e to construct the 'odel ti'e to use the 'odel
?o!ustness handling noise and 'issing alues
Scala!ilit efficienc in dis"3resident data!ases
+nterpreta!ilit: understanding and insight proded ! the 'odel
<oodness of rules decision tree si@e co'pactness of classification rules
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 12/23March 6, 2014 Data Mining: Concepts and Techniques 12
Classification ! Decision Tree
+nduction
Decision tree
flo$3chart3li"e tree structure
+nternal node denotes a test on an attri!ute
-ranch represents an outco'e of the test
5eaf nodes represent class la!els or class distri!ution Decision tree generation consists of t$o phases
Tree construction
t start, all the training ea'ples are at the root
(artition ea'ples recursiel !ased on selected attri!utes
Tree pruning
+dentif and re'oe !ranches that reflect noise or outliers
7se of decision tree: Classifing an un"no$n sa'ple
Test the attri!ute alues of the sa'ple against the decision tree
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 13/23March 6, 2014 Data Mining: Concepts and Techniques 1/
Training Dataset
age inco'e student creditArating !u,sAco'puter
BC/0 high no fair no
BC/0 high no ecellent no
/140 high no fair ,esE40 'ediu' no fair ,es
E40 lo$ ,es fair ,es
E40 lo$ ,es ecellent no
/140 lo$ ,es ecellent ,es
BC/0 'ediu' no fair no
BC/0 lo$ ,es fair ,es
E40 'ediu' ,es fair ,esBC/0 'ediu' ,es ecellent ,es
/140 'ediu' no ecellent ,es
/140 high ,es fair ,es
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 14/23March 6, 2014 Data Mining: Concepts and Techniques 14
>otes a!out the ta!le
)e hae four attri!utes used to predict8classif
$hether the custo'er !ought a co'puter or not
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 15/23March 6, 2014 Data Mining: Concepts and Techniques 1
Output: A Decision Tree for “buys_computer”
age-
o.ercast
st#$ent- cre$it rating-
no yes fair e/cellent
<=30>40
no noyes yes
yes
30..40
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 16/23
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 17/23March 6, 2014 Data Mining: Concepts and Techniques 1#
ttri!ute Selection Measure
+nfor'ation gain +D/8C4%
ll attri!utes are assu'ed to !e categorical
Can !e 'odified for continuous3alued attri!utes
<ini inde +-M +ntelligentMiner ll attri!utes are assu'ed continuous3alued
ssu'e there eist seeral possi!le split alues for each
attri!ute
Ma need other tools, such as clustering, to get thepossi!le split alues
Can !e 'odified for categorical attri!utes
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 18/23March 6, 2014 Data Mining: Concepts and Techniques 1;
+nfor'ation <ain +D/8C4%
Select the attri!ute $ith the highest infor'ation gain
ssu'e there are t$o classes, P and N
5et the set of ea'ples S contain p ele'ents of class P
and n ele'ents of class N
The a'ount of infor'ation, needed to decide if an
ar!itrar ea'ple in S !elongs to P or N is defined as
n p
n
n p
n
n p
p
n p
pn p I
++−
++−= 00 loglog'*%
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 19/23March 6, 2014 Data Mining: Concepts and Techniques 1=
+nfor'ation <ain in Decision
Tree +nduction
ssu'e that using attri!ute a set S $ill !e partitioned
into sets GS1, S2 , , Sv H
+f Si contains pi ea'ples of P and ni ea'ples of N ,
the entrop, or the epected infor'ation needed to
classif o!&ects in all su!trees Si is
The encoding infor'ation that $ould !e gained !
!ranching on A
∑
= +
+=
ν
1
'*%'%
i
iiii n p I
n p
n p A E
'%'*%'% A E n p I AGain −=
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 20/23March 6, 2014 Data Mining: Concepts and Techniques 20
ttri!ute Selection ! +nfor'ation
<ain Co'putation
Class (: !usAco'puter
IesJ
Class >: !usAco'puter
InoJ
+p, n +=, 0%=40
Co'pute the entrop for age:
Kence
Si'ilarl
age pi ni +pi, niBC/0 2 / 0%=#1
/0D40 4 0 0
E40 / 2 0%=#1
6234'0*5%1,
'4*,%1,
,'5*0%
1,
'%
=+
+=
I
I I age E
4,734' 8 %
1134'%
40234'%
=
=
=
rating credit Gain
student Gain
incomeGain
'%'*%'% age E n p I ageGain −=
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 21/23March 6, 2014 Data Mining: Concepts and Techniques 21
age inco'e student creditArating !u,sAco'puter
BC/0 high no fair no
BC/0 high no ecellent no
BC/0 'ediu' no fair no
BC/0 lo$ ,es fair ,es
BC/0 'ediu' ,es ecellent ,es
ge is the highest leel attri!ute
)e repeat the analsis for each alue of age
ctiit: Lind out gain for the re'aining attri!utes forage B/0 and age E40
There is no need for such analsis for age in /0,40
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 22/23March 6, 2014 Data Mining: Concepts and Techniques 22
Gini +nde +-M +ntelligentMiner
+f a data set T contains ea'ples fro' n classes, gini inde,gini T is defined as
$here p j is the relatie frequenc of class j in T.
+f a data set T is split into t$o su!sets T 1 and T 2 $ith si@es N 1 and N 2 respectiel, the gini inde of the split data contains
ea'ples fro' n classes, the gini inde gini T is defined as
The attri!ute proides the s'allest gini split T is chosen to split
the node need to enumerate all possible splitting points for
each attribute%
∑
=
−=n
j
p jT gini
1
01'%
'%'%'%0
0
1
1
T gini
N
N
T gini
N
N T gini split
+=
8/12/2019 Lecturer 7
http://slidepdf.com/reader/full/lecturer-7 23/23
9tracting Classification ?ules fro' Trees
?epresent the "no$ledge in the for' of +L3TK9> rules .ne rule is created for each path fro' the root to a leaf 9ach attri!ute3alue pair along a path for's a con&unction The leaf node holds the class prediction ?ules are easier for hu'ans to understand 9a'ple
+L age IB/0J >D student InoJ TK9> buyscomputer InoJ
+L age IB/0J >D student IyesJ TK9> buyscomputer IyesJ
+L age I/140J TK9> buyscomputer IyesJ
+L age IE40J >D creditrating Ie!cellent J TK9>buyscomputer IyesJ
+L age IE40J >D creditrating Ifair J TK9> buyscomputer InoJ