23
March 6, 2014 Data Mining: Concepts and Techniques 1  : Concepts and Techniques  — Slides for Tet!oo" —  — Chapter # — $$$%&ntu$orld%co' 

Lecturer 7

Embed Size (px)

Citation preview

Page 1: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 1/23

March 6, 2014 Data Mining: Concepts and Techniques 1

  :

Concepts and

Techniques — Slides for Tet!oo" —

 — Chapter # —

$$$%&ntu$orld%co' 

Page 2: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 2/23

March 6, 2014 Data Mining: Concepts and Techniques 2

Chapter #% Classification and

(rediction

)hat is classification* )hat is prediction*

+ssues regarding classification and prediction

Classification ! decision tree induction

-aesian Classification Classification ! !ac"propagation

Classification !ased on concepts fro' association rule

'ining

.ther Classification Methods

(rediction

Classification accurac

Su''ar

Page 3: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 3/23

March 6, 2014 Data Mining: Concepts and Techniques /

Classification:  predicts categorical class la!els classifies data constructs a 'odel !ased on the

training set and the alues class la!els in a

classifing attri!ute and uses it in classifing ne$ data (rediction:

'odels continuous3alued functions, i%e%, predictsun"no$n or 'issing alues

Tpical pplications credit approal target 'ar"eting 'edical diagnosis

treat'ent effectieness analsis

Classification s% (rediction

Page 4: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 4/23

March 6, 2014 Data Mining: Concepts and Techniques 4

.ur ea'ple

)e used clustering !ased on te'poral patterns

of isits and spending to co'e up $ith follo$ing

la!els 5oal3!igSpender 

5oal3'oderateSpender 

Se'i5oal3!igSpender  Se'i5oal3'oderateSpender 

.ther 

Page 5: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 5/23

March 6, 2014 Data Mining: Concepts and Techniques

.ur classification

)e $ill tr to predict those la!les fro' $hat

the !ought

7se spending in / i'portant categories topredict8classif the la!el

)e use the $ord classif instead of predict,

!ecause prediction tpicall is for continuous

attri!utes in our !oo", an$a

Classification is prediction of categories or

la!els

Page 6: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 6/23

March 6, 2014 Data Mining: Concepts and Techniques 6

Classification— T$o3Step (rocess 

Model construction: descri!ing a set of predeter'ined classes

9ach tuple8sa'ple is assu'ed to !elong to a predefined class,

as deter'ined ! the class la!el attri!ute

The set of tuples used for 'odel construction: training set

The 'odel is represented as classification rules, decision trees,or 'athe'atical for'ulae

Model usage: for classifing future or un"no$n o!&ects

9sti'ate accurac of the 'odel

The "no$n la!el of test sa'ple is co'pared $ith the

classified result fro' the 'odel  ccurac rate is the percentage of test set sa'ples that are

correctl classified ! the 'odel

Test set is independent of training set, other$ise oer3fitting

$ill occur 

Page 7: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 7/23March 6, 2014 Data Mining: Concepts and Techniques #

Classification (rocess 1: Model

Construction

Training

Data

NAME RANK YEARS TENURED

Mi"e ssistant (rof / no

Mar, ssistant (rof # ,es-ill (rofessor 2 ,es

:i' ssociate (rof # ,es

Dae ssistant (rof 6 no

 nne ssociate (rof / no

Classification

Algorithms

IF rank = ‘professor’

OR years > 6

T!" ten#re$ = ‘yes’

Classifier 

%&o$el'

Page 8: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 8/23March 6, 2014 Data Mining: Concepts and Techniques ;

Classification (rocess 2: 7se the

Model in (rediction

Classifier 

Testing

Data

NAME RANK YEARS TENURED

To' ssistant (rof 2 no

Merlisa ssociate (rof # no

<eorge (rofessor ,es

:oseph ssistant (rof # ,es

(nseen Data

%)eff* +rofessor* ,'

Ten#re$-

Page 9: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 9/23March 6, 2014 Data Mining: Concepts and Techniques =

Superised s% 7nsuperised

5earning

Superised learning classification

Superision: The training data o!serations,

'easure'ents, etc% are acco'panied ! la!els

indicating the class of the o!serations

>e$ data is classified !ased on the training set

7nsuperised learning clustering

The class la!els of training data is un"no$n <ien a set of 'easure'ents, o!serations, etc% $ith

the ai' of esta!lishing the eistence of classes or

clusters in the data

Page 10: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 10/23March 6, 2014 Data Mining: Concepts and Techniques 10

+ssues regarding classification and

prediction 1: Data (reparation

Data cleaning

(reprocess data in order to reduce noise and handle

'issing alues

?eleance analsis feature selection

?e'oe the irreleant or redundant attri!utes

Data transfor'ation

<enerali@e and8or nor'ali@e data

Page 11: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 11/23March 6, 2014 Data Mining: Concepts and Techniques 11

+ssues regarding classification and prediction

2: 9aluating Classification Methods

(redictie accurac Speed and scala!ilit

ti'e to construct the 'odel ti'e to use the 'odel

?o!ustness handling noise and 'issing alues

Scala!ilit efficienc in dis"3resident data!ases

+nterpreta!ilit: understanding and insight proded ! the 'odel

<oodness of rules decision tree si@e co'pactness of classification rules

Page 12: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 12/23March 6, 2014 Data Mining: Concepts and Techniques 12

Classification ! Decision Tree

+nduction

Decision tree

  flo$3chart3li"e tree structure

+nternal node denotes a test on an attri!ute

-ranch represents an outco'e of the test

5eaf nodes represent class la!els or class distri!ution Decision tree generation consists of t$o phases

Tree construction

 t start, all the training ea'ples are at the root

(artition ea'ples recursiel !ased on selected attri!utes

Tree pruning

+dentif and re'oe !ranches that reflect noise or outliers

7se of decision tree: Classifing an un"no$n sa'ple

Test the attri!ute alues of the sa'ple against the decision tree

Page 13: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 13/23March 6, 2014 Data Mining: Concepts and Techniques 1/

Training Dataset

age inco'e student creditArating !u,sAco'puter 

BC/0 high no fair no

BC/0 high no ecellent no

/140 high no fair ,esE40 'ediu' no fair ,es

E40 lo$ ,es fair ,es

E40 lo$ ,es ecellent no

/140 lo$ ,es ecellent ,es

BC/0 'ediu' no fair no

BC/0 lo$ ,es fair ,es

E40 'ediu' ,es fair ,esBC/0 'ediu' ,es ecellent ,es

/140 'ediu' no ecellent ,es

/140 high ,es fair ,es

Page 14: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 14/23March 6, 2014 Data Mining: Concepts and Techniques 14

>otes a!out the ta!le

)e hae four attri!utes used to predict8classif

$hether the custo'er !ought a co'puter or not

Page 15: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 15/23March 6, 2014 Data Mining: Concepts and Techniques 1

Output: A Decision Tree for “buys_computer” 

age-

o.ercast

st#$ent- cre$it rating-

no yes fair e/cellent

<=30>40

no noyes yes

yes

30..40

Page 16: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 16/23

Page 17: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 17/23March 6, 2014 Data Mining: Concepts and Techniques 1#

 ttri!ute Selection Measure

+nfor'ation gain +D/8C4%

 ll attri!utes are assu'ed to !e categorical

Can !e 'odified for continuous3alued attri!utes

<ini inde +-M +ntelligentMiner  ll attri!utes are assu'ed continuous3alued

 ssu'e there eist seeral possi!le split alues for each

attri!ute

Ma need other tools, such as clustering, to get thepossi!le split alues

Can !e 'odified for categorical attri!utes

Page 18: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 18/23March 6, 2014 Data Mining: Concepts and Techniques 1;

+nfor'ation <ain +D/8C4%

Select the attri!ute $ith the highest infor'ation gain

 ssu'e there are t$o classes, P   and N 

5et the set of ea'ples S contain p ele'ents of class P  

and n ele'ents of class N 

The a'ount of infor'ation, needed to decide if an

ar!itrar ea'ple in S !elongs to P   or N  is defined as

n p

n

n p

n

n p

 p

n p

 pn p I 

++−

++−= 00 loglog'*%

Page 19: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 19/23March 6, 2014 Data Mining: Concepts and Techniques 1=

+nfor'ation <ain in Decision

Tree +nduction

 ssu'e that using attri!ute a set S $ill !e partitioned

into sets GS1, S2 , , Sv H

+f Si  contains pi  ea'ples of P  and ni  ea'ples of N ,

the entrop, or the epected infor'ation needed to

classif o!&ects in all su!trees Si  is

The encoding infor'ation that $ould !e gained !

!ranching on A

=  +

+=

ν 

1

'*%'%

i

iiii n p I 

n p

n p A E 

'%'*%'%   A E n p I  AGain   −=

Page 20: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 20/23March 6, 2014 Data Mining: Concepts and Techniques 20

 ttri!ute Selection ! +nfor'ation

<ain Co'putation

Class (: !usAco'puter

IesJ

Class >: !usAco'puter

InoJ

+p, n +=, 0%=40

Co'pute the entrop for age:

Kence

Si'ilarl

age pi   ni   +pi, niBC/0 2 / 0%=#1

/0D40 4 0 0

E40 / 2 0%=#1

6234'0*5%1,

'4*,%1,

,'5*0%

1,

'%

=+

+=

 I 

 I  I age E 

4,734' 8 %

1134'%

40234'%

=

=

=

rating credit Gain

 student Gain

incomeGain

'%'*%'%   age E n p I ageGain   −=

Page 21: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 21/23March 6, 2014 Data Mining: Concepts and Techniques 21

age inco'e student creditArating !u,sAco'puter 

BC/0 high no fair no

BC/0 high no ecellent no

BC/0 'ediu' no fair no

BC/0 lo$ ,es fair ,es

BC/0 'ediu' ,es ecellent ,es

 ge is the highest leel attri!ute

)e repeat the analsis for each alue of age

 ctiit: Lind out gain for the re'aining attri!utes forage B/0 and age E40

There is no need for such analsis for age in /0,40

Page 22: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 22/23March 6, 2014 Data Mining: Concepts and Techniques 22

Gini  +nde +-M +ntelligentMiner

+f a data set T  contains ea'ples fro' n classes, gini inde,gini T  is defined as

  $here p j  is the relatie frequenc of class j  in T.

+f a data set T  is split into t$o su!sets T 1 and T 2 $ith si@es N 1 and N 2 respectiel, the gini  inde of the split data contains

ea'ples fro' n classes, the gini  inde gini T  is defined as

The attri!ute proides the s'allest gini split T  is chosen to split

the node need to enumerate all possible splitting points for

each attribute%

=

−=n

  j

 p  jT  gini

1

01'%

'%'%'%0

0

1

1

T  gini

 N 

 N 

T  gini

 N 

 N T  gini split 

  +=

Page 23: Lecturer 7

8/12/2019 Lecturer 7

http://slidepdf.com/reader/full/lecturer-7 23/23

9tracting Classification ?ules fro' Trees

?epresent the "no$ledge in the for' of +L3TK9> rules .ne rule is created for each path fro' the root to a leaf  9ach attri!ute3alue pair along a path for's a con&unction The leaf node holds the class prediction ?ules are easier for hu'ans to understand 9a'ple

+L age  IB/0J >D student   InoJ TK9> buyscomputer   InoJ

+L age  IB/0J >D student   IyesJ TK9> buyscomputer   IyesJ

+L age  I/140J TK9> buyscomputer   IyesJ

+L age  IE40J >D creditrating   Ie!cellent J TK9>buyscomputer IyesJ

+L age  IE40J >D creditrating   Ifair J TK9> buyscomputer   InoJ