Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

Embed Size (px)

Citation preview

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    1/31

    Chng 3: PHN LPDA VO CY QUYT NHClassification based on Decision Tree

    KHAI PH D LIU

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    2/31

    Page 2

    Chng 3: PHN LP BNG CY QUYT NH

    PHN LP (Classification)

    Spxpmtitng vo mtlpbit Bi ton hc c gim st (Supervised learning)

    Chomtcsdliu (CSDL) D = {t1, t2,.., tn} vtp cclp C = {C1, C2, .., Cm}.Bitrng lp Ci = {t, Ci(t)}, phn lp l bi ton xcnh nhx f:DC, sao choti, Cj: tiCj.

    Tin trnh phnlp:B1: Xc nh hm y=f(X) gn nhn lp cho itng X

    Hm f c thbiudinbi cc lut, cng thc ton hc, cy quytnh

    B2: Sdng f phn lp (gn nhn) cc itngchabit

    D ON (Prediction)Xc nh cc gi tr cn thiu hay chabit thng qua cc hm phn lpc xydngttpmu.

    http://localhost/var/www/apps/conversion/tmp/scratch_1/Ho_tro/Ch3_1.ppthttp://localhost/var/www/apps/conversion/tmp/scratch_1/Ho_tro/dudoan.pptxhttp://localhost/var/www/apps/conversion/tmp/scratch_1/Ho_tro/dudoan.pptxhttp://localhost/var/www/apps/conversion/tmp/scratch_1/Ho_tro/Ch3_1.ppt
  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    3/31

    Page 3

    Chng 3: PHN LP BNG CY QUYT NH

    CY QUYT NH (Decision tree)

    Cy quyt nh l mt m hnh d bo (predictive model), nh x t cc d liu quan st c v mt s vt/hin tng n cc kt lun

    v gi tr ch ca s vt/hin tng :

    Mi nt trong (internal node) tng ng vi mt bin; ng ni gia mt nt trongvi cc nt con th hin gi tr c th cho bin.

    Mi nt l i din cho gi tr d on ca bin (cc gi tr d on ca cc bin cbiu din bi ng i t nt gc ti nt l).

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    4/31

    Page 4

    Chng 3: PHN LP BNG CY QUYT NH

    XY DNG CY QUYT NH

    Xydngcy

    Thchin chia quytpmudliuhunluyn chon khi ccitng/mu mintlthuccngmtlp

    Rirc ha cc thuc tnh dng phi s.

    Cc muhunluynxut pht nm gcca cy

    Chnmtthuc tnh (da trnothngkhoco heuristic) phn chia tpmuhunluyn thnh cc nhnh.

    Tiptclpvic xy dng cy quytnh cho cc nhnh, qu trnh lpdng khi:

    Ttc cc muuc phn lp (thucmt nt l)

    Khng cn thuc tnh no c th dng phn chia mucna.

    Ctta cy: Ctta cy chnh l victrnmt cy con vo trong mt nt l.nh gi cy:nh gi chnh xc ca cy ktqu.

    Tiu ch nh gi l tsgiatngsitngc phn lp chnh xc trn tngsitngcn phn lp.

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    5/31

    Page 5

    Chng 3: PHN LP BNG CY QUYT NH

    XY DNG CY

    Sdngchinlc tham lam (greedy strategy) Phn chia cc bn ghi theo mtthuc tnh thnghimtctiu

    theo mt tiu ch no

    Cc bi ton Phn chia cc bn ghi

    Xc nh tnh trngthuc tnh thnghimnhth no?

    Lm sao xc nhcs phn nhnh l tiunht?

    Xc nhthiimkt thc chia nhnh

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    6/31

    Page 6

    Chng 3: PHN LP BNG CY QUYT NH

    PH THUC VO KIU THUC TNH

    nh danh (Nominal) Th t (Ordinal)

    Lin tc (Continuous)

    PH THUC VO S LNG NHNH PHN CHIA

    2- Nhnh

    Nhiu nhnh (Multi-way split)

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    7/31

    Page 7

    Chng 3: PHN LP BNG CY QUYT NH

    Phn chia da vo thuc tnhnh danh&Th t (Nominal & Ordinal Att)

    Phn chia nhiu nhnh (Multi-way split):Sdng cc gi tr khc nhau phn hoch

    Phn chia nh phn (Binary split): Chia tp gi tr thnh 2 tpCn tm phn hochtiu.

    CarType{Family,

    Luxury}{Sports}

    CarType{Sports,

    Luxury}{Family}

    hay

    CarTypeFamily

    Sports

    Luxury Small

    Medium

    Large

    Size

    Size{Medium,

    Large} {Small}

    Size{Small,

    Medium} {Large}hay

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    8/31

    Page 8

    Chng 3: PHN LP BNG CY QUYT NH

    Phn chia da vo thuc Lin tc (Continuos Att)

    C cc cch x l khc nhau Ri rc ha (Discretization) thnh cc thuc tnh th t phn cp

    (ordinal categorical attribute)

    Tnh Ri rc ha mt ln khi bt u x l

    ng Tm cc khong/ bucket lin tip nhau Quyt nh nh phn (Binary Decision): (A < v) or (A v)

    Tm cc cch phn chia c th tm phn hoch tt nht

    Taxable

    Income

    > 80K?

    Yes No

    Taxable

    Income?

    (i) Binary split (ii) Multi-way split

    < 10K

    [10K,25K) [25K,50K) [50K,80K)

    > 80K

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    9/31

    Page 9

    Chng 3: PHN LP BNG CY QUYT NH

    XY DNG CY

    Sdngchinlc tham lam (greedy strategy) Phn chia cc bn ghi theo mtthuc tnh thnghimtctiu

    theo mt tiu ch no

    Cc bi ton Phn chia cc bn ghi nhth no

    Xc nh tnh trngthuc tnh thnghimnhth no?

    Lm sao xc nhcs phn nhnh l tiunht?

    Xc nhthiimkt thc chia nhnh

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    10/31

    Page 10

    Chng 3: PHN LP BNG CY QUYT NH

    Da vo cc o:

    Entropy

    li thng tin (Gain)

    Ch s GINI

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    11/31

    Page 11

    Chng 3: PHN LP BNG CY QUYT NH

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    12/31

    Page 12

    C1 0

    C2 6

    C1 2

    C2 4

    C1 1

    C2 5

    P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

    Entropy = 0 log 0 1 log 1 = 0 0 = 0

    P(C1) = 1/6 P(C2) = 5/6

    Entropy = (1/6) log2 (1/6) (5/6) log2 (5/6) = 0.65

    P(C1) = 2/6 P(C2) = 4/6

    Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92

    j tjptjptEntropy )|(log)|()(

    2

    Chng 3: PHN LP BNG CY QUYT NH

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    13/31

    Page 13

    Chng 3: PHN LP BNG CY QUYT NH

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    14/31

    Page 14

    B?

    Yes No

    Node N3 Node N4

    A?

    Yes No

    Node N1 Node N2

    Trc khi phn nhnh:

    C0 N10

    C1 N11

    C0 N20

    C1 N21

    C0 N30

    C1 N31

    C0 N40

    C1 N41

    C0 N00

    C1 N01M0

    M1 M2 M3 M4

    M12 M34Gain = M0 M12 vs M0 M34

    Chng 3: PHN LP BNG CY QUYT NH

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    15/31

    Page 15

    Chng 3: PHN LP BNG CY QUYT NH

    CH S GINI (GINI Index)

    Chs GINI ca nt t

    Trong : p(j/t) l tnsutcalpjtrong nt t

    Lnnht l 1-1/nc khi cc mu phn bu trn cc lp

    Thpnht l 0 khi cc muchthucvmtlp Khi phn chia nt pthnh knhnh, chtlngca php chia c tnh bng:

    trong :

    nil smu trong nt i;

    nl smu trong nt p

    Ngi ta chnthuc tnh c GINI nhnht phn nhnh.

    2

    )(1)(GINI j tjpt

    k

    i

    ichia i

    n

    n

    1

    )(GINIGINI

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    16/31

    Page 16

    V d: Xt mt phn nhnh thuc tnh nh phn

    N2N1

    A p

    p

    C1 7

    C2 3

    Gini=0.42

    N1 N2

    C1 3 4

    C2 0 3

    Gini=0.342

    Gini(N1) =1-(3/3)2-(0/3)2

    =0

    Gini(N2) =1-(4/7)2

    -(3/7)2

    =0.489

    Ginichia =3/10*0+7/10*0.489

    =0.342

    Chng 3: PHN LP BNG CY QUYT NH

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    17/31

    Page 17

    Chng 3: PHN LP BNG CY QUYT NH

    XY DNG CY

    Sdngchinlc tham lam (greedy strategy) Phn chia cc bn ghi theo mtthuc tnh thnghimtctiu

    theo mt tiu ch no

    Cc bi ton Phn chia cc bn ghi nhth no

    Xc nh tnh trngthuc tnh thnghimnhth no?

    Lm sao xc nhcs phn nhnh l tiunht?

    Xc nhthiimkt thc chia nhnh- Ti nt ang xt cc bn ghi thuc cng mtlp

    - Cc o cho tiu ch phn lpnhhnngng qui c

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    18/31

    Page 18

    Chng 3: PHN LP BNG CY QUYT NH

    Thut ton ID3:

    Thut ton ID3 c pht biu bi Quinlan (trng i hcSydney, Australia), ccngb vocui thp nin 70ca thk20. Sau, thut ton ID3cgii thiu v trnh by (trongmc Induction on decision trees, machine learning) nm 1986.ID3cxemnhlmtcitinca CLSvikhnnglachnthuctnhttnhttiptctrin khai cytimibc.

    uvo: Mttphp cc mu, mimu/ i tng bao gm cc thuc tnhm t v mt gi tr/nhn phn lp .

    ura: Cy quytnh c khnng phn lp ngn cc mu trong tpd

    liuhunluyn v daon phn lp cho cc mu/itngchac phnlp.

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    19/31

    Page 19

    Chng 3: PHN LP BNG CY QUYT NH

    V d Xt bi ton phn lp xem c i chi tennis trong mt tnh trng thi tit no khng (?). Gii thut ID3 s hc cy quyt nh t tp hp cc mu:

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    20/31

    Page 20

    Chng 3: PHN LP BNG CY QUYT NH

    Qu trnh xy dng cy

    Tnh Entropy(S) |S| = 14; m = 2; C1= C, C2= Khng; |C1|= 9=s1, |C2|= 5=s2

    Entropy(S) =I(s1, s2)= - (9/14) Log2(9/14) (5/14) Log2(5/14) =0.940

    Entropy(SNng) =-2/5 log 2/5 -3/5 log 3/5 =0.971

    Entropy(Sm_u)=-4/4 log 4/4 0 log 0/4 = 0

    Entropy(SMa) =-3/5 log 3/5 2/5 log 2/5 = 0.971 Gain(S, Quangcnh) =

    Entropy(S) (5/14)Entropy(SNng) (4/14)Entropy(Sm_u) (5/14)Entropy(SMa)

    = 0.940(5/14)*0.971-(4/14)*0.0-(5/14)*0.0971=0.247

    Quang cnh Nng m_u Ma

    C 2 4 3

    Khng 3 0 2

    5 4 5

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    21/31

    Page 21

    Chng 3: PHN LP BNG CY QUYT NH

    Qu trnh xy dng cy

    Entropy(SMt) =-3/4 log 3/4 -1/4 log 1/4 = 0.811

    Entropy(Sm ap ) =-4/6 log 4/6 2/6 log 2/6 = 0.918

    Entropy(SNng) =-2/4 log 2/4 2/4 log 2/4 = 1

    Gain(S, Nhit) =

    Entropy(S) (4/14)Entropy(SMt) (6/14)Entropy(Sm_p) (4/14)Entropy(SNng)= 0.940(4/14)*0.811-(6/14)*0.918-(4/14)*1=0.029

    Nhit Mt m p NngC 3 4 2

    Khng 1 2 2

    4 6 4

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    22/31

    Page 22

    Chng 3: PHN LP BNG CY QUYT NH

    Qu trnh xy dng cy

    Entropy(SCao) =-3/7 log 3/7 -4/7 log 4/7 = 0.986

    Entropy(STB ) =-6/7 log 4/6 1/7 log 1/7 = 0.592

    Gain(S, m) =

    Entropy(S) (7/14)Entropy(SCao) (7/14)Entropy(STB)= 0.940(7/14)*0.986 -(7/14)*0.592 =0.151

    m Cao TBC 3 6

    Khng 4 1

    7 7

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    23/31

    Page 23

    Chng 3: PHN LP BNG CY QUYT NH

    Qu trnh xy dng cy

    Entropy(SNh) =-6/8 log 6/8 -2/8 log 2/8 = 0.811

    Entropy(SMnh ) =-3/6 log 3/6 3/6 log 3/6 = 1

    Gain(S, m) =

    Entropy(S) (8/14)Entropy(SNh) (6/14)Entropy(SMnh)= 0.940(8/14)*0.811 -(6/14)*1 =0.048

    Gi Nh MnhC 6 3

    Khng 2 3

    8 6

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    24/31

    Page 24

    Chng 3: PHN LP BNG CY QUYT NH

    Qu trnh xy dng cy

    Tnh thunnhtcadliu v lingvi cc thuc tnh

    Gain thuc tnh Quang cnhlnnhtnn phn nhnh da vo thuc

    tnh ny

    Entropy (S) = 0,940

    Gain(S, Quang cnh) 0.247

    Gain(S,Nhit) 0.029

    Gain(S,m) 0.151Gain(S, Gi) 0.048

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    25/31

    Page 25

    Chng 3: PHN LP BNG CY QUYT NH

    Phn nhnh vithuc tnh Quang cnh

    ...

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    26/31

    Page 26

    Chng 3: PHN LP BNG CY QUYT NH

    Tngt xt nhnh con Nng

    Tnh Entropy v Gain ngvi cc thuc tnhm, Nhit, GiGain(SNng, m) = 0.970; Gain(SNng, Nhit) = 0.570; Gain(SNng, Gi) = 0.019

    Cy quytnhcui cng c dng

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    27/31

    Page 27

    Chng 3: PHN LP BNG CY QUYT NH

    V d Xt tp mu hun luyn sau:

    Phn nhnh da vo thuc tnh Thu

    Ch 3 PHN LP BNG CY QUYT NH

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    28/31

    Page 28

    Chng 3: PHN LP BNG CY QUYT NH

    RT TRCH LUT T CY QUYT NH

    Rt trchluttcyquytnh: C thchuyni qua ligia m hnh cyquytnh v m hnh dnglut(IF THEN) theo qui tc:

    Milutto ra tmingdntgcn l.

    Micp gi trthuc tnh dc theo ngdnto nn php kt (php ANDv)

    Cc nt l mang nhn calpV d: Cc lut rt trch ct cy quytnh dliu Thitit-Tennis

    R1: IF (Quang cnh = Nng) AND (m = Cao) THEN Chi Tennis = Khng

    R2: IF (Quang cnh = Nng) AND (m = TB) THEN Chi Tennis = C

    R3: IF (Quang cnh = m u) THEN Chi Tennis = C

    R4: IF (Quang cnh = Ma) AND (Gi = Mnh) THEN Chi Tennis = Khng

    R5: IF (Quang cnh = Ma) AND (Gi = Nh) THEN Chi Tennis = C

    Ch 3 PHN LP BNG CY QUYT NH

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    29/31

    Page 29

    Chng 3: PHN LP BNG CY QUYT NH

    S DNG CY QUYT NH D ON LP CA D LIU MI

    Cch d onlpca cc itngcthchin: Duyt cy hay da vo tplut rt trch c.

    Chn nhnh ca cy hay lut c tpiukin bao phtp gi trthuc tnh itngcndonlnnht lm csdon.

    V d

    Ch 3 PHN LP BNG CY QUYT NH

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    30/31

    Page 30

    Chng 3: PHN LP BNG CY QUYT NH

    KT LUN & NHN XT

    Cy quytnhdhiu, hiunng phn lp cao. Hnch:

    Khng Backtracking Hai chiu

    Thut ton ID3 gp kh khn khi x l dliu c kiu gi tr lin tc, thiudliu (missing data) hay nhiu (noisy data)

    Cc o Entropy, Gain, GINI phcv cho viclachnthuc tnh phnnhnh cy quytnh.

    Vicnh gi cy quytnhthngtin hnh:

    Chia tpmu thnh 2 tp:

    Mttphunluyn (xy dng) cc phn lp (70-75% kch dliugc) Mttpnh gi

    chnh xc l tsgiasitng phn lp chnh xc trn tngsitng phn lp.

    Ch 3 PHN LP BNG CY QUYT NH

  • 7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh

    31/31

    Page 31

    Chng 3: PHN LP BNG CY QUYT NH

    TI LIU THAM KHO THM

    The top ten algorithm in Data Mining Xindong Hu, Vipin Kuma Principles of Data Mining Max Bramer

    SlideLecture Notes for Chapter 4: www.cse.msu.edu/~ptan/

    Ccphn mm m ngun m Khai ph d liu:

    Weka: www.cs.waikato.ac.nz/ml/weka/ DBMiner: http://db.cs.sfu.ca/DBMiner/index.html

    TANAGRA: http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html

    BI TP

    http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://db.cs.sfu.ca/DBMiner/index.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.htmlhttp://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.htmlhttp://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.htmlhttp://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/