bai_4-P2-2010

  • Upload
    caotu90

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

  • 8/8/2019 bai_4-P2-2010

    1/27

    1

    KHAI THC

    DLIU &NG DNG

    (DATA MINING)

    GV : NGUYN HONG T ANH

    2

    BI 4PHN 2

    PHN LP DLIU

  • 8/8/2019 bai_4-P2-2010

    2/27

    3

    NI DUNG

    1. Gii thiu2. Phng php Nave Bayes

    3. Phng php da trn thhin

    4. nh gi m hnh

    4

    GII THIUCustomer Age

    Income(K)

    No.cards

    Response

    Lm 35 35 3 Yes

    Hng 22 50 2 No

    Mai 28 40 1 Yes

    Lan 45 100 2 No

    Thy 20 30 3 Yes

    Tun 34 55 2 No

    Minh 63 200 1 No

    Vn 55 140 2 No

    Thin 59 170 1 No

    Ngc 25 40 4 Yes

    Chu 30 45 3 ???

    Thi gian : 5Yu cu :Trnh by tng xcnh lp cho

    mu cuicng (Chu)khi cho bitcc mu cnli.

  • 8/8/2019 bai_4-P2-2010

    3/27

    5

    GII THIU

    1. Phn lp :Cho tp cc mu phn lp trc, xydng m hnh cho tng lpMc ch : Gn ccmu mivo cclpvi chnh xc caonhtcth.

    Cho CSDL D={t1,t2,,tn} v tp cc lpC={C1,,Cm}, phn lp l bi ton xc

    nh nh x f : DgC sao cho mi ti cgn vo mt lp.

    6

    Hnh ng

    M hnh

    D liuLng gi, hi qui, hc, hun luyn

    Phn loi, ra quyt nh

    GII THIU

  • 8/8/2019 bai_4-P2-2010

    4/27

    7

    NI DUNG1. Gii thiu

    2. Phng php NaveBayes

    3. Phng php da trn th hin

    4.

    nh gi m hnh

    8

    GII THIU1. Phn lp theo m hnh xc sut :

    D on xc sut hay d on xcsut l thnh vin ca lpNn tng:da trnnh l Bayes

    Cho X, Y l cc bin bt k( ri rc,s, cu trc, )

    D on YtX

    Lng gi cc tham s ca P(X | Y) , P(Y)trc tip t tp DL hun luynS dng nh l Bayes tnh P(Y | X=x)

  • 8/8/2019 bai_4-P2-2010

    5/27

    9

    2. nh l Bayes

    )x(P

    )y(P)y|x(P)x|y(P

    C th :

    Bin bt k Gi tr th i

    GII THIU

    10

    2. nh l Bayes

    Tng ng :

    GII THIU

  • 8/8/2019 bai_4-P2-2010

    6/27

    11

    3. Phn loi Bayes

    XD m hnh : Lng gi P(X |Y), P(Y)

    Phn lp : Dng nh l Bayes tnh

    P(Y | X new)

    Tp DL hun luyn

    GII THIU

    12

    4. c lp iu kin(Conditional independence)

    Ta thng vit :

    nh ngha : X c lp iu kin vi Y khi cho Z nuphn b xc sut trn X c lp vi cc gi tr ca Ykhi cho cc gi tr ca Z.

    V d :

    P(Sm st |Ma, Chp) = P(Sm st|Chp)

    GII THIU

  • 8/8/2019 bai_4-P2-2010

    7/27

    13

    Thut ton Nave Bayes

    Gi s:

    D : tp hun luyn gm cc mu biu din didng X =

    Ci,D : tp cc mu ca D thuc lp Ci vi

    i = {1, , m}

    Cc thuc tnh x1, ..., xn c lp iu kin

    i mt vinhau khi cho lp C

    Khi : ta cn xc nh xc sut P(Ci|X) lnnht

    14

    Thut ton Nave BayesTheo nh l Bayes :

    )|(...)|()|(

    1

    )|()|(21 Cix

    PCix

    PCix

    Pn

    k CixP

    CiP

    nk

    X

    )(

    )()|()|(

    X

    XX

    Pi

    CPi

    CP

    iCP

    Theo tnh cht c lp iu kin:

    Lut phn lp cho Xnew = {x1, ...,xn} l :

    n

    kCixPCP kiCk 1

    )|()(maxarg

  • 8/8/2019 bai_4-P2-2010

    8/27

    15

    Thut ton Nave Bayes

    B1 : Hun luyn Nave Bayes (trn tp DLhun luyn)

    Lng gi P(Ci)Lng gi P(Xk|Ci)

    B2 : Xnew c gn vo lp cho gi trcng thc ln nht :

    n

    kCixPCP kiCk 1

    )|()(maxarg

    16

    Trng hp X gi tr ri rcGi s :

    X =

    xi nhn cc gi tr ri rcKhi : Lng gi P(Ci) v lng gi

    P(Xk|Ci) theo cng thc

    DiC

    kx

    DiC

    iC

    kxP

    ,

    }{,#)|(

    D

    DiC

    iCP,

    )(

  • 8/8/2019 bai_4-P2-2010

    9/27

    17

    Trng hp X gi tr ri rc

    trnh trng hp gi tr P(Xk|Ci) = 0 do khngc mu no trong DL hun kuyn tha mn t s,ta lm trn bng cch thm mt s mu o.

    Khi :

    Lm trn theo Laplace :

    rDi

    C

    kx

    DiC

    i

    C

    k

    xP

    ,

    1}{,#)|(

    mDDi

    C

    iCP

    1

    ,)(

    vi m s lp v r l s gi tr ri rc ca thuc tnh

    18

    V D 1 :Cho tp d liu hun luyn :

    Outlook Temperature Humidity Windy Play?

    sunny hot high weak No

    sunny hot high strong No

    overcast hot high weak Yes

    rain mild High weak Yes

    rain cool Normal weak Yes

    rain cool normal strong Noovercast cool normal strong Yes

    sunny mild high weak No

    sunny cool normal weak Yes

    rain mild normal weak Yes

    sunny mild normal strong Yes

    overcast mild high strong Yes

    overcast hot normal weak Yes

    rain mild high strong No

  • 8/8/2019 bai_4-P2-2010

    10/27

    19

    B1 :c lng P(Ci) vi C1 = yes, C2= no vP(xk|Ci)

    Ta thu c P(Ci) :

    Vi thuc tnh Outlook, ta c cc gi tr : sunny,

    overcast, rain. Trong P(sunny|Ci) l :

    P(C1) = 9/14=0.643

    P(C2) = 5/14=0.357

    OutlookP(sunny | yes) = 2/9 P(sunny | no) = 3/5

    V D 1 :

    20

    Bi tp theo nhm

    Thigian : 5c lng P(xk|Ci) vi C1 = yes, C2= no P(Outlook|Ci)

    Nhm : dy tri P(Temperature|Ci)

    Nhm : dy phiP(Humidity|Ci)

    Nhm : dy gia (na trn)P(windy|Ci)

    Nhm : dy gia (na di)

  • 8/8/2019 bai_4-P2-2010

    11/27

    21

    B2 : Phn lpXnew =

    Ta cn tnh :P(C1)*P(X|C1)=P(C1)*P(sunny|y)*P(cool|y)*P(high|y)*P(strong|y)= 0.005P(C2)*P(X|C2)=P(C2)*P(sunny|n)*P(cool|n)*P(high|n)*

    P(strong|n) = 0.021

    Xnewthuc lp C2(no)

    V D 1 :

    22

    Thi gian : 5

    Hy xc nh lp cho mu mi sau :

    Xnew =

    Bi tp c nhn

  • 8/8/2019 bai_4-P2-2010

    12/27

    23

    P(C1) = (9+1)/(14+2)= 10/16

    P(C2) = (5+1)/(14+2)= 6/16

    OutlookP(sunny | y) = 3/12 P(sunny | n) = 4/8

    P(overcast | y) = 5/12 P(overcast | n) = 1/8

    P(rain | y) = 4/12 P(rain | n) = 3/8

    Temperature

    P(hot | y) = 3/12 P(hot | n) = 3/8

    P(mild | y) = 5/12 P(mild | n) = 3/8

    P(cool | y) = 4/12 P(cool | n) = 2/8

    Humidity

    P(high | y) = 4/11 P(high | n) = 5/7

    P(normal | y) = 7/11 P(normal | n) = 2/7

    Windy

    P(strong | y) = 4/11 P(strong | n) = 4/7

    P(weak | y) = 7/11 P(weak | n) = 3/7

    B1 : c lngP(Ci) vi C1 = yes,C2= no v P(xk|Ci)theo cng thc lmtrn Laplace

    V D 1 : Lm trn Laplace

    24

    B2 : Phn loi

    Xnew =

    Ta tnh theo cng thc lm trn Laplace :

    P(C1)*P(X|C1)=P(C1)*P(ovecast|y)*P(cool|y)*P(high|y)*P(strong|y)= .011

    P(C2)*P(X|C2)=P(C2)*P(ovecast|n)*P(cool|n)*P(high|n)*

    P(strong|n) = .005

    Xnewthuc lp C1 (yes)

    V D 1 :

  • 8/8/2019 bai_4-P2-2010

    13/27

    25

    Nu thuc tnh nhn gi tr lin tc th xcsut P(Xk|Ci) thng c tnh da theophn b Gauss vi gi tr trung bnh v lch :

    V P(Xk|Ci) l :

    Trng hp X gi tr lin tc

    2

    2

    2

    )(

    2

    1),,(

    x

    exg

    ),,()|(ii CCk

    xgCiP X

    26

    u im : D dng ci t Thi gian thi hnh tng t nh cy quyt

    nh

    t kt qu tt trong phn ln cc trnghp

    Nhc im : Gi thit v tnh c lp iu kin ca cc

    thuc tnh lm gim chnh xc

    Thut ton Nave Bayes

  • 8/8/2019 bai_4-P2-2010

    14/27

    27

    NI DUNG

    1. Gii thiu

    2. Phng php Nave Bayes

    3. Phng php da trnth hin

    4. nh gi m hnh

    28

    GiI THIU Phng php phn lp da trn th hin

    (Instance-based) : Lu tr cc mu/i tng hun luyn v ch x l

    khi c yu cu phn lp mu/i tng mi a mu/i tngvo lp mgn vichngnht

    Cc phng php : Thut ton k- lng ging gn nht (k-NN) Hi qui vi trng s cc b (Locally weighted

    regression) Suy lun da trn trng hp (Case-based

    reasoning)

  • 8/8/2019 bai_4-P2-2010

    15/27

    29

    K- LNG GING GN NHT

    Hy cho ti bit bn ca bn l ai, tis ni bn l ngi nh th no. Mt mu mi c gn vo lp cnhiu mu ging vi n nht trong s kmu gn nht

    30

    K- LNG GING GN NHT Thut ton xc nh lp cho mu mi E :

    Tnh khong cch gia E v tt c cc mu trong tphun luyn

    Chn k mu gn nht vi E trong tp hun luyn Gn E vo lp c nhiu mu nht trong s k mu lng

    ging (hoc E nhn gi tr trung bnh ca k mu)

    Response

    Response

    No response

    No response

    No response

    Class: Response

  • 8/8/2019 bai_4-P2-2010

    16/27

    31

    K- LNG GING GN NHT

    Tnh khong cch gia 2 mu/ i tng Mi mu - tp thuc tnh s Khong cch Euclide gia X=(x1,xn) vY=(y1,yn) l:

    Khi thc hin so snh, c th b qua cn bc2

    n

    i

    ii yxYXD1

    2)(),(

    32

    K- LNG GING GN NHT V d tnh khong cch gia John v Rachel

    D(John, Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2] Cc thuc tnh c gi tr ln s nh hng nhiu nkhong cch gia cc i tng (VD: thuc tnh income)

    Cc thuc tnh c min gi tr khc nhau

    -> Cn chun ha gitr thuctnh

    John:

    Age=35

    Income=95K

    No. of creditcards=3

    Rachel:

    Age=41

    Income=215K

    No. of creditcards=2

  • 8/8/2019 bai_4-P2-2010

    17/27

    33

    K- LNG GING GN NHT

    Cn phi chun ho d liu : nh x cc gi trvo on [0,1] theo cng thc :

    vi: vi l gitr thc t ca thuctnh i

    ai l gitr ca thuctnh chun ha

    ii

    iii

    vv

    vva

    minmax

    min

    34

    K- LNG GING GN NHT u im :

    D s dng v ci t X l tt vi d liu nhiu

    Khuyt im : Cn lu tt c cc mu Cn nhiu thi gian xc nh

    lp cho mt mu mi (cn tnh vso snh khong cch n tt ccc mu)

    Ph thucvo gi trk dongidngla chn Nu k qu nh, nhy cm vi

    nhiu Nu k quln, vng lncn cth

    cha ccim ca lp khc

    Thuc tnh phi s ?

    X

  • 8/8/2019 bai_4-P2-2010

    18/27

    35

    K- LNG GING GN NHT

    PH THUC VO GI TR K DO NGI DNG LACHN

    36

    NI DUNG1. Gii thiu

    2. Phng php Nave Bayes

    3. Phng php da trn th

    hin

    4. nh gi m hnh

  • 8/8/2019 bai_4-P2-2010

    19/27

    37

    nh gi m hnh

    Ngoi thut ton hc, s thc thica m hnh c th ph thuc vocc yu t khc : S phn b ca cc lp Chi ph phnloisai

    Kch thc ca tp hun luyn v tp

    th nghim o thc thi Phng php nh gi

    38

    nh gi m hnh nh gi thc thi Tp trung vo kh nng d on ca m

    hnh hn l tc phn loi hay xy dngm hnh, kh nng co dn,

    PREDICTED CLASS

    ACTUALCLASS

    Class=Yes Class=No

    Class=Yes a b

    Class=No c d

    a: TP (true positive) b: FN (false negative)

    c: FP (false positive) d: TN (true negative)

  • 8/8/2019 bai_4-P2-2010

    20/27

    39

    nh gi thc thi

    chnh xc ca m hnh M, acc(M)

    PREDICTED CLASS

    ACTUALCLASS

    Class=Yes Class=No

    Class=Yes a(TP)

    b(FN)

    Class=No c(FP)

    d(TN)

    FNFPTNTP

    TNTP

    dcba

    da

    Acc(M)

    40

    nh gi thc thi li ca m hnh M, error_rate(M) =1-acc(M)

    Mt s o khc :

    cba

    a

    pr

    rp

    ba

    a

    ca

    a

    2

    22(F)measure-F

    (r)Recall

    (p)Precision

  • 8/8/2019 bai_4-P2-2010

    21/27

    41

    nh gi thc thi Vd

    classes buy_computer = yes buy_computer = no total recognition(%)

    buy_computer = yes 6954 46 7000 99.34

    buy_computer = no 412 2588 3000 86.27

    total 7366 2634 10000 95.42

    acc(M) = (6954+2588)/10000=95.42% error_rate(M) = 1-95.42%=4.58% Precision(M-Yes) = 6954/7366 = 94.41% Recall(M-Yes) = 6954/7000 = 99.34% F-measure(M-Yes)= 96.81%

    42

    nh gi m hnh Phng php nh gi Phng php Holdout :

    Phn chia ngu nhin tp DL thnh 2 tpc lp : Tp hun luyn : 2/3 v tp th nghim : 1/3

    Thchhp chotp DL nh.

    Ccmu

    cth

    khngi din

    cho tonb

    DL : thiu lp trong tp th nghim Ci tin :

    Dng phng php ly mu sao cho mi lp cphn b u trong c 2 tp DL hun luyn v thnghim

    Ly mu ngu nhin : thc hin holdout k ln v chnh xc acc(M) = trung bnh cng k gi tr chnhxc

  • 8/8/2019 bai_4-P2-2010

    22/27

    43

    Phng php nh gi

    Phng php Cross-validation (k-fold) Phn chia DL thnh k tp con c cng kchthc

    Ti mivnglp s dng mt tp con ltpth nghim v cc tp con cn li l tphun luyn

    Gi tr k thng l = 10 Leave-one-out : k=s mu trong DL (dnh

    chotp DL nh) Stratified cross-validation : dng phng

    php ly mu phn b cc lp trongtng tp con nh trn ton b DL.

    44

    TM TT Phn lp l hnh thc phn tch DL rt ra

    cc m hnh m t cc lp DL quan trng

    Nhiu thut ton hiu qu c pht trin.

    Khng thut ton no vt tri nht cho mi

    tp DL Cc vn nh chnh xc, thi gian hun

    luyn, tnh linh hot, kh nng co gin, cnqun tm v nghin cu su hn .

  • 8/8/2019 bai_4-P2-2010

    23/27

    45

    CC CNG VIC CN LM

    1. Tho lun v t thc hin cc bi tp ca chng4 Phn 1v Phn 2 (khng np)2. Chun b bi 5 : Gom nhm d liu Xem ni dung cc bi tp thuc bi 5.

    Cch thc hin :

    c slide, xem cc vd

    Tham kho trn Internet v ti liu tham kho

    46

    Bi tp nhmCustomer Age

    Income(K)

    No.cards

    Response

    Lm 35 35 3 Yes

    Hng 22 50 2 No

    Mai 28 40 1 Yes

    Lan 45 100 2 No

    Thy 20 30 3 Yes

    Tun 34 55 2 No

    Minh 63 200 1 No

    Vn 55 140 2 No

    Thin 59 170 1 No

    Ngc 25 40 4 Yes

    Vinh 40 45 2 ???

    Thi gian :15S dng thutton k-NN vik = 3 xc nh

    lp cho Vinh

  • 8/8/2019 bai_4-P2-2010

    24/27

    47

    Qui nh trnh by bi np

    Bi tp nhm Ngy np :

    Tn nhm : lit k cc thnh vin thamgia H v tn:

    M s SV :

    Ni dung :

    48

    BI TP PHN 21. Cho tp hun luyn nh trong v d 1 ca bi 5-P1

    (mua,khng mua my tnh). p dng thut tonNave Bayes cho v d 1 v xc nh lp cho mumi : X= (

  • 8/8/2019 bai_4-P2-2010

    25/27

    49

    BI TP PHN 2

    age income student credit_rating buys_computer

    40 low yes fair yes

    >40 low yes excellent no

    3140 low yes excellent yes

  • 8/8/2019 bai_4-P2-2010

    26/27

    51

    3. Cho tp hun luyn sau :a) S dng thut ton k-NN xc nh lp cho Tuyn vi

    k = 3, hoc 5, hoc 7. So snh kt qu thu c.

    b) Chun ha DL v xc nh lp cho Dng. So snh kt quvi cu a).

    c) Tm phng php bin i tp DL bn v dng c th pdng phng php cy quyt nh, ILA, Nave Bayes. pdng mt trong 3 phng php ln DL bin i

    xc nh lp cho Dng. So snh kt qu vi cu a).4. So snh u im, khuyt im ca cc phng php phn

    lp da trn cy quyt nh, da trn lut, xc sut v datrn th hin .

    BI TP PHN 2

    52

    BI TP PHN 2Customer Age

    Income(K)

    No.cards

    Response

    Lm 35 35 3 Yes

    Hng 22 50 2 No

    Mai 28 40 1 Yes

    Lan 45 100 2 No

    Thy 20 30 3 Yes

    Tun 34 55 2 No

    Minh 63 200 1 No

    Vn 55 140 2 No

    Thin 59 170 1 No

    Ngc 25 40 4 Yes

    Tuyn 25 30 1 ???

  • 8/8/2019 bai_4-P2-2010

    27/27

    53

    TI LIU THAM KHO

    1. T. M. Mitchell, Machine Learning. McGraw Hill,1997

    2. J.Han, M.Kamber, Chng 7 Data mining :Concepts and Techniqueshttp://www.cs.sfu.ca/~han/dmbook

    http://www-faculty.cs.uiuc.edu/~hanj/bk2/slidesindex.html :2nd

    3. P.-N. Tan, M. Steinbach, V. Kumar, Chng 4 -Introduction to Data Mininghttp://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf

    54

    Q & A