Upload
caotu90
View
216
Download
0
Embed Size (px)
Citation preview
8/8/2019 bai_4-P2-2010
1/27
1
KHAI THC
DLIU &NG DNG
(DATA MINING)
GV : NGUYN HONG T ANH
2
BI 4PHN 2
PHN LP DLIU
8/8/2019 bai_4-P2-2010
2/27
3
NI DUNG
1. Gii thiu2. Phng php Nave Bayes
3. Phng php da trn thhin
4. nh gi m hnh
4
GII THIUCustomer Age
Income(K)
No.cards
Response
Lm 35 35 3 Yes
Hng 22 50 2 No
Mai 28 40 1 Yes
Lan 45 100 2 No
Thy 20 30 3 Yes
Tun 34 55 2 No
Minh 63 200 1 No
Vn 55 140 2 No
Thin 59 170 1 No
Ngc 25 40 4 Yes
Chu 30 45 3 ???
Thi gian : 5Yu cu :Trnh by tng xcnh lp cho
mu cuicng (Chu)khi cho bitcc mu cnli.
8/8/2019 bai_4-P2-2010
3/27
5
GII THIU
1. Phn lp :Cho tp cc mu phn lp trc, xydng m hnh cho tng lpMc ch : Gn ccmu mivo cclpvi chnh xc caonhtcth.
Cho CSDL D={t1,t2,,tn} v tp cc lpC={C1,,Cm}, phn lp l bi ton xc
nh nh x f : DgC sao cho mi ti cgn vo mt lp.
6
Hnh ng
M hnh
D liuLng gi, hi qui, hc, hun luyn
Phn loi, ra quyt nh
GII THIU
8/8/2019 bai_4-P2-2010
4/27
7
NI DUNG1. Gii thiu
2. Phng php NaveBayes
3. Phng php da trn th hin
4.
nh gi m hnh
8
GII THIU1. Phn lp theo m hnh xc sut :
D on xc sut hay d on xcsut l thnh vin ca lpNn tng:da trnnh l Bayes
Cho X, Y l cc bin bt k( ri rc,s, cu trc, )
D on YtX
Lng gi cc tham s ca P(X | Y) , P(Y)trc tip t tp DL hun luynS dng nh l Bayes tnh P(Y | X=x)
8/8/2019 bai_4-P2-2010
5/27
9
2. nh l Bayes
)x(P
)y(P)y|x(P)x|y(P
C th :
Bin bt k Gi tr th i
GII THIU
10
2. nh l Bayes
Tng ng :
GII THIU
8/8/2019 bai_4-P2-2010
6/27
11
3. Phn loi Bayes
XD m hnh : Lng gi P(X |Y), P(Y)
Phn lp : Dng nh l Bayes tnh
P(Y | X new)
Tp DL hun luyn
GII THIU
12
4. c lp iu kin(Conditional independence)
Ta thng vit :
nh ngha : X c lp iu kin vi Y khi cho Z nuphn b xc sut trn X c lp vi cc gi tr ca Ykhi cho cc gi tr ca Z.
V d :
P(Sm st |Ma, Chp) = P(Sm st|Chp)
GII THIU
8/8/2019 bai_4-P2-2010
7/27
13
Thut ton Nave Bayes
Gi s:
D : tp hun luyn gm cc mu biu din didng X =
Ci,D : tp cc mu ca D thuc lp Ci vi
i = {1, , m}
Cc thuc tnh x1, ..., xn c lp iu kin
i mt vinhau khi cho lp C
Khi : ta cn xc nh xc sut P(Ci|X) lnnht
14
Thut ton Nave BayesTheo nh l Bayes :
)|(...)|()|(
1
)|()|(21 Cix
PCix
PCix
Pn
k CixP
CiP
nk
X
)(
)()|()|(
X
XX
Pi
CPi
CP
iCP
Theo tnh cht c lp iu kin:
Lut phn lp cho Xnew = {x1, ...,xn} l :
n
kCixPCP kiCk 1
)|()(maxarg
8/8/2019 bai_4-P2-2010
8/27
15
Thut ton Nave Bayes
B1 : Hun luyn Nave Bayes (trn tp DLhun luyn)
Lng gi P(Ci)Lng gi P(Xk|Ci)
B2 : Xnew c gn vo lp cho gi trcng thc ln nht :
n
kCixPCP kiCk 1
)|()(maxarg
16
Trng hp X gi tr ri rcGi s :
X =
xi nhn cc gi tr ri rcKhi : Lng gi P(Ci) v lng gi
P(Xk|Ci) theo cng thc
DiC
kx
DiC
iC
kxP
,
}{,#)|(
D
DiC
iCP,
)(
8/8/2019 bai_4-P2-2010
9/27
17
Trng hp X gi tr ri rc
trnh trng hp gi tr P(Xk|Ci) = 0 do khngc mu no trong DL hun kuyn tha mn t s,ta lm trn bng cch thm mt s mu o.
Khi :
Lm trn theo Laplace :
rDi
C
kx
DiC
i
C
k
xP
,
1}{,#)|(
mDDi
C
iCP
1
,)(
vi m s lp v r l s gi tr ri rc ca thuc tnh
18
V D 1 :Cho tp d liu hun luyn :
Outlook Temperature Humidity Windy Play?
sunny hot high weak No
sunny hot high strong No
overcast hot high weak Yes
rain mild High weak Yes
rain cool Normal weak Yes
rain cool normal strong Noovercast cool normal strong Yes
sunny mild high weak No
sunny cool normal weak Yes
rain mild normal weak Yes
sunny mild normal strong Yes
overcast mild high strong Yes
overcast hot normal weak Yes
rain mild high strong No
8/8/2019 bai_4-P2-2010
10/27
19
B1 :c lng P(Ci) vi C1 = yes, C2= no vP(xk|Ci)
Ta thu c P(Ci) :
Vi thuc tnh Outlook, ta c cc gi tr : sunny,
overcast, rain. Trong P(sunny|Ci) l :
P(C1) = 9/14=0.643
P(C2) = 5/14=0.357
OutlookP(sunny | yes) = 2/9 P(sunny | no) = 3/5
V D 1 :
20
Bi tp theo nhm
Thigian : 5c lng P(xk|Ci) vi C1 = yes, C2= no P(Outlook|Ci)
Nhm : dy tri P(Temperature|Ci)
Nhm : dy phiP(Humidity|Ci)
Nhm : dy gia (na trn)P(windy|Ci)
Nhm : dy gia (na di)
8/8/2019 bai_4-P2-2010
11/27
21
B2 : Phn lpXnew =
Ta cn tnh :P(C1)*P(X|C1)=P(C1)*P(sunny|y)*P(cool|y)*P(high|y)*P(strong|y)= 0.005P(C2)*P(X|C2)=P(C2)*P(sunny|n)*P(cool|n)*P(high|n)*
P(strong|n) = 0.021
Xnewthuc lp C2(no)
V D 1 :
22
Thi gian : 5
Hy xc nh lp cho mu mi sau :
Xnew =
Bi tp c nhn
8/8/2019 bai_4-P2-2010
12/27
23
P(C1) = (9+1)/(14+2)= 10/16
P(C2) = (5+1)/(14+2)= 6/16
OutlookP(sunny | y) = 3/12 P(sunny | n) = 4/8
P(overcast | y) = 5/12 P(overcast | n) = 1/8
P(rain | y) = 4/12 P(rain | n) = 3/8
Temperature
P(hot | y) = 3/12 P(hot | n) = 3/8
P(mild | y) = 5/12 P(mild | n) = 3/8
P(cool | y) = 4/12 P(cool | n) = 2/8
Humidity
P(high | y) = 4/11 P(high | n) = 5/7
P(normal | y) = 7/11 P(normal | n) = 2/7
Windy
P(strong | y) = 4/11 P(strong | n) = 4/7
P(weak | y) = 7/11 P(weak | n) = 3/7
B1 : c lngP(Ci) vi C1 = yes,C2= no v P(xk|Ci)theo cng thc lmtrn Laplace
V D 1 : Lm trn Laplace
24
B2 : Phn loi
Xnew =
Ta tnh theo cng thc lm trn Laplace :
P(C1)*P(X|C1)=P(C1)*P(ovecast|y)*P(cool|y)*P(high|y)*P(strong|y)= .011
P(C2)*P(X|C2)=P(C2)*P(ovecast|n)*P(cool|n)*P(high|n)*
P(strong|n) = .005
Xnewthuc lp C1 (yes)
V D 1 :
8/8/2019 bai_4-P2-2010
13/27
25
Nu thuc tnh nhn gi tr lin tc th xcsut P(Xk|Ci) thng c tnh da theophn b Gauss vi gi tr trung bnh v lch :
V P(Xk|Ci) l :
Trng hp X gi tr lin tc
2
2
2
)(
2
1),,(
x
exg
),,()|(ii CCk
xgCiP X
26
u im : D dng ci t Thi gian thi hnh tng t nh cy quyt
nh
t kt qu tt trong phn ln cc trnghp
Nhc im : Gi thit v tnh c lp iu kin ca cc
thuc tnh lm gim chnh xc
Thut ton Nave Bayes
8/8/2019 bai_4-P2-2010
14/27
27
NI DUNG
1. Gii thiu
2. Phng php Nave Bayes
3. Phng php da trnth hin
4. nh gi m hnh
28
GiI THIU Phng php phn lp da trn th hin
(Instance-based) : Lu tr cc mu/i tng hun luyn v ch x l
khi c yu cu phn lp mu/i tng mi a mu/i tngvo lp mgn vichngnht
Cc phng php : Thut ton k- lng ging gn nht (k-NN) Hi qui vi trng s cc b (Locally weighted
regression) Suy lun da trn trng hp (Case-based
reasoning)
8/8/2019 bai_4-P2-2010
15/27
29
K- LNG GING GN NHT
Hy cho ti bit bn ca bn l ai, tis ni bn l ngi nh th no. Mt mu mi c gn vo lp cnhiu mu ging vi n nht trong s kmu gn nht
30
K- LNG GING GN NHT Thut ton xc nh lp cho mu mi E :
Tnh khong cch gia E v tt c cc mu trong tphun luyn
Chn k mu gn nht vi E trong tp hun luyn Gn E vo lp c nhiu mu nht trong s k mu lng
ging (hoc E nhn gi tr trung bnh ca k mu)
Response
Response
No response
No response
No response
Class: Response
8/8/2019 bai_4-P2-2010
16/27
31
K- LNG GING GN NHT
Tnh khong cch gia 2 mu/ i tng Mi mu - tp thuc tnh s Khong cch Euclide gia X=(x1,xn) vY=(y1,yn) l:
Khi thc hin so snh, c th b qua cn bc2
n
i
ii yxYXD1
2)(),(
32
K- LNG GING GN NHT V d tnh khong cch gia John v Rachel
D(John, Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2] Cc thuc tnh c gi tr ln s nh hng nhiu nkhong cch gia cc i tng (VD: thuc tnh income)
Cc thuc tnh c min gi tr khc nhau
-> Cn chun ha gitr thuctnh
John:
Age=35
Income=95K
No. of creditcards=3
Rachel:
Age=41
Income=215K
No. of creditcards=2
8/8/2019 bai_4-P2-2010
17/27
33
K- LNG GING GN NHT
Cn phi chun ho d liu : nh x cc gi trvo on [0,1] theo cng thc :
vi: vi l gitr thc t ca thuctnh i
ai l gitr ca thuctnh chun ha
ii
iii
vv
vva
minmax
min
34
K- LNG GING GN NHT u im :
D s dng v ci t X l tt vi d liu nhiu
Khuyt im : Cn lu tt c cc mu Cn nhiu thi gian xc nh
lp cho mt mu mi (cn tnh vso snh khong cch n tt ccc mu)
Ph thucvo gi trk dongidngla chn Nu k qu nh, nhy cm vi
nhiu Nu k quln, vng lncn cth
cha ccim ca lp khc
Thuc tnh phi s ?
X
8/8/2019 bai_4-P2-2010
18/27
35
K- LNG GING GN NHT
PH THUC VO GI TR K DO NGI DNG LACHN
36
NI DUNG1. Gii thiu
2. Phng php Nave Bayes
3. Phng php da trn th
hin
4. nh gi m hnh
8/8/2019 bai_4-P2-2010
19/27
37
nh gi m hnh
Ngoi thut ton hc, s thc thica m hnh c th ph thuc vocc yu t khc : S phn b ca cc lp Chi ph phnloisai
Kch thc ca tp hun luyn v tp
th nghim o thc thi Phng php nh gi
38
nh gi m hnh nh gi thc thi Tp trung vo kh nng d on ca m
hnh hn l tc phn loi hay xy dngm hnh, kh nng co dn,
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive) b: FN (false negative)
c: FP (false positive) d: TN (true negative)
8/8/2019 bai_4-P2-2010
20/27
39
nh gi thc thi
chnh xc ca m hnh M, acc(M)
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a(TP)
b(FN)
Class=No c(FP)
d(TN)
FNFPTNTP
TNTP
dcba
da
Acc(M)
40
nh gi thc thi li ca m hnh M, error_rate(M) =1-acc(M)
Mt s o khc :
cba
a
pr
rp
ba
a
ca
a
2
22(F)measure-F
(r)Recall
(p)Precision
8/8/2019 bai_4-P2-2010
21/27
41
nh gi thc thi Vd
classes buy_computer = yes buy_computer = no total recognition(%)
buy_computer = yes 6954 46 7000 99.34
buy_computer = no 412 2588 3000 86.27
total 7366 2634 10000 95.42
acc(M) = (6954+2588)/10000=95.42% error_rate(M) = 1-95.42%=4.58% Precision(M-Yes) = 6954/7366 = 94.41% Recall(M-Yes) = 6954/7000 = 99.34% F-measure(M-Yes)= 96.81%
42
nh gi m hnh Phng php nh gi Phng php Holdout :
Phn chia ngu nhin tp DL thnh 2 tpc lp : Tp hun luyn : 2/3 v tp th nghim : 1/3
Thchhp chotp DL nh.
Ccmu
cth
khngi din
cho tonb
DL : thiu lp trong tp th nghim Ci tin :
Dng phng php ly mu sao cho mi lp cphn b u trong c 2 tp DL hun luyn v thnghim
Ly mu ngu nhin : thc hin holdout k ln v chnh xc acc(M) = trung bnh cng k gi tr chnhxc
8/8/2019 bai_4-P2-2010
22/27
43
Phng php nh gi
Phng php Cross-validation (k-fold) Phn chia DL thnh k tp con c cng kchthc
Ti mivnglp s dng mt tp con ltpth nghim v cc tp con cn li l tphun luyn
Gi tr k thng l = 10 Leave-one-out : k=s mu trong DL (dnh
chotp DL nh) Stratified cross-validation : dng phng
php ly mu phn b cc lp trongtng tp con nh trn ton b DL.
44
TM TT Phn lp l hnh thc phn tch DL rt ra
cc m hnh m t cc lp DL quan trng
Nhiu thut ton hiu qu c pht trin.
Khng thut ton no vt tri nht cho mi
tp DL Cc vn nh chnh xc, thi gian hun
luyn, tnh linh hot, kh nng co gin, cnqun tm v nghin cu su hn .
8/8/2019 bai_4-P2-2010
23/27
45
CC CNG VIC CN LM
1. Tho lun v t thc hin cc bi tp ca chng4 Phn 1v Phn 2 (khng np)2. Chun b bi 5 : Gom nhm d liu Xem ni dung cc bi tp thuc bi 5.
Cch thc hin :
c slide, xem cc vd
Tham kho trn Internet v ti liu tham kho
46
Bi tp nhmCustomer Age
Income(K)
No.cards
Response
Lm 35 35 3 Yes
Hng 22 50 2 No
Mai 28 40 1 Yes
Lan 45 100 2 No
Thy 20 30 3 Yes
Tun 34 55 2 No
Minh 63 200 1 No
Vn 55 140 2 No
Thin 59 170 1 No
Ngc 25 40 4 Yes
Vinh 40 45 2 ???
Thi gian :15S dng thutton k-NN vik = 3 xc nh
lp cho Vinh
8/8/2019 bai_4-P2-2010
24/27
47
Qui nh trnh by bi np
Bi tp nhm Ngy np :
Tn nhm : lit k cc thnh vin thamgia H v tn:
M s SV :
Ni dung :
48
BI TP PHN 21. Cho tp hun luyn nh trong v d 1 ca bi 5-P1
(mua,khng mua my tnh). p dng thut tonNave Bayes cho v d 1 v xc nh lp cho mumi : X= (
8/8/2019 bai_4-P2-2010
25/27
49
BI TP PHN 2
age income student credit_rating buys_computer
40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
8/8/2019 bai_4-P2-2010
26/27
51
3. Cho tp hun luyn sau :a) S dng thut ton k-NN xc nh lp cho Tuyn vi
k = 3, hoc 5, hoc 7. So snh kt qu thu c.
b) Chun ha DL v xc nh lp cho Dng. So snh kt quvi cu a).
c) Tm phng php bin i tp DL bn v dng c th pdng phng php cy quyt nh, ILA, Nave Bayes. pdng mt trong 3 phng php ln DL bin i
xc nh lp cho Dng. So snh kt qu vi cu a).4. So snh u im, khuyt im ca cc phng php phn
lp da trn cy quyt nh, da trn lut, xc sut v datrn th hin .
BI TP PHN 2
52
BI TP PHN 2Customer Age
Income(K)
No.cards
Response
Lm 35 35 3 Yes
Hng 22 50 2 No
Mai 28 40 1 Yes
Lan 45 100 2 No
Thy 20 30 3 Yes
Tun 34 55 2 No
Minh 63 200 1 No
Vn 55 140 2 No
Thin 59 170 1 No
Ngc 25 40 4 Yes
Tuyn 25 30 1 ???
8/8/2019 bai_4-P2-2010
27/27
53
TI LIU THAM KHO
1. T. M. Mitchell, Machine Learning. McGraw Hill,1997
2. J.Han, M.Kamber, Chng 7 Data mining :Concepts and Techniqueshttp://www.cs.sfu.ca/~han/dmbook
http://www-faculty.cs.uiuc.edu/~hanj/bk2/slidesindex.html :2nd
3. P.-N. Tan, M. Steinbach, V. Kumar, Chng 4 -Introduction to Data Mininghttp://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
54
Q & A