Upload
dhiec
View
236
Download
2
Embed Size (px)
DESCRIPTION
Rough Clustering
Citation preview
IJBSCHS(2010-16-2-16) Biomedical Soft Computing and Human Sciences, Vol.16, No.2, pp.135-145[Original Article] Copyright c1995 Biomedical Fuzzy Systems Association
(Accepted on February 20,2010)
RoCeT: Rough Clustering for web Transactions
Iwan Tri Riyadi Yanto1,3, Tutut Herawan2,3, Mustafa Mat Deris31Department of Mathematics
2Department of Mathematics Education
Universitas Ahmad Dahlan, Yogyakarta,Indonesia3Faculty of Information Technology and Multimedia
Universiti Tun Hussein Onn Malaysia, Johor, Malaysia
(received 31 December 2009, revised and accepted 20 February 2010)
Abstract: Grouping web transactions into clusters is important in order to obtain better understanding ofusers behavior. Currently, the rough approximation-based clustering technique has been used to group webtransactions into clusters. However, the processing time is still an issue due to the high complexity for findingthe similarity of upper approximations of a transaction which used to merge between two or more clusters.On the other hand, the problem of more than one transaction under given threshold is not addressed. In thispaper, we propose RoCeT model for grouping web transactions using rough set theory. It is based on the twosimilarity classes which are nonvoid intersection.Keywords: Clustering, Web transactions, Rough set theory.
1 Introduction
Web usage data includes data from web server ac-cess logs, proxy server logs, browser logs, user profiles,registration files, user sessions or transactions, userqueries, bookmark folders, mouse-clicks and scrolls,and any other data generated by the interaction ofusers and the web [1]. Generally, web mining tech-niques can be defined as those methods to extract socalled nuggets (or knowledge) from web data repos-itory, such as content, linkage, usage information, byutilizing data mining tool. Among such web data,user click stream, i.e. usage data, can be mainly uti-lized to capture users navigation patters and identifyuser intended tasks. Once the user navigational be-haviors are effectively characterized, they will providebenefit for further web applications, in turn, facilitateand improve web service quality for both web-basedorganizations and for end users [2-9]. In web datamining research, many data mining techniques, suchas clustering [8,10] is adopted widely to improve theusability and scalability of web mining.
Access transaction over the web can be expressedin the two finite sets, user transaction and hyper-links/URLs [11]. A user transaction U is a sequence ofitems, this set is formed by m users and the set A is setof distinct n clicks (hyperlinks/URLs) clicked by usersthat are U = {t1, t2, . . . , tm} andA = {hl1, hl2, . . . , hln},
1Faculty of Information Technology and Multimedia,UTHM, Parit Raja, Batu Pahat 86400, Johor.Phone : +60177061496Email : [email protected]
where for every ti T U is a non-empty subset ofU . The temporal order of users clicks within transac-tions has been taken into account. A user transactiont T is represented as a vector t = but1, ut2, . . . , utnc,where uti = 1 if hli t, and uti = 0 if otherwise.
A well-known approach for clustering web trans-actions is using rough set theory [12-14]. De and Kr-ishna [11] proposed an algorithm for clustering webtransactions using rough approximation. It is basedon the similarity of upper approximations of trans-actions by given any threshold. However, there aresome iterations should be done to merges of two ormore clusters that have the same similarity of upperapproximations and didnt present how to handle theproblem if there are more than one transaction undergiven threshold. To overcome those problems, in thispaper, we propose an alternative technique for cluster-ing web transaction. We use the concept of similarityclass proposed by [11]. But, the RoCeT model differson how to allocate transaction in the same cluster andhow to handle the problem if there is more than onetransaction under given threshold.
The rest of the paper is organized as follows. Sec-tion 2 describes the concept of rough set theory. Sec-tion 3 describes the work of [11]. Section 4 describesthe RoCeT model. Section 5 describes the experimen-tal test. Finally, we conclude our works in Section 6.
135
I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions
2 Rough Set Theory
An information system is a 4-tuple (quadruple) S =(U,A, V, f), where U = {u1, u2, u3, . . . , u|U |} is a non-empty finite set of objects, A = {a1, a2, a3, . . . , a|A|} isa non-empty finite set of attributes, V = aAVa, Va isthe domain (value set) of attribute a, f : U A Vis an information function such that f(u, a) Va, forevery (u, a) U A, called information (knowledge)function. The starting point of rough set approxima-tions is the indiscernibility relation, which is generatedby information about objects of interest. Two objectsin an information system are called indiscernible (in-distinguishable or similar) if they have the same fea-ture.
Definition 1 Two elements x, y U are said to beB-indiscernible (indiscernible by the set of attributeB A in S) if and only if f(x, a) = f(y, a), for everya B.
Obviously, every subset of A induces unique in-discernibility relation. Notice that, an indiscernibilityrelation induced by the set of attribute B, denotedby IND(B), is an equivalence relation. The parti-tion of U induced by IND(B) is denoted by U/B andthe equivalence class in the partition U/B containingx U , in denoted by [x]B . The notions of lower andupper approximations of a set are defined as follows.
Definition 2 (See [14].) The B-lower approximationof X, denoted by B(X) and B-upper approximationof X, denoted by B(X), respectively, are defined byB(X) = {x U |[x]B X} and B(X) = {x U |[x]B
X 6= }.
The accuracy of approximation (accuracy of rough-ness) of any subset X U with respect to B A,denoted B(X) is measured by
B(X) =|B(X)||B(X)| ,
where |X| denotes the cardinality of X. For emptyset , we define B() = 1. Obviously, 0 B(X) 1. if X is a union of some equivalence classes of U ,then B(X) = 1. Thus, the set X is crisp (precise)with respect to B. And, if X is not a union of someequivalence classes of U , then B(X) < 1. Thus, theset X is rough (imprecise)with respect to B [13]. Thismeans that the higher of accuracy of approximationof any subset X U is the more precise (the lessimprecise) of itself.
3 Related Work
In this section, we discuss the technique proposed by[11]. Given two transactions t and s, the measure-
ment of similarity between t and s is given by sim(s,t)= |t s|/|t s|. Obviously, sim(t,s) [0, 1], wheresim(t,s) = 1, when two transactions t and s are exactlyidentical and sim(t,s) = 0, when two transactions tand s have no items in common. De and Krishna [11]used a binary relation R defined on T defined as fol-lows. For any threshold value th [0, 1] and for anytwo user transactions t and s T , a binary relation Ron T denoted as tRs iff sim(t, s) th. This relationR is a tolerance relation as R is both reflexive andsymmetric but transitive may not hold good always.
Definition 3 The similarity class of t, denoted byR(t), is a set of transactions which are similar to twhich is given by R(t) = {s T : sRt}.
For different threshold values, one can get differentsimilarity classes. A domain expert can choose thethreshold based on this experience to get a propersimilarity class. It is clear that for a fixed threshold [0, 1], a transaction form a given similarity class maybe similar to an object of another similarity class.
Definition 4 Let P T , for a fixed threshold [0, 1]a binary tolerance relation R is defined on T. Thelower approximation of P, denoted by R(P ) and theupper approximation of P, denoted by R(P ) are re-spectively defined as R(P ) = {t P : R(t) P} andR(P =
tP )R(t).
They proposed a technique of clustering the clicksof user navigations called as similarity upper approxi-mation and denoted by Si. A set of transactions thatare possibly similar to R(ti) in denoted by RR(ti).This process continues until two consecutive upperapproximations for ti, i = 1, 2, 3, , |U | are the sameand two or more clusters that have the same similarityupper approximations merges at each iteration. Withthis technique, we need high computational complex-ity to cluster the transactions. This is due to find outthe similarity upper approximation until two consec-utive upper approximations are same. To overcomethis problem, we propose an alternative technique tocluster the transactions.
4 The RoCeT model
The RoCeT model for clustering the transactions isbased on all the possibly similar to the similarity classof t(R(t)). The union of two similarity classes withnon void intersection will be the same clusters. Thejustification that a cluster is a union of two similar-ity classes with non void intersection is presented inProposition 6.
136
Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)
Definition 5 Two clusters Si and Sj, i 6= j are saidto be the same if
Si =R(ti), i = 1, 2, 3, , |U |.
Proposition 6 Let Si be a cluster. IfR(ti) 6= ,
thenR(ti) = Si.
Proof. We suppose that Si and Sj, where i 6= j arethe same clusters.From Definition 5, if
R(ti) 6= Si, then we have
R(ti) 6= Si 6= Sj 6=R(tj)
R(ti) 6=R(tj)
(R(ti))
(R(tj)) =
R(ti) = This is a contradiction from the hypothesis.
4.1 Complexity
Suppose that in an information system S = (U,A, V, f),there is U objects that mean there are at most |U | sim-ilarity classes. For computation of similarity classesR(ti) on R(tj), where i 6= j is |U | |U 1|. Thus,the overall computational complexity of the RoCeTmodel is of the polynomial (|U | |U 1|).
4.2 Example
In this study, the comparisons between the RoCeTmodel and the technique proposed by [11] are pre-sented by given two examples, where two small datasets of transactions are considered.a. The first transactions data is adopted from [11]given in Table 1 containing four objects (|U | = 4)with five hyperlinks (|A| = 5).
Table 1. Data transactions
U/A hl1 hl2 hl3 hl4 hl5t1 1 1 0 0 0t2 0 1 1 1 0t3 1 0 1 0 1t4 0 1 1 0 1
The technique of [11] needs three main steps. Thefirst of the techniques is obtaining the measure of sim-ilarity that gives information about the users accesspatterns related to their common areas of interest bysimilarity relation between two transactions of ob-jects. The calculations of the measure of similarityweb transactions from Table 1 are given bellow.
sim(t1, t2) =|t1t2||t1t2| =
|{hl2}||{hl1,hl2,hl3,hl4}| = 0.25,
sim(t1, t3) =|t1t3||t1t3| =
|{hl1}||{hl1,hl2,hl3,hl4}| = 0.25,
sim(t1, t4) =|t1t4||t1t4| =
|{hl2}||{hl1,hl2,hl3,hl5}| = 0.25,
sim(t2, t3) =|t1t4||t1t4| =
|{hl3}||{hl1,hl2,hl3,hl4,hl5}| = 0.2,
sim(t2, t4) =|t1t4||t1t4| =
|{hl2,hl3}||{hl2,hl3,hl4,hl5}| = 0.5,
sim(t3, t4) =|t1t4||t1t4| =
|{hl3,hl5}||{hl1,hl2,hl3,hl5}| = 0.5.
Second, The similarity classes can be obtained by giventhe threshold value using Definition 1. By given thevalue of threshold 0.5, we get the similarity classes asfollow.
R(t1) = {t1},R(t2) = {t2, t4},R(t3) = {t3, t4},R(t4) = {t2, t3, t4}.The last step is to cluster the transactions. To getthe clusters, [11] used the similarity upper approxi-mations and the processes are shown bellow.
R(t1) = {t1},R(t2) = {t2, t4},R(t3) = {t3, t4},R(t4) = {t2, t3, t4}RR(t1) = {t1},RR(t2) = {t2, t3, t4},RR(t3) = {t2, t3, t4},RR(t4) = {t2, t3, t4},RRR(t2) = {t2, t3, t4},RRR(t3) = {t2, t3, t4}.Here, we can see that two consecutive upper approxi-mations for {t1}, {t2}, {t3} and {t4} are same. Thus,we get the similarities upper approximation for {t1},{t2}, {t3} and {t4} as follow.S1 = {t1},S2 = {t2, t3, t4},S3 = {t2, t3, t4},S4 = {t2, t3, t4},where S2 = S3 = S4 and S1 6= Si for i = 2, 3, 4.Finally, we get the two clusters {t1} and {t2, t3, t4}.However, for the RoCeT model, it is based on non-void intersection. According to Definition 5 and thesimilarity classes as used in [11], there are a few com-putation we need to do to get the clusters. Therefore,the RoCeT model to clusters the transactions performbetter than that [11]. The calculation of similarity re-lation is shown in Figure 1.
R(t1) R(t2) = {t1} {t2, t4} = ,R(t1) R(t3) = {t1} {t3, t4} = ,R(t1) R(t4) = {t1} {t3, t4} = ,R(t2) R(t3) = {t2, t4} {t3, t4} = {t4}R(t2) R(t4) = {t2, t4} {t2, t3, t4} = {t4},R(t3) R(t4) = {t3, t4} {t2, t3, t4} = {t4}
Fig.1. The similarity relation
Here, we can see that R(ti) R(tj) 6= , i 6= j, fori, j = 2, 3, 4. We get the clusters as follow.
137
I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions
S1 =R(t1) = {t1},
S2 = S3 = S4 =R(ti), i = 2, 3, 4,
R(ti) = {t2, t4} {t3, t4} {t2, t3, t4} = {t2, t3, t4}.Hence, the two clusters are {t1} and {t2, t3, t4}.
b. For the second data transactions is given in Table 2containing eleven objects |U | = 11, with six hyperlinks|A| = 6.
Table 2. Data transactions
U/A hl1 hl2 hl3 hl4 hl5 hl6t1 1 0 1 1 0 0t2 0 0 1 1 1 1t3 0 1 0 0 1 1t4 1 0 0 0 0 0t5 0 0 0 0 1 0t6 0 1 1 0 0 0t7 0 0 1 1 0 0t8 0 0 1 1 0 1t9 0 0 1 0 0 0t10 0 1 0 1 0 0t11 0 0 1 1 1 0
The similarity for the transactions are given bellow.
sim(t1, t2) = 0.40,sim(t1, t3) = 0,sim(t1, t4) = 0.33,sim(t1, t5) = 0,sim(t1, t6) = 0.25,sim(t1, t7) = 0.67,sim(t1, t8) = 0.50,sim(t1, t9) = 0.33,sim(t1, t10) = 0.25,sim(t1, t11) = 0.5,sim(t2, t3) = 0.40,sim(t2, t4) = 0,sim(t2, t5) = 0.25,sim(t2, t6) = 0.20,sim(t2, t7) = 0.50,sim(t2, t8) = 0.75,sim(t2, t9) = 0.25,sim(t2, t10) = 0.20,sim(t2, t11) = 0.75,sim(t3, t4) = 0.25,sim(t3, t5) = 0.33,sim(t3, t6) = 0.25,sim(t3, t7) = 0,sim(t3, t8) = 0.20,sim(t3, t9) = 0,sim(t3, t10) = 0.25,sim(t3, t11) = 0.20,sim(t4, t5) = 0,
sim(t4, t6) = 0,sim(t4, t7) = 0,sim(t4, t8) = 0,sim(t4, t9) = 0,sim(t4, t10) = 0,sim(t4, t11) = 0,sim(t5, t6) = 0,sim(t5, t7) = 0,sim(t5, t8) = 0,sim(t5, t9) = 0,sim(t5, t10) = 0,sim(t5, t111) = 0.33,sim(t6, t7) = 0.33,sim(t6, t8) = 0.25,sim(t6, t9) = 0.50,sim(t6, t10) = 0.33,sim(t6, t11) = 0.20,sim(t7, t8) = 0.67,sim(t7, t9) = 0.50,sim(t7, t10) = 0.33,sim(t7, t11) = 0.50,sim(t8, t9) = 0.33,sim(t8, t10) = 0.25,sim(t8, t11) = 0.50,sim(t9, t10) = 0,sim(t9, t11) = 0.33,sim(t10, t11) = 0.25.
By given the threshold value 0.5, the similarity classesare shown as follow.
R(t1) = {t1, t7, t8, t11},R(t2) = {t2, t7, t8, t11},R(t3) = {t3},R(t4) = {t4},R(t5) = {t5},R(t6) = {t6, t9},R(t7) = {t1, t2, t7, t8, t9, t11},R(t8) = {t1, t2, t7, t8, t11},R(t9) = {t6, t7, t9},R(t10) = {t10},R(t11) = {t1, t2, t7, t8, t11}.The process for finding similarity upper approxima-tions in each transaction can be illustrated as follow.
R(t1) = {t1, t7, t8, t11},R(t2) = {t2, t7, t8, t11},R(t3) = {t3},R(t4) = {t4},R(t5) = {t5},R(t6) = {t6, t9},R(t7) = {t1, t2, t7, t8, t9, t11},R(t8) = {t1, t2, t7, t8, t11},R(t9) = {t6, t7, t9},R(t10) = {t10},R(t11) = {t1, t2, t7, t8, t11}RR(t1) = {t1, t2, t7, t8, t9, t11},RR(t2) = {t1, t2, t7, t8, t9, t11},RR(t3) = {t3}RR(t4) = {t4},RR(t5) = {t5},RR(t6) = {t1, t2, t6, t7, t8, t9, t11}RR(t7) = {t1, t2, t6, t7, t8, t9, t11},RR(t8) = {t1, t2, t7, t8, t9, t11}RR(t9) = {t1, t2, t6, t7, t8, t9, t11},RR(t10) = {t10},RR(t11) = {t1, t2, t7, t8, t9, t11}RRR(t1) = {t1, t2, t6, t7, t8, t9, t11},RRR(t2) = {t1, t2, t6, t7, t8, t9, t11},RRR(t3) = {t3},RRR(t4) = {t4},RRR(t5) = {t5},RRR(t6) = {t1, t2, t6, t7, t8, t9, t11}RRR(t7) = {t1, t2, t6, t7, t8, t9, t11},RRR(t8) = {t1, t2, t6, t7, t8, t9, t11}RRR(t9) = {t1, t2, t6, t7, t8, t9, t11},RRR(t10) = {t10},RRR(t11) = {t1, t2, t6, t7, t8, t9, t11}RRR(t1) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t2) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t3) = {t3},RRRR(t4) = {t4},
138
Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)
RRRR(t5) = {t5},RRRR(t6) = {t1, t2, t6, t7, t8, t9, t11}RRRR(t7) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t8) = {t1, t2, t6, t7, t8, t9, t11}RRRR(t9) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t10) = {t10},RRRR(t11) = {t1, t2, t6, t7, t8, t9, t11}Hence, two consecutive upper approximation for {ti},i = 1, 2, . . . , 11 are the same. Therefore, we get thesimilarities upper approximation as follow.
S1 = {t1, t2, t6, t7, t8, t9, t11},S2 = {t1, t2, t6, t7, t8, t9, t11},S3 = {t3},S4 = {t4},S5 = {t5},S6 = {t1, t2, t6, t7, t8, t9, t11},S7 = {t1, t2, t6, t7, t8, t9, t11},S8 = {t1, t2, t6, t7, t8, t9, t11},S9 = {t1, t2, t6, t7, t8, t9, t11},S10 = {t10},S11 = {t1, t2, t6, t7, t8, t9, t11}.Since Si = Sj 6= Sk, where i, j = 1, 2, 7, 8, 9, 11 andk = 3, 4, 5, 10, then according to [11], there are fiveclusters {t3}, {t4}, {t5}, {t10} and {t1, t2, t6, t7, t8, t9, t11}.For the proposed method, the intersection of similar-ity classes are summarized in Table 3. From Table3, notice that R(ti) R(tj) 6= , for i 6= j; i, j =1, 2, 6, 7, 8, 9, 11, and R(tk) R(tl) = , for k 6= l;k = 1, 2, . . . , 11; l = 3, 4, 5, 10. We get the clusters asfollow.
S1 = {t1, t2, t6, t7, t8, t9, t11},S2 = {t1, t2, t6, t7, t8, t9, t11},S3 = {t3},S4 = {t4},S5 = {t5},S6 = {t1, t2, t6, t7, t8, t9, t11},S7 = {t1, t2, t6, t7, t8, t9, t11},S8 = {t1, t2, t6, t7, t8, t9, t11},S9 = {t1, t2, t6, t7, t8, t9, t11},S10 = {t10},S11 = {t1, t2, t6, t7, t8, t9, t11}.The five clusters are {t1, t2, t6, t7, t8, t9, t11}, {t3}, {t4},{t5} and {t10}. These are the same cluster with thatin [11]. However, the iteration is lower than thatof the technique proposed by [11]. For the clusters{t3}, {t4}, {t5}, {t11}, with the threshold value given,{t3}, {t4}, {t5}, {t11} be segregated clusters, but if wesee in the data transactions, may be there is a re-lated transactions among the clusters. To this, wepropose the alternative technique to handle this prob-lem by given the second threshold value. Therefore,we decide {t1, t2, t6, t7, t8, t9, t11} as the first clusteron the first threshold value given and the remainder
{t3}, {t4}, {t5}, {t11}, we given the second thresholdvalue and group the similarity for the remainder trans-actions. The similarity of remainder of transactionsare shown bellow.
sim(t3, t4) = 0.25,sim(t3, t5) = 0.33,sim(t3, t10) = 0.25,sim(t4, t5) = 0,sim(t4, t10) = 0,sim(t5, t10) = 0.
Let given a second threshold value 0.3, then we havesimilarity classes are given bellow.
R(t3) = {t3, t5},R(t4) = {t4},R(t5) = {t3, t5},R(t10) = {t10}.The intersection of similarity classes are summarizedin Figure 2.
R(t3) R(t4) = {t3, t5} {t4} = ,R(t3) R(t5) = {t3, t5} {t3, t5} = {t3, t5},R(t3) R(t10) = {t3, t5} {t10} = ,R(t4) R(t5) = {t4} {t3, t5} = ,R(t4) R(t10) = {t4} {t10} = ,R(t5) R(t10) = {t3, t5} {t10} = .Fig 2. The intersection of similarity classes
Based on Figure 2, we see that R(t3) R(t5) 6= and R(ti) R(tj) = for i 6= j, i = 3, 4, 5; j = 4, 10.We get the cluster S3 = {t3, t5}, S4 = {t4}, S5 ={t3, t5}, S10 = {t10}. Hence, the three clusters are{t3, t5}, {t4}, {t10}. Overall, for both of threshold val-ues given we have four clusters {t1, t2, t6, t7, t8, t9, t11},{t3, t5}, {t4}, and {t10}.
The purity of clusters was used as a measure totest the quality of the clusters[11]. The purity of acluster and overall purity are defined as:
Purity(i) =tithtn
where :tith : the number of data occuring in both
the ith cluster under given threshold.tn : the number of data in the data set.
Overall Purity =] of clusteri=1 Purity(i)] of cluster
According to this measure, a higher value of over-all purity indicates a better clustering result, with per-fect clustering yielding a value of 100%. The RoCeTmodel and [11] algorithms for clustering web transac-tions are implemented in MATLAB version 7.6.0.324(R2008a).
139
I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions
Table 3. The intersection of similarity classes
T/T t1 t2 t3 t4 t5 t6 t7 t8 t9 t10t2 7,8,11t3 t4 t5 t6 t7 1,2,8,11 2,7,8,11 t8 1,2,8,11 2,7,8,11 1,2,7,8,11t9 7 7 6,9 7,9 7t10 t11 1,2,8,11 2,7,8,11 1,2,7,8,11 1,2,7,8,11 7
1 2 3 4 5
1
2
3
4
5
6
7
8
9
10
11
clusters
Transa
ctions
1 2 3 4
1
2
3
4
5
6
7
8
9
10
11
clusters
transa
ctions
by given threshold 0.5 after given second threshold 0.3
Fig.3. Visualization of example 2
They are executed sequentially on a processor In-tel Core 2 Duo CPUs. The total main memory is 1Gigabyte and the operating system is Windows XPProfessional SP3. The purity of clusters is describedin Figure 4.
The comparisons of computation and response timeof RoCeT and [11] on a transaction data set from Ta-ble 2 are given in Figures 6 and 7, respectively.
Based on Figure 8, the RoCeT model algorithm pro-vides better solutions compared with [11] algorithm.
Cluster Member Transactions Purity1 t1, t2, t6, t7, t8, t9, t11 12 t3, t5 13 t4 14 t10 1
Overall Purity 100 %
Fig.4. The Purity of clusters
Overall PurityThe technique of [11] 81.82 %RoCeT 100 %
Fig.5. The Overall Purity
The Technique of [11] RoCeT55
60
65
70
75
80
85
90
95
100Computation
Fig.6. The Computation
The Technique of [11] RoCeT0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
seco
nd
Response Time
Fig.7. The Response Time
Purity Computation TimeData
Transaction 18.18 % 46.30 % 80.77 %
Fig.8. The overall improvement of to [11] byRoCeT
140
Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)
5 Experiment test
In order to test the RoCeT model and compare with[11] algorithm, we use a web log data set from:http://kdd.ics.uci.edu/databases/msnbc/msnbc.html.The data describes the page visits by users who vis-ited on September 28, 1999. Visitors are recorded atthe level of URL category and are recorded chrono-logically. The data comes from Internet InformationServer (IIS) logs for msnbc.com. Each row in thedata set corresponds to the page visits of a user
within a twenty-four hour period. Each item in arow corresponds to a request of a user for a page.The client-side cached data is not recorded, thus thisdata contains only the server-side log. From almostone million transactions, we take 2000 transactionsand split into five categories; 100, 200, 500, 1000 and2000. The comparison of response times is capturedin Figure 9 and computational is given in figure 10.
Table 4. The Purity of clusters
Number of The RoCeT The technique ImprovementTransaction model of [11]
100 100% 93.0% 7.0%200 100% 96.0% 4.0%500 100% 95.5% 0.5%1000 100% 95.5% 0.5%2000 100% 99.9% 0.1%
Average 2.5%
Table 5. The executing time
Number of The RoCeT The technique ImprovementTransaction model of [11]
100 1.6969 6.250 68.79%200 9.093 6.250 66.25%500 77.266 163.760 48.35%1000 554.426 2205.100 65.10%2000 3043.500 9780.900 64.97%
Average 62.69%
Table 6. The Computation
Number of The RoCeT The technique ImprovementTransaction model of [11]
100 8806 28213 68.50%200 39349 116576 68.39%500 257003 497595 52.82%1000 1034964 2965579 74.86%2000 4161122 11879645 68.88%
Average 69.69%
100 200 500 1000 20000
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Number of Transactions
seco
nd
Response Time
The Technique of [11]
RoCeT
Fig.9. The executing time
141
I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions
100 200 500 1000 20000
2
4
6
8
10
12x 106 Computation
Number of Transaction
RoCeT
The technique of [11]
Fig.10. The Computation
1 2 3 4 5 6 7 8 910111213141516171819202122230
10
20
30
40
50
60
70
80
90
100
cluster
Tran
sact
ion
1st threshold 0.6
0 1 2 3 4 5 6 7 8 9 101112131415161718190
10
20
30
40
50
60
70
80
90
100
Cluster
Tran
sact
ion
after given 2nd threshold 0.3
Fig.11. Visualization of 100 transactions
0123456789101112131415161718192021222324252627282930310
20
40
60
80
100
120
140
160
180
200
Cluster
Tran
sact
ion
1st threshold 0.6
1 2 3 4 5 6 7 8 910111213141516171819202122232425
20
40
60
80
100
120
140
160
180
200
Cluster
Tr
ansa
ctio
n
after given 2nd threshold 0.3
Fig.12. Visualization of 200 transactions142
Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)
123456789101112131415161718192021222324252627282930313233343536
50
100
150
200
250
300
350
400
450
500
Cluster
Tran
sacti
on1st threshold 0.6
1234567891011121314151617181920212223242526272829303132
50
100
150
200
250
300
350
400
450
500
ClusterTr
ansa
ction
after given 2nd threshold
Fig.13. Visualization of 500 transactions
01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
100
200
300
400
500
600
700
800
900
1000
Cluster
Tran
sacti
on
1st threshold 0.6
12345678910111213141516171819202122232425262728293031323334353637383940
100
200
300
400
500
600
700
800
900
1000
Cluster
Tran
sacti
on
after given 2nd threshold 0.3
Fig.14. Visualization of 1000 transactions
01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
200
400
600
800
1000
1200
1400
1600
1800
2000
Cluster
Tran
sacti
on
1st threshold 0.6
012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
200
400
600
800
1000
1200
1400
1600
1800
2000
Cluster
Tr
ansa
ction
after given 2nd threshold 0.3
Fig. 15. Visualization of 2000 transactions
143
I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions
6 Conclusion
A web clustering technique can be applied to findinteresting user access patterns in web log. In thispaper, we have proposed RoCeT model for cluster-ing web transactions using rough set theory based onsimilarity between two transactions. The analysis ofthe RoCeT model was presented in terms of computa-tion, processing time and cluster purity. We elaboratethe proposed technique through UCI benchmark data,i.e., msnbc.com web log data. It is shown that the Ro-CeT model requires significantly lower response timeup to 62.69 % as compared to the technique of [11].Meanwhile, for cluster purity it performs better up to2.5 %.
7 Acknowledgement
This work was supported by the FRGS under theGrant No. Vote 0402, Ministry of Higher Education,Malaysia.
References
[1] Pal, S.K., Talwar, V. and Mitra, P.,(2002) WebMining in Soft Computing Framework: Relevance,State of the Art and Future Directions, IEEETransactions on neural network, 13 (5), 1163 -1177.
[2] Bucher, A.G. and Mulvenna, M.D., (1998) Dis-covering Internet Marketing Intelligence throughOnline Analytical Web Usage Mining, SIGMODRecord, 27 (4), 54-61.
[3] Cohen, E., Krishnamurthy, B. and Rexford, J.,(1998) Improving and-to-end performance of theweb using server volumes and proxy lters, Pro-ceeding of the ACM SIGCOMM . Vancouver,British Columbia, Canada: ACM Press.
[4] Joachims, T., Freitag, D. and Mitchell, T.,(1997)Webwatcher: A tour guide for the world wideweb, In the 15th international Joint Confer-ence on Artificial Intelligence (ICJAI97), Nagoya,Japan.
[5] Lieberman, H., (1995) Letizea: An agent thatassists web browsing, Proceeding of the 1995 In-ternational Joint Conference on Artificial Intelli-gence. Montreal, Canada: Morgan Kaufmann.
[6] Mobasher, B., Cooley, R., and Srivastava, J.,(1999) Creating adaptive web sites trough us-age based clustering of URLs, Proceedings of the
1999 Workshop on Knowledge and Data Engineer-ing Exchange. IEEE Computer Society.
[7] Ngu, D.S.W. and Wu, X., (1997) Sitehelper: Alocalized agent that helps incremental explorationof the world wide web, Proceeding of 6th Interna-tional World Wide Web Conference. Santa Clara,CA: ACM Press.
[8] Perkowitz, M. and Etzioni, O., (1998) Adap-tive Web Sites: Automatically Synthesizing WebPages, Proceedings of the 15th National Con-ference on Artificial Intelligence. Madison, WI:AAAI.
[9] Z . Yanchun, X. Guandong and Z. Xiaofang.,(2005) A Latent Usage Approach for Cluster-ing Web Transaction and Building User Profile,Springer-Verlag Berlin Heidelberg , 31 - 42.
[10] Han, E. et al., (1998) Hypergraph Based Clus-tering in High-Dimensional Data Sets: A Sum-mary of Results, IEEE Data Engineering Bul-letin, 21 (11), 15-22.
[11] De, S.K. and Krishna, P.R., ( 2004) Cluster-ing web transactions using rough approximation,Fuzzy Sets and Systems, 148, 131-138.
[12] Pawlak, Z., (1982) Rough sets, InternationalJournal of Computer and Information Science. 11,341-356.
[13] Pawlak, Z. (1991) Rough sets: A theoretical as-pect of reasoning about data, Kluwer AcademicPublisher.
[14] Pawlak, Z. and Skowron, A., (2007) Rudimentsof rough sets Information Sciences. An Interna-tional Journal. 177 (1), 3-27.
Iwan Tri Riyadi YantoHe is a Master candidate in DataMining at Universiti Tun HusseinOnn Malaysia (UTHM). His re-search area includes Data Mining,KDD, and Real Analysis.
Tutut HerawanHe is a Ph.D. candidate in DataMining at Universiti Tun HusseinOnn Malaysia (UTHM). His re-search area includes Data Mining,KDD and Real Analysis.
144
Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)
Mustafa Mat DerisHe received the B.Sc. from Univer-sity Putra Malaysia, M.Sc. fromUniversity of Bradford, Englandand Ph.D. from University PutraMalaysia. He is a professor of com-puter science in the Faculty of In-formation Technology and Multi-media, UTHM, Malaysia. His re-search interests include distributed
databases, data grid, database performance issues anddata mining. He has published more than 80 papersin journals and conference proceedings. He was ap-pointed as one of editorial board members for Inter-national Journal of Information Technology, WorldEnformatika Society, a reviewer of a special issueon International Journal of Parallel and DistributedDatabases, Elsevier, 2004, a special issue on Interna-tional Journal of Cluster Computing, Kluwer, 2004,IEEE conference on Cluster and Grid Computing,held in Chicago, April, 2004, and Malaysian Jour-nal of Computer Science. He has served as a pro-gram committee member for numerous internationalconferences/workshops including Grid and Peer-to-Peer Computing, (GP2P 2005, 2006), AutonomicDistributed Data and Storage Systems Management(ADSM 2005, 2006), WSEAS, International Associa-tion of Science and Technology, IASTED on Database,etc.
145