Rough Clustering

IJBSCHS(2010-16-2-16) Biomedical Soft Computing and Human Sciences, Vol.16, No.2, pp.135-145[Original Article] Copyright c1995 Biomedical Fuzzy Systems Association

(Accepted on February 20,2010)

RoCeT: Rough Clustering for web Transactions

Iwan Tri Riyadi Yanto1,3, Tutut Herawan2,3, Mustafa Mat Deris31Department of Mathematics

2Department of Mathematics Education

Universitas Ahmad Dahlan, Yogyakarta,Indonesia3Faculty of Information Technology and Multimedia

Universiti Tun Hussein Onn Malaysia, Johor, Malaysia

(received 31 December 2009, revised and accepted 20 February 2010)

Abstract: Grouping web transactions into clusters is important in order to obtain better understanding ofusers behavior. Currently, the rough approximation-based clustering technique has been used to group webtransactions into clusters. However, the processing time is still an issue due to the high complexity for findingthe similarity of upper approximations of a transaction which used to merge between two or more clusters.On the other hand, the problem of more than one transaction under given threshold is not addressed. In thispaper, we propose RoCeT model for grouping web transactions using rough set theory. It is based on the twosimilarity classes which are nonvoid intersection.Keywords: Clustering, Web transactions, Rough set theory.

1 Introduction

Web usage data includes data from web server ac-cess logs, proxy server logs, browser logs, user profiles,registration files, user sessions or transactions, userqueries, bookmark folders, mouse-clicks and scrolls,and any other data generated by the interaction ofusers and the web [1]. Generally, web mining tech-niques can be defined as those methods to extract socalled nuggets (or knowledge) from web data repos-itory, such as content, linkage, usage information, byutilizing data mining tool. Among such web data,user click stream, i.e. usage data, can be mainly uti-lized to capture users navigation patters and identifyuser intended tasks. Once the user navigational be-haviors are effectively characterized, they will providebenefit for further web applications, in turn, facilitateand improve web service quality for both web-basedorganizations and for end users [2-9]. In web datamining research, many data mining techniques, suchas clustering [8,10] is adopted widely to improve theusability and scalability of web mining.

Access transaction over the web can be expressedin the two finite sets, user transaction and hyper-links/URLs [11]. A user transaction U is a sequence ofitems, this set is formed by m users and the set A is setof distinct n clicks (hyperlinks/URLs) clicked by usersthat are U = {t1, t2, . . . , tm} andA = {hl1, hl2, . . . , hln},

1Faculty of Information Technology and Multimedia,UTHM, Parit Raja, Batu Pahat 86400, Johor.Phone : +60177061496Email : [email protected]

where for every ti T U is a non-empty subset ofU . The temporal order of users clicks within transac-tions has been taken into account. A user transactiont T is represented as a vector t = but1, ut2, . . . , utnc,where uti = 1 if hli t, and uti = 0 if otherwise.

A well-known approach for clustering web trans-actions is using rough set theory [12-14]. De and Kr-ishna [11] proposed an algorithm for clustering webtransactions using rough approximation. It is basedon the similarity of upper approximations of trans-actions by given any threshold. However, there aresome iterations should be done to merges of two ormore clusters that have the same similarity of upperapproximations and didnt present how to handle theproblem if there are more than one transaction undergiven threshold. To overcome those problems, in thispaper, we propose an alternative technique for cluster-ing web transaction. We use the concept of similarityclass proposed by [11]. But, the RoCeT model differson how to allocate transaction in the same cluster andhow to handle the problem if there is more than onetransaction under given threshold.

The rest of the paper is organized as follows. Sec-tion 2 describes the concept of rough set theory. Sec-tion 3 describes the work of [11]. Section 4 describesthe RoCeT model. Section 5 describes the experimen-tal test. Finally, we conclude our works in Section 6.

135

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

2 Rough Set Theory

An information system is a 4-tuple (quadruple) S =(U,A, V, f), where U = {u1, u2, u3, . . . , u|U |} is a non-empty finite set of objects, A = {a1, a2, a3, . . . , a|A|} isa non-empty finite set of attributes, V = aAVa, Va isthe domain (value set) of attribute a, f : U A Vis an information function such that f(u, a) Va, forevery (u, a) U A, called information (knowledge)function. The starting point of rough set approxima-tions is the indiscernibility relation, which is generatedby information about objects of interest. Two objectsin an information system are called indiscernible (in-distinguishable or similar) if they have the same fea-ture.

Definition 1 Two elements x, y U are said to beB-indiscernible (indiscernible by the set of attributeB A in S) if and only if f(x, a) = f(y, a), for everya B.

Obviously, every subset of A induces unique in-discernibility relation. Notice that, an indiscernibilityrelation induced by the set of attribute B, denotedby IND(B), is an equivalence relation. The parti-tion of U induced by IND(B) is denoted by U/B andthe equivalence class in the partition U/B containingx U , in denoted by [x]B . The notions of lower andupper approximations of a set are defined as follows.

Definition 2 (See [14].) The B-lower approximationof X, denoted by B(X) and B-upper approximationof X, denoted by B(X), respectively, are defined byB(X) = {x U |[x]B X} and B(X) = {x U |[x]B

X 6= }.

The accuracy of approximation (accuracy of rough-ness) of any subset X U with respect to B A,denoted B(X) is measured by

B(X) =|B(X)||B(X)| ,

where |X| denotes the cardinality of X. For emptyset , we define B() = 1. Obviously, 0 B(X) 1. if X is a union of some equivalence classes of U ,then B(X) = 1. Thus, the set X is crisp (precise)with respect to B. And, if X is not a union of someequivalence classes of U , then B(X) < 1. Thus, theset X is rough (imprecise)with respect to B [13]. Thismeans that the higher of accuracy of approximationof any subset X U is the more precise (the lessimprecise) of itself.

3 Related Work

In this section, we discuss the technique proposed by[11]. Given two transactions t and s, the measure-

ment of similarity between t and s is given by sim(s,t)= |t s|/|t s|. Obviously, sim(t,s) [0, 1], wheresim(t,s) = 1, when two transactions t and s are exactlyidentical and sim(t,s) = 0, when two transactions tand s have no items in common. De and Krishna [11]used a binary relation R defined on T defined as fol-lows. For any threshold value th [0, 1] and for anytwo user transactions t and s T , a binary relation Ron T denoted as tRs iff sim(t, s) th. This relationR is a tolerance relation as R is both reflexive andsymmetric but transitive may not hold good always.

Definition 3 The similarity class of t, denoted byR(t), is a set of transactions which are similar to twhich is given by R(t) = {s T : sRt}.

For different threshold values, one can get differentsimilarity classes. A domain expert can choose thethreshold based on this experience to get a propersimilarity class. It is clear that for a fixed threshold [0, 1], a transaction form a given similarity class maybe similar to an object of another similarity class.

Definition 4 Let P T , for a fixed threshold [0, 1]a binary tolerance relation R is defined on T. Thelower approximation of P, denoted by R(P ) and theupper approximation of P, denoted by R(P ) are re-spectively defined as R(P ) = {t P : R(t) P} andR(P =

tP )R(t).

They proposed a technique of clustering the clicksof user navigations called as similarity upper approxi-mation and denoted by Si. A set of transactions thatare possibly similar to R(ti) in denoted by RR(ti).This process continues until two consecutive upperapproximations for ti, i = 1, 2, 3, , |U | are the sameand two or more clusters that have the same similarityupper approximations merges at each iteration. Withthis technique, we need high computational complex-ity to cluster the transactions. This is due to find outthe similarity upper approximation until two consec-utive upper approximations are same. To overcomethis problem, we propose an alternative technique tocluster the transactions.

4 The RoCeT model

The RoCeT model for clustering the transactions isbased on all the possibly similar to the similarity classof t(R(t)). The union of two similarity classes withnon void intersection will be the same clusters. Thejustification that a cluster is a union of two similar-ity classes with non void intersection is presented inProposition 6.

136

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

Definition 5 Two clusters Si and Sj, i 6= j are saidto be the same if

Si =R(ti), i = 1, 2, 3, , |U |.

Proposition 6 Let Si be a cluster. IfR(ti) 6= ,

thenR(ti) = Si.

Proof. We suppose that Si and Sj, where i 6= j arethe same clusters.From Definition 5, if

R(ti) 6= Si, then we have

R(ti) 6= Si 6= Sj 6=R(tj)

R(ti) 6=R(tj)

(R(ti))

(R(tj)) =

R(ti) = This is a contradiction from the hypothesis.

4.1 Complexity

Suppose that in an information system S = (U,A, V, f),there is U objects that mean there are at most |U | sim-ilarity classes. For computation of similarity classesR(ti) on R(tj), where i 6= j is |U | |U 1|. Thus,the overall computational complexity of the RoCeTmodel is of the polynomial (|U | |U 1|).

4.2 Example

In this study, the comparisons between the RoCeTmodel and the technique proposed by [11] are pre-sented by given two examples, where two small datasets of transactions are considered.a. The first transactions data is adopted from [11]given in Table 1 containing four objects (|U | = 4)with five hyperlinks (|A| = 5).

Table 1. Data transactions

U/A hl1 hl2 hl3 hl4 hl5t1 1 1 0 0 0t2 0 1 1 1 0t3 1 0 1 0 1t4 0 1 1 0 1

The technique of [11] needs three main steps. Thefirst of the techniques is obtaining the measure of sim-ilarity that gives information about the users accesspatterns related to their common areas of interest bysimilarity relation between two transactions of ob-jects. The calculations of the measure of similarityweb transactions from Table 1 are given bellow.

sim(t1, t2) =|t1t2||t1t2| =

|{hl2}||{hl1,hl2,hl3,hl4}| = 0.25,

sim(t1, t3) =|t1t3||t1t3| =

|{hl1}||{hl1,hl2,hl3,hl4}| = 0.25,

sim(t1, t4) =|t1t4||t1t4| =

|{hl2}||{hl1,hl2,hl3,hl5}| = 0.25,

sim(t2, t3) =|t1t4||t1t4| =

|{hl3}||{hl1,hl2,hl3,hl4,hl5}| = 0.2,

sim(t2, t4) =|t1t4||t1t4| =

|{hl2,hl3}||{hl2,hl3,hl4,hl5}| = 0.5,

sim(t3, t4) =|t1t4||t1t4| =

|{hl3,hl5}||{hl1,hl2,hl3,hl5}| = 0.5.

Second, The similarity classes can be obtained by giventhe threshold value using Definition 1. By given thevalue of threshold 0.5, we get the similarity classes asfollow.

R(t1) = {t1},R(t2) = {t2, t4},R(t3) = {t3, t4},R(t4) = {t2, t3, t4}.The last step is to cluster the transactions. To getthe clusters, [11] used the similarity upper approxi-mations and the processes are shown bellow.

R(t1) = {t1},R(t2) = {t2, t4},R(t3) = {t3, t4},R(t4) = {t2, t3, t4}RR(t1) = {t1},RR(t2) = {t2, t3, t4},RR(t3) = {t2, t3, t4},RR(t4) = {t2, t3, t4},RRR(t2) = {t2, t3, t4},RRR(t3) = {t2, t3, t4}.Here, we can see that two consecutive upper approxi-mations for {t1}, {t2}, {t3} and {t4} are same. Thus,we get the similarities upper approximation for {t1},{t2}, {t3} and {t4} as follow.S1 = {t1},S2 = {t2, t3, t4},S3 = {t2, t3, t4},S4 = {t2, t3, t4},where S2 = S3 = S4 and S1 6= Si for i = 2, 3, 4.Finally, we get the two clusters {t1} and {t2, t3, t4}.However, for the RoCeT model, it is based on non-void intersection. According to Definition 5 and thesimilarity classes as used in [11], there are a few com-putation we need to do to get the clusters. Therefore,the RoCeT model to clusters the transactions performbetter than that [11]. The calculation of similarity re-lation is shown in Figure 1.

R(t1) R(t2) = {t1} {t2, t4} = ,R(t1) R(t3) = {t1} {t3, t4} = ,R(t1) R(t4) = {t1} {t3, t4} = ,R(t2) R(t3) = {t2, t4} {t3, t4} = {t4}R(t2) R(t4) = {t2, t4} {t2, t3, t4} = {t4},R(t3) R(t4) = {t3, t4} {t2, t3, t4} = {t4}

Fig.1. The similarity relation

Here, we can see that R(ti) R(tj) 6= , i 6= j, fori, j = 2, 3, 4. We get the clusters as follow.

137


S1 =R(t1) = {t1},

S2 = S3 = S4 =R(ti), i = 2, 3, 4,

R(ti) = {t2, t4} {t3, t4} {t2, t3, t4} = {t2, t3, t4}.Hence, the two clusters are {t1} and {t2, t3, t4}.

b. For the second data transactions is given in Table 2containing eleven objects |U | = 11, with six hyperlinks|A| = 6.

Table 2. Data transactions

U/A hl1 hl2 hl3 hl4 hl5 hl6t1 1 0 1 1 0 0t2 0 0 1 1 1 1t3 0 1 0 0 1 1t4 1 0 0 0 0 0t5 0 0 0 0 1 0t6 0 1 1 0 0 0t7 0 0 1 1 0 0t8 0 0 1 1 0 1t9 0 0 1 0 0 0t10 0 1 0 1 0 0t11 0 0 1 1 1 0

The similarity for the transactions are given bellow.

sim(t1, t2) = 0.40,sim(t1, t3) = 0,sim(t1, t4) = 0.33,sim(t1, t5) = 0,sim(t1, t6) = 0.25,sim(t1, t7) = 0.67,sim(t1, t8) = 0.50,sim(t1, t9) = 0.33,sim(t1, t10) = 0.25,sim(t1, t11) = 0.5,sim(t2, t3) = 0.40,sim(t2, t4) = 0,sim(t2, t5) = 0.25,sim(t2, t6) = 0.20,sim(t2, t7) = 0.50,sim(t2, t8) = 0.75,sim(t2, t9) = 0.25,sim(t2, t10) = 0.20,sim(t2, t11) = 0.75,sim(t3, t4) = 0.25,sim(t3, t5) = 0.33,sim(t3, t6) = 0.25,sim(t3, t7) = 0,sim(t3, t8) = 0.20,sim(t3, t9) = 0,sim(t3, t10) = 0.25,sim(t3, t11) = 0.20,sim(t4, t5) = 0,

sim(t4, t6) = 0,sim(t4, t7) = 0,sim(t4, t8) = 0,sim(t4, t9) = 0,sim(t4, t10) = 0,sim(t4, t11) = 0,sim(t5, t6) = 0,sim(t5, t7) = 0,sim(t5, t8) = 0,sim(t5, t9) = 0,sim(t5, t10) = 0,sim(t5, t111) = 0.33,sim(t6, t7) = 0.33,sim(t6, t8) = 0.25,sim(t6, t9) = 0.50,sim(t6, t10) = 0.33,sim(t6, t11) = 0.20,sim(t7, t8) = 0.67,sim(t7, t9) = 0.50,sim(t7, t10) = 0.33,sim(t7, t11) = 0.50,sim(t8, t9) = 0.33,sim(t8, t10) = 0.25,sim(t8, t11) = 0.50,sim(t9, t10) = 0,sim(t9, t11) = 0.33,sim(t10, t11) = 0.25.

By given the threshold value 0.5, the similarity classesare shown as follow.

R(t1) = {t1, t7, t8, t11},R(t2) = {t2, t7, t8, t11},R(t3) = {t3},R(t4) = {t4},R(t5) = {t5},R(t6) = {t6, t9},R(t7) = {t1, t2, t7, t8, t9, t11},R(t8) = {t1, t2, t7, t8, t11},R(t9) = {t6, t7, t9},R(t10) = {t10},R(t11) = {t1, t2, t7, t8, t11}.The process for finding similarity upper approxima-tions in each transaction can be illustrated as follow.

R(t1) = {t1, t7, t8, t11},R(t2) = {t2, t7, t8, t11},R(t3) = {t3},R(t4) = {t4},R(t5) = {t5},R(t6) = {t6, t9},R(t7) = {t1, t2, t7, t8, t9, t11},R(t8) = {t1, t2, t7, t8, t11},R(t9) = {t6, t7, t9},R(t10) = {t10},R(t11) = {t1, t2, t7, t8, t11}RR(t1) = {t1, t2, t7, t8, t9, t11},RR(t2) = {t1, t2, t7, t8, t9, t11},RR(t3) = {t3}RR(t4) = {t4},RR(t5) = {t5},RR(t6) = {t1, t2, t6, t7, t8, t9, t11}RR(t7) = {t1, t2, t6, t7, t8, t9, t11},RR(t8) = {t1, t2, t7, t8, t9, t11}RR(t9) = {t1, t2, t6, t7, t8, t9, t11},RR(t10) = {t10},RR(t11) = {t1, t2, t7, t8, t9, t11}RRR(t1) = {t1, t2, t6, t7, t8, t9, t11},RRR(t2) = {t1, t2, t6, t7, t8, t9, t11},RRR(t3) = {t3},RRR(t4) = {t4},RRR(t5) = {t5},RRR(t6) = {t1, t2, t6, t7, t8, t9, t11}RRR(t7) = {t1, t2, t6, t7, t8, t9, t11},RRR(t8) = {t1, t2, t6, t7, t8, t9, t11}RRR(t9) = {t1, t2, t6, t7, t8, t9, t11},RRR(t10) = {t10},RRR(t11) = {t1, t2, t6, t7, t8, t9, t11}RRR(t1) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t2) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t3) = {t3},RRRR(t4) = {t4},

138


RRRR(t5) = {t5},RRRR(t6) = {t1, t2, t6, t7, t8, t9, t11}RRRR(t7) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t8) = {t1, t2, t6, t7, t8, t9, t11}RRRR(t9) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t10) = {t10},RRRR(t11) = {t1, t2, t6, t7, t8, t9, t11}Hence, two consecutive upper approximation for {ti},i = 1, 2, . . . , 11 are the same. Therefore, we get thesimilarities upper approximation as follow.

S1 = {t1, t2, t6, t7, t8, t9, t11},S2 = {t1, t2, t6, t7, t8, t9, t11},S3 = {t3},S4 = {t4},S5 = {t5},S6 = {t1, t2, t6, t7, t8, t9, t11},S7 = {t1, t2, t6, t7, t8, t9, t11},S8 = {t1, t2, t6, t7, t8, t9, t11},S9 = {t1, t2, t6, t7, t8, t9, t11},S10 = {t10},S11 = {t1, t2, t6, t7, t8, t9, t11}.Since Si = Sj 6= Sk, where i, j = 1, 2, 7, 8, 9, 11 andk = 3, 4, 5, 10, then according to [11], there are fiveclusters {t3}, {t4}, {t5}, {t10} and {t1, t2, t6, t7, t8, t9, t11}.For the proposed method, the intersection of similar-ity classes are summarized in Table 3. From Table3, notice that R(ti) R(tj) 6= , for i 6= j; i, j =1, 2, 6, 7, 8, 9, 11, and R(tk) R(tl) = , for k 6= l;k = 1, 2, . . . , 11; l = 3, 4, 5, 10. We get the clusters asfollow.

S1 = {t1, t2, t6, t7, t8, t9, t11},S2 = {t1, t2, t6, t7, t8, t9, t11},S3 = {t3},S4 = {t4},S5 = {t5},S6 = {t1, t2, t6, t7, t8, t9, t11},S7 = {t1, t2, t6, t7, t8, t9, t11},S8 = {t1, t2, t6, t7, t8, t9, t11},S9 = {t1, t2, t6, t7, t8, t9, t11},S10 = {t10},S11 = {t1, t2, t6, t7, t8, t9, t11}.The five clusters are {t1, t2, t6, t7, t8, t9, t11}, {t3}, {t4},{t5} and {t10}. These are the same cluster with thatin [11]. However, the iteration is lower than thatof the technique proposed by [11]. For the clusters{t3}, {t4}, {t5}, {t11}, with the threshold value given,{t3}, {t4}, {t5}, {t11} be segregated clusters, but if wesee in the data transactions, may be there is a re-lated transactions among the clusters. To this, wepropose the alternative technique to handle this prob-lem by given the second threshold value. Therefore,we decide {t1, t2, t6, t7, t8, t9, t11} as the first clusteron the first threshold value given and the remainder

{t3}, {t4}, {t5}, {t11}, we given the second thresholdvalue and group the similarity for the remainder trans-actions. The similarity of remainder of transactionsare shown bellow.

sim(t3, t4) = 0.25,sim(t3, t5) = 0.33,sim(t3, t10) = 0.25,sim(t4, t5) = 0,sim(t4, t10) = 0,sim(t5, t10) = 0.

Let given a second threshold value 0.3, then we havesimilarity classes are given bellow.

R(t3) = {t3, t5},R(t4) = {t4},R(t5) = {t3, t5},R(t10) = {t10}.The intersection of similarity classes are summarizedin Figure 2.

R(t3) R(t4) = {t3, t5} {t4} = ,R(t3) R(t5) = {t3, t5} {t3, t5} = {t3, t5},R(t3) R(t10) = {t3, t5} {t10} = ,R(t4) R(t5) = {t4} {t3, t5} = ,R(t4) R(t10) = {t4} {t10} = ,R(t5) R(t10) = {t3, t5} {t10} = .Fig 2. The intersection of similarity classes

Based on Figure 2, we see that R(t3) R(t5) 6= and R(ti) R(tj) = for i 6= j, i = 3, 4, 5; j = 4, 10.We get the cluster S3 = {t3, t5}, S4 = {t4}, S5 ={t3, t5}, S10 = {t10}. Hence, the three clusters are{t3, t5}, {t4}, {t10}. Overall, for both of threshold val-ues given we have four clusters {t1, t2, t6, t7, t8, t9, t11},{t3, t5}, {t4}, and {t10}.

The purity of clusters was used as a measure totest the quality of the clusters[11]. The purity of acluster and overall purity are defined as:

Purity(i) =tithtn

where :tith : the number of data occuring in both

the ith cluster under given threshold.tn : the number of data in the data set.

Overall Purity =] of clusteri=1 Purity(i)] of cluster

According to this measure, a higher value of over-all purity indicates a better clustering result, with per-fect clustering yielding a value of 100%. The RoCeTmodel and [11] algorithms for clustering web transac-tions are implemented in MATLAB version 7.6.0.324(R2008a).

139


Table 3. The intersection of similarity classes

T/T t1 t2 t3 t4 t5 t6 t7 t8 t9 t10t2 7,8,11t3 t4 t5 t6 t7 1,2,8,11 2,7,8,11 t8 1,2,8,11 2,7,8,11 1,2,7,8,11t9 7 7 6,9 7,9 7t10 t11 1,2,8,11 2,7,8,11 1,2,7,8,11 1,2,7,8,11 7

1 2 3 4 5

1

2

3

4

5

6

7

8

9

10

11

clusters

Transa

ctions

1 2 3 4

1

2

3

4

5

6

7

8

9

10

11

clusters

transa

ctions

by given threshold 0.5 after given second threshold 0.3

Fig.3. Visualization of example 2

They are executed sequentially on a processor In-tel Core 2 Duo CPUs. The total main memory is 1Gigabyte and the operating system is Windows XPProfessional SP3. The purity of clusters is describedin Figure 4.

The comparisons of computation and response timeof RoCeT and [11] on a transaction data set from Ta-ble 2 are given in Figures 6 and 7, respectively.

Based on Figure 8, the RoCeT model algorithm pro-vides better solutions compared with [11] algorithm.

Cluster Member Transactions Purity1 t1, t2, t6, t7, t8, t9, t11 12 t3, t5 13 t4 14 t10 1

Overall Purity 100 %

Fig.4. The Purity of clusters

Overall PurityThe technique of [11] 81.82 %RoCeT 100 %

Fig.5. The Overall Purity

The Technique of [11] RoCeT55

60

65

70

75

80

85

90

95

100Computation

Fig.6. The Computation

The Technique of [11] RoCeT0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

seco

nd

Response Time

Fig.7. The Response Time

Purity Computation TimeData

Transaction 18.18 % 46.30 % 80.77 %

Fig.8. The overall improvement of to [11] byRoCeT

140


5 Experiment test

In order to test the RoCeT model and compare with[11] algorithm, we use a web log data set from:http://kdd.ics.uci.edu/databases/msnbc/msnbc.html.The data describes the page visits by users who vis-ited on September 28, 1999. Visitors are recorded atthe level of URL category and are recorded chrono-logically. The data comes from Internet InformationServer (IIS) logs for msnbc.com. Each row in thedata set corresponds to the page visits of a user

within a twenty-four hour period. Each item in arow corresponds to a request of a user for a page.The client-side cached data is not recorded, thus thisdata contains only the server-side log. From almostone million transactions, we take 2000 transactionsand split into five categories; 100, 200, 500, 1000 and2000. The comparison of response times is capturedin Figure 9 and computational is given in figure 10.

Table 4. The Purity of clusters

Number of The RoCeT The technique ImprovementTransaction model of [11]

100 100% 93.0% 7.0%200 100% 96.0% 4.0%500 100% 95.5% 0.5%1000 100% 95.5% 0.5%2000 100% 99.9% 0.1%

Average 2.5%

Table 5. The executing time


100 1.6969 6.250 68.79%200 9.093 6.250 66.25%500 77.266 163.760 48.35%1000 554.426 2205.100 65.10%2000 3043.500 9780.900 64.97%

Average 62.69%

Table 6. The Computation


100 8806 28213 68.50%200 39349 116576 68.39%500 257003 497595 52.82%1000 1034964 2965579 74.86%2000 4161122 11879645 68.88%

Average 69.69%

100 200 500 1000 20000

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Number of Transactions

seco

nd

Response Time

The Technique of [11]

RoCeT

Fig.9. The executing time

141


100 200 500 1000 20000

2

4

6

8

10

12x 106 Computation

Number of Transaction

RoCeT

The technique of [11]

Fig.10. The Computation

1 2 3 4 5 6 7 8 910111213141516171819202122230

10

20

30

40

50

60

70

80

90

100

cluster

Tran

sact

ion

1st threshold 0.6

0 1 2 3 4 5 6 7 8 9 101112131415161718190

10

20

30

40

50

60

70

80

90

100

Cluster

Tran

sact

ion

after given 2nd threshold 0.3

Fig.11. Visualization of 100 transactions

0123456789101112131415161718192021222324252627282930310

20

40

60

80

100

120

140

160

180

200

Cluster

Tran

sact

ion

1st threshold 0.6

1 2 3 4 5 6 7 8 910111213141516171819202122232425

20

40

60

80

100

120

140

160

180

200

Cluster

Tr

ansa

ctio

n


Fig.12. Visualization of 200 transactions142


123456789101112131415161718192021222324252627282930313233343536

50

100

150

200

250

300

350

400

450

500

Cluster

Tran

sacti

on1st threshold 0.6

1234567891011121314151617181920212223242526272829303132

50

100

150

200

250

300

350

400

450

500

ClusterTr

ansa

ction

after given 2nd threshold


01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

100

200

300

400

500

600

700

800

900

1000

Cluster

Tran

sacti

on

1st threshold 0.6

12345678910111213141516171819202122232425262728293031323334353637383940

100

200

300

400

500

600

700

800

900

1000

Cluster

Tran

sacti

on



01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

200

400

600

800

1000

1200

1400

1600

1800

2000

Cluster

Tran

sacti

on

1st threshold 0.6

012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455

200

400

600

800

1000

1200

1400

1600

1800

2000

Cluster

Tr

ansa

ction


Fig. 15. Visualization of 2000 transactions

143


6 Conclusion

A web clustering technique can be applied to findinteresting user access patterns in web log. In thispaper, we have proposed RoCeT model for cluster-ing web transactions using rough set theory based onsimilarity between two transactions. The analysis ofthe RoCeT model was presented in terms of computa-tion, processing time and cluster purity. We elaboratethe proposed technique through UCI benchmark data,i.e., msnbc.com web log data. It is shown that the Ro-CeT model requires significantly lower response timeup to 62.69 % as compared to the technique of [11].Meanwhile, for cluster purity it performs better up to2.5 %.

7 Acknowledgement

This work was supported by the FRGS under theGrant No. Vote 0402, Ministry of Higher Education,Malaysia.

References

[1] Pal, S.K., Talwar, V. and Mitra, P.,(2002) WebMining in Soft Computing Framework: Relevance,State of the Art and Future Directions, IEEETransactions on neural network, 13 (5), 1163 -1177.

[2] Bucher, A.G. and Mulvenna, M.D., (1998) Dis-covering Internet Marketing Intelligence throughOnline Analytical Web Usage Mining, SIGMODRecord, 27 (4), 54-61.

[3] Cohen, E., Krishnamurthy, B. and Rexford, J.,(1998) Improving and-to-end performance of theweb using server volumes and proxy lters, Pro-ceeding of the ACM SIGCOMM . Vancouver,British Columbia, Canada: ACM Press.

[4] Joachims, T., Freitag, D. and Mitchell, T.,(1997)Webwatcher: A tour guide for the world wideweb, In the 15th international Joint Confer-ence on Artificial Intelligence (ICJAI97), Nagoya,Japan.

[5] Lieberman, H., (1995) Letizea: An agent thatassists web browsing, Proceeding of the 1995 In-ternational Joint Conference on Artificial Intelli-gence. Montreal, Canada: Morgan Kaufmann.

[6] Mobasher, B., Cooley, R., and Srivastava, J.,(1999) Creating adaptive web sites trough us-age based clustering of URLs, Proceedings of the

1999 Workshop on Knowledge and Data Engineer-ing Exchange. IEEE Computer Society.

[7] Ngu, D.S.W. and Wu, X., (1997) Sitehelper: Alocalized agent that helps incremental explorationof the world wide web, Proceeding of 6th Interna-tional World Wide Web Conference. Santa Clara,CA: ACM Press.

[8] Perkowitz, M. and Etzioni, O., (1998) Adap-tive Web Sites: Automatically Synthesizing WebPages, Proceedings of the 15th National Con-ference on Artificial Intelligence. Madison, WI:AAAI.

[9] Z . Yanchun, X. Guandong and Z. Xiaofang.,(2005) A Latent Usage Approach for Cluster-ing Web Transaction and Building User Profile,Springer-Verlag Berlin Heidelberg , 31 - 42.

[10] Han, E. et al., (1998) Hypergraph Based Clus-tering in High-Dimensional Data Sets: A Sum-mary of Results, IEEE Data Engineering Bul-letin, 21 (11), 15-22.

[11] De, S.K. and Krishna, P.R., ( 2004) Cluster-ing web transactions using rough approximation,Fuzzy Sets and Systems, 148, 131-138.

[12] Pawlak, Z., (1982) Rough sets, InternationalJournal of Computer and Information Science. 11,341-356.

[13] Pawlak, Z. (1991) Rough sets: A theoretical as-pect of reasoning about data, Kluwer AcademicPublisher.

[14] Pawlak, Z. and Skowron, A., (2007) Rudimentsof rough sets Information Sciences. An Interna-tional Journal. 177 (1), 3-27.

Iwan Tri Riyadi YantoHe is a Master candidate in DataMining at Universiti Tun HusseinOnn Malaysia (UTHM). His re-search area includes Data Mining,KDD, and Real Analysis.

Tutut HerawanHe is a Ph.D. candidate in DataMining at Universiti Tun HusseinOnn Malaysia (UTHM). His re-search area includes Data Mining,KDD and Real Analysis.

144


Mustafa Mat DerisHe received the B.Sc. from Univer-sity Putra Malaysia, M.Sc. fromUniversity of Bradford, Englandand Ph.D. from University PutraMalaysia. He is a professor of com-puter science in the Faculty of In-formation Technology and Multi-media, UTHM, Malaysia. His re-search interests include distributed

databases, data grid, database performance issues anddata mining. He has published more than 80 papersin journals and conference proceedings. He was ap-pointed as one of editorial board members for Inter-national Journal of Information Technology, WorldEnformatika Society, a reviewer of a special issueon International Journal of Parallel and DistributedDatabases, Elsevier, 2004, a special issue on Interna-tional Journal of Cluster Computing, Kluwer, 2004,IEEE conference on Cluster and Grid Computing,held in Chicago, April, 2004, and Malaysian Jour-nal of Computer Science. He has served as a pro-gram committee member for numerous internationalconferences/workshops including Grid and Peer-to-Peer Computing, (GP2P 2005, 2006), AutonomicDistributed Data and Storage Systems Management(ADSM 2005, 2006), WSEAS, International Associa-tion of Science and Technology, IASTED on Database,etc.

145

Documents

Rough Clustering