12
IJBSCHS(2010-16-2-16) Biomedical Soft Computing and Human Sciences, Vol.16, No.2, pp.135-145 [Original Article] Copyright c 1995 Biomedical Fuzzy Systems Association (Accepted on February 20,2010) RoCeT: Rough Clustering for web Transactions Iwan Tri Riyadi Yanto 1,3* , Tutut Herawan 2,3 , Mustafa Mat Deris 3 1 Department of Mathematics 2 Department of Mathematics Education Universitas Ahmad Dahlan, Yogyakarta,Indonesia 3 Faculty of Information Technology and Multimedia Universiti Tun Hussein Onn Malaysia, Johor, Malaysia (received 31 December 2009, revised and accepted 20 February 2010) Abstract: Grouping web transactions into clusters is important in order to obtain better understanding of user’s behavior. Currently, the rough approximation-based clustering technique has been used to group web transactions into clusters. However, the processing time is still an issue due to the high complexity for finding the similarity of upper approximations of a transaction which used to merge between two or more clusters. On the other hand, the problem of more than one transaction under given threshold is not addressed. In this paper, we propose RoCeT model for grouping web transactions using rough set theory. It is based on the two similarity classes which are nonvoid intersection. Keywords: Clustering, Web transactions, Rough set theory. 1 Introduction Web usage data includes data from web server ac- cess logs, proxy server logs, browser logs, user profiles, registration files, user sessions or transactions, user queries, bookmark folders, mouse-clicks and scrolls, and any other data generated by the interaction of users and the web [1]. Generally, web mining tech- niques can be defined as those methods to extract so called ”nuggets” (or knowledge) from web data repos- itory, such as content, linkage, usage information, by utilizing data mining tool. Among such web data, user click stream, i.e. usage data, can be mainly uti- lized to capture users’ navigation patters and identify user intended tasks. Once the user navigational be- haviors are effectively characterized, they will provide benefit for further web applications, in turn, facilitate and improve web service quality for both web-based organizations and for end users [2-9]. In web data mining research, many data mining techniques, such as clustering [8,10] is adopted widely to improve the usability and scalability of web mining. Access transaction over the web can be expressed in the two finite sets, user transaction and hyper- links/URLs [11]. A user transaction U is a sequence of items, this set is formed by m users and the set A is set of distinct n clicks (hyperlinks/URLs) clicked by users that are U = {t 1 ,t 2 ,...,t m } and A = {hl 1 , hl 2 ,...,hl n }, 1 Faculty of Information Technology and Multimedia, UTHM, Parit Raja, Batu Pahat 86400, Johor. Phone : +60177061496 Email : [email protected] where for every t i T U is a non-empty subset of U . The temporal order of users clicks within transac- tions has been taken into account. A user transaction t T is represented as a vector t = bu t 1 ,u t 2 ,...,u t n c, where u t i = 1 if hl i t, and u t i = 0 if otherwise. A well-known approach for clustering web trans- actions is using rough set theory [12-14]. De and Kr- ishna [11] proposed an algorithm for clustering web transactions using rough approximation. It is based on the similarity of upper approximations of trans- actions by given any threshold. However, there are some iterations should be done to merges of two or more clusters that have the same similarity of upper approximations and didn’t present how to handle the problem if there are more than one transaction under given threshold. To overcome those problems, in this paper, we propose an alternative technique for cluster- ing web transaction. We use the concept of similarity class proposed by [11]. But, the RoCeT model differs on how to allocate transaction in the same cluster and how to handle the problem if there is more than one transaction under given threshold. The rest of the paper is organized as follows. Sec- tion 2 describes the concept of rough set theory. Sec- tion 3 describes the work of [11]. Section 4 describes the RoCeT model. Section 5 describes the experimen- tal test. Finally, we conclude our works in Section 6. 135

Rough Clustering

  • Upload
    dhiec

  • View
    236

  • Download
    2

Embed Size (px)

DESCRIPTION

Rough Clustering

Citation preview

  • IJBSCHS(2010-16-2-16) Biomedical Soft Computing and Human Sciences, Vol.16, No.2, pp.135-145[Original Article] Copyright c1995 Biomedical Fuzzy Systems Association

    (Accepted on February 20,2010)

    RoCeT: Rough Clustering for web Transactions

    Iwan Tri Riyadi Yanto1,3, Tutut Herawan2,3, Mustafa Mat Deris31Department of Mathematics

    2Department of Mathematics Education

    Universitas Ahmad Dahlan, Yogyakarta,Indonesia3Faculty of Information Technology and Multimedia

    Universiti Tun Hussein Onn Malaysia, Johor, Malaysia

    (received 31 December 2009, revised and accepted 20 February 2010)

    Abstract: Grouping web transactions into clusters is important in order to obtain better understanding ofusers behavior. Currently, the rough approximation-based clustering technique has been used to group webtransactions into clusters. However, the processing time is still an issue due to the high complexity for findingthe similarity of upper approximations of a transaction which used to merge between two or more clusters.On the other hand, the problem of more than one transaction under given threshold is not addressed. In thispaper, we propose RoCeT model for grouping web transactions using rough set theory. It is based on the twosimilarity classes which are nonvoid intersection.Keywords: Clustering, Web transactions, Rough set theory.

    1 Introduction

    Web usage data includes data from web server ac-cess logs, proxy server logs, browser logs, user profiles,registration files, user sessions or transactions, userqueries, bookmark folders, mouse-clicks and scrolls,and any other data generated by the interaction ofusers and the web [1]. Generally, web mining tech-niques can be defined as those methods to extract socalled nuggets (or knowledge) from web data repos-itory, such as content, linkage, usage information, byutilizing data mining tool. Among such web data,user click stream, i.e. usage data, can be mainly uti-lized to capture users navigation patters and identifyuser intended tasks. Once the user navigational be-haviors are effectively characterized, they will providebenefit for further web applications, in turn, facilitateand improve web service quality for both web-basedorganizations and for end users [2-9]. In web datamining research, many data mining techniques, suchas clustering [8,10] is adopted widely to improve theusability and scalability of web mining.

    Access transaction over the web can be expressedin the two finite sets, user transaction and hyper-links/URLs [11]. A user transaction U is a sequence ofitems, this set is formed by m users and the set A is setof distinct n clicks (hyperlinks/URLs) clicked by usersthat are U = {t1, t2, . . . , tm} andA = {hl1, hl2, . . . , hln},

    1Faculty of Information Technology and Multimedia,UTHM, Parit Raja, Batu Pahat 86400, Johor.Phone : +60177061496Email : [email protected]

    where for every ti T U is a non-empty subset ofU . The temporal order of users clicks within transac-tions has been taken into account. A user transactiont T is represented as a vector t = but1, ut2, . . . , utnc,where uti = 1 if hli t, and uti = 0 if otherwise.

    A well-known approach for clustering web trans-actions is using rough set theory [12-14]. De and Kr-ishna [11] proposed an algorithm for clustering webtransactions using rough approximation. It is basedon the similarity of upper approximations of trans-actions by given any threshold. However, there aresome iterations should be done to merges of two ormore clusters that have the same similarity of upperapproximations and didnt present how to handle theproblem if there are more than one transaction undergiven threshold. To overcome those problems, in thispaper, we propose an alternative technique for cluster-ing web transaction. We use the concept of similarityclass proposed by [11]. But, the RoCeT model differson how to allocate transaction in the same cluster andhow to handle the problem if there is more than onetransaction under given threshold.

    The rest of the paper is organized as follows. Sec-tion 2 describes the concept of rough set theory. Sec-tion 3 describes the work of [11]. Section 4 describesthe RoCeT model. Section 5 describes the experimen-tal test. Finally, we conclude our works in Section 6.

    135

  • I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

    2 Rough Set Theory

    An information system is a 4-tuple (quadruple) S =(U,A, V, f), where U = {u1, u2, u3, . . . , u|U |} is a non-empty finite set of objects, A = {a1, a2, a3, . . . , a|A|} isa non-empty finite set of attributes, V = aAVa, Va isthe domain (value set) of attribute a, f : U A Vis an information function such that f(u, a) Va, forevery (u, a) U A, called information (knowledge)function. The starting point of rough set approxima-tions is the indiscernibility relation, which is generatedby information about objects of interest. Two objectsin an information system are called indiscernible (in-distinguishable or similar) if they have the same fea-ture.

    Definition 1 Two elements x, y U are said to beB-indiscernible (indiscernible by the set of attributeB A in S) if and only if f(x, a) = f(y, a), for everya B.

    Obviously, every subset of A induces unique in-discernibility relation. Notice that, an indiscernibilityrelation induced by the set of attribute B, denotedby IND(B), is an equivalence relation. The parti-tion of U induced by IND(B) is denoted by U/B andthe equivalence class in the partition U/B containingx U , in denoted by [x]B . The notions of lower andupper approximations of a set are defined as follows.

    Definition 2 (See [14].) The B-lower approximationof X, denoted by B(X) and B-upper approximationof X, denoted by B(X), respectively, are defined byB(X) = {x U |[x]B X} and B(X) = {x U |[x]B

    X 6= }.

    The accuracy of approximation (accuracy of rough-ness) of any subset X U with respect to B A,denoted B(X) is measured by

    B(X) =|B(X)||B(X)| ,

    where |X| denotes the cardinality of X. For emptyset , we define B() = 1. Obviously, 0 B(X) 1. if X is a union of some equivalence classes of U ,then B(X) = 1. Thus, the set X is crisp (precise)with respect to B. And, if X is not a union of someequivalence classes of U , then B(X) < 1. Thus, theset X is rough (imprecise)with respect to B [13]. Thismeans that the higher of accuracy of approximationof any subset X U is the more precise (the lessimprecise) of itself.

    3 Related Work

    In this section, we discuss the technique proposed by[11]. Given two transactions t and s, the measure-

    ment of similarity between t and s is given by sim(s,t)= |t s|/|t s|. Obviously, sim(t,s) [0, 1], wheresim(t,s) = 1, when two transactions t and s are exactlyidentical and sim(t,s) = 0, when two transactions tand s have no items in common. De and Krishna [11]used a binary relation R defined on T defined as fol-lows. For any threshold value th [0, 1] and for anytwo user transactions t and s T , a binary relation Ron T denoted as tRs iff sim(t, s) th. This relationR is a tolerance relation as R is both reflexive andsymmetric but transitive may not hold good always.

    Definition 3 The similarity class of t, denoted byR(t), is a set of transactions which are similar to twhich is given by R(t) = {s T : sRt}.

    For different threshold values, one can get differentsimilarity classes. A domain expert can choose thethreshold based on this experience to get a propersimilarity class. It is clear that for a fixed threshold [0, 1], a transaction form a given similarity class maybe similar to an object of another similarity class.

    Definition 4 Let P T , for a fixed threshold [0, 1]a binary tolerance relation R is defined on T. Thelower approximation of P, denoted by R(P ) and theupper approximation of P, denoted by R(P ) are re-spectively defined as R(P ) = {t P : R(t) P} andR(P =

    tP )R(t).

    They proposed a technique of clustering the clicksof user navigations called as similarity upper approxi-mation and denoted by Si. A set of transactions thatare possibly similar to R(ti) in denoted by RR(ti).This process continues until two consecutive upperapproximations for ti, i = 1, 2, 3, , |U | are the sameand two or more clusters that have the same similarityupper approximations merges at each iteration. Withthis technique, we need high computational complex-ity to cluster the transactions. This is due to find outthe similarity upper approximation until two consec-utive upper approximations are same. To overcomethis problem, we propose an alternative technique tocluster the transactions.

    4 The RoCeT model

    The RoCeT model for clustering the transactions isbased on all the possibly similar to the similarity classof t(R(t)). The union of two similarity classes withnon void intersection will be the same clusters. Thejustification that a cluster is a union of two similar-ity classes with non void intersection is presented inProposition 6.

    136

  • Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

    Definition 5 Two clusters Si and Sj, i 6= j are saidto be the same if

    Si =R(ti), i = 1, 2, 3, , |U |.

    Proposition 6 Let Si be a cluster. IfR(ti) 6= ,

    thenR(ti) = Si.

    Proof. We suppose that Si and Sj, where i 6= j arethe same clusters.From Definition 5, if

    R(ti) 6= Si, then we have

    R(ti) 6= Si 6= Sj 6=R(tj)

    R(ti) 6=R(tj)

    (R(ti))

    (R(tj)) =

    R(ti) = This is a contradiction from the hypothesis.

    4.1 Complexity

    Suppose that in an information system S = (U,A, V, f),there is U objects that mean there are at most |U | sim-ilarity classes. For computation of similarity classesR(ti) on R(tj), where i 6= j is |U | |U 1|. Thus,the overall computational complexity of the RoCeTmodel is of the polynomial (|U | |U 1|).

    4.2 Example

    In this study, the comparisons between the RoCeTmodel and the technique proposed by [11] are pre-sented by given two examples, where two small datasets of transactions are considered.a. The first transactions data is adopted from [11]given in Table 1 containing four objects (|U | = 4)with five hyperlinks (|A| = 5).

    Table 1. Data transactions

    U/A hl1 hl2 hl3 hl4 hl5t1 1 1 0 0 0t2 0 1 1 1 0t3 1 0 1 0 1t4 0 1 1 0 1

    The technique of [11] needs three main steps. Thefirst of the techniques is obtaining the measure of sim-ilarity that gives information about the users accesspatterns related to their common areas of interest bysimilarity relation between two transactions of ob-jects. The calculations of the measure of similarityweb transactions from Table 1 are given bellow.

    sim(t1, t2) =|t1t2||t1t2| =

    |{hl2}||{hl1,hl2,hl3,hl4}| = 0.25,

    sim(t1, t3) =|t1t3||t1t3| =

    |{hl1}||{hl1,hl2,hl3,hl4}| = 0.25,

    sim(t1, t4) =|t1t4||t1t4| =

    |{hl2}||{hl1,hl2,hl3,hl5}| = 0.25,

    sim(t2, t3) =|t1t4||t1t4| =

    |{hl3}||{hl1,hl2,hl3,hl4,hl5}| = 0.2,

    sim(t2, t4) =|t1t4||t1t4| =

    |{hl2,hl3}||{hl2,hl3,hl4,hl5}| = 0.5,

    sim(t3, t4) =|t1t4||t1t4| =

    |{hl3,hl5}||{hl1,hl2,hl3,hl5}| = 0.5.

    Second, The similarity classes can be obtained by giventhe threshold value using Definition 1. By given thevalue of threshold 0.5, we get the similarity classes asfollow.

    R(t1) = {t1},R(t2) = {t2, t4},R(t3) = {t3, t4},R(t4) = {t2, t3, t4}.The last step is to cluster the transactions. To getthe clusters, [11] used the similarity upper approxi-mations and the processes are shown bellow.

    R(t1) = {t1},R(t2) = {t2, t4},R(t3) = {t3, t4},R(t4) = {t2, t3, t4}RR(t1) = {t1},RR(t2) = {t2, t3, t4},RR(t3) = {t2, t3, t4},RR(t4) = {t2, t3, t4},RRR(t2) = {t2, t3, t4},RRR(t3) = {t2, t3, t4}.Here, we can see that two consecutive upper approxi-mations for {t1}, {t2}, {t3} and {t4} are same. Thus,we get the similarities upper approximation for {t1},{t2}, {t3} and {t4} as follow.S1 = {t1},S2 = {t2, t3, t4},S3 = {t2, t3, t4},S4 = {t2, t3, t4},where S2 = S3 = S4 and S1 6= Si for i = 2, 3, 4.Finally, we get the two clusters {t1} and {t2, t3, t4}.However, for the RoCeT model, it is based on non-void intersection. According to Definition 5 and thesimilarity classes as used in [11], there are a few com-putation we need to do to get the clusters. Therefore,the RoCeT model to clusters the transactions performbetter than that [11]. The calculation of similarity re-lation is shown in Figure 1.

    R(t1) R(t2) = {t1} {t2, t4} = ,R(t1) R(t3) = {t1} {t3, t4} = ,R(t1) R(t4) = {t1} {t3, t4} = ,R(t2) R(t3) = {t2, t4} {t3, t4} = {t4}R(t2) R(t4) = {t2, t4} {t2, t3, t4} = {t4},R(t3) R(t4) = {t3, t4} {t2, t3, t4} = {t4}

    Fig.1. The similarity relation

    Here, we can see that R(ti) R(tj) 6= , i 6= j, fori, j = 2, 3, 4. We get the clusters as follow.

    137

  • I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

    S1 =R(t1) = {t1},

    S2 = S3 = S4 =R(ti), i = 2, 3, 4,

    R(ti) = {t2, t4} {t3, t4} {t2, t3, t4} = {t2, t3, t4}.Hence, the two clusters are {t1} and {t2, t3, t4}.

    b. For the second data transactions is given in Table 2containing eleven objects |U | = 11, with six hyperlinks|A| = 6.

    Table 2. Data transactions

    U/A hl1 hl2 hl3 hl4 hl5 hl6t1 1 0 1 1 0 0t2 0 0 1 1 1 1t3 0 1 0 0 1 1t4 1 0 0 0 0 0t5 0 0 0 0 1 0t6 0 1 1 0 0 0t7 0 0 1 1 0 0t8 0 0 1 1 0 1t9 0 0 1 0 0 0t10 0 1 0 1 0 0t11 0 0 1 1 1 0

    The similarity for the transactions are given bellow.

    sim(t1, t2) = 0.40,sim(t1, t3) = 0,sim(t1, t4) = 0.33,sim(t1, t5) = 0,sim(t1, t6) = 0.25,sim(t1, t7) = 0.67,sim(t1, t8) = 0.50,sim(t1, t9) = 0.33,sim(t1, t10) = 0.25,sim(t1, t11) = 0.5,sim(t2, t3) = 0.40,sim(t2, t4) = 0,sim(t2, t5) = 0.25,sim(t2, t6) = 0.20,sim(t2, t7) = 0.50,sim(t2, t8) = 0.75,sim(t2, t9) = 0.25,sim(t2, t10) = 0.20,sim(t2, t11) = 0.75,sim(t3, t4) = 0.25,sim(t3, t5) = 0.33,sim(t3, t6) = 0.25,sim(t3, t7) = 0,sim(t3, t8) = 0.20,sim(t3, t9) = 0,sim(t3, t10) = 0.25,sim(t3, t11) = 0.20,sim(t4, t5) = 0,

    sim(t4, t6) = 0,sim(t4, t7) = 0,sim(t4, t8) = 0,sim(t4, t9) = 0,sim(t4, t10) = 0,sim(t4, t11) = 0,sim(t5, t6) = 0,sim(t5, t7) = 0,sim(t5, t8) = 0,sim(t5, t9) = 0,sim(t5, t10) = 0,sim(t5, t111) = 0.33,sim(t6, t7) = 0.33,sim(t6, t8) = 0.25,sim(t6, t9) = 0.50,sim(t6, t10) = 0.33,sim(t6, t11) = 0.20,sim(t7, t8) = 0.67,sim(t7, t9) = 0.50,sim(t7, t10) = 0.33,sim(t7, t11) = 0.50,sim(t8, t9) = 0.33,sim(t8, t10) = 0.25,sim(t8, t11) = 0.50,sim(t9, t10) = 0,sim(t9, t11) = 0.33,sim(t10, t11) = 0.25.

    By given the threshold value 0.5, the similarity classesare shown as follow.

    R(t1) = {t1, t7, t8, t11},R(t2) = {t2, t7, t8, t11},R(t3) = {t3},R(t4) = {t4},R(t5) = {t5},R(t6) = {t6, t9},R(t7) = {t1, t2, t7, t8, t9, t11},R(t8) = {t1, t2, t7, t8, t11},R(t9) = {t6, t7, t9},R(t10) = {t10},R(t11) = {t1, t2, t7, t8, t11}.The process for finding similarity upper approxima-tions in each transaction can be illustrated as follow.

    R(t1) = {t1, t7, t8, t11},R(t2) = {t2, t7, t8, t11},R(t3) = {t3},R(t4) = {t4},R(t5) = {t5},R(t6) = {t6, t9},R(t7) = {t1, t2, t7, t8, t9, t11},R(t8) = {t1, t2, t7, t8, t11},R(t9) = {t6, t7, t9},R(t10) = {t10},R(t11) = {t1, t2, t7, t8, t11}RR(t1) = {t1, t2, t7, t8, t9, t11},RR(t2) = {t1, t2, t7, t8, t9, t11},RR(t3) = {t3}RR(t4) = {t4},RR(t5) = {t5},RR(t6) = {t1, t2, t6, t7, t8, t9, t11}RR(t7) = {t1, t2, t6, t7, t8, t9, t11},RR(t8) = {t1, t2, t7, t8, t9, t11}RR(t9) = {t1, t2, t6, t7, t8, t9, t11},RR(t10) = {t10},RR(t11) = {t1, t2, t7, t8, t9, t11}RRR(t1) = {t1, t2, t6, t7, t8, t9, t11},RRR(t2) = {t1, t2, t6, t7, t8, t9, t11},RRR(t3) = {t3},RRR(t4) = {t4},RRR(t5) = {t5},RRR(t6) = {t1, t2, t6, t7, t8, t9, t11}RRR(t7) = {t1, t2, t6, t7, t8, t9, t11},RRR(t8) = {t1, t2, t6, t7, t8, t9, t11}RRR(t9) = {t1, t2, t6, t7, t8, t9, t11},RRR(t10) = {t10},RRR(t11) = {t1, t2, t6, t7, t8, t9, t11}RRR(t1) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t2) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t3) = {t3},RRRR(t4) = {t4},

    138

  • Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

    RRRR(t5) = {t5},RRRR(t6) = {t1, t2, t6, t7, t8, t9, t11}RRRR(t7) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t8) = {t1, t2, t6, t7, t8, t9, t11}RRRR(t9) = {t1, t2, t6, t7, t8, t9, t11},RRRR(t10) = {t10},RRRR(t11) = {t1, t2, t6, t7, t8, t9, t11}Hence, two consecutive upper approximation for {ti},i = 1, 2, . . . , 11 are the same. Therefore, we get thesimilarities upper approximation as follow.

    S1 = {t1, t2, t6, t7, t8, t9, t11},S2 = {t1, t2, t6, t7, t8, t9, t11},S3 = {t3},S4 = {t4},S5 = {t5},S6 = {t1, t2, t6, t7, t8, t9, t11},S7 = {t1, t2, t6, t7, t8, t9, t11},S8 = {t1, t2, t6, t7, t8, t9, t11},S9 = {t1, t2, t6, t7, t8, t9, t11},S10 = {t10},S11 = {t1, t2, t6, t7, t8, t9, t11}.Since Si = Sj 6= Sk, where i, j = 1, 2, 7, 8, 9, 11 andk = 3, 4, 5, 10, then according to [11], there are fiveclusters {t3}, {t4}, {t5}, {t10} and {t1, t2, t6, t7, t8, t9, t11}.For the proposed method, the intersection of similar-ity classes are summarized in Table 3. From Table3, notice that R(ti) R(tj) 6= , for i 6= j; i, j =1, 2, 6, 7, 8, 9, 11, and R(tk) R(tl) = , for k 6= l;k = 1, 2, . . . , 11; l = 3, 4, 5, 10. We get the clusters asfollow.

    S1 = {t1, t2, t6, t7, t8, t9, t11},S2 = {t1, t2, t6, t7, t8, t9, t11},S3 = {t3},S4 = {t4},S5 = {t5},S6 = {t1, t2, t6, t7, t8, t9, t11},S7 = {t1, t2, t6, t7, t8, t9, t11},S8 = {t1, t2, t6, t7, t8, t9, t11},S9 = {t1, t2, t6, t7, t8, t9, t11},S10 = {t10},S11 = {t1, t2, t6, t7, t8, t9, t11}.The five clusters are {t1, t2, t6, t7, t8, t9, t11}, {t3}, {t4},{t5} and {t10}. These are the same cluster with thatin [11]. However, the iteration is lower than thatof the technique proposed by [11]. For the clusters{t3}, {t4}, {t5}, {t11}, with the threshold value given,{t3}, {t4}, {t5}, {t11} be segregated clusters, but if wesee in the data transactions, may be there is a re-lated transactions among the clusters. To this, wepropose the alternative technique to handle this prob-lem by given the second threshold value. Therefore,we decide {t1, t2, t6, t7, t8, t9, t11} as the first clusteron the first threshold value given and the remainder

    {t3}, {t4}, {t5}, {t11}, we given the second thresholdvalue and group the similarity for the remainder trans-actions. The similarity of remainder of transactionsare shown bellow.

    sim(t3, t4) = 0.25,sim(t3, t5) = 0.33,sim(t3, t10) = 0.25,sim(t4, t5) = 0,sim(t4, t10) = 0,sim(t5, t10) = 0.

    Let given a second threshold value 0.3, then we havesimilarity classes are given bellow.

    R(t3) = {t3, t5},R(t4) = {t4},R(t5) = {t3, t5},R(t10) = {t10}.The intersection of similarity classes are summarizedin Figure 2.

    R(t3) R(t4) = {t3, t5} {t4} = ,R(t3) R(t5) = {t3, t5} {t3, t5} = {t3, t5},R(t3) R(t10) = {t3, t5} {t10} = ,R(t4) R(t5) = {t4} {t3, t5} = ,R(t4) R(t10) = {t4} {t10} = ,R(t5) R(t10) = {t3, t5} {t10} = .Fig 2. The intersection of similarity classes

    Based on Figure 2, we see that R(t3) R(t5) 6= and R(ti) R(tj) = for i 6= j, i = 3, 4, 5; j = 4, 10.We get the cluster S3 = {t3, t5}, S4 = {t4}, S5 ={t3, t5}, S10 = {t10}. Hence, the three clusters are{t3, t5}, {t4}, {t10}. Overall, for both of threshold val-ues given we have four clusters {t1, t2, t6, t7, t8, t9, t11},{t3, t5}, {t4}, and {t10}.

    The purity of clusters was used as a measure totest the quality of the clusters[11]. The purity of acluster and overall purity are defined as:

    Purity(i) =tithtn

    where :tith : the number of data occuring in both

    the ith cluster under given threshold.tn : the number of data in the data set.

    Overall Purity =] of clusteri=1 Purity(i)] of cluster

    According to this measure, a higher value of over-all purity indicates a better clustering result, with per-fect clustering yielding a value of 100%. The RoCeTmodel and [11] algorithms for clustering web transac-tions are implemented in MATLAB version 7.6.0.324(R2008a).

    139

  • I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

    Table 3. The intersection of similarity classes

    T/T t1 t2 t3 t4 t5 t6 t7 t8 t9 t10t2 7,8,11t3 t4 t5 t6 t7 1,2,8,11 2,7,8,11 t8 1,2,8,11 2,7,8,11 1,2,7,8,11t9 7 7 6,9 7,9 7t10 t11 1,2,8,11 2,7,8,11 1,2,7,8,11 1,2,7,8,11 7

    1 2 3 4 5

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    clusters

    Transa

    ctions

    1 2 3 4

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    clusters

    transa

    ctions

    by given threshold 0.5 after given second threshold 0.3

    Fig.3. Visualization of example 2

    They are executed sequentially on a processor In-tel Core 2 Duo CPUs. The total main memory is 1Gigabyte and the operating system is Windows XPProfessional SP3. The purity of clusters is describedin Figure 4.

    The comparisons of computation and response timeof RoCeT and [11] on a transaction data set from Ta-ble 2 are given in Figures 6 and 7, respectively.

    Based on Figure 8, the RoCeT model algorithm pro-vides better solutions compared with [11] algorithm.

    Cluster Member Transactions Purity1 t1, t2, t6, t7, t8, t9, t11 12 t3, t5 13 t4 14 t10 1

    Overall Purity 100 %

    Fig.4. The Purity of clusters

    Overall PurityThe technique of [11] 81.82 %RoCeT 100 %

    Fig.5. The Overall Purity

    The Technique of [11] RoCeT55

    60

    65

    70

    75

    80

    85

    90

    95

    100Computation

    Fig.6. The Computation

    The Technique of [11] RoCeT0.015

    0.02

    0.025

    0.03

    0.035

    0.04

    0.045

    0.05

    0.055

    0.06

    seco

    nd

    Response Time

    Fig.7. The Response Time

    Purity Computation TimeData

    Transaction 18.18 % 46.30 % 80.77 %

    Fig.8. The overall improvement of to [11] byRoCeT

    140

  • Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

    5 Experiment test

    In order to test the RoCeT model and compare with[11] algorithm, we use a web log data set from:http://kdd.ics.uci.edu/databases/msnbc/msnbc.html.The data describes the page visits by users who vis-ited on September 28, 1999. Visitors are recorded atthe level of URL category and are recorded chrono-logically. The data comes from Internet InformationServer (IIS) logs for msnbc.com. Each row in thedata set corresponds to the page visits of a user

    within a twenty-four hour period. Each item in arow corresponds to a request of a user for a page.The client-side cached data is not recorded, thus thisdata contains only the server-side log. From almostone million transactions, we take 2000 transactionsand split into five categories; 100, 200, 500, 1000 and2000. The comparison of response times is capturedin Figure 9 and computational is given in figure 10.

    Table 4. The Purity of clusters

    Number of The RoCeT The technique ImprovementTransaction model of [11]

    100 100% 93.0% 7.0%200 100% 96.0% 4.0%500 100% 95.5% 0.5%1000 100% 95.5% 0.5%2000 100% 99.9% 0.1%

    Average 2.5%

    Table 5. The executing time

    Number of The RoCeT The technique ImprovementTransaction model of [11]

    100 1.6969 6.250 68.79%200 9.093 6.250 66.25%500 77.266 163.760 48.35%1000 554.426 2205.100 65.10%2000 3043.500 9780.900 64.97%

    Average 62.69%

    Table 6. The Computation

    Number of The RoCeT The technique ImprovementTransaction model of [11]

    100 8806 28213 68.50%200 39349 116576 68.39%500 257003 497595 52.82%1000 1034964 2965579 74.86%2000 4161122 11879645 68.88%

    Average 69.69%

    100 200 500 1000 20000

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    Number of Transactions

    seco

    nd

    Response Time

    The Technique of [11]

    RoCeT

    Fig.9. The executing time

    141

  • I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

    100 200 500 1000 20000

    2

    4

    6

    8

    10

    12x 106 Computation

    Number of Transaction

    RoCeT

    The technique of [11]

    Fig.10. The Computation

    1 2 3 4 5 6 7 8 910111213141516171819202122230

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    cluster

    Tran

    sact

    ion

    1st threshold 0.6

    0 1 2 3 4 5 6 7 8 9 101112131415161718190

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Cluster

    Tran

    sact

    ion

    after given 2nd threshold 0.3

    Fig.11. Visualization of 100 transactions

    0123456789101112131415161718192021222324252627282930310

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    Cluster

    Tran

    sact

    ion

    1st threshold 0.6

    1 2 3 4 5 6 7 8 910111213141516171819202122232425

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    Cluster

    Tr

    ansa

    ctio

    n

    after given 2nd threshold 0.3

    Fig.12. Visualization of 200 transactions142

  • Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

    123456789101112131415161718192021222324252627282930313233343536

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    Cluster

    Tran

    sacti

    on1st threshold 0.6

    1234567891011121314151617181920212223242526272829303132

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    ClusterTr

    ansa

    ction

    after given 2nd threshold

    Fig.13. Visualization of 500 transactions

    01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    Cluster

    Tran

    sacti

    on

    1st threshold 0.6

    12345678910111213141516171819202122232425262728293031323334353637383940

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    Cluster

    Tran

    sacti

    on

    after given 2nd threshold 0.3

    Fig.14. Visualization of 1000 transactions

    01234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    2000

    Cluster

    Tran

    sacti

    on

    1st threshold 0.6

    012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    2000

    Cluster

    Tr

    ansa

    ction

    after given 2nd threshold 0.3

    Fig. 15. Visualization of 2000 transactions

    143

  • I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

    6 Conclusion

    A web clustering technique can be applied to findinteresting user access patterns in web log. In thispaper, we have proposed RoCeT model for cluster-ing web transactions using rough set theory based onsimilarity between two transactions. The analysis ofthe RoCeT model was presented in terms of computa-tion, processing time and cluster purity. We elaboratethe proposed technique through UCI benchmark data,i.e., msnbc.com web log data. It is shown that the Ro-CeT model requires significantly lower response timeup to 62.69 % as compared to the technique of [11].Meanwhile, for cluster purity it performs better up to2.5 %.

    7 Acknowledgement

    This work was supported by the FRGS under theGrant No. Vote 0402, Ministry of Higher Education,Malaysia.

    References

    [1] Pal, S.K., Talwar, V. and Mitra, P.,(2002) WebMining in Soft Computing Framework: Relevance,State of the Art and Future Directions, IEEETransactions on neural network, 13 (5), 1163 -1177.

    [2] Bucher, A.G. and Mulvenna, M.D., (1998) Dis-covering Internet Marketing Intelligence throughOnline Analytical Web Usage Mining, SIGMODRecord, 27 (4), 54-61.

    [3] Cohen, E., Krishnamurthy, B. and Rexford, J.,(1998) Improving and-to-end performance of theweb using server volumes and proxy lters, Pro-ceeding of the ACM SIGCOMM . Vancouver,British Columbia, Canada: ACM Press.

    [4] Joachims, T., Freitag, D. and Mitchell, T.,(1997)Webwatcher: A tour guide for the world wideweb, In the 15th international Joint Confer-ence on Artificial Intelligence (ICJAI97), Nagoya,Japan.

    [5] Lieberman, H., (1995) Letizea: An agent thatassists web browsing, Proceeding of the 1995 In-ternational Joint Conference on Artificial Intelli-gence. Montreal, Canada: Morgan Kaufmann.

    [6] Mobasher, B., Cooley, R., and Srivastava, J.,(1999) Creating adaptive web sites trough us-age based clustering of URLs, Proceedings of the

    1999 Workshop on Knowledge and Data Engineer-ing Exchange. IEEE Computer Society.

    [7] Ngu, D.S.W. and Wu, X., (1997) Sitehelper: Alocalized agent that helps incremental explorationof the world wide web, Proceeding of 6th Interna-tional World Wide Web Conference. Santa Clara,CA: ACM Press.

    [8] Perkowitz, M. and Etzioni, O., (1998) Adap-tive Web Sites: Automatically Synthesizing WebPages, Proceedings of the 15th National Con-ference on Artificial Intelligence. Madison, WI:AAAI.

    [9] Z . Yanchun, X. Guandong and Z. Xiaofang.,(2005) A Latent Usage Approach for Cluster-ing Web Transaction and Building User Profile,Springer-Verlag Berlin Heidelberg , 31 - 42.

    [10] Han, E. et al., (1998) Hypergraph Based Clus-tering in High-Dimensional Data Sets: A Sum-mary of Results, IEEE Data Engineering Bul-letin, 21 (11), 15-22.

    [11] De, S.K. and Krishna, P.R., ( 2004) Cluster-ing web transactions using rough approximation,Fuzzy Sets and Systems, 148, 131-138.

    [12] Pawlak, Z., (1982) Rough sets, InternationalJournal of Computer and Information Science. 11,341-356.

    [13] Pawlak, Z. (1991) Rough sets: A theoretical as-pect of reasoning about data, Kluwer AcademicPublisher.

    [14] Pawlak, Z. and Skowron, A., (2007) Rudimentsof rough sets Information Sciences. An Interna-tional Journal. 177 (1), 3-27.

    Iwan Tri Riyadi YantoHe is a Master candidate in DataMining at Universiti Tun HusseinOnn Malaysia (UTHM). His re-search area includes Data Mining,KDD, and Real Analysis.

    Tutut HerawanHe is a Ph.D. candidate in DataMining at Universiti Tun HusseinOnn Malaysia (UTHM). His re-search area includes Data Mining,KDD and Real Analysis.

    144

  • Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

    Mustafa Mat DerisHe received the B.Sc. from Univer-sity Putra Malaysia, M.Sc. fromUniversity of Bradford, Englandand Ph.D. from University PutraMalaysia. He is a professor of com-puter science in the Faculty of In-formation Technology and Multi-media, UTHM, Malaysia. His re-search interests include distributed

    databases, data grid, database performance issues anddata mining. He has published more than 80 papersin journals and conference proceedings. He was ap-pointed as one of editorial board members for Inter-national Journal of Information Technology, WorldEnformatika Society, a reviewer of a special issueon International Journal of Parallel and DistributedDatabases, Elsevier, 2004, a special issue on Interna-tional Journal of Cluster Computing, Kluwer, 2004,IEEE conference on Cluster and Grid Computing,held in Chicago, April, 2004, and Malaysian Jour-nal of Computer Science. He has served as a pro-gram committee member for numerous internationalconferences/workshops including Grid and Peer-to-Peer Computing, (GP2P 2005, 2006), AutonomicDistributed Data and Storage Systems Management(ADSM 2005, 2006), WSEAS, International Associa-tion of Science and Technology, IASTED on Database,etc.

    145