7
Page Clustering Using a Distance Based Algorithm Jairo Andr´ es Mojica [email protected] Diego Alexander Rojas [email protected] Jonatan G´ omez [email protected] Fabio Gonz´ alez [email protected] Intelligent Systems Research Lab. National University of Colombia Abstract This paper presents an application of a clustering algo- rithm based on gravitational forces to the problem of Web Page Clustering in a dynamic environment. The proposed algorithm uses a modification of the gravitational algorithm proposed by G´ omez et al. but using only the distance mea- sures (a notion of space is not required). This approach is useful when similarities (and/or then distances) between pages can be defined and compute quickly, but the definition of a space is computationally expensive. Experiments with data representing real URL’s and sessions are performed, and a comparison with the incremental connected compo- nents algorithm, which has been previously used to solve this problem, is done. 1. Introduction Web mining is the use of data mining techniques in the Web environment. The main goal of Web data miningis to obtain useful information automatically by mining on data from Web documents or Service Usage Records [3]. Some of the most remarkable tasks concerning Web data mining achieve efficient indexed information retrieval, usage pat- tern discovery, personalization of the information and learn- ing from individual user behavior. The rapidly changing structure and content of today’s Web sites causes difficult problems to the information man- agement and retrieval processes because new users are ar- riving and clusters change over time. As a consequence, Web data mining has been used to acquire information of Web site usage in order to modify the site’s behavior in a personalized manner. These changes are done according to the knowledge found. Some of the content problems of Web sites arise because visitors often have different interests and therefore they behave in a different way when surfing on one’s web site pages. In those cases, classification and clus- tering techniques represent natural approaches in finding a solution. In [16] Srivastava et al. describes Web Usage Mining as the applied discovery of user behavior patterns for a better understanding and serving of Web based applications. Web Usage Mining uses techniques and algorithms from differ- ent fields like Statistical Analysis, Association Rules, Clus- tering and Classification. Clustering is an unsupervised learning technique that takes unlabeled points and assign each of them into a group so that points of a same cluster should have a high similar- ity measure (or low distance) and elements of two different clusters have low similarity [7]. Several techniques have been propossed for clustering and most of them take the number of clusters as a parameter [1, 10]. When the number of clusters is not provided and found by the algorithm itself, the techniques are called unsupervised [4, 9, 13] On other hand, page clustering represents a particular problem for soft computing techniques. Clustering meth- ods group web pages according to different measures. In this sense, the most common approach is to use information gathered from network users so that they themselves define the clusters of the pages. Then, the problem of page clus- tering will be reduced finally to find the induced similarity and then to classify the pages according to it. The kind of approach used for web mining is usually called “Relational Clustering” because the absence of a space where points can be placed. The information generated by those tech- niques is important for discovering the site user’s tenden- cies. Several approaches are known for page clustering and most of them are designed to work with Web server logs [6, 15, 17]. There are basically two designs for modules that perform clustering and suggestion procedures: The first one uses off- line clustering and on-line suggestions generation, when the Proceedings of the Third Latin American Web Congress (LA-WEB’05) 0-7695-2471-0/05 $20.00 © 2005 IEEE

[IEEE Third Latin American Web Congress (LA-WEB'2005) - Buenos Aires, Argentina (31-02 Oct. 2005)] Third Latin American Web Congress (LA-WEB'2005) - Page Clustering Using a Distance-Based

  • Upload
    f

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page Clustering Using a Distance Based Algorithm

Jairo Andres [email protected]

Diego Alexander [email protected]

Jonatan [email protected]

Fabio [email protected]

Intelligent Systems Research Lab.National University of Colombia

Abstract

This paper presents an application of a clustering algo-rithm based on gravitational forces to the problem of WebPage Clustering in a dynamic environment. The proposedalgorithm uses a modification of the gravitational algorithmproposed by Gomez et al. but using only the distance mea-sures (a notion of space is not required). This approachis useful when similarities (and/or then distances) betweenpages can be defined and compute quickly, but the definitionof a space is computationally expensive. Experiments withdata representing real URL’s and sessions are performed,and a comparison with the incremental connected compo-nents algorithm, which has been previously used to solvethis problem, is done.

1. Introduction

Web mining is the use of data mining techniques in theWeb environment. The main goal of Web data mining is toobtain useful information automatically by mining on datafrom Web documents or Service Usage Records [3]. Someof the most remarkable tasks concerning Web data miningachieve efficient indexed information retrieval, usage pat-tern discovery, personalization of the information and learn-ing from individual user behavior.

The rapidly changing structure and content of today’sWeb sites causes difficult problems to the information man-agement and retrieval processes because new users are ar-riving and clusters change over time. As a consequence,Web data mining has been used to acquire information ofWeb site usage in order to modify the site’s behavior in apersonalized manner. These changes are done according tothe knowledge found. Some of the content problems of Websites arise because visitors often have different interests andtherefore they behave in a different way when surfing on

one’s web site pages. In those cases, classification and clus-tering techniques represent natural approaches in finding asolution.

In [16] Srivastava et al. describes Web Usage Mining asthe applied discovery of user behavior patterns for a betterunderstanding and serving of Web based applications. WebUsage Mining uses techniques and algorithms from differ-ent fields like Statistical Analysis, Association Rules, Clus-tering and Classification.

Clustering is an unsupervised learning technique thattakes unlabeled points and assign each of them into a groupso that points of a same cluster should have a high similar-ity measure (or low distance) and elements of two differentclusters have low similarity [7]. Several techniques havebeen propossed for clustering and most of them take thenumber of clusters as a parameter [1,10]. When the numberof clusters is not provided and found by the algorithm itself,the techniques are called unsupervised [4, 9, 13]

On other hand, page clustering represents a particularproblem for soft computing techniques. Clustering meth-ods group web pages according to different measures. Inthis sense, the most common approach is to use informationgathered from network users so that they themselves definethe clusters of the pages. Then, the problem of page clus-tering will be reduced finally to find the induced similarityand then to classify the pages according to it. The kind ofapproach used for web mining is usually called “RelationalClustering” because the absence of a space where pointscan be placed. The information generated by those tech-niques is important for discovering the site user’s tenden-cies.

Several approaches are known for page clustering andmost of them are designed to work with Web server logs[6, 15, 17].

There are basically two designs for modules that performclustering and suggestion procedures: The first one uses off-line clustering and on-line suggestions generation, when the

Proceedings of the Third Latin American Web Congress (LA-WEB’05) 0-7695-2471-0/05 $20.00 © 2005 IEEE

users are surfing the Web [11,18]. The second one presentedby Silvestrini et. al [14] is designed as a tool capable toperform in one step clustering and suggestions on-line; thistool uses the Incremental connected components algorithm.

The main purpose of this work is to test the DBRAINalgorithm when used on dynamical environments such asthe page clustering problem. It also provides an alternativepath for Relational Clustering approaches. The input/outputprocedures of the model proposed by Silvestri et al. [14]have been used in the application presented here in orderto perform a comparison between the clustering algorithmused in SUGGEST1 and the DBRAIN algorithm.

A description of the RAIN algorithm can be found insection 2. In section 3 the modifications to RAIN in order tobe a pure distance based algorithm are presented. In section4 we present the proposed model and section 5 shows theexperiments performed in order to test the model. Finallywe present some conclusions and discussion as well as theline of future work.

2. Randomized Gravitational Clustering

In [5], Gomez et al. developed a robust clustering tech-nique based on the gravitational law and Newton’s secondmotion law. In this way, for an n-dimensional data set withN data points, each data point is considered as an object inthe n-dimensional space with mass equal to 1. Each pointin the data set is moved according to a simplified version ofthe Gravitational Law using the Second Newton’s MotionLaw. The basic ideas behind applying the gravitational laware:

1. A data point in some cluster exerts a higher gravita-tional force on a data point in the same cluster than ona data point that is not in the cluster. Then, points in acluster move in the direction of the center of the clus-ter. In this way, the proposed technique will determineautomatically the clusters in the data set.

2. If some point is a noise point, i.e., does not belong toany cluster, then the gravitational force exerted on itfrom other points is so small that the point is almostimmobile. Therefore, noise points will not be assignedto any cluster.

In order to reduce the amount of memory and time expendedin moving a data point according to the gravitational fieldgenerated by another point (y), we use the following sim-plified equation:

1SUGGEST is an aplication that aims to build useful link suggestionsto users visiting a site, based on Web server logs. Since versin 2.0 SUG-GEST was designed to work fully on-line mining data and making sug-gestion in one single process. The apache server module implementationof SUGGEST 2.0 is named suggestol and originally uses the connectedcomponents algorithm.

−→x (t + 1) = −→x (t) +−→d

G∥∥∥−→d ∥∥∥3 (1)

where,−→d = −→y −−→x , and the gravitational constant G.

We considered the velocity at any time, −→v (t), as the zerovector and Δt = 1. Since the distance between points is re-duced each iteration, all the points will be moved to a singleposition after a huge (possibly infinite) number of iterations,(big crunch). Then, the gravitational clustering algorithmwill define a single cluster.

In order to eliminate this limit effect, the gravitationalconstant G is reduced each iteration in a constant proportion(the decay term: ΔG). Algorithm 1 shows the randomizedgravitational clustering algorithm.

Algorithm 1 Randomized Gravitational ClusteringRGC( x, G, ΔG, M, ε)

1: for i ← 1 to N do2: MAKE(i)3: end for4: for i ← 1 to M do5: for j ← 1 to N do6: k ← random point index such that k �= j7: MOVE(xj , xk) (see Eq (1))//Move both points8: if dist(xj , xk) ≤ ε then9: UNION(j, k)

10: end if11: end for12: G ← (1 − ΔG)G13: end for14: for i ← 1 to N do15: FIND(i)16: end for17: return disjoint-sets

Function MOVE (line 7), moves both points xj and xk

using (1) taking into consideration that both points cannotmove further than half of the distance between them. Ineach iteration, RGC creates a set of clusters by using anoptimal disjoint set union-find structure2 and the distancebetween objects (after moving data points according to thegravitational force). When two points are merged, both of

2A disjoint set Union-Find structure is a structure that supports the fol-lowing three operators [2]:

• MAKESET(x): Create a new set containing the single element x

• UNION(x, y): Replace the two sets containing x and y by theirunion.

• FIND(x): Return the name of the set containing the element x.

In the optimal disjoint set union-find structure, each set is represented bya tree where the root of the tree is the canonical element of the set, andeach child node has a pointer to the parent node (the root node points toitself) [2].

Proceedings of the Third Latin American Web Congress (LA-WEB’05) 0-7695-2471-0/05 $20.00 © 2005 IEEE

them are kept in the system while the associated set struc-ture is modified. In order to determine the new position ofeach data point, the proposed algorithm only selects anotherdata point in a random way and move both of them accord-ing to (1) (MOVE function). RGC returns the sets stored inthe disjoint set union-find structure.

Because RGC assigns every point in the data set (noisyor normal) to one cluster, it is necessary to extract the validclusters. We used an extra parameter (α) to determine theminimum number of points (percentage of the training dataset) that a cluster should include in order to be considereda valid cluster. In this way, we used an additional func-tion GETCLUSTERS that takes the disjoint sets generatedby RGC and returns the collection of clusters that have atleast the minimum number of points defined, see Algorithm2.

Algorithm 2 Cluster Extraction.GETCLUSTERS( clusters, α, N )

1: newClusters ← ∅2: MIN POINTS ← α · N3: for i ← 0 to numberOfClusters do4: if size(clusteri) ≥ MIN POINTS then5: newClusters ← newClusters ∪ {clustersi}6: end if7: end for8: return newClusters

3. DBRAIN

In order to achieve the clustering phase without generatethe euclidean space (or any other) some modifications wereperformed to the original RAIN algorithm. This version ofthe algorithm works for problems where the positions of thepoints in a space (for example an Euclidean space) cannotbe defined but the distance between each pair of points isknown.

The modifications contained in DBRAIN resides in theMOVE function. In RAIN the new position of a point inthe space was computed using (1). In DBRAIN we do nothave the concept of space so this is done according to thefollowing procedure:

1. Compute the new distance of the selected points.

2. Update the distance of the selected points to the rest ofthe points.

3. Keep the distance properties of the distance matrix.

The first step is done using a simplified version of (1). Aswe can see, equation 1 works to move points in R

n. Sincewe have only distances we can see this movement as a one

Figure 1. The problem of finding a distancefrom a point k to a linear combination of othertwo points when only three points are consid-ered.

dimension move (R), this is the distance line. This lead usto (2) where dt(i, j) means the distance between points iand j at time t. As it was done in RAIN the velocity of anypoint (v(t)) is considered zero at any time, a mass of onefor every point and Δ(t) = 1. The factor of −2 that is used,means that the distance is decaying and that the two pointsare moving.

dt+1(i, j) = dt(i, j) − 2G

dt(i, j)2(2)

Hathaway and Bezdek [8] proved that the euclidean dis-tance from a point to the weighted sum of a set of points(a generalized centroid) can be calculated, using only theinter-point distances. This come as the generalization of thesame problem when only three points are considered (figure1). This generalization can be done using equation 3 and theonly constrain it has to satisfy is the one in (4). In DBRAINonly the case of figure 1 is considered. To find the factorswi (5) and (6) are used. When two points cross each otherby the gravity force, DBRAIN considers that a collision hashappened and sets the distance to zero, dt+1(i, j) = 0.

d2

(k,

n∑i=1

wixi

)=

n∑i=1

wid2(k, xi)

−12

n∑i=1

n∑j=1

wiwjd2(xi, xj) (3)

n∑i=1

xi = 1 (4)

w1 =dt(i, j) − dt+1(i, j)

2dt(i, j)(5)

w2 = 1 − w1 (6)

Proceedings of the Third Latin American Web Congress (LA-WEB’05) 0-7695-2471-0/05 $20.00 © 2005 IEEE

In the final step we have to consider some distance prop-erties. As it is well know these conditions are the following:

d(i, i) = 0 (7)

d(i, j) = d(j, i) (8)

d(i, k) ≤ d(i, j) + d(j, k) (9)

Properties (7) and (8) can be easily maintained by keep-ing simmetry property of the distance matrix and keepingits diagonals equal to zero. Property (9) is maintained whenall distances are greater or equal to zero. We have shown al-ready how to mantain this for the two selected points. Oth-erwise, when (3) is less than zero we add a value to thewhole distance matrix so (3) is zero in that step.

Finally the function MOVE became an algorithm shownin algorithm 3

Algorithm 3 MOVE functionMOVE(i, j, G, dMatrix)

1: compute dt+1(i, j) acording to (2)2: if dt+1(i, j) < 0 then3: dt+1(i, j) ← 04: end if5: compute w1 and w2 acording to (5) and (6)6: for k ← 1 To N, k �= i, k �= j do7: update dMatrix(k, i) acording to (3)8: update dMatrix(k, j) acording to (3)9: end for

4. Proposed Model

In order to generate a model which can run in real time, ithas to generate clusters when new sessions appear. We usedan approach similar to the proposed by Silvestri et al [14]but using the DBRAIN algorithm presented here. In thisway, we used the same input functions as the co-occurrencematrix because such matrix gives us a similarity measure inthe interval [0,∞) and it can be easily transformed into adistance matrix using (10). The factor α is used to maintainpoints separation so we can avoid big crunch (in the interval[0, 1) the gravitational force is very strong).

d(i, j) =

{α · exp {Matrix(i, j)} if i �= j

0 otherwise(10)

In order to test the quality of the clusters we also use thesame algorithm as the proposed in SUGGEST.

4.1. Data Input

The correlation matrix is probably the most importantstructure on the SUGGEST architecture. At the same timethis is the center of the data we use. This matrix stores the cooccurrence records for the pages of the site as the number ofsessions containing both pages. The diagonal of this matrixstores the total number of visits to a page. There are somedetails in the setting of the matrix we think are important toclarify:

1. The co occurrence matrix stores only one entry formultiple apparitions of the same pair of pages in a ses-sion. Multiple visits to the same page within a sessionare not registered.

2. The position i, j of the matrix is the sum of every dif-ferent couple of sessions having the i and j pages di-vided by the maximum number of sessions containingonly one of the pages. This is done to reduce impor-tance to index pages, which are visited with all othersfrequently. i.e. if Nij is the total number of times thatboth pages appear then Wij is the normalized similar-ity. The whole expression can be found in 11.

Wij =Nij

max{Ni, Nj} (11)

3. The matrix is symmetric making it possible to store theupper triangle and the diagonal only, saving memory tobe able to handle bigger sites.

As there is a limited amount of sessions stored at thesame time, the program relies on the matrix for long terminformation storage. In the version using the distance clus-tering algorithm the co occurrence matrix is the base for thedistance matrix as discussed in this paper. The original ver-sion uses the graph inferred from the matrix to apply theconnected components graph algorithm.

4.2. Clustering Phase

As mentioned before this is the most important partof this work. Clustering phase consists in a run of theDBRAIN algorithm. Notice that, the clustering phase doesnot run every time the similarity matrix is updated but it isrun when n sessions are finished, n is a parameter the usercan fix.

The clustering phase begins with a transformation of thesimilarity matrix into a distance one. Once the distance ma-trix is generated it is used as a parameter to the algorithmand the cluster set is updated. Finally the counter is resetedto zero so when n new sessions have passed the algorithmwill run again.

Proceedings of the Third Latin American Web Congress (LA-WEB’05) 0-7695-2471-0/05 $20.00 © 2005 IEEE

4.3. Generation of Suggestions

Once the clustering phase is performed it is necessary topresent to the users the m suggestions for the actual page(note that m is a parameter for this algorithm). To find thissuggestions the n previous pages visited for the actual ses-sion are used. This n is called the page window. The sug-gestions set is built finding the cluster with the largest in-tersection with the page window. Once the cluster with thelargest intersection is found we look into it for the m mostrelevant (closest) pages to the actual page. As shown in [14]this is an algorithm with an O(1) complexity.

The algorithm 4 shows how this is done. For notationCi = k if and only if the page i belongs to cluster k (i ∈ k).C is considered as the full cluster set.

Algorithm 4 Suggestion GenerationCREATESUGGESTION(pageWindow, C, pageID)

1: rmax ← 02: k ← 03: sugg← ∅4: for i ← 1 to |pageWindow| do5: ri ← |{n ∈ pageWindow|Cn = CpageWindowi}|+

16: if ri > rmax then7: rmax ← ri

8: k ← CpageWindowi

9: end if10: end for11: K ← {n ∈ C|Cn = k}12: for all k ∈ K do13: for i ← 0 to m do14: if k �∈ pageWindow then15: sugg ← sugg ∪ {k}16: end if17: end for18: end for19: from sugg take the closest points to pageID20: return sugg

Finally we can establish the general sequence for themodel. Algorithm 5 shows the behavior of the model eachtime a new page arrives.

5. Experiments

5.1. Data Set

A web log file containing web user access sessions tothe Web site of the department of Computer Engineer-ing and Computer Sciences at the University of Missouri,Columbia, was taken in order to apply the model. The ini-

Algorithm 5 Proposed Model (when a new page is visited)1: UPDATEMATRIX

2: if is a new session then3: numberSessions ←numberSessions + 14: end if5: if numberSessions = maxSessions then6: Generate distance matrix according to (10)7: DBRAIN(dMatrix)8: numberSessions ← 09: end if

10: MAKESUGGESTIONS(Clusters)

Parameter Value

G 0.125ΔG 0.001

Iterations 15 (for each run)ε 0.0001

Table 1. Parameters for the clustering algo-rithm

tial web log file was transformed to a data set which con-tains 1704 records of sessions with a maximum of 343 at-tributes which represents the pages visited in each session.“A session is defined as a sequence of temporaly compactaccesses by a user. We define user session as accesses fromthe same IP adress such that the duration of time enlapsedbetween any two accesses in the session is within a pre-specified threshold” [12]. Note that this became our searchspace so this is a real difficult task for a traditional cluster-ing algorithm because the so called curse of dimensionality.

In order to test the approach, the data set was divided inchunks of 100 registers, so only 1700 registers where used.The order as well as the the registers that belong to eachchunk are chosen randomly.

5.2. Experimental Setup

The clustering algorithm was run with the parametersshown in table 1 where G refers to the initial gravitationalforce, ΔG the rate of the gravitational force decay and ε thedistance where points can be considered close enough toperform a UNION operation. As we can see, the parametersare the same that RAIN uses. The main difference is that wedo not use any algorithm to estimate the initial gravitationalforce (G) but it was fixed for all runs.

In table 2 The parameters for the suggestion generatorare shown. As described before, the page window repre-sents how many of the previous pages are considered. Andthe number of suggestions represents the maximum numberof suggestions the program should give as an outcome.

Proceedings of the Third Latin American Web Congress (LA-WEB’05) 0-7695-2471-0/05 $20.00 © 2005 IEEE

Parameters Value

Page Window 20Number of Suggestions 4

Table 2. Parameters for the Suggestion gen-erator

Recall Precision

DBRAIN 0.3(0.02) 0.24(0.01)Connected Comp. 0.27(0.02) 0.29(0.01)

Table 3. Results

For the initial iteration of DBRAIN the first chunk of 100sessions was passed to the algorithm which created the firstversion of the model that generates the suggestions. Nextthe following chunk was used to validate this first approachof the model and we used the previous knowledge with thenew data to generate our second model. This is achieved in anew run of the clustering algorithm. This steps are repeateduntil all chunks are used to validate and update the model.Note that the last chunk (17th.) is only used for validationpurposes. This iterations are done in such way that representthe behavior of real users in a web site.

5.3. Results and Analysis

In order to test the proposed approach and test it againstthe original SUGGEST model we use the standar mea-sures of precision and recall. To calculate this measureswe use the well predicted and misspredicted data samples.To acomplish this, for each session we take the suggestionsmade by the model and look if this predictions appear in thecurrent session, if it does we classify it as a well predictedsample, otherwise as a misspredicted. Once all the infor-mation is recopiled the measures precision and recall werecalculated. Note that the way this measures are calculatedcould lead to a bias because if a page does not appear inone session, this doesn’t necessarily mean it is not relatedto the pages in the rest of the session. Also the values of themeasures can be affected in a negative manner when dealingwith short sessions as it could lead to many misspredicteddata samples, specially in sessions shorter than the desirednumber of suggestions.

For this kind of applications we consider recall to bea more important performance measure than precision be-cause we want to get the relevant information (good sug-gestions), but we also want to give some flexibility to themodel.

In table 3 we show the results after five runs of eachalgorithm. The values correspond to the means and Stan-dard deviation (in parenthesis). The data was permutated

/∼joshi/∼joshi/msim.html/∼joshi/courses/cecs301/announce.html/∼joshi/courses/cecs352/proj/∼joshi/index.html/∼saab/cecs303/private/assignment/hw2.html/∼saab/cecs303/specialneeds.html/faculty/faculty/joshi.html/faculty/palani.html

/∼joshi/sciag/node1.html/∼joshi/scipad/∼joshi/courses/cecs301/comments.html/∼joshi/dbrowse/mbref.html/∼saab/cecs303/private/assignment/∼saab/cecs333/private/assignments/compile.html

/∼shi/cecs345/Lectures/09.html/∼shi/cecs345/Lectures/13.html/∼shi/cecs345/Homeworks/∼joshi/webiq

/∼joshi/courses/cecs438/handout.html/∼joshi/sciag/intro.html/∼joshi/scipad/grp.html/∼zhuang/courses

/∼shi/cecs345/Lectures/11.html/∼joshi/courses/cecs352/proj/footnode.html/∼joshi/courses/cecs301/projects/proj2.html

/faculty/shang.html/∼manager/LAB/unix-mail.html/∼saab/cecs333/private/lectureprograms/∼zhuang/courses/cecs476/index.html

Table 4. Example Clusters After Last Iterationwith DBRAIN

because connected components is not a random algorithmso we introduced the sessions in different order in each run.As we can see, Recall for DBRAIN tends to be a little higer(3%) than Connected Components; on the other hand, Con-nected Components precision is higher than the achieved

Proceedings of the Third Latin American Web Congress (LA-WEB’05) 0-7695-2471-0/05 $20.00 © 2005 IEEE

by DBRAIN, showing the usual trade of between these twomeasures.

Table 4 shows some example clusters obtained with theapproach proposed in this work. The lines work as divisionsfor the clusters where the most relevant items are shown.The size and shape of the clusters tends to be the sameacross all experiments.

6. Conclusions and Future Work

Our results show the DBRAIN algorithm opens apromising research path in applications using it and in fu-ture improvements to the algorithm itself. The combinationof the distance based clustering and the gravitational algo-rithm suits well for the Web environment as any similaritymeasure can be used to build the clusters according to par-ticular research interests. The results obtained for our linksuggestion prototype based on the SUGGESTOL moduleare encouraging in terms of suggestion quality, given thealgorithm is an early design and development stage.

One of the paths of research would be to find a versionof the algorithm that takes advantage of the clusters previ-ously generated. The current prototype does not include thiscapability yet.

7. Acknowledgments

The authors would like to thank all the members of theLISI (Intelligent Systems Research Lab. at the NationalUniversity of Colombia) for their encouragement and sup-port.

References

[1] J. C. Bezdek. Pattern recognition with fuzzy objective func-tion algorithms. Plenun Press, 1981.

[2] T. Cormer, C. Leiserson, and R. Rivest. Introduction to Al-gorithms. McGraw Hill, 1990.

[3] O. Etzioni. The World-Wide Web: quagmire or gold mine?Commun. ACM, 39(11):65–68, 1996.

[4] H. Frigui and R. Krishnapuram. Clustering by competitiveagglomeration. Pattern Recognition, (30(7)):1109–1119,1997.

[5] J. Gomez, D. Dasgupta, and O. Nasraoui. A new gravi-tational clustering algorithm. In Proceedings of the ThirdSIAM International Conference on Data Mining 2003, pages83–94, 2003.

[6] J. Gomez, A. Tocarruncho, and O. Nasraoui. Web profilingusing gravitational clustering. Technical report, UniversidadNacional de Colombia, 2005.

[7] J. Han and M. Kamber. Data Mining: Concepts and Tech-niques. Morgan Kaufmann, 2000.

[8] R. J. Hathaway, J. W. Davenport, and J. C. Bezdek. Rela-tional duals of the C-means algorithms. Pattern Recognition,(22):205–212, 1989.

[9] J. M. Jolion, P. Meer, and S. Bataouche. Robust clusteringwith applications in computer vision. IEEE Transactionson Pattern Analysis and Machine Intelligence, (13(8)):791–802, 1991.

[10] J. MacQueen. Some methods for classification and analysisof multivariate observations. In Fifth Berkeley Symposiumon Mathematics, Statistics, and Probabilities, 1967.

[11] B. Mobasher, R. Cooley, and J. Scrivastava. Automatic per-sonalization based on web usage mining. Communicationsof the ACM, 43(8):142–151, august 2000.

[12] O. Nasraoui, H. Frigui, A. Joshi, and R. Krishnapuram. Min-ing web access logs using relational competitive fuzzy clus-tering. In Eight International Fuzzy Systems AssociationWorld Congress - IFSA 99, August 1999.

[13] O. Nasraoui and R. Krishnapuram. A novel approach tounsupervised robust clustering using genetic niching. In InProceedings of the Ninth IEEE International Conference onFuzzy Systems, pages 170–175, 2000.

[14] F. Silvestri, R. Baraglia, and P. Palmerini. On-line gener-ation of suggestions for web users. In Journal Of DigitalInformation Management, pages 104–108, June 2004.

[15] J. Srivastana, R. Cooley, M. Desh-Pande, and P. N. Tan. Webussage mining: discovery and applications of usage patternform web data. ACM SIGKDD Exploration Newslatter, 1(2),2000.

[16] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Webusage mining: Discovery and applications of usage patternsfrom web data. SIGKDD Explorations, 1(2):12–23, 2000.

[17] Z. Su, Q. Yang, H. Zhang, X. Xu, and Y. Hu. Correlation-based document clustering using web logs. In HICSS, 2001.

[18] T. W. Yan, M. Jacobsen, H. Garcıa-Molina, and D. Umesh-war. From user access patterns to dynamic hypertext linking.In Fifth International World Wide Web Conference, May1996.

Proceedings of the Third Latin American Web Congress (LA-WEB’05) 0-7695-2471-0/05 $20.00 © 2005 IEEE