23
Nov, 2002 Banerjee and Ghosh 1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh

Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh

Embed Size (px)

Citation preview

Nov, 2002 Banerjee and Ghosh 1

Characterizing Visitors to a Website Across Multiple Sessions

NGDM Workshop, Nov 2002

Arindam BanerjeeJoydeep Ghosh

Nov, 2002 Banerjee and Ghosh 2

Motivation

Why Characterize or Predict web user behavior?

• Site-centric view: Personalization, sticky websites

• User-centric view: personal agents for information acquisition

• Universalist approaches: Pagerank, web metrics,…

Nov, 2002 Banerjee and Ghosh 3

Clustering Users from Web Logs

• Wide variety of web behavior segment users based on surfing behavior as a first step to further analysis.

• User: set of sessions• Session: sequence of

– (page I.d., time spent on that page) tuples

– How to cluster sets of sequences?

Nov, 2002 Banerjee and Ghosh 4

The Approach

• Cluster Sessions– Session Similarity Measure

– Session Similarity Graph

• Outlier Detection

– Graph Partitioning

• Create a Cluster Space

• Cluster users in this Space

Nov, 2002 Banerjee and Ghosh 5

A Similarity Measure for Sessions

1. Overlap between two sessions represented by the longest common subsequence (LCS)

2. Obtain session similarity using LCS and time information session similarity = (time similarity in LCS) x (importance of LCS)

• The similarity component : – Average min-max similarity for each page in the LCS

• The importance component : – Average of the fraction of overall session time spent in the LCS

1,0

Nov, 2002 Banerjee and Ghosh 6

Session Clustering

• Find the pairwise similarity values between all pair of sessions; record only similarities >

• Incrementally construct similarity graph G

– the vertices are the sessions, the edge weights are the session similarity values

– no isolated vertices (discard “outliers”)

• Balanced Graph Partitioning– we used Metis [Karypis, Kumar]

Nov, 2002 Banerjee and Ghosh 7

The Cluster Space

• Given: each session assigned to one of k clusters (sets)Sessions of a user are distributed among the k sets

– vector u = [u1 u2 … uk ]T where ui = number of sessions of the user belonging to cluster I

• Stage II : User Clustering

– find pairwise similarity values using the extended Jaccard measure

– partition similarity graph

• Gives l user clusters and a set of outlier users

Nov, 2002 Banerjee and Ghosh 8

The Dataset : Sulekha.com

Nov, 2002 Banerjee and Ghosh 9

Dataset details

• Logs over a one month period

• Raw log size 184 Mb

• 453,953 files accessed

• 37,753 sessions in all

• 23,310 sessions after some preprocessing/filtering

• 2,493 users

Nov, 2002 Banerjee and Ghosh 10

Results : Session ClustersCluster 1 – interest in coffeehouse, contests Cluster 2 – glance through home, articles

-(/,12)(/movies,6)(/contests,178)

-(/contests,142)

-(/coffeehouse,5)(/contests,183)

-(/contests,172)

-(/,10)(/contests,143)

-(/,22)(/articles,22)

-(/,20)(/articles,20)

-(/,21)(/articles,21)

-(/,19)(/articles,19)

-(/,20)(/articles,19)

Cluster 3 – interest in author, articles Cluster 4 – read articles

-(/,148)(/authors,6)(/articles,77)

-(/authors,290)(/articles,290)

-(/authors,295)(/articles,295)

-(/,33)(/authors,90)(/articles,475)

-(/,32)(/authors,91)(/articles,425)

-(/,39)(/articles,98)(/misc,17)

(/articles,2649)

-(/,9)(/articles,2666)

-(/authors,26)(/articles,2561)

-(/misc,20)(/articles,77)(/misc

32)(/articles,43)(/authors,16)

(/articles,2373.1)

Nov, 2002 Banerjee and Ghosh 11

Results : User Clusters• user : [(128.194.xxx.xxx)]

– (/authors,3)(/articles,129)– (/authors,8)(/articles,8)– (/authors,80)(/articles,2141)

• user : [(209.30.xxx.xxx)]– (/home,77)(/articles,111)(/authors,93)(/articles,629)(/

misc,58) (/coffeehouse,75)(/wo-men,967)– (/articles,2627)

• user : [(171.68.xxx.xxx)]– (/home,323)(/articles,24)(/authors,45)(/articles,1290)

A user cluster :

people who read the articles

Nov, 2002 Banerjee and Ghosh 12

Results : User Clusters• user : [(152.170.xxx.xxx)]

– (/home,21)(/wo-men,1075)(/philosophy,52)

• user : [(209.244.xxx.xxx)]– (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo-

men,31)– (/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)

(/coffeehouse,382)(/biztech,298)(/philosophy,290)– (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6)

(/biztech,94)(/coffeehouse,2)(/philosophy,1093)

A user cluster :

people interested in wo-men, philosophy, coffeehouse

Nov, 2002 Banerjee and Ghosh 13

Results : User Clusters• user : [(216.154.xxx.xxx)]

– (/coffeehouse,12)(/biztech,25)(/books,48)– (/coffeehouse,13)(/biztech,26)(/books,19)

• user : [(204.220.xxx.xxx)]– (/coffeehouse,162)– (/coffeehouse,40)

• user : [(32.100.xxx.xxx)]– (/coffeehouse,12)(/contests 12)– (/coffeehouse,43)(/contests 44)

A user cluster :

people interested in coffeehouse – bookmarked it !

Nov, 2002 Banerjee and Ghosh 14

Result Visualization using CLUSION [Strehl &Ghosh 01]

Sessions Users

Nov, 2002 Banerjee and Ghosh 15

Conclusions

• Segmentation: a basic pre-processing step for Web Mining• Similarity measure + Cluster Space Concept: applicable to

clustering of sets of any data-structure • For certain websites, time spent on the pages matters

– not handled by current commercial tools

• Outlier detection before clustering is important• Results QA-ed by human subjects

– Results for clusters & outliers at both levels were subjectively good

No good way to find cluster quality analytically

Formation of similarity graph is a slow process

Nov, 2002 Banerjee and Ghosh 16

Future Work

• Improve the present method by:– using cluster seeds for cluster growing

– using alternative clustering algorithms for each stage

– studying the effect of thresholds, number of clusters on performance

– studying the importance of order of page-visits

– studying the importance of balanced clustering

Nov, 2002 Banerjee and Ghosh 17

Backup

Nov, 2002 Banerjee and Ghosh 18

Issues : Choice of Parameters

• Number of session clusters, k, should be chosen appropriately

• Thresholds for forming session & user similarity graphs :– threshold value should be chosen after looking at the

distribution of edge weights

Nov, 2002 Banerjee and Ghosh 19

Related Work

• Research in Web Mining :– Extraction of navigational patterns : Spiliopoulou,

Faulstich

– Ordering relationships : Mannila, Meek

– Surfing prediction : Pitkow, Pirolli

– Clustering web usage sessions : Fu, Sandhu, Shih

Nov, 2002 Banerjee and Ghosh 20

Example

• Sessions :

– Session1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)]

– Session2 = [(b,5) (d,12) (f,1) (a,7) (c,5)]

• LCS pages = [(b)(d)(c)]

• Corresponding Index, Times Sequences :– Index1 = [(1)(2)(3)], Time1 = [(100) (8) (5)]

– Index2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)]

• Similarity over each LCS page : of the two times– Similarity on page b = 5/100 = 0.05

– Similarity on page d = 8/12 = 0.67

– Similarity on page c = 5/5 = 1.00

max

min

Nov, 2002 Banerjee and Ghosh 21

Example (contd.)

• The similarity component = (0.05 + 0.67 + 1.00)/3

= 0.57

• The importance component :– Fraction of time spent in the LCS by Session1 = 113/149 = 0.76

– Fraction of time spent in the LCS by Session2 = 22/30 = 0.73

– The mean = (0.76+0.73)/2 = 0.75

• The overall similarity= 0.57 x 0.75

= 0.43

Nov, 2002 Banerjee and Ghosh 22

Issues : Session Resolution

• Generate coarse resolution paths making use of the concept hierarchy of the website

• Reduces computations; Increases interpretability of results

Original Path Concept-level Path(/authors/ramesh_mahadevan.html,3)

(/articles/rm_phattas.html,75)

(/articles/rm_desidads.html,39)

(/authors,3)

(/articles,114)

(/authors/arun_sampath.html,109)

(/philosophy/messages/1951.html,102)

(/philosophy/messages/1953.html,46)

(/,3)

(/philosophy/messages/1954.html,69)

(/authors,109)

(/philosophy,148)

(/,3)

(/philosophy,69)

Nov, 2002 Banerjee and Ghosh 23

Comments

• Results QA-ed by human subject– Results for clusters & outliers at both levels were subjectively

good

– No good way to find cluster quality analytically

• Clustering algorithms for the two stages– Stage I : Graph partitioning works well for large sparse graphs, so

it is desirable in this stage

– Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate

• Cluster space – Gives a general framework for mapping any non-vector clustering

problem to an equivalent vector clustering problem