15
Benefits of InterSite Pre- Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse AxIS Research Team INRIA Sophia Antipolis and Rocquencourt

Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Embed Size (px)

DESCRIPTION

Plan 1. Intersite Data Pre-Processing - introduction of user’s intersite visit « Group of SessionIDs » - first statistical Intersite analysis 2. Crossed Clustering Approach - confusion table with classes of time periods and classes of product types - analysis on the most used shop: shop 4 3. Conclusions

Citation preview

Page 1: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain

Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse

AxIS Research TeamINRIA Sophia Antipolis and Rocquencourt

Page 2: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

MotivationsTo show on the clickstream dataset proposed for ECML/PKDD 2005 Discovery challenge

the benefits of our InterSite pre-processing method proposed by Tanasa in his PhD Thesis (2005)

And

the benefits of a new crossed clustering method developed by Lechevallier&Verde and published in (2003, 2004) on Web logs

2 main viewpoints: User and web site charge

Page 3: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Plan 1. Intersite Data Pre-Processing - introduction of user’s intersite visit

« Group of SessionIDs » - first statistical Intersite analysis

2. Crossed Clustering Approach - confusion table with classes of time periods and classes of product types - analysis on the most used shop: shop 4

3. Conclusions

Page 4: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Table 1. Format of page requestsShopID Date IP address SessionID Page Referrer

11 1074585663 213.151.91.186 939dad92c4…84208dca /

11 1074585670 213.151.91.186 87ee02ddcff…7655bb9e /ct/?c=148 http://www.shop2.cz

Table 2. Number of requests per shop

ShopID Site name (shop) #Requests10 www.shop1.cz 509,688

11 www.shop2.cz 400,045

12 www.shop3.cz 645,724

14 www.shop4.cz 1,290,870

15 www.shop5.cz 308,367

16 www.shop6.cz 298,030

17 www.shop7.cz 164,447

Data pre-processing

Initial data:

Page 5: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Data pre-processing

Tanasa & Trousse (IEEE Intelligent Systems 2004)Tanasa ‘s Thesis (2005)

Page 6: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Table 3. Transformed log lines

Datetime IP SessionID URL Referrer2004-01-20 09:01:03 213.151.91.186 939dad92c4…84208dca http://www.shop2.cz/ -2004-01-20 09:01:10 213.151.91.186 87ee02ddcff…7655bb9e http://www.shop2.cz/ct/?c=148 http://www.shop2.cz/

Data pre-processing

• Data Structuration SessionID a single visit on each shop Towards the notion of user’s intersite visit: we group such SessionIDs that belongs to a single user (same IP) into a « Group of SessionIDs ». We compare the Referer with the URLs previously accessed (in a reasonable time window)

522, ,410 SessionIDs into 397,629 Groups, equivalent to a 23.88% reduction;

• Data fusion, data cleaning

Page 7: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Relational DB modelData summarisation

Page 8: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

0

1000

2000

3000

4000

5000

6000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Hour

Vis

its

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

0

50

100

150

200

250

300

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

HourG

roup

s

Fig. 1. Visits per days and hours: (a) globally, (b) multi-shop

Data pre-processing

• Low number of new visits on Saturdays and Sundays during the lunch time• The high number of new visits on Tuesdays and Wednesdays• Same results a) and b)

Page 9: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Crossed Clustering Aproach for Time Periods/Product Analysis

Data: Selection of ls pages in shop 4 (the most used)

Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)

0

200 000

400 000

600 000

800 000

1 000 000

1 200 000

1 400 000

10 11 12 14 15 16 17Shop

Acc

ess

/ct /ls /dt /znacka /akce others

Page 10: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Crossed Clustering Aproach for Time Periods/Product Analysis

Relational BD model : We add easily a crossed table

Line: an individual (weekday, one hour) 7 days X 24 hours = 168 individuals

Column: a multi-categorical variable representing the number of products requested

by users into the specific time slice

Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)

Page 11: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Crossed Clustering Aproach for Time Periods/Product Analysis

Table 4. Quantity of products requested by weekday x hour and registered on shop 4

Weekday x Hour Product (number of requests)

Monday_0Built-in electric hobs (10),Built-in dish washers 60cm (64),Corner single sinks (50), ...

Monday_1Free standing combi refrigerators (44),Corner single sinks (50), Built-in hoods (60), ...

… …

Sunday_22Built-in microwave ovens (27),Built-in dish washers 45cm (38),Built-in dish washers 60cm (85), ...

Sunday_23Built-in freezers (56),Kitchen taps with shower (45), Garbage disposers (32), ...

Page 12: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Crossed Clustering Aproach for Time Period/Product Analysis

Table 5. Confusion table

Product_1 Product _2 Product _3 Product _4 Product _5 Total

Period_ 1 2847 5084 3284 2265 2471 15951

Period_ 2 11305 31492 12951 1895 9610 67253

Period _3 33107 55652 36699 5345 20370 151173

Period _4 22682 46322 30200 5165 27659 132028

Period _5 9576 20477 19721 2339 7551 59664

Period _6 1783 3515 2549 392 11240 19479

Period _7 15019 14297 8608 1397 6014 45335

Total 96319 176839 114012 18798 84915 490883

57,7%

Page 13: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Crossed Clustering Aproach for Time Period/Product Analysis

Example of one surprising result:

the class Product 5 is defined by one type of products « Free standing combi refrigerators »

consulted predominantly on Fridays from 17:00 to 20:00 (class period 6)

57,7% of such a product type requested on this period

Page 14: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Conclusions

1. Intersite Data Pre-Processing - structuration into user’s intersite visits

« Group of SessionIDs » - first statistical Intersite analysis

- anomalies and recommandations for the dataset

2. Crossed Clustering Approach - first application of such a method on time periods of Web logs

and in e-commerce domain - promising results

Page 15: Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte

Data pre-processing

Inconsistency problems:- table kategorie: found repeated entries and different entries with same ID

- for some page types (dt, df) the given parameter represented actually a specific product, not the given product description (from products table).

- extra parameters equivalent to the give ones for some page types:i.e. for ct page type, id is equivalent to the given c parameter

- missing values (descriptions) in tables: 3 values in product table and 64 in category table

- multiple site SessionIDs: 13 cross-server visits had same SessionID on the visited sites (up to 4 sites); SessionID should change on each new site;

- multiple IP SessionIDs: 3690 visits (SessionIDs) were done from more than one IP (anonymization proxies ?).