View
157
Download
2
Category
Preview:
Citation preview
Classification of E-commerce Websites by
Product Categories
Case Study
Moiseev George Higher School of Economics
Faculty of Computer Science
Higher School of Economics , Moscow, 2016
www.hse.ru
Outline
• Introduction
• Preprocessing
• Feature extraction
• Classification and evaluation
• Experimental results
2
Problem Statement
• Retrieve e-commerce websites (e-shops)
• Classify e-shops by sold product type
*We don’t include customer-to-customer websites as e-
commerce shops
3
Applications
• Market research
• Statistics gathering
• Organizing a knowledge base
• Goods search
4
Dataset
The dataset was received by datainsight.ru
There are two training subsets marked by experts:
1. 1312 e-commerce and 1077 non e-commerce web
sites
2. 1448 of 15 product
categories.
5
Preprocessing
Downloading a website:
Starting from the main page
Download all internal hyperlinks from a web page which weren’t
downloaded before
Check if equal webpage was already downloaded by other
hyperlink
What information should be saved from other webpages:
1. Nothing
2. Only meta data
3. Everything
6
Preprocessing
Each webpage will be stored in two versions
• Raw page:
– Remove only javascript and obvious advertisements
• Cleaned page:
– Extract only content of markup tags
– Tokenization – retrieving sentences and words
– Stemming – reducing words to their root or base form
– Lowercase conversion
– Filter out stopwords
7
Feature Extraction
There many methods and models for automatic text feature
extraction:
• Bag of words
• n-grams
• word2vec
• TF-IDF (on the picture)
• Mutual information
• Chi-square
• …
8
Feature extraction
Proposed approach:
The term weighting formula for the i-th term in the k-th website is
derived from TF-IDF as follows:
𝑊𝑖𝑘 =𝑡𝑓𝑖𝑘 log
𝑁𝑛𝑖
(𝑡𝑓𝑖𝑗 log𝑁𝑛𝑗)2𝑁
𝑗=1
where ni is the number of websites where the i-th term appears, N –
total number of web sites in the sample and tfik is computed as:
𝑡𝑓𝑖𝑘 = 𝑤(𝑡)f(𝑖, 𝑘, 𝑡)
𝑇
𝑡
Where w(t) is inversely proportional frequency of a tag t, f(i, k, t) is
frequency of the i-th term in t-th tag.
9
Classification and evaluation
• Support Vector Machine as classifier.
• multiclass classification performs in “one-vs-all” way.
• precision, recall and F-score for evaluation
• overall performance of the product type classification is evaluated
by average F-score among all categories.
10
Results
F-score of e-commerce class in binary classification
11
Used web site information pure TF-IDF TF-IDF with Tag
weighting
only main page 0.85 0.89
main page + meta and title from other
pages 0.89 0.94
main page + whole other pages 0.86 0.92
.
Results
average F-score of e-commerce categorization by sold product type:
12
.
Used web site information pure TF-IDF TF-IDF with Tag
Weighting
only main page 0.67 0.72
main page + meta and title from other pages
0.74 0.79
main page + whole other pages 0.73 0.81
References
1. A. Rahmani and S. Meshkizadeh, "Webpage Classification based on Compound of Using HTML Features & URL Features and Features of Sibling Pages", International Journal of Advancements in Computing Technology, vol. 2, no. 4, pp. 36-46, 2010.
2. A. Aizawa, "An information-theoretic perspective of tf-idf measures", Information Processing & Management, vol. 39, no. 1, pp. 45-65, 2003.
3. D. Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation", Journal of Machine Learning Technologies, vol. 1, no. 2, pp. 37-63, 2011.
4. Vapnik, V., Cortez, C.: Support vector networks. Machine Learning. (1995).
5. Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using hyperlink patterns and meta data. ICML 01: Proceedings of the Eighteenth International Conference on Machine Learning. 178-185 (2001).
13
.
Moiseev George
gvmoiseev@edu.hse.ru
Recommended