View
214
Download
0
Category
Preview:
Citation preview
関係者外秘関係者外秘 Copyright 2015 Fujitsu R&D Center Co.,LTD
FRDC’s approach at PAKDD’s Data Mining Competition
Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang, Dajun ChenFujitsu Research and Development Center, Beijing, China
関係者外秘関係者外秘
Background
Copyright 2015 Fujitsu R&D Center Co.,LTD2
Gender Prediction: The task in this competition is to predict user’s gender from product
viewing logs.
Our solution: Use the product viewing information within single session Use information among different sessions by exploring their potential
associations We adopt a two-step strategy for gender prediction, which consists of
“gender classification” and the “continuous session alignment model”.
関係者外秘関係者外秘 Copyright 2015 Fujitsu R&D Center Co.,LTD3
Features for gender classification[1] Product and Category features
view products and product categories in each session as words in the document, and the “bag of word” model is applied
U10171 2014/12/20 20:31 2014/12/20 20:31 A00001/B00001/C00075/D33237/;A00001/B00001/C00075/D34328
A00001, A00001/B00001, A00001/B00001/C00075, A00001/B00001/C00075/D33237, A00001/B00001/C00075/D34328
Product and Category features with timestamp time stamp is taken from the start time (only year, month and date) of
each session
u10171 2014/12/20 20:31 2014/12/20 20:31 A00001/B00001/C00075/D33237/;A00001/B00001/C00075/D34328/ maleu10174 2014/11/14 0:37 2014/11/14 0:37 A00001/B00001/C00019/D00044/ male
2014/12/20/A00001,2014/12/20/A00001/B00001,2014/12/20/A00001/B00001/C00075,2014/12/20/A00001/B00001/C00075/D33237 ;2014/11/14/A00001,2014/11/14/A00001/B00001,2014/11/14/A00001/B00001/C00019,2014/11/14/A00001/B00001/C00019/D00044;
関係者外秘関係者外秘
Features for gender classification[2]
Copyright 2015 Fujitsu R&D Center Co.,LTD4
Same level product and category features with time stamp Considering that different products focus on different target customers, it
is quite natural for individual customers to hold several fixed preferences, like products and categories.
u10204 2014/12/19 14:40 2014/12/19 14:42 A00003/B00036/C00190/D33072/;A00003/B00036/C00175/D33078/ female
2014/12/19/A00003, 2014/12/19/B00036, 2014/12/19/{C000175,C000190}, 2014/12/19/{D333072,D333078}
Product ID Prefix with time stamp We have noticed that many products hold same product ID prefix in
training data. Products share same product id prefix “D3307” as follows.
Prefix length is set to 4
u10204 2014/12/19 14:40 2014/12/19 14:42 A00003/B00036/C00190/D33072/;A00003/B00036/C00175/D33078 female
2014/12/19/D3307
関係者外秘関係者外秘
Features for gender classification[3]
Copyright 2015 Fujitsu R&D Center Co.,LTD5
Transferring features of sequential products The transferring actions between sequentially viewed products may
reflect click habits of users with different genders.
Counts on different kinds of features:
u10204 2014/12/19 14:40 2014/12/19 14:42 A00003/B00036/C00190/D33072/;A00003/B00036/C00175/D33078/ female
2014/12/19/D33072/C000175
Features PF PFT SLT PIP TFT Total
Count 22,464 35,828 11403 5,157 17,231 92,083
Table 1. Counts of different kinds of features. PF denotes product and category feature, PFT is PF with timestamp,SLT denotes same level features with timestamp, PIP means product ID prefix features, TFT means transferring features
関係者外秘関係者外秘
Features for gender classification[4]
Copyright 2015 Fujitsu R&D Center Co.,LTD6
𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑣𝑎𝑙𝑢𝑒 ( 𝑓 )= 𝑚𝑎𝑙𝑒 ( 𝑓 )𝑚𝑎𝑙𝑒 ( 𝑓 )+ 𝑓𝑒𝑚𝑎𝑙𝑒( 𝑓 )
Feature value is calculated by:
denotes the percentage of the occurrence of f in “male” sessions, and denotes its frequency of occurrence in “male” sessions.
Classification model: We use a well implemented SVM library named libsvm [2] with linear
kernel function, We set “male” session’s weight to be 1.3 and 0.25 for “female” session
during training due to the unbalance of gender ratio in training data.
Summary We finally use a sparse feature set with high feature dimensions. Timestamp based features greatly increase feature dimensions, but turn
out to be useful. Linear classifier is efficient and works well on this data set.
関係者外秘関係者外秘
Continuous Session Alignment Model
Copyright 2015 Fujitsu R&D Center Co.,LTD7
Continuous Session: Start time of subsequent session should be strictly later than the end time
of its preceding session. Session Alignment:
Align continuous sessions in training data to corresponding sessions in test data (unidirectional alignment).
Figure. 1. one continuous session segment in training data can be aligned to more than one segments in test data
We find the most similar continuous session segment in test data for each continuous session segment in training data
feature vector of each segment, the number of sessions in each segment. Features of segment include: Product and Category features with timestamp, Product ID
Prefix with time stamp and Same level product and category features with time stamp.
関係者外秘関係者外秘
References1. Zellig S. Harris. Distributional structure. Word, 10:146-162, 1954.
2. Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Copyright 2015 Fujitsu R&D Center Co.,LTD8
Recommended