8
関関関関関 関関関関関 Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang, Dajun Chen Fujitsu Research and Development Center, Beijing, China

Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

Embed Size (px)

Citation preview

Page 1: Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

関係者外秘関係者外秘 Copyright 2015 Fujitsu R&D Center Co.,LTD

FRDC’s approach at PAKDD’s Data Mining Competition

Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang, Dajun ChenFujitsu Research and Development Center, Beijing, China

Page 2: Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

関係者外秘関係者外秘

Background

Copyright 2015 Fujitsu R&D Center Co.,LTD2

Gender Prediction: The task in this competition is to predict user’s gender from product

viewing logs.

Our solution: Use the product viewing information within single session Use information among different sessions by exploring their potential

associations We adopt a two-step strategy for gender prediction, which consists of

“gender classification” and the “continuous session alignment model”.

Page 3: Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

関係者外秘関係者外秘 Copyright 2015 Fujitsu R&D Center Co.,LTD3

Features for gender classification[1] Product and Category features

view products and product categories in each session as words in the document, and the “bag of word” model is applied

U10171 2014/12/20 20:31 2014/12/20 20:31 A00001/B00001/C00075/D33237/;A00001/B00001/C00075/D34328

A00001, A00001/B00001, A00001/B00001/C00075, A00001/B00001/C00075/D33237, A00001/B00001/C00075/D34328

Product and Category features with timestamp time stamp is taken from the start time (only year, month and date) of

each session

u10171 2014/12/20 20:31 2014/12/20 20:31 A00001/B00001/C00075/D33237/;A00001/B00001/C00075/D34328/ maleu10174 2014/11/14 0:37 2014/11/14 0:37 A00001/B00001/C00019/D00044/ male

2014/12/20/A00001,2014/12/20/A00001/B00001,2014/12/20/A00001/B00001/C00075,2014/12/20/A00001/B00001/C00075/D33237 ;2014/11/14/A00001,2014/11/14/A00001/B00001,2014/11/14/A00001/B00001/C00019,2014/11/14/A00001/B00001/C00019/D00044;

Page 4: Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

関係者外秘関係者外秘

Features for gender classification[2]

Copyright 2015 Fujitsu R&D Center Co.,LTD4

Same level product and category features with time stamp Considering that different products focus on different target customers, it

is quite natural for individual customers to hold several fixed preferences, like products and categories.

u10204 2014/12/19 14:40 2014/12/19 14:42 A00003/B00036/C00190/D33072/;A00003/B00036/C00175/D33078/ female

2014/12/19/A00003, 2014/12/19/B00036, 2014/12/19/{C000175,C000190}, 2014/12/19/{D333072,D333078}

Product ID Prefix with time stamp We have noticed that many products hold same product ID prefix in

training data. Products share same product id prefix “D3307” as follows.

Prefix length is set to 4

u10204 2014/12/19 14:40 2014/12/19 14:42 A00003/B00036/C00190/D33072/;A00003/B00036/C00175/D33078 female

2014/12/19/D3307

Page 5: Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

関係者外秘関係者外秘

Features for gender classification[3]

Copyright 2015 Fujitsu R&D Center Co.,LTD5

Transferring features of sequential products The transferring actions between sequentially viewed products may

reflect click habits of users with different genders.

Counts on different kinds of features:

u10204 2014/12/19 14:40 2014/12/19 14:42 A00003/B00036/C00190/D33072/;A00003/B00036/C00175/D33078/ female

2014/12/19/D33072/C000175

Features PF PFT SLT PIP TFT Total

Count 22,464 35,828 11403 5,157 17,231 92,083

Table 1. Counts of different kinds of features. PF denotes product and category feature, PFT is PF with timestamp,SLT denotes same level features with timestamp, PIP means product ID prefix features, TFT means transferring features

Page 6: Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

関係者外秘関係者外秘

Features for gender classification[4]

Copyright 2015 Fujitsu R&D Center Co.,LTD6

𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑣𝑎𝑙𝑢𝑒 ( 𝑓 )= 𝑚𝑎𝑙𝑒 ( 𝑓 )𝑚𝑎𝑙𝑒 ( 𝑓 )+ 𝑓𝑒𝑚𝑎𝑙𝑒( 𝑓 )

Feature value is calculated by:

denotes the percentage of the occurrence of f in “male” sessions, and denotes its frequency of occurrence in “male” sessions.

Classification model: We use a well implemented SVM library named libsvm [2] with linear

kernel function, We set “male” session’s weight to be 1.3 and 0.25 for “female” session

during training due to the unbalance of gender ratio in training data.

Summary We finally use a sparse feature set with high feature dimensions. Timestamp based features greatly increase feature dimensions, but turn

out to be useful. Linear classifier is efficient and works well on this data set.

Page 7: Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

関係者外秘関係者外秘

Continuous Session Alignment Model

Copyright 2015 Fujitsu R&D Center Co.,LTD7

Continuous Session: Start time of subsequent session should be strictly later than the end time

of its preceding session. Session Alignment:

Align continuous sessions in training data to corresponding sessions in test data (unidirectional alignment).

Figure. 1. one continuous session segment in training data can be aligned to more than one segments in test data

We find the most similar continuous session segment in test data for each continuous session segment in training data

feature vector of each segment, the number of sessions in each segment. Features of segment include: Product and Category features with timestamp, Product ID

Prefix with time stamp and Same level product and category features with time stamp.

Page 8: Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

関係者外秘関係者外秘

References1. Zellig S. Harris. Distributional structure. Word, 10:146-162, 1954.

2. Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Copyright 2015 Fujitsu R&D Center Co.,LTD8