Upload
romovpa
View
695
Download
2
Embed Size (px)
Citation preview
RecSys Challenge 2015: ensemble learning with categorical features
Peter Romov, Evgeny Sokolov
• Logs from e-commerce website: collection of sessions
• Session
• sequence of clicks on the item pages • could end with or without purchase
• Click
• Timestamp • ItemID (≈54k unique IDs) • CategoryID (≈350 unique IDs)
• Purchase
• set of bought items with price and quantity
• Known purchases for the train-set, need to predict on the test-set
Problem statement
2
Problem statement
3
Clicks from session s
Purchase (actual)
Purchase (predicted)
c(s) = (c1(s), . . . , cn(s)(s))
h(s) ⇡ y(s)
y(s) =
(; — no purchase
{i1, . . . , im(s)} (bought items) — otherwise
Evaluation measure:
— Jaccard distance between two sets
Sbtest
Stest
where
— all sessions from test-set
— sessions from test-set with purchase
J(y(s), h(s)) =|y(s) \ h(s)||y(s) [ h(s)|
Q(h, Stest) =
X
s2Stest:|h(s)|>0
8<
:
|Sbtest|
|Stest| + J(y(s), h(s)), if y(s) 6= ;� |Sb
test||Stest| , otherwise
Problem statement
4
First observations (from the task): • the task is uncommon (set prediction with specific loss function) • evaluation measure could be rewritten
• the original problem can be divided into two well-known binary classification problems; 1. predict purchase given session
optimize Purchase score 2. predict bought items given session with purchase
optimize Jaccard score
P (y(s) 6= ;|s)
Q(h, Stest
) =|Sb
test
||S
test
| (TP� FP)
| {z }purchase score
+X
s2Stest
J(y(s), h(s))
| {z }Jaccard score
,
P (i 2 y(s)|s, y(s) 6= ;)
• Two-stage prediction
• Two binary classification models learned on the train-set • Both classifiers require thresholds • Set up thresholds to optimize Purchase score and Jaccard score
using hold-out subsample of the train-set
Solution schema
5
Strain
purchase classifier
bought item classifier
Slearn90%
Svalid10%
s 7! P (purchase|s)
(s, i) 7! P (i 2 y(s)|s, y(s) 6= ;)
classifier thresholds
Some relationships from data
6
Next observations (from the data): • Buying rate strongly depends on time features • Buying rate varies highly between categorical features
• timestamp — the time when the click occured;
• item ID — the unique identifier of the item;
• category — the category identifier of the item; thesame item can belong to different categories in differ-ent click events.
The total number of item IDs and category IDs is 54, 287and 347 correspondingly. Both training and test sets be-long to an interval of 6 months. The target function y(s)corresponds to the set of items that were bought in the ses-sion s 1. In other words, the target function gives somesubset of the universal item set I for each user session s.We are given sets of bought items y(s) for all sessions s thetraining set Strain, and are required to predict these sets fortest sessions s ∈ Stest.
2.2 Evaluation MeasureDenote by h(s) a hypothesis that predicts a set of bought
items for any user session s. The score of this hypothesis ismeasured by the following formula:
Q(h, Stest) =!
s∈Stest:|h(s)|>0
"
|Sbtest|
|Stest|(−1)isEmpty(y(s))+J(y(s), h(s))
#
,
where Sbtest is the set of all test sessions with at least one
purchase event, J(A,B) = |A∩B||A∪B| is the Jaccard similarity
measure. It is easy to rewrite this expression as
Q(h,Stest) =|Sb
test||Stest|
(TP− FP)
$ %& '
purchase score
+!
s∈Stest
J(y(s), h(s))
$ %& '
Jaccard score
, (1)
where TP is the number of sessions with |y(s)| > 0 and |h(s)| >0 (i.e. true positives), and FP is the number of sessionswith |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now itis easy to see that the score consists of two parts. The firstone gives a reward for each correctly guessed session withbuy events and a penalty for each false alarm; the absolute
values of penalty and reward are both equal to|Sb
test|
|Stest|. The
second part calculates the total similarity of predicted setsof bought items to the real sets.
2.3 Purchase statisticsIt is interesting that sessions in training and test sets
are not separated in time, although it is considered a goodpractice for recommender problems to predict future eventsbased on a training set from the past. This peculiarity al-lows us to use date-time features that appeared to be quiteuseful.
We define a buying rate as the fraction of buyer sessions insome subset of sessions. Figure 1 shows how the buying ratechanges in time. It is obvious from the plots that visiting thee-commerce site during midday leads to a purchase severaltimes more often that in the night hours. One will also makea purchase during his session with a higher probability onthe weekend than on the working day. Buying activity differsbetween all the days in the dataset, e.g. there could be daysof sales or holidays. Another observation (see Figure. 2)
1 Actually, the original data contains additional information foreach buy event: timestamp, price and item quantity. However,we do not use any of it in the solution.
Figure 1: Dynamics of the buying rate in time
Figure 2: Buying rate versus number of clickeditems (left) and ID of the item with the maximumnumber of clicks in session (right)
is that a higher number of items clicked during the sessionleads to a higher chance of a purchase.
Lots of information could be extracted from the data byconsidering item identifiers and categories clicked during thesession. For example, we can find the item m(s) with themaximum number of clicks in each session s. In Figure 2 weshow buying rates among sessions with specific most clickeditem m(s) = j for some popular item IDs j. It is easy to seethat buying rates vary significantly among these IDs.
3. SOLUTION METHOD
3.1 OutlineOnly 5.5% of user sessions have at least one buy event, so
the problem is unbalanced: it appears to be important to beable to predict whether the user will buy at least one item.We have also noticed above that the evaluation measure con-sists of two independent parts. These considerations give usan idea to build a two-stage classifier: at first, we predictwhether the user will buy something, and if the prediction ispositive, then we predict what he will buy. More precisely,
• timestamp — the time when the click occured;
• item ID — the unique identifier of the item;
• category — the category identifier of the item; thesame item can belong to different categories in differ-ent click events.
The total number of item IDs and category IDs is 54, 287and 347 correspondingly. Both training and test sets be-long to an interval of 6 months. The target function y(s)corresponds to the set of items that were bought in the ses-sion s 1. In other words, the target function gives somesubset of the universal item set I for each user session s.We are given sets of bought items y(s) for all sessions s thetraining set Strain, and are required to predict these sets fortest sessions s ∈ Stest.
2.2 Evaluation MeasureDenote by h(s) a hypothesis that predicts a set of bought
items for any user session s. The score of this hypothesis ismeasured by the following formula:
Q(h, Stest) =!
s∈Stest:|h(s)|>0
"
|Sbtest|
|Stest|(−1)isEmpty(y(s))+J(y(s), h(s))
#
,
where Sbtest is the set of all test sessions with at least one
purchase event, J(A,B) = |A∩B||A∪B| is the Jaccard similarity
measure. It is easy to rewrite this expression as
Q(h,Stest) =|Sb
test||Stest|
(TP− FP)
$ %& '
purchase score
+!
s∈Stest
J(y(s), h(s))
$ %& '
Jaccard score
, (1)
where TP is the number of sessions with |y(s)| > 0 and |h(s)| >0 (i.e. true positives), and FP is the number of sessionswith |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now itis easy to see that the score consists of two parts. The firstone gives a reward for each correctly guessed session withbuy events and a penalty for each false alarm; the absolute
values of penalty and reward are both equal to|Sb
test|
|Stest|. The
second part calculates the total similarity of predicted setsof bought items to the real sets.
2.3 Purchase statisticsIt is interesting that sessions in training and test sets
are not separated in time, although it is considered a goodpractice for recommender problems to predict future eventsbased on a training set from the past. This peculiarity al-lows us to use date-time features that appeared to be quiteuseful.
We define a buying rate as the fraction of buyer sessions insome subset of sessions. Figure 1 shows how the buying ratechanges in time. It is obvious from the plots that visiting thee-commerce site during midday leads to a purchase severaltimes more often that in the night hours. One will also makea purchase during his session with a higher probability onthe weekend than on the working day. Buying activity differsbetween all the days in the dataset, e.g. there could be daysof sales or holidays. Another observation (see Figure. 2)
1 Actually, the original data contains additional information foreach buy event: timestamp, price and item quantity. However,we do not use any of it in the solution.
Figure 1: Dynamics of the buying rate in time
Figure 2: Buying rate versus number of clickeditems (left) and ID of the item with the maximumnumber of clicks in session (right)
is that a higher number of items clicked during the sessionleads to a higher chance of a purchase.
Lots of information could be extracted from the data byconsidering item identifiers and categories clicked during thesession. For example, we can find the item m(s) with themaximum number of clicks in each session s. In Figure 2 weshow buying rates among sessions with specific most clickeditem m(s) = j for some popular item IDs j. It is easy to seethat buying rates vary significantly among these IDs.
3. SOLUTION METHOD
3.1 OutlineOnly 5.5% of user sessions have at least one buy event, so
the problem is unbalanced: it appears to be important to beable to predict whether the user will buy at least one item.We have also noticed above that the evaluation measure con-sists of two independent parts. These considerations give usan idea to build a two-stage classifier: at first, we predictwhether the user will buy something, and if the prediction ispositive, then we predict what he will buy. More precisely,
Buying rate — fraction of buyer sessions in some subset of sessions
Feature extraction
7
• Purchase classifier: features from session (sequence of clicks)
• Bought item classifier: features from pair session+itemID
• Observation: bought item is a clicked item
• We use two types of features • Numerical: real number, e.g. seconds between two clicks • Categorical: element of the unordered set of values (levels),
e.g. ItemID
Feature extraction: session
8
1. Start/end of the session (month, day, hour, etc.) [numerical + categorical with few levels]
2. Number of clicks, unique items, categories, item-category pairs [numerical]
3. Top 10 items and categories by the number of clicks [categorical with ≈50k levels]
4. ID of the first/last item clicked at least k times [categorical with ≈50k levels]
5. Click counts for 100 items and 50 categories that were most popular in the whole training set [sparse numerical]
Feature extraction: session+ItemID
9
1. All session features
2. ItemID [categorical with ≈50k levels]
3. Timestamp of the first/last click on the item (month, day, hour, etc.) [numerical + categorical with few levels]
4. Number of clicks in the session for the given item [numerical]
5. Total duration (by analogy with dwell time) of the clicks on the item [numerical]
• GBM and similar ensemble learning techniques • useful with numerical features • one-hot encoding of categorical features doesn’t perform well
• Matrix decompositions, FM
• useful with categorical features • hard to incorporate numerical features because of rough (bi-linear)
model
• We used our internal learning algorithm: MatrixNet
• GBM with oblivious decision trees • trees properly handle categorical features (multi-split decision trees) • SVD-like decompositions for new feature value combinations
Classification method
10
Classification method
11
Oblivious decision tree with categorical features
duration > 20
user
item item
[numerical]
[categorical]
[categorical]
yes no
…
user
item item …
… … … …
• Training classifiers
• GB with 10k trees for each classifier
• ≈12 hours to train both models on 150 machines
• Making predictions
• We made 4000 predictions per second per thread
Classification method: speed
12
Threshold optimization
13
We optimized thresholds using validation set (10% hold-out from train-set) 1) Maximize Jaccard score 2) Maximize Purchase+Jaccard scores using fixed bought
item threshold
Q(h, Svalid
) =|Sb
valid
||S
valid
| (TP� FP)
| {z }purchase score
+X
s2Svalid
J(y(s), h(s))
| {z }Jaccard score
,
we train purchase detection and purchased item detectionclassifiers. The purchase detection classifier hp(s) predictsthe outcome of the function yp(s) = isNotEmpty(y(s)) anduses the entire training set in the learning phase. The itemdetection classifier hi(s, j) approximates the indicator func-tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with boughtitems in the learning phase. Of course, it would be wise touse classifiers that output probabilities rather than binarypredictions, because in this case we will be able to selectthresholds that directly optimize evaluation metric (1) in-stead of the classifier’s internal quality measure. So, ourfinal expression for the hypothesis can be written as
h(s) =
!
∅ if hp(s) < αp,
{j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp.(2)
3.2 Feature ExtractionWe have outlined two groups of features: one describes a
session and the other describes a session-item pair. The pur-chase detection classifier uses only session features and theitem detection classifier uses both groups. The full featurelisting can be found in Table 1; for further details, pleaserefer to our code2. We give some comments on our featureextraction decisions below.
One could use sophisticated aggregations to extract nu-merical features that describe items and categories. How-ever, we utilize a simpler approach to add raw identifiers tothe feature space instead. The suitable learning method forsuch feature space will be discussed in the next section.
We have found it useful to calculate features based onwhat is called dwell time [6] in information retrieval. Wedefine the duration for the click on the item j as the numberof seconds between that click and the next click. We havecalculated features such as total duration of the item withinthe session, total duration of the item’s category, etc.
3.3 Classification MethodThe main challenge of our dataset is that it contains dozens
of categorical features with tens of thousands levels. Popu-lar ensembling libraries (e.g. XGBoost, ensemble in sklearn,gbm in R) do not support categorical features directly andrequire them first to be transformed to real-valued features,e.g. by one-hot encoding. However, one-hot encoding ofour dataset would lead to the unfeasible number of features.To overcome this problem, we relied on Yandex’ proprietarymachine learning tool called MatrixNet [3]. It is an imple-mentation of gradient boosting [2] over oblivious decisiontrees [4] with careful leaf values weighting based on theirvariance. To extend this approach to categorical features,MatrixNet uses hash tables as base learners. One hash ta-ble corresponds to a small subset of features and containspredictions for all value combinations of these features seenon a training set. MatrixNet also builds SVD-like low-rankapproximationsof hash tables for the case it will encounternew feature value combinations in the test set (for example,suppose that a table contains features “User ID”and “MovieID”, and we want to predict a user rating for a film the userhad not seen before).
Friedman [2] proposed to select features for decision treesbased on MSE gain. It works quite well for real-valued fea-tures, but the presence of categorical features in the training
2https://github.com/romovpa/ydf-recsys2015-challenge
Figure 3: Item detection threshold (above) and pur-chase detection threshold (below) quality on the val-idation set.
set can lead to more complex hash tables and severe over-fitting due to the nature of MSE gain. We cope with thisproblem in MatrixNet by using a generalization bound-basedcriterion (e.g. [5]) for feature selection.
Since both learning tasks in our solution schema are bi-nary classification tasks, we have selected binary log-likelihoodas a loss function. To improve convergence for this function,MatrixNet performs an additional optimization. When thehash table is built and predictions in each record are calcu-lated based on MSE (recall that weak learners are alwaystrained using MSE criterion regardless the actual objectivefunction [2]), MatrixNet performs univariate optimization ineach hash table record separately by gradient descent overlog-likelihood. We have trained classification models in dis-tributed mode on Yandex’s cluster of 150 machines. It tookabout 12 hours to train both models, and their size turnedout to be about 60 gigabytes. The final prediction for thetest set was generated in 10 minutes in one thread on thesingle machine, what corresponds to a prediction speed ofapproximately 4000 sessions per second.
3.4 Threshold OptimizationWe have selected 90% of the training set for the learning
purposes and 10% for the validation. The purchase anditem detection classifiers hp(s) and hi(s, j) were trained byMatrixNet on the learning part with the optimal number ofweak learners selected on the validation set. Then we usedthe validation set to find the best thresholds for purchasedetection (αp ≈ −2.97) and item detection (αi ≈ −0.43)classifiers.
To make the final predictions on the test set we first max-imize the mean Jaccard similarity for validation samples bychoosing the optimal item threshold αi. Then we fix αi andoptimize the competition score (1) by choosing the optimalpurchase threshold αp. The first optimization is performedby using a standard univariate optimization method, the ob-
we train purchase detection and purchased item detectionclassifiers. The purchase detection classifier hp(s) predictsthe outcome of the function yp(s) = isNotEmpty(y(s)) anduses the entire training set in the learning phase. The itemdetection classifier hi(s, j) approximates the indicator func-tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with boughtitems in the learning phase. Of course, it would be wise touse classifiers that output probabilities rather than binarypredictions, because in this case we will be able to selectthresholds that directly optimize evaluation metric (1) in-stead of the classifier’s internal quality measure. So, ourfinal expression for the hypothesis can be written as
h(s) =
!
∅ if hp(s) < αp,
{j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp.(2)
3.2 Feature ExtractionWe have outlined two groups of features: one describes a
session and the other describes a session-item pair. The pur-chase detection classifier uses only session features and theitem detection classifier uses both groups. The full featurelisting can be found in Table 1; for further details, pleaserefer to our code2. We give some comments on our featureextraction decisions below.
One could use sophisticated aggregations to extract nu-merical features that describe items and categories. How-ever, we utilize a simpler approach to add raw identifiers tothe feature space instead. The suitable learning method forsuch feature space will be discussed in the next section.
We have found it useful to calculate features based onwhat is called dwell time [6] in information retrieval. Wedefine the duration for the click on the item j as the numberof seconds between that click and the next click. We havecalculated features such as total duration of the item withinthe session, total duration of the item’s category, etc.
3.3 Classification MethodThe main challenge of our dataset is that it contains dozens
of categorical features with tens of thousands levels. Popu-lar ensembling libraries (e.g. XGBoost, ensemble in sklearn,gbm in R) do not support categorical features directly andrequire them first to be transformed to real-valued features,e.g. by one-hot encoding. However, one-hot encoding ofour dataset would lead to the unfeasible number of features.To overcome this problem, we relied on Yandex’ proprietarymachine learning tool called MatrixNet [3]. It is an imple-mentation of gradient boosting [2] over oblivious decisiontrees [4] with careful leaf values weighting based on theirvariance. To extend this approach to categorical features,MatrixNet uses hash tables as base learners. One hash ta-ble corresponds to a small subset of features and containspredictions for all value combinations of these features seenon a training set. MatrixNet also builds SVD-like low-rankapproximationsof hash tables for the case it will encounternew feature value combinations in the test set (for example,suppose that a table contains features “User ID”and “MovieID”, and we want to predict a user rating for a film the userhad not seen before).
Friedman [2] proposed to select features for decision treesbased on MSE gain. It works quite well for real-valued fea-tures, but the presence of categorical features in the training
2https://github.com/romovpa/ydf-recsys2015-challenge
Figure 3: Item detection threshold (above) and pur-chase detection threshold (below) quality on the val-idation set.
set can lead to more complex hash tables and severe over-fitting due to the nature of MSE gain. We cope with thisproblem in MatrixNet by using a generalization bound-basedcriterion (e.g. [5]) for feature selection.
Since both learning tasks in our solution schema are bi-nary classification tasks, we have selected binary log-likelihoodas a loss function. To improve convergence for this function,MatrixNet performs an additional optimization. When thehash table is built and predictions in each record are calcu-lated based on MSE (recall that weak learners are alwaystrained using MSE criterion regardless the actual objectivefunction [2]), MatrixNet performs univariate optimization ineach hash table record separately by gradient descent overlog-likelihood. We have trained classification models in dis-tributed mode on Yandex’s cluster of 150 machines. It tookabout 12 hours to train both models, and their size turnedout to be about 60 gigabytes. The final prediction for thetest set was generated in 10 minutes in one thread on thesingle machine, what corresponds to a prediction speed ofapproximately 4000 sessions per second.
3.4 Threshold OptimizationWe have selected 90% of the training set for the learning
purposes and 10% for the validation. The purchase anditem detection classifiers hp(s) and hi(s, j) were trained byMatrixNet on the learning part with the optimal number ofweak learners selected on the validation set. Then we usedthe validation set to find the best thresholds for purchasedetection (αp ≈ −2.97) and item detection (αi ≈ −0.43)classifiers.
To make the final predictions on the test set we first max-imize the mean Jaccard similarity for validation samples bychoosing the optimal item threshold αi. Then we fix αi andoptimize the competition score (1) by choosing the optimalpurchase threshold αp. The first optimization is performedby using a standard univariate optimization method, the ob-
• Leaderboard: 63102 (1st place)
• Purchase detection on validation (10% hold-out):
• 16% precision • 77% recall • AUC 0.85
• Purchased item detection on validation:
• Jaccard measure 0.765
• Features / datasets used to learn classifiers / evaluation process
can be reproduced, see our code1
Final results
14 1https://github.com/romovpa/ydf-recsys2015-challenge
1. Observations from the problem statement › The task is complex but decomposable into two well-known:
binary classification of sessions and (session, ItemID)-pairs
2. Observations from the data (user click sessions) › Features from sessions and (session, ItemID)-pairs › Easy to develop many meaningful categorical features
3. The algorithm › Gradient boosting on trees with categorical features › No sophisticated mixtures of Machine Learning techniques: one
algorithm to work with many numerical and categorical features
Summary / Questions?
15