RecSys Challenge 2015: ensemble learning with categorical features

RecSys Challenge 2015: ensemble learning with categorical features

Peter Romov, Evgeny Sokolov

•  Logs from e-commerce website: collection of sessions

•  Session

•  sequence of clicks on the item pages •  could end with or without purchase

•  Click

•  Timestamp •  ItemID (≈54k unique IDs) •  CategoryID (≈350 unique IDs)

•  Purchase

•  set of bought items with price and quantity

•  Known purchases for the train-set, need to predict on the test-set

Problem statement

2

Problem statement

3

Clicks from session s

Purchase (actual)

Purchase (predicted)

c(s) = (c1(s), . . . , cn(s)(s))

h(s) ⇡ y(s)

y(s) =

(; — no purchase

{i1, . . . , im(s)} (bought items) — otherwise

Evaluation measure:

— Jaccard distance between two sets

Sbtest

Stest

where

— all sessions from test-set

— sessions from test-set with purchase

J(y(s), h(s)) =|y(s) \ h(s)||y(s) [ h(s)|

Q(h, Stest) =

X

s2Stest:|h(s)|>0

8<

:

|Sbtest|

|Stest| + J(y(s), h(s)), if y(s) 6= ;� |Sb

test||Stest| , otherwise

Problem statement

4

First observations (from the task): •  the task is uncommon (set prediction with specific loss function) •  evaluation measure could be rewritten

•  the original problem can be divided into two well-known binary classification problems; 1.  predict purchase given session

optimize Purchase score 2.  predict bought items given session with purchase

optimize Jaccard score

P (y(s) 6= ;|s)

Q(h, Stest

) =|Sb

test

||S

test

| (TP� FP)

| {z }purchase score

+X

s2Stest

J(y(s), h(s))

| {z }Jaccard score

,

P (i 2 y(s)|s, y(s) 6= ;)

•  Two-stage prediction

•  Two binary classification models learned on the train-set •  Both classifiers require thresholds •  Set up thresholds to optimize Purchase score and Jaccard score

using hold-out subsample of the train-set

Solution schema

5

Strain

purchase classifier

bought item classifier

Slearn90%

Svalid10%

s 7! P (purchase|s)

(s, i) 7! P (i 2 y(s)|s, y(s) 6= ;)

classifier thresholds

Some relationships from data

6

Next observations (from the data): •  Buying rate strongly depends on time features •  Buying rate varies highly between categorical features

• timestamp — the time when the click occured;

• item ID — the unique identifier of the item;

• category — the category identifier of the item; thesame item can belong to different categories in differ-ent click events.

The total number of item IDs and category IDs is 54, 287and 347 correspondingly. Both training and test sets be-long to an interval of 6 months. The target function y(s)corresponds to the set of items that were bought in the ses-sion s 1. In other words, the target function gives somesubset of the universal item set I for each user session s.We are given sets of bought items y(s) for all sessions s thetraining set Strain, and are required to predict these sets fortest sessions s ∈ Stest.

2.2 Evaluation MeasureDenote by h(s) a hypothesis that predicts a set of bought

items for any user session s. The score of this hypothesis ismeasured by the following formula:

Q(h, Stest) =!

s∈Stest:|h(s)|>0

"

|Sbtest|

|Stest|(−1)isEmpty(y(s))+J(y(s), h(s))

#

,

where Sbtest is the set of all test sessions with at least one

purchase event, J(A,B) = |A∩B||A∪B| is the Jaccard similarity

measure. It is easy to rewrite this expression as

Q(h,Stest) =|Sb

test||Stest|

(TP− FP)

$ %& '

purchase score

+!

s∈Stest

J(y(s), h(s))

$ %& '

Jaccard score

, (1)

where TP is the number of sessions with |y(s)| > 0 and |h(s)| >0 (i.e. true positives), and FP is the number of sessionswith |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now itis easy to see that the score consists of two parts. The firstone gives a reward for each correctly guessed session withbuy events and a penalty for each false alarm; the absolute

values of penalty and reward are both equal to|Sb

test|

|Stest|. The

second part calculates the total similarity of predicted setsof bought items to the real sets.

2.3 Purchase statisticsIt is interesting that sessions in training and test sets

are not separated in time, although it is considered a goodpractice for recommender problems to predict future eventsbased on a training set from the past. This peculiarity al-lows us to use date-time features that appeared to be quiteuseful.

We define a buying rate as the fraction of buyer sessions insome subset of sessions. Figure 1 shows how the buying ratechanges in time. It is obvious from the plots that visiting thee-commerce site during midday leads to a purchase severaltimes more often that in the night hours. One will also makea purchase during his session with a higher probability onthe weekend than on the working day. Buying activity differsbetween all the days in the dataset, e.g. there could be daysof sales or holidays. Another observation (see Figure. 2)

1 Actually, the original data contains additional information foreach buy event: timestamp, price and item quantity. However,we do not use any of it in the solution.

Figure 1: Dynamics of the buying rate in time

Figure 2: Buying rate versus number of clickeditems (left) and ID of the item with the maximumnumber of clicks in session (right)

is that a higher number of items clicked during the sessionleads to a higher chance of a purchase.

Lots of information could be extracted from the data byconsidering item identifiers and categories clicked during thesession. For example, we can find the item m(s) with themaximum number of clicks in each session s. In Figure 2 weshow buying rates among sessions with specific most clickeditem m(s) = j for some popular item IDs j. It is easy to seethat buying rates vary significantly among these IDs.

3. SOLUTION METHOD

3.1 OutlineOnly 5.5% of user sessions have at least one buy event, so

the problem is unbalanced: it appears to be important to beable to predict whether the user will buy at least one item.We have also noticed above that the evaluation measure con-sists of two independent parts. These considerations give usan idea to build a two-stage classifier: at first, we predictwhether the user will buy something, and if the prediction ispositive, then we predict what he will buy. More precisely,

• timestamp — the time when the click occured;

• item ID — the unique identifier of the item;

• category — the category identifier of the item; thesame item can belong to different categories in differ-ent click events.

The total number of item IDs and category IDs is 54, 287and 347 correspondingly. Both training and test sets be-long to an interval of 6 months. The target function y(s)corresponds to the set of items that were bought in the ses-sion s 1. In other words, the target function gives somesubset of the universal item set I for each user session s.We are given sets of bought items y(s) for all sessions s thetraining set Strain, and are required to predict these sets fortest sessions s ∈ Stest.

2.2 Evaluation MeasureDenote by h(s) a hypothesis that predicts a set of bought

items for any user session s. The score of this hypothesis ismeasured by the following formula:

Q(h, Stest) =!

s∈Stest:|h(s)|>0

"

|Sbtest|

|Stest|(−1)isEmpty(y(s))+J(y(s), h(s))

#

,

where Sbtest is the set of all test sessions with at least one

purchase event, J(A,B) = |A∩B||A∪B| is the Jaccard similarity

measure. It is easy to rewrite this expression as

Q(h,Stest) =|Sb

test||Stest|

(TP− FP)

$ %& '

purchase score

+!

s∈Stest

J(y(s), h(s))

$ %& '

Jaccard score

, (1)

where TP is the number of sessions with |y(s)| > 0 and |h(s)| >0 (i.e. true positives), and FP is the number of sessionswith |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now itis easy to see that the score consists of two parts. The firstone gives a reward for each correctly guessed session withbuy events and a penalty for each false alarm; the absolute

values of penalty and reward are both equal to|Sb

test|

|Stest|. The

second part calculates the total similarity of predicted setsof bought items to the real sets.

2.3 Purchase statisticsIt is interesting that sessions in training and test sets

are not separated in time, although it is considered a goodpractice for recommender problems to predict future eventsbased on a training set from the past. This peculiarity al-lows us to use date-time features that appeared to be quiteuseful.

We define a buying rate as the fraction of buyer sessions insome subset of sessions. Figure 1 shows how the buying ratechanges in time. It is obvious from the plots that visiting thee-commerce site during midday leads to a purchase severaltimes more often that in the night hours. One will also makea purchase during his session with a higher probability onthe weekend than on the working day. Buying activity differsbetween all the days in the dataset, e.g. there could be daysof sales or holidays. Another observation (see Figure. 2)

1 Actually, the original data contains additional information foreach buy event: timestamp, price and item quantity. However,we do not use any of it in the solution.

Figure 1: Dynamics of the buying rate in time

Figure 2: Buying rate versus number of clickeditems (left) and ID of the item with the maximumnumber of clicks in session (right)

is that a higher number of items clicked during the sessionleads to a higher chance of a purchase.

Lots of information could be extracted from the data byconsidering item identifiers and categories clicked during thesession. For example, we can find the item m(s) with themaximum number of clicks in each session s. In Figure 2 weshow buying rates among sessions with specific most clickeditem m(s) = j for some popular item IDs j. It is easy to seethat buying rates vary significantly among these IDs.

3. SOLUTION METHOD

3.1 OutlineOnly 5.5% of user sessions have at least one buy event, so

the problem is unbalanced: it appears to be important to beable to predict whether the user will buy at least one item.We have also noticed above that the evaluation measure con-sists of two independent parts. These considerations give usan idea to build a two-stage classifier: at first, we predictwhether the user will buy something, and if the prediction ispositive, then we predict what he will buy. More precisely,

Buying rate — fraction of buyer sessions in some subset of sessions

Feature extraction

7

•  Purchase classifier: features from session (sequence of clicks)

•  Bought item classifier: features from pair session+itemID

•  Observation: bought item is a clicked item

•  We use two types of features •  Numerical: real number, e.g. seconds between two clicks •  Categorical: element of the unordered set of values (levels),

e.g. ItemID

Feature extraction: session

8

1.  Start/end of the session (month, day, hour, etc.) [numerical + categorical with few levels]

2.  Number of clicks, unique items, categories, item-category pairs [numerical]

3.  Top 10 items and categories by the number of clicks [categorical with ≈50k levels]

4.  ID of the first/last item clicked at least k times [categorical with ≈50k levels]

5.  Click counts for 100 items and 50 categories that were most popular in the whole training set [sparse numerical]

Feature extraction: session+ItemID

9

1.  All session features

2.  ItemID [categorical with ≈50k levels]

3.  Timestamp of the first/last click on the item (month, day, hour, etc.) [numerical + categorical with few levels]

4.  Number of clicks in the session for the given item [numerical]

5.  Total duration (by analogy with dwell time) of the clicks on the item [numerical]

•  GBM and similar ensemble learning techniques •  useful with numerical features •  one-hot encoding of categorical features doesn’t perform well

•  Matrix decompositions, FM

•  useful with categorical features •  hard to incorporate numerical features because of rough (bi-linear)

model

•  We used our internal learning algorithm: MatrixNet

•  GBM with oblivious decision trees •  trees properly handle categorical features (multi-split decision trees) •  SVD-like decompositions for new feature value combinations

Classification method

10

Classification method

11

Oblivious decision tree with categorical features

duration > 20

user

item item

[numerical]

[categorical]

[categorical]

yes no

…

user

item item …

… … … …

•  Training classifiers

•  GB with 10k trees for each classifier

•  ≈12 hours to train both models on 150 machines

•  Making predictions

•  We made 4000 predictions per second per thread

Classification method: speed

12

Threshold optimization

13

We optimized thresholds using validation set (10% hold-out from train-set) 1)  Maximize Jaccard score 2)  Maximize Purchase+Jaccard scores using fixed bought

item threshold

Q(h, Svalid

) =|Sb

valid

||S

valid

| (TP� FP)

| {z }purchase score

+X

s2Svalid

J(y(s), h(s))

| {z }Jaccard score

,

we train purchase detection and purchased item detectionclassifiers. The purchase detection classifier hp(s) predictsthe outcome of the function yp(s) = isNotEmpty(y(s)) anduses the entire training set in the learning phase. The itemdetection classifier hi(s, j) approximates the indicator func-tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with boughtitems in the learning phase. Of course, it would be wise touse classifiers that output probabilities rather than binarypredictions, because in this case we will be able to selectthresholds that directly optimize evaluation metric (1) in-stead of the classifier’s internal quality measure. So, ourfinal expression for the hypothesis can be written as

h(s) =

!

∅ if hp(s) < αp,

{j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp.(2)

3.2 Feature ExtractionWe have outlined two groups of features: one describes a

session and the other describes a session-item pair. The pur-chase detection classifier uses only session features and theitem detection classifier uses both groups. The full featurelisting can be found in Table 1; for further details, pleaserefer to our code2. We give some comments on our featureextraction decisions below.

One could use sophisticated aggregations to extract nu-merical features that describe items and categories. How-ever, we utilize a simpler approach to add raw identifiers tothe feature space instead. The suitable learning method forsuch feature space will be discussed in the next section.

We have found it useful to calculate features based onwhat is called dwell time [6] in information retrieval. Wedefine the duration for the click on the item j as the numberof seconds between that click and the next click. We havecalculated features such as total duration of the item withinthe session, total duration of the item’s category, etc.

3.3 Classification MethodThe main challenge of our dataset is that it contains dozens

of categorical features with tens of thousands levels. Popu-lar ensembling libraries (e.g. XGBoost, ensemble in sklearn,gbm in R) do not support categorical features directly andrequire them first to be transformed to real-valued features,e.g. by one-hot encoding. However, one-hot encoding ofour dataset would lead to the unfeasible number of features.To overcome this problem, we relied on Yandex’ proprietarymachine learning tool called MatrixNet [3]. It is an imple-mentation of gradient boosting [2] over oblivious decisiontrees [4] with careful leaf values weighting based on theirvariance. To extend this approach to categorical features,MatrixNet uses hash tables as base learners. One hash ta-ble corresponds to a small subset of features and containspredictions for all value combinations of these features seenon a training set. MatrixNet also builds SVD-like low-rankapproximationsof hash tables for the case it will encounternew feature value combinations in the test set (for example,suppose that a table contains features “User ID”and “MovieID”, and we want to predict a user rating for a film the userhad not seen before).

Friedman [2] proposed to select features for decision treesbased on MSE gain. It works quite well for real-valued fea-tures, but the presence of categorical features in the training

2https://github.com/romovpa/ydf-recsys2015-challenge

Figure 3: Item detection threshold (above) and pur-chase detection threshold (below) quality on the val-idation set.

set can lead to more complex hash tables and severe over-fitting due to the nature of MSE gain. We cope with thisproblem in MatrixNet by using a generalization bound-basedcriterion (e.g. [5]) for feature selection.

Since both learning tasks in our solution schema are bi-nary classification tasks, we have selected binary log-likelihoodas a loss function. To improve convergence for this function,MatrixNet performs an additional optimization. When thehash table is built and predictions in each record are calcu-lated based on MSE (recall that weak learners are alwaystrained using MSE criterion regardless the actual objectivefunction [2]), MatrixNet performs univariate optimization ineach hash table record separately by gradient descent overlog-likelihood. We have trained classification models in dis-tributed mode on Yandex’s cluster of 150 machines. It tookabout 12 hours to train both models, and their size turnedout to be about 60 gigabytes. The final prediction for thetest set was generated in 10 minutes in one thread on thesingle machine, what corresponds to a prediction speed ofapproximately 4000 sessions per second.

3.4 Threshold OptimizationWe have selected 90% of the training set for the learning

purposes and 10% for the validation. The purchase anditem detection classifiers hp(s) and hi(s, j) were trained byMatrixNet on the learning part with the optimal number ofweak learners selected on the validation set. Then we usedthe validation set to find the best thresholds for purchasedetection (αp ≈ −2.97) and item detection (αi ≈ −0.43)classifiers.

To make the final predictions on the test set we first max-imize the mean Jaccard similarity for validation samples bychoosing the optimal item threshold αi. Then we fix αi andoptimize the competition score (1) by choosing the optimalpurchase threshold αp. The first optimization is performedby using a standard univariate optimization method, the ob-

we train purchase detection and purchased item detectionclassifiers. The purchase detection classifier hp(s) predictsthe outcome of the function yp(s) = isNotEmpty(y(s)) anduses the entire training set in the learning phase. The itemdetection classifier hi(s, j) approximates the indicator func-tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with boughtitems in the learning phase. Of course, it would be wise touse classifiers that output probabilities rather than binarypredictions, because in this case we will be able to selectthresholds that directly optimize evaluation metric (1) in-stead of the classifier’s internal quality measure. So, ourfinal expression for the hypothesis can be written as

h(s) =

!

∅ if hp(s) < αp,

{j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp.(2)

3.2 Feature ExtractionWe have outlined two groups of features: one describes a

session and the other describes a session-item pair. The pur-chase detection classifier uses only session features and theitem detection classifier uses both groups. The full featurelisting can be found in Table 1; for further details, pleaserefer to our code2. We give some comments on our featureextraction decisions below.

One could use sophisticated aggregations to extract nu-merical features that describe items and categories. How-ever, we utilize a simpler approach to add raw identifiers tothe feature space instead. The suitable learning method forsuch feature space will be discussed in the next section.

We have found it useful to calculate features based onwhat is called dwell time [6] in information retrieval. Wedefine the duration for the click on the item j as the numberof seconds between that click and the next click. We havecalculated features such as total duration of the item withinthe session, total duration of the item’s category, etc.

3.3 Classification MethodThe main challenge of our dataset is that it contains dozens

of categorical features with tens of thousands levels. Popu-lar ensembling libraries (e.g. XGBoost, ensemble in sklearn,gbm in R) do not support categorical features directly andrequire them first to be transformed to real-valued features,e.g. by one-hot encoding. However, one-hot encoding ofour dataset would lead to the unfeasible number of features.To overcome this problem, we relied on Yandex’ proprietarymachine learning tool called MatrixNet [3]. It is an imple-mentation of gradient boosting [2] over oblivious decisiontrees [4] with careful leaf values weighting based on theirvariance. To extend this approach to categorical features,MatrixNet uses hash tables as base learners. One hash ta-ble corresponds to a small subset of features and containspredictions for all value combinations of these features seenon a training set. MatrixNet also builds SVD-like low-rankapproximationsof hash tables for the case it will encounternew feature value combinations in the test set (for example,suppose that a table contains features “User ID”and “MovieID”, and we want to predict a user rating for a film the userhad not seen before).

Friedman [2] proposed to select features for decision treesbased on MSE gain. It works quite well for real-valued fea-tures, but the presence of categorical features in the training

2https://github.com/romovpa/ydf-recsys2015-challenge

Figure 3: Item detection threshold (above) and pur-chase detection threshold (below) quality on the val-idation set.

set can lead to more complex hash tables and severe over-fitting due to the nature of MSE gain. We cope with thisproblem in MatrixNet by using a generalization bound-basedcriterion (e.g. [5]) for feature selection.

Since both learning tasks in our solution schema are bi-nary classification tasks, we have selected binary log-likelihoodas a loss function. To improve convergence for this function,MatrixNet performs an additional optimization. When thehash table is built and predictions in each record are calcu-lated based on MSE (recall that weak learners are alwaystrained using MSE criterion regardless the actual objectivefunction [2]), MatrixNet performs univariate optimization ineach hash table record separately by gradient descent overlog-likelihood. We have trained classification models in dis-tributed mode on Yandex’s cluster of 150 machines. It tookabout 12 hours to train both models, and their size turnedout to be about 60 gigabytes. The final prediction for thetest set was generated in 10 minutes in one thread on thesingle machine, what corresponds to a prediction speed ofapproximately 4000 sessions per second.

3.4 Threshold OptimizationWe have selected 90% of the training set for the learning

purposes and 10% for the validation. The purchase anditem detection classifiers hp(s) and hi(s, j) were trained byMatrixNet on the learning part with the optimal number ofweak learners selected on the validation set. Then we usedthe validation set to find the best thresholds for purchasedetection (αp ≈ −2.97) and item detection (αi ≈ −0.43)classifiers.

To make the final predictions on the test set we first max-imize the mean Jaccard similarity for validation samples bychoosing the optimal item threshold αi. Then we fix αi andoptimize the competition score (1) by choosing the optimalpurchase threshold αp. The first optimization is performedby using a standard univariate optimization method, the ob-

•  Leaderboard: 63102 (1st place)

•  Purchase detection on validation (10% hold-out):

•  16% precision •  77% recall •  AUC 0.85

•  Purchased item detection on validation:

•  Jaccard measure 0.765

•  Features / datasets used to learn classifiers / evaluation process

can be reproduced, see our code1

Final results

14 1https://github.com/romovpa/ydf-recsys2015-challenge

1.  Observations from the problem statement ›  The task is complex but decomposable into two well-known:

binary classification of sessions and (session, ItemID)-pairs

2.  Observations from the data (user click sessions) ›  Features from sessions and (session, ItemID)-pairs ›  Easy to develop many meaningful categorical features

3.  The algorithm ›  Gradient boosting on trees with categorical features ›  No sophisticated mixtures of Machine Learning techniques: one

algorithm to work with many numerical and categorical features

Summary / Questions?

15

Science

RecSys Challenge 2015: ensemble learning with categorical features