Multi-Stack Ensemble for Job Recommendation€¦ · job-recommendation and the problem \next click prediction". Given a Xing user, the goal was to predict those job postings that

$Page 1: Multi-Stack Ensemble for Job Recommendation€¦ · job-recommendation and the problem \next click prediction". Given a Xing user, the goal was to predict those job postings that$
POLITECNICO DI MILANOCorso di Laurea Magistrale in Ingegneria Informatica

Dipartimento di Elettronica e Informazione

Multi-Stack Ensemble for Job

Recommendation

Relatore: Prof. Paolo Cremonesi

Correlatore: Ing. Roberto Pagano

Tesi di Laurea di:

Tommaso Carpi, matricola 836986

Marco Edemanti, matricola 838979

Anno Accademico 2015-2016


Abstract

Recommender Systems are a subclass of information filtering systems that

try to predict the preferences of users with respect to a set of items.

This thesis was developed in collaboration with TU Delft and Xing AG, a

Business Social Network, that gave us the dataset used in this research.

The technique that we created is called Multi-Stack Ensemble and it consists

of a series of different hybridization layers. The general idea is to ensemble

algorithms in batches, starting with weak learners at the bottom to have

recommendations more and more accurate as we climb the layers of the

stack.

As a proof of the quality of our results, we participated to the RecSys Chal-

lenge 2016, organized among others by TU Delft and Xing, representing

Politecnico of Milan as team “PumpkinPie”, stating that we would have

renounced any monetary prize, given our relationship with the organizers.

We tackled the problem using our Multi-Stack Ensemble which performed

really well allowing us to end in the 4th position and 1st among Academic

teams (the first 3 were companies) out of more that 120 teams. We were

the only Academic team composed of Master students.

We were then invited to the ACM RecSys Conference, held in Boston at

the Massachusetts Institute of Technology, to present our solution to re-

searchers and companies from all over the world. Here we also received a

special mention as youngest team during the prize-giving ceremony. Our

paper “Multi-Stack Ensemble for Job Recommendation” was then accepted

and published in the ACM RecSys proceedings. We were also awarded by

Politecnico of Milan for our results with a scholarship.

3

Sommario

I Recommender Systems sono una sottoclasse dei sistemi di information fil-

tering per predirre le preferenze degli utenti nei confronti di un set di oggetti.

Questa tesi e stata sviluppata in collaborazione con TU Delft e Xing AG,

un Business Social Network, che ci ha fornito il dataset per questa ricerca.

La tecnica che abbiamo creato e chiamata Multi-Stack Ensemble e consiste

di una serie di diversi layer di ibridazione. L’idea generale e quella di com-

binare gli algoritmi a grappoli, partendo da quelli piu deboli alla base in

modo da avere raccomandazione sempre piu accurate mentre si salgono i

layer dello stack.

Come prova della qualita dei nostri risultati abbiamo partecipato alla Rec-

Sys Challenge 2016, organizzata tra gli altri da TU Delft e Xing, rappre-

sentando il Politecnico di Milano come team “PumpkinPie”, dichiarando di

voler rinunciare a qualsiasi premio monetario data la nostra relazione con

gli organizzatori.

La nostra tecnica ha dato ottimi risultati permettendoci di chiudere al 4◦

posto e 1◦ tra i team Accademici (i primi 3 erano aziende) tra oltre 120

partecipanti. Eravamo l’unico team composto da Master students. Abbi-

amo dunque ricevuto un invito per la ACM RecSys Conference, ospitata al

MIT di Boston, per presentare la nostra soluzione a ricercatori e aziende

provenienti da tutto il mondo, e abbiamo anche ricevuto una menzione spe-

ciale come team piu giovane durante la cerimonia di premiazione.

Il nostro paper “Multi-Stack Ensemble for job Recommendation” e stato in-

oltre accettato e pubblicato negli ACM RecSys proceedings.

Siamo infine stati premiati dal Politecnico di Milano per i nostri risultati

con una borsa di studio.

4


Contents

Abstract 3

Sommario 4

1 Introduction 11

1.1 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Problem Description 15

2.1 Job Recommendation . . . . . . . . . . . . . . . . . . . . . . 15

2.2 ACM RecSys Challenge . . . . . . . . . . . . . . . . . . . . . 16

2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 State of the Art 19

3.1 Ensemble in the context of Recommender Systems . . . . . . 19

3.2 Ensemble Techniques . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.3 Rank Averaging . . . . . . . . . . . . . . . . . . . . . 26

3.2.4 Interleave . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.5 Stacked Generalization . . . . . . . . . . . . . . . . . . 27

3.2.6 Blending . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.7 Collaborative via Content . . . . . . . . . . . . . . . . 30

3.2.8 Monolithic Hybridization . . . . . . . . . . . . . . . . 31

4 Evaluation 32

4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.2 Precision at K . . . . . . . . . . . . . . . . . . . . . . 33

4.1.3 User Success . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.4 Competition Metric . . . . . . . . . . . . . . . . . . . 34

6

5 Dataset 35

5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.1 User Profile . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1.2 Item Profile . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.4 Impressions . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.5 Test Users . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Source Algorithms 40

6.1 Past Interaction & Past Impression . . . . . . . . . . . . . . . 40

6.2 Collaborative Filtering Algorithms . . . . . . . . . . . . . . . 41

6.3 Content Based Algorithm . . . . . . . . . . . . . . . . . . . . 42

7 Ensemble Technique 43

7.1 Multi-Stack Ensemble . . . . . . . . . . . . . . . . . . . . . . 43

7.2 Voting-Based Methods . . . . . . . . . . . . . . . . . . . . . . 46

7.2.1 Linear Ensemble . . . . . . . . . . . . . . . . . . . . . 46

7.2.2 Evaluation Score Ensemble . . . . . . . . . . . . . . . 48

7.3 Reduce Function . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.4 Stack Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.4.1 Layer 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.4.2 Layer 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.4.3 Layer 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 52

8 Results 54

8.1 Layers Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8.1.1 Input Algorithms . . . . . . . . . . . . . . . . . . . . . 55

8.1.2 Layer 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . 56

8.1.3 Layer 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . 57

8.1.4 Layer 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8.1.5 Layer 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8.2 Ensemble Comparison . . . . . . . . . . . . . . . . . . . . . . 59

9 Conclusion and Future Developments 63

7

List of Figures

2.1 Cluster mode overview . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Hybrid recommender system’s scheme . . . . . . . . . . . . . 20

3.2 Stacked Generalization scheme . . . . . . . . . . . . . . . . . 28

3.3 Collaborative via Content scheme . . . . . . . . . . . . . . . . 30

7.1 Ensemble Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 44

7.2 Stack Layer Structure . . . . . . . . . . . . . . . . . . . . . . 45

7.3 Reduce Function Example. Inside each block the values rep-

resents the structure (item, rating) . . . . . . . . . . . . . . . 50

8

List of Tables

3.1 Voting Example (1) . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Voting Example (2) . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Voting Example (3) . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Voting Example (4) . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Voting Example (5) . . . . . . . . . . . . . . . . . . . . . . . 24

3.6 Score Averaging Example . . . . . . . . . . . . . . . . . . . . 25

3.7 Rank Averaging Example . . . . . . . . . . . . . . . . . . . . 26

3.8 Interleave Example . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 User features . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3 Item features . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Interactions description . . . . . . . . . . . . . . . . . . . . . 38

5.5 Impressions dataset . . . . . . . . . . . . . . . . . . . . . . . . 39

7.1 Linear Ensemble Example. . . . . . . . . . . . . . . . . . . . . 46

7.2 Linear Ensemble Interleaving Example . . . . . . . . . . . . . 47

7.3 Weight and Decay set of values . . . . . . . . . . . . . . . . . 47

7.4 Evaluation Score Example . . . . . . . . . . . . . . . . . . . . 49

7.5 I/O Layer 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.6 I/O Layer 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.7 I/O Layer 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8.1 Scores of Input Algorithms. The Score value is obtained using

the evaluation metric described in Section 4.1.4 . . . . . . . . 55

8.2 Points per item of an algorithm . . . . . . . . . . . . . . . . . 55

8.3 Layer 1.1 (Collaborative Filtering) Linear Method Parametriza-

tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

8.4 Layer 1.1 Results. The Score value is obtained using the

evaluation metric described in Section 4.1.4 . . . . . . . . . . 56

8.5 Layer 1.2 (Content-Based) Linear Method Parametrization . 57

9

8.6 Layer 1.2 Results. The Score value is obtained using the


8.7 Layer 2 Linear Method Parametrization . . . . . . . . . . . . 58

8.8 Layer 2 Results. The Score value is obtained using the eval-

uation metric described in Section 4.1.4 . . . . . . . . . . . . 58

8.9 Layer 3 Linear Method Parametrization . . . . . . . . . . . . 59

8.10 Layer 3 Results. The Score value is obtained using the eval-

uation metric described in Section 4.1.4 . . . . . . . . . . . . 59

8.11 Ensemble Comparison. The Score value is obtained using the


10

Chapter 1

Introduction

Recommender Systems are a subclass of information filtering systems that

try to predict the preferences of users with respect to a set of items. The

extensive growth of product catalogues made necessary to create automatic

systems that could help customers discover relevant items, and in doing so

they changed the way websites communicate with their users. Rather than

providing a static experience in which users search for and potentially buy

products, Recommender Systems increase interaction to provide a richer

experience. As a matter of fact it is not just about preferences prediction

to maximize clickthrough, the goal of a Recommender Engine is to “Assist

users in accessing and understanding large digital collections in domains

subject to significant personal taste” [1].

In recent years they have become really popular and are employed in a va-

riety of domains: some popular applications include movies, music, news,

books, jobs, e-commerce, e-tourism and online dating.

This thesis work was developed in collaboration with TU Delft and Xing AG,

and it explores an innovative approach for the creation of hybrid models in a

particular domain of Recommender Systems, which is job-recommendation.

Job Recommender Systems [2] [3] were born when the Internet-based recruit-

ing platforms become a primary recruitment channel in most companies,

due to the less costs in recruitment time and advertising. Many platforms

exist to connect companies with employees, the so called “Business Social

Networks” such as Linkedin [4] [5] and Xing AG. Since the companies and

customers database grew rapidly it is necessary to use a Recommender En-

gine in order to help users discover jobs that match their personal interests.

The whole recruiting and hiring process has some peculiarities with respect

to other most famous Recommender System domains such as movies, music

and e-commerce. In fact job postings, i.e. open positions posted by a com-

pany, are limited-quantity items whose expiration time is not predictable,

either the company hires a candidate or decide to dismiss the position in

a certain point in time. Employments moreover may require life changes,

features like the geographical relocation may influence users differently, it is

not as simple as choosing a movie on your sofa or buying an item shipped

to your address.

Also ratings, i.e. preferences expressed by a user with respect to an item, are

implicit. We can only infer information from the users behaviour, whether

he clicked on or replied to a job offering. This pushes forward the difficulty

of the task because a bad interpretation of the preferences of the users may

lead to poor results. On the contrary an explicit rating system helps at

clearly identify what users like or don’t like since they consciously leave a

preference. For example reviewing a movie with 1 star out of 5 is a strong

signal that the user didn’t like it.

As we can see job-recommendation tackles different problems with respect

to other domains, which makes it interesting from a research point of view.

The aim of this thesis work is to develop a new ensemble technique, trying

to overcome the limitations of single learners to achieve a better accuracy.

The basic models that we use as input algorithms are those presented in

the thesis work of Elena Sacchi and Ervin Kamberoski [6]. They consists

of 2 Collaborative Filtering and 2 Content-Based techniques, followed by 1

recommendations obtained from the processing of interactions (i.e. job post-

ings that users actively clicked on in the past). To these models we added 4

other Collaborative Filtering techniques, that either train on or recommend

impressions (i.e. job postings recommended by the Xing recommender sys-

tems), and 1 other list of items derived from the processing of impressions

directly from the data.

The dataset provided by Xing AG is the same used in the RecSys Challenge

2016, a competition hosted for the annual RecSys Conference, for which

Xing and TU Delft were organizers. The topic of the challenge was then

job-recommendation and the problem “next click prediction”. Given a Xing

user, the goal was to predict those job postings that a user will positively

interact with.

The technique that we created is called Multi-Stack Ensemble, it consists

of a series of different hybridization layers. The general idea is to ensem-

ble the algorithms in batches, starting with weak learners at the bottom to

have recommendations more and more accurate as we climb the layers of

the stack. Inside each layer a 2-step function is applied for combining the

recommendations: first a voting method is called in order to assign a score

to every item, based on the characteristics of the input algorithms, then a

12

reduce function is applied to sort and select the top items for the final rec-

ommendation. Our implementation for the Xing dataset uses three different

layers.

The novel idea behind Multi-Stack ensembling is that one can create hy-

bridization batches and apply different techniques to combine recommen-

dations depending on the input algorithms. In each layer different voting

methods can be used as much as totally different mechanisms. We tested

our solution on the aforementioned dataset and compared the results with

other state of art hybridization techniques. Multi-Stack Ensemble outper-

forms all of them, achieving an higher accuracy degree, as shown in Table

8.11.

As a proof of the quality of our results, we participated to the RecSys Chal-

lenge 2016 representing Politecnico of Milan as team “PumpkinPie”, stating

that we would have renounced any monetary prize given our relationship

with TU Delft and Xing.

We tackled the problem using our Multi-Stack Ensemble which performed

really well allowing us to end in the 4th position and 1st among Academic

teams (the first 3 were companies) out of more that 120 teams. We were

the only Academic team composed of Master students.

We were then invited at the ACM RecSys Conference, held in Boston at the

Massachusetts Institute of Technology, to present our solution to researchers

and companies from all over the world. Here we also received a special men-

tion as youngest team during the prize-giving ceremony.

Our paper “Multi-Stack Ensemble for Job Recommendation” (Tommaso

Carpi, Marco Edemanti, Ervin Kamberoski, Elena Sacchi, Paolo Cremonesi,

Roberto Pagano, Massimo Quadrana) was then accepted and published in

the ACM RecSys proceedings.

We were also awarded by Politecnico of Milan for our results with a schol-

arship.

1.1 Thesis Structure

This thesis work is structured in the following way:

In Chapter 2 we describe in the first part the job-recommendation domain,

following with the description of RecSys Challenge. In the end we present

the technologies used for both this thesis work and the competition.

In Chapter 3 is presented the State of the Art of Ensemble techniques, first

describing the reasons why hybrid models are so powerful and then listings

a series of well known techniques used in both research and business envi-

ronments.

13

In Chapter 4 is shown the evaluation metric used internally at Xing which

was adopted for the competition and used also for this thesis work .

In Chapter 5 is described the dataset used for this thesis work.

In Chapter 6 we present the input algorithms used for our Multi-Stack en-

semble.

In Chapter 7 we present our Ensemble technique, first describing the 2-step

algorithm that we implemented: (i) the two voting-based methods that we

implemented, Linear and Evaluation-Score and (ii) the Reduce function.

Then we describe the overall architecture of our stack structure, covering

each one of the three layers that we created in the hierarchical stack, show-

ing the input/output parameters and the voting technique used, in order to

compute the final recommendation.

In Chapter 8 are presented the results of our technique with respect to “state

of art” algorithms that were implemented, showing how, in this domain, our

solution achieves a better accuracy.

In Chapter 9 we leave the conclusions of this thesis work and possible future

developments.

14

Chapter 2

Problem Description

This chapter wants to highlight the problems faced during either our col-

laboration with Xing AG and TU Delft and our participation to the ACM

RecSys Challenge 2016. In the first section we present some information

about the job recommendation scenario. In the second section, we describe

the challenge and the problems that came with it. In the third section,

we describe how we propose to solve these problems in our work and the

fourth section explains the technologies involved in the deployment of our

infrastructure.

2.1 Job Recommendation

Recommender Systems have multiple useful applications in the business

world, one of them regards Business Social Networks in which the goal is

suggesting to an user a potential job that fits its skills and interests.

Thanks to the data provided by Xing AG, a Business Social Network well

known and used in German speaking countries, we were able to explore and

exploit some characteristics that are peculiar to the job recommendation

scenario with respect to the most famous e-commerce or movie recommen-

dations. One of the most interesting characteristic is the fact that users

perform multiple interaction with the same job. If you think about a movie

platform it is really uncommon to recommended an already seen item, since

users are not likely to watch again the same movie, but in the job domain

instead it happens frequently for many reasons: users may be interested

in reviewing the description of the jobs, replying to the posting or simply

comparing two offers. It follows also that the preferences expressed by an

user are implicit, we actually don’t know how much an user is interested in a

job posting and this pushes forward the difficulty of the task because a bad

15

interpretation of the user behavior may lead to poor results, on the other

hand in an explicit rating system the users leave consciously a preference.

Another interesting aspect is the sometimes odd relation between a user and

an item, for example we found top-managers clicking on internship job post-

ings for new graduates. This is not a random event of course, if we think

about the domain we can find some explanation for this strange behaviour:

top-manager may have children that use his/her account to look for jobs,

probably with premium membership, or the manager may be interested in

the way competitors hire people. So all this shows that it might actually

be valuable to recommend internships to top-managers based on their past

interactions.

Of course all of these considerations and difficulties where taken in account

when deploying our ensemble technique.

2.2 ACM RecSys Challenge

ACM RecSys Conference 2016, hosted in Boston at the Massachusetts In-

stitute of Technology, found the job-recommendation problem so interesting

they decided to make a competition on the topic. The RecSys Challenge

2016 [7] was then organized by Xing AG, which provided as source data

the same dataset we were given. We then decided to take part to the com-

petition representing Politecnico of Milan as team “PumpkinPie” and for

correctness, having a collaboration with Xing AG, we refused to take any

money prize in case of victory; for us it was the best opportunity to show off

that our ideas and solutions where actually working in a real environment,

luckily for us our approaches turned out to be effective as in the local tests.

The task of the challenge was to predict those job postings that were likely

to be relevant for the user. Participants should provide a set of up to 30

recommendations, ranked by relevance, for each one of the 150k users in

the test set. The evaluation metric was an hybrid of different classic metrics

that we will describe in Chapter 4.

In our solution we started working with both Content-Based and Collabora-

tive Filtering techniques, whose recommendations were then combined using

an ensemble technique called “Multi-Stack Ensemble” in order to overcome

the weak points of the single learners.

We finished the competition in the 4th position, but ranking 1st among the

Academic teams (first 3 were companies) over more than one hundred total

teams and we also receive a mention as the youngest team participating, in

fact our team were composed only by master students.

16

2.3 Our Approach

One of the biggest issue when dealing with ensemble technique is surely the

fact that there is no one single approach, because the solutions are often

domain dependent and you have actually to try them all in order to come

up with a decent result, moreover there was not any literature regarding

how to deal with the job recommendation scenario.

We thought that a good way for starting was deploying the basic hybrid

approaches, common to every recommender system, see their performance

and then, according to the assumption made in Section 2.1, customize these

solution to see if they performed better.

At the end we came up with a new fresh idea that tries to unify the pecu-

liarities of all the basic ensemble techniques, probably one of the best key

feature is that you just need the submission files to obtain a new recommen-

dation, no additional models are required.

We might say that we had a trial and error approach [8] in order to achieve

the best possible result, of course this does not mean that we randomly

changed the parameters, but we manipulated methodically the variables in

an attempt to exploit the best possible configuration.

Our method has proved to be successful, indeed our solution was better than

any traditional hybrid recommender system known so far in this domain.

2.4 Technologies

All of our research works were deployed such that all the experiments were

reproducible despite the size of the input. We thought that in order to have

a fully customizable and scalable system all of our algorithms should be

re-implemented from scratch, that’s why we did not use any framework or

library because it gave us the possibility to implement or change any lesser

details of our algorithms in a painless and faster way than having a black

box system.

To achieve this purpose we develop our “infrastructure” using different tools:

• Python 2.7: is a widely used high-level, general-purpose, interpreted,

dynamic programming language; all of our script and algorithms are

written in Python;

• Apache Spark 1.6: is an open source cluster computing framework,

is a fast and general engine for big data processing, with built-in mod-

ules for streaming, SQL, machine learning and graph processing.We

used it to parallelize our tasks and to have a cluster infrastructure if

17

needed when computational effort becomes too heavy on a standalone

machine. Figure 2.1 provides a rough idea of how a network of multi-

ple nodes is managed; you just need to specify which node is the driver

and which node is a worker and Apache Spark will automatically split

the workload among the network;

Figure 2.1: Cluster mode overview

• PoliCloud: is the IaaS cloud designed, managed, and deployed by

Politecnico di Milano. Is a cloud infrastructure for research and ex-

perimentation on big data, distributed computing, cloud architectures

and Internet of Things. The datasets provided by Xing were too big

to be managed on our machine thus we use this network to run our

experiments, we were provided with 5 different machines having 16

Gb of RAM and 8 cores. On top of these small infrastructure we then

setup Apache Spark.

• Jupyter Notebook: is a web application that allows you to create

and share documents that contain live code, equations, visualizations

and explanatory text. We use it as our high level IDE;

• Graphlab: is an extensible machine learning framework that enables

developers and data scientists to easily build and deploy intelligent

applications and services at scale. But we use this framework only to

perform data analysis because we found out that if offers some faster

solution than the one offered by Apache Spark;

Chapter 3

State of the Art

In this chapter we present the state of the art of Ensemble Techniques [9]

for Recommender Systems.

In the first part we describe what is an Ensemble in this domain and why

it can be so powerful, especially to overcome the limitations of standalone

learners.

In the second part we present the most known techniques used in both

research and business environments, applicable to the context of job recom-

mendation.

3.1 Ensemble in the context of Recommender Sys-

tems

The most prominent recommendation approaches, discussed in the thesis of

Elena Sacchi and Ervin Kamberoski [6], exploit different sources of infor-

mation and follow different paradigms to create a recommendation. Even if

they produce results that are considered to be personalized based on the as-

sumed interests of their users, they perform with varying degrees of accuracy

depending on the quality of the data and application domain. Collaborative

Filtering [10] exploits a specific type of information from the user model

together with community data to derive recommendations, while Content-

Based [11] approaches rely on product features and/or user’s features.

Each of these basic approaches has its pros and cons, for instance the former

is able to exploits trends and increase serendipity in the recommendations,

suggesting new items that may not be related to previously interacted ones,

while the latter can solve the cold-start problem, providing recommenda-

tions to new users whose profile’s information are too scarce or anomalous

to give the collaborative technique any traction. However, none of the basic

approaches is able to fully exploit all of these characteristics, therefore hy-

brid systems helps at overcoming these limitations.

An excellent example for combining different recommendation algorithm

variants is the Netflix Prize competition, in which hundreds of students and

researchers worked to improve a Collaborative movie recommender engine

by hybridizing hundreds of different Collaborative Filtering techniques to

improve the overall accuracy. Figure 3.1 gives an high level overview of

a hybrid recommendation system: starting from different recommendation

sources as input data it combines them and outputs a new enriched item list.

Usually the methods involved in the hybridization step are based on very

Figure 3.1: Hybrid recommender system’s scheme

different approaches in order to smooth the error of each different techniques

but as you will see in Chapter 7 there is no reason why several different tech-

niques of the same type could not be hybridized, for example, two or more

different Collaborative Filtering system could work together.

Unfortunately there is little on hybrid recommender systems in the actual

state of the art, probably because it is really context-dependent, making it

difficult to identify a standard solution. Therefore we had really few materi-

als to work with while creating our Multi-Stack Ensemble, we started taking

some knowledge from standard techniques, but then tried to implement an

innovative approach that could fit well the job-recommendation domain.

20

3.2 Ensemble Techniques

3.2.1 Voting

Voting ensemble techniques are based on the same way simple error cor-

recting codes work. The simplest error correcting code is a repetition-code,

where we have the string of bits repeated n-times and we extract the correct

original sequence using a majority vote. So if by chance one bit of a string is

corrupted then it is likely that all the other sequences still have the correct

value. Applying a majority vote our output will then be the original string.

This technique is generally used for machine learning classification prob-

lems, but it is also used in Recommender Systems. We will discuss some

application later in this section.

Now let’s describe the main idea behind the algorithm. Suppose to have a

test set of 10 samples and the ground truth is 10 times “1”.

1111111111

We then have 3 binary classifiers (A,B and C) with a 70% accuracy, which

means they list seven 1s and three 0s.

That being said our majority vote technique will have 4 possible outcomes

for each triple of bits:

• All 3 are correct (i.e. there are three 1s)

0.7 ∗ 0.7 ∗ 0.7

= 0.3429

• Only 2 are correct

0.7 ∗ 0.7 ∗ 0.3+

0.7 ∗ 0.3 ∗ 0.7+

0.3 ∗ 0.7 ∗ 0.7

= 0.4409

• Only 1 is correct

0.7 ∗ 0.3 ∗ 0.3+

0.3 ∗ 0.3 ∗ 0.7+

0.3 ∗ 0.7 ∗ 0.3

= 0.189

• All are wrong

0.3 ∗ 0.3 ∗ 0.3+

0.3 ∗ 0.3 ∗ 0.3+

0.3 ∗ 0.3 ∗ 0.3

= 0.027

21

What we see from this statistics is that almost 44% of the time the majority

vote corrects the error. To wrap everything up we can say that overall

this technique allows our ensembled prediction to have an accuracy of 78%

(0.3429+0.4409), higher with respect to each one of the single learners. So

it actually improves our recommendation.

In the same way as error correcting codes work the more replicated items

we have, i.e. basic learners, the more accurate will be our final result. As a

matter of fact using the above example with 5 binary classifier rather than

3 we would have 83% accuracy.

From here we can go a step ahead noting that the more the single predictions

are uncorrelated the higher would be the accuracy of our final model. Let’s

start from the previous example were we had highly correlated models.

Description Configuration Accuracy

Inputs

1111111100 80%

1111111100 80%

1011111100 70%

Ensemble 1111111100 80%

Table 3.1: Voting Example (1)

Applying the same algorithms as before we see no improvement in the

accuracy, which stays on the 80%, as stated in Table 3.1.

Now let’s try three different models which may be less accurate, but highly

uncorrelated.

Description Configuration Accuracy

Inputs

1111111100 80%

0111011101 70%

1000101111 60%

Ensemble 1111111101 90%


When ensembling with majority vote we get a 90% accuracy, as of Table

3.2. Which is a huge improvement with respect to the basic models.

A further improvement can be obtained using a weighing technique, as a

matter of fact it is unlikely that all the input models are equally accurate,

which make sense to assign an higher weight to better models. Obviously

22

the counter part of this is that low weighted models would lightly affect high

weighted ones, leading to a small improvement of the accuracy.

As you could imagine it in mostly used in machine learning classification

problem, but can also be implemented int the context of Recommender Sys-

tems. For example one can see the problem as a binary classification, where

one class contains the list of recommended items, while the other one con-

tains all the remaining ones.

The simplest approach would be to apply a majority vote directly on the

output of each basic learner. Suppose to have a list of all possible recom-

mendable items with value 1 if the item was recommended by that algo-

rithm and 0 otherwise. The majority vote would select those items which

are recommended by multiple algorithms exploiting the fact that if different

learners recommend the same items it is probably a good recommendation

for the user.

Let’s see this with an example: imagine to have a list of 10 recommendable

items where you have to provide 3 recommendations. We have our three

learners A,B and C, where a 1 correspond to the i-th element recommended.

What we can do is to count the number of times an item i is recommended

and then select the 3 highest ones.

Description Configuration

Inputs

1010010000

1001001000

1010001000

Occurrences 3021012000

Top 3 Rec 1010001000


Usually there are some learners that are better than others, so it make

sense to give different weights to different algorithms. Following the example

above we may have Table 3.4. So as you can see we have a slightly different

recommendation given the fact that the three input algorithms have differ-

ent weights.

The example that we have just showed is based on the output of each al-

gorithm, so we basically perform a standalone recommendation using each

single learners and then mix them together in a second step. This method

is the simplest and more efficient one as you can add up new algorithm over

time without the need to re-run previous techniques, you simply use their

23

Description Configuration Weight

Inputs

2020020000 2

4004004000 4

1010001000 1

Sum 7034025000 -

Top 3 Rec 1001001000 -


output. Anyway there is another implementation of the voting technique

which uses the explicit rating ra(u, i), that is the rating that algorithm a

assigned to item i recommended for user u. Let’s see an example, suppose

that each learners assigns a rating to each recommended items in a scale of

0-10, as in Table 3.5.

Description Configuration

Inputs

1080080000

4003001000

1010006000

Sum 6093087000

Top 3 Rec 0010011000


What the voting technique does is to sum the rating given to each items,

so even though the 1st item is recommended in all 3 input algorithms it is

not going to be recommend as a result of the ensemble, since its final rating is

lower than other items. Obviously this implementation can take advantage

of the weighing of algorithms simply multiplying the weight and the rating

wa × ra(u, i).

3.2.2 Averaging

Ensemble averaging is one of the most common strategy applied in the ma-

chine learning field. It works well for a wide range of problems (both classifi-

cation and regression) and metrics (AUC squared error or logarithmic loss).

The main idea is to create multiple models and combine them to produce

a desired output. Often an ensemble of models performs better than any

24

individual model, because the various errors of the models average out, in-

deed this approach should prevent overfitting; we create multiple predictors

with low bias and high variance and then hopefully combine them to have

a predictor with low bias and low variance.

Generally in a machine learning problem this means to create a set of learn-

ers with varying parameters, such as the learning rate, momentum, etc. ,

and then average their results. Instead in the recommender systems sce-

nario the averaging strategy combines the recommendations of two or more

different recommendation systems by computing the average of their scores.

This indicate that when adopting this approach we should prefer averaging

the score coming from completely different approach such as Collaborative

Filtering and Content-Based Filtering.

Thus, given n different rating functions rk with associated relative weights

βk the final score of an user u for an item i is:

rweighted(u, i) =n∑

k=1

βk × rk(u, i) (3.1)

where all rk(u, i) need to be normalized to have consistent values among

different recommendations and∑n

k=1 βn = 1.

This technique is quite straight forward and that is why it is a popular

strategy for combining the predictive power of different recommendation

techniques. Consider an example in which two recommender systems are

used to suggest one out of five items for a user Alice. As can be easily

seen from Table 3.6, these recommendation lists are hybridized by using a

uniform weighing scheme with β1 = β2 = 0.5. Item c is then the one that

on average received the highest score.

r1 r2 rweighted

item score score score

a 1 4 2.5

b 2 1 1.5

c 3 5 4

d 4 3 3.5

e 5 2 3.5

Table 3.6: Score Averaging Example

25

3.2.3 Rank Averaging

When averaging is used to ensemble multiple different models some prob-

lems may arise because not all predictors are perfectly consistent with the

score assignment or they can have different scales. So it is good practice

to normalize the values before applying an average technique. Instead the

proposed solution here is to use the rank of items. Each recommendations is

nothing but an ordered list, so what we do is to exploit the position of each

items, i.e. the first item will have rank 1 while the n-th will have rank n. Af-

ter this fast processing phase we can apply the averaging technique based on

the ranks. The basic implementation simply perform an arithmetic average

of the ranks that each input algorithm assigned to an item i

r′(i) =

∑nk rk(i)

n(3.2)

Suppose to have 5 recommendable items [a,b,c,d,e] and 2 basic algorithms

M and N, in Table 3.7 we show each item associated with its corresponding

rank. The column Ens represents the average rank computed using Equation

3.2, while column Rec is the ordered final recommendation.

rec1 rec2 recens recfinalitem rank rank rank rank

a 1 4 2.5 2

b 2 1 1.5 1

c 3 5 4 5

d 4 3 3.5 3

e 5 1 3 4

Table 3.7: Rank Averaging Example

As you can see the result of this technique is different from the one ob-

tained in Section 3.6 where we applied the averaging on the rating ra(u, i).

26

3.2.4 Interleave

When it is practical to make large number of recommendations simultane-

ously, it may be possible to use a hybrid approach where recommendations

from more than one technique are presented together. Interleaving meth-

ods is a trivial hybrid technique that alternates the recommendation on the

input algorithms in a round robin fashion. Obviously this technique may

lead to poor results especially if some learners are much weaker than others.

For this reason a weighted approach can be implemented which takes into

consideration the different characteristics of the input data. This can be

achieved using a custom interleaving factor instead on a plain round-robin

method to privilege stronger input recommendations.

Another problem that may arise is if the order of the final recommendation

matters, as for the RecSys Challenge. As a matter of fact the order of our

round robin approach would be determinant, giving higher priority to the

first algorithms visited during the iteration. A possible solution would be

to randomly change the order whenever we perform the computation for a

different user, or to define it a priori given some euristics.

The interleaving was proposed in [12] with the name mixed hybridization.

Table 3.8 shows how interleave works with three different input sources using

a round-robin approach.

rec1 rec2 rec3 recinterleaveitem item item item

a e l a

b f m e

c g a l

d h b b

e i c f

Table 3.8: Interleave Example

3.2.5 Stacked Generalization

Averaging and voting methods are really straightforward to understand and

to implement because there is no need to train new complex learners, indeed

they rely only on combining the predictions files obtained from the different

models to hopefully reduce the error.

Stacked generalization was introduced by Wolpert [13] in a 1992 paper and

the basic idea behind stacked generalization is to use a pool of base classifiers,

27

then using another classifier to combine their predictions, with the aim of

reducing the generalization error.

Figure 3.2: Stacked Generalization scheme

Figure 3.2 highlights the two main phases of the staked generalization

process: first step is to collect the output of each model into a new set of data.

Each instance in the original training set is now represented by every model’s

prediction of that instance’s value along with it’s true classification. When

constructing these models we must take care to ensure that the predictors

are formed from a batch of samples that does not include the instance in

question, just in the same way as cross validation does. The new constructed

data set is now treated as training set for another learning problem thus in

the second step a learning algorithm is employed to solve this problem.

According to Wolpert’s terminology the data and the models constructed

for it in the first step are referred as level-0 data and level-0 models while the

second-stage learning algorithm are referred to as level-1 data and level-1

generalizer.

Let us imagine that our given data set ν = {(yn, xn), n = 1, .., N} where ynrepresents integer value and xn represents the attribute values of the nth

instance. We randomly split the sample into J almost equal parts ν, .., νJand let define νj and ν(−j) = ν - νj to be respectively the test and

the training set for the jth fold of a J-fold cross-validation. Now given K

learning algorithms which we call level-0 generalizers, for each k = 1,...,K

invoke the kth algorithm on the data in the training set ν(−j) to induce a

model M(−j)k .

We can now denote the prediction of the modelM(−j)k on instance x belong

to νj as:

zkn = ν−jk (xn) (3.3)

28

At the end of the entire cross-validation process the data assembled from

the outputs of the K models is :

νcv = {(yn, z1n, ..., zKn), n = 1, ..., N} (3.4)

We can refer to this new data as the level-1 data and using a new learning

algorithm we can derive a data model M or level-1 model that taking as

input the vector (z1, ..., zk) output the final prediction (yn) or classification

for our instance.

This last process completes the description of stacked generalization method

proposed by Wolpert [1992].

3.2.6 Blending

It’s almost identical to the stacked generalization method proposed by Wolpert

[1992] and is a term introduced by the Netflix winners. It’s simpler that it’s

original version and has less risk of an information leak.

With blending [14], instead of creating out-of-fold predictions for the train

set, you create a small holdout set of say 10% of the train set.

The k level−0 models then trains on this holdout set only and the level−1

model learn on the new training set obtnained at the previous step.

This approach as few benefits:

• It’s simpler than stacking;

• The level-0 generalizers and level-1 generalizers use different data pre-

venting information leak;

• There’s no need to share a seed for stratified folds with your team-

mates;

and the cons are:

• You use less data;

• The final model may overfit to the hold out set;

• Your cross validation is more solid with stacking (calculated over more

folds) than using a single small holdout set;

As for performance, both techniques are able to give similar results and

we can also combine them creating stacked ensembles with stacked gener-

alization and out-of-fold predictions and then use a holdout set to further

combine these models at a third stage.

29

3.2.7 Collaborative via Content

A very well known issue in Recommender System is the so called “cold start”

problem, which means that the system cannot draw any inferences for users

about which it has not yet gathered sufficient information. This happens

every time the platform acquires a new user or if the dataset is really sparse,

i.e. the number of interactions between users and items is far lower that the

number of items present in the database. For example think about the num-

ber of movies present in a database like IMDB and the number of movies

watched on average by a user.

Collaborative filtering techniques are those that suffer the most the sparsity

of the dataset, since they need as much matching information as possible

among users. For example, if one user liked the movie “Rocky” and another

liked the movie “Rocky II” they would not necessarily be matched together.

On the other hand Content-Based techniques can deal better with sparsity

given the fact that the system has at least some information about the user,

may them be interactions or explicit preferences. For example if a user liked

“Rocky” a Content-based technique would find similarities with “Rocky II”

and recommend it to that user.

To address this problem a 2-step pipelined solution was proposed to lower

sparsity [15]: we use the predictions obtained from a Content-Based tech-

nique to enrich the dataset, thus reducing sparsity by increasing the links

between users and items. Then we can apply a collaborative filtering algo-

rithm that exploits the less sparse dataset as you can see in Figure 3.3.

Content-Based

Dataset

Enriched Dataset

Collaborative

Filtering

Train

Enrich

Train

Figure 3.3: Collaborative via Content scheme

30

3.2.8 Monolithic Hybridization

Monolithic [16] denotes a hybridization design that incorporates aspects of

different recommendation algorithms, mainly Content-Based and Collabo-

rative Filtering, in one single implementation.

The idea behind this technique is that the hybrid uses additional input

data that are specific to another recommendation algorithm, for example

a Content-Based recommender that also exploits community data to deter-

mine item similarities falls into this category. What is needed then is a

preprocessing and combination of different knowledge sources followed by

the modification of the algorithm behaviour in order to exploit different

types of input data.

There are 2 main approaches to Monolithic hybridization:

• Feature Combination: the algorithm uses a different range of input

data combined together trying to build an algorithm with both content

and collaborative capabilities. An example for this methodology was

presented by Basu et al.(1998) that proposed a feature combination

hybrid that combines collaborative features, such as a user’s likes and

dislikes, with content features of catalog items.

• Feature Augmentation: differently from feature combination, this hy-

brid does not simply combine and preprocess several types of input,

but rather applies more complex transformation steps. In fact, the

output of a recommender system augments the feature space of an-

other recommender by preprocessing its knowledge sources. However,

this must not be mistaken for a pipelined design, as we discussed before

in this chapter, because the implementation of the input recommender

is strongly interwoven with the main component for reasons of perfor-

mance and functionality.

An example of feature augmentation can be found in Content-boosted

Collaborative Filtering approach (Melville et al. 2002). It predicts

a user’s assumed rating based on a collaborative mechanism that in-

cludes Content-Based predictions.

31

Chapter 4

Evaluation

This chapter describes in details the evaluation metric used for the compe-

tition and for this thesis work. It is an hybrid of different more standard

metrics.

4.1 Evaluation Metrics

The dataset that we have has this important feature: it lacks explicit ratings.

With such datasets, all metrics relying on precision and rating prediction

cannot be adopted, also because we have no indication that a user disliked

an item, hence recall-based methods are more suitable for this problem.

As we have seen in previous sections, the final output of our recommender

system is a top-k of items recommended for the user ordered by a evaluated

score that represents the preference of that item for the user. The main

idea when making recommendation is that ordering matters, therefore the

items on top of the list are likely to be the most interesting for the user. In

an online evaluation setting, the evaluation metrics can measure different

aspects of the recommendation: how much the item recommended is of

interest for the user, but also how different the items provided and then liked

by the user are from the previously viewed ones (we talk about novelty) or

also how likely it was for the user to find these interesting items by himself

(serendipity). In an offline evaluation setting like ours all these other aspects

cannot be measured, and all the metrics revolve around how much the list

of recommended items corresponds to the list present in the ground truth,

hoping that a high scoring algorithm in this setting will also behave well

in a online setting. In this section we first provide the description of the

standard metrics that composed the competition metrics and then describe

it.

4.1.1 Recall

Recall is measure of relevance and it finds application in many fields. It was

born in the information retrieval scenario: here the items are documents

and the task is to return a set of relevant documents given a search term; or

equivalently, to assign each document to one of two categories, “relevant”

and “not relevant”. The recall metric represents the proportion of relevant

documents that are retrieved: it is the ratio between the number of relevant

documents retrieved by a query and the size of the set of relevant documents.

Using classification terms, it represents the ratio between true positives and

true positive + false negative instances.

In recommender system field, the recall is calculated on the list of items

returned by the algorithms to the user. This list is a composed by the k

most relevant items, ranked with the best one on top. The metric takes in

input the lists and evaluates how many items of the k recommended items are

truly relevant for the current user, that is how many recommended items are

then present in the ground truth. The result is than calculated by dividing

the length of this list by the length of the whole list of relevant items. The

mathematical representation is:

recall(k) =#hits

|T |(4.1)

where #hits is the number of good recommendations and T is the set of

relevant items.

4.1.2 Precision at K

Precision, is a known accuracy metrics, extensively used in information re-

trieval and data mining. Differently from recall, precision in general mea-

sures the proportion of retrieved documents that are relevant, or the ratio

between true positive and true positive + true negative instances.

Precision takes all retrieved documents into account, but it can also be eval-

uated at a given cut-off rank, considering only the topmost results returned

by the system. This measure is called precision at k or P@k:

P@k =#hits@k

k(4.2)

where #hits@k are the number of good recommendation at a given cut-off

k.

33

4.1.3 User Success

It’s a simple measure not very used in information retrieval or data mining

because it does not return any value regarding the quality of the recommen-

dation. It just returns 1 whether there is at least one correct recommended

item, otherwise it returns 0. Either you get all items right or you recommend

only one right items the score that you’ll get it’s always 1.

4.1.4 Competition Metric

Is an hybrid metric that combine all the aforementioned ones. It reflects the

typical use cases at XING: users are presented with their top-k personalized

recommendations, and a user interaction with one of the top-k is counted

as a success.

According to this formula for each user one can earn up to 100 points :

C(u) =20× (P@2 + P@4 + recall(30) + userSucces)

+ 10× (P@6 + P@20)(4.3)

The final score is then computed as the sum over the test set U:

S =

U∑u

C(u) (4.4)

34

Chapter 5

Dataset

The current chapter provides a description of the datasets that we were pro-

vided by Xing AG, different from the thesis work of Sacchi and Kamberoski

[6] in our research we use also the source data called Impressions. To re-

duce the computational effort all our algorithms were previously tested on

a sampled dataset that preserved all its statistical properties.

5.1 Datasets

The dataset provided by Xing was rather big and was not the classic User-

Rating Matrix found in recommender systems researches. We extracted a

sort of implicit (and very sparse) rating matrix from the interactions and

impressions files, but also some side information were available, like the user

and item profiles. The training set time span is of about three months, from

mid-August to mid-November 2015. The test set period on the other hand

is just a week long, and starts immediately after the end of the training set

time period. Table 5.1 shows the quantity of each entry of the dataset used

for this thesis work.

DB Entry Quantity

Users 40 000

Test Users 10 000

Items 274 318

Interactions 474 198

Impressions 38 511 532

Table 5.1: Dataset

35

5.1.1 User Profile

Contains the details about a user, its fields are described in Table 5.2.

Feature Description

id Anonymized ID of the user

jobroles List containing the anonymized skills of

the users e.g. python, java, HTML, etc.

career level Beginner, experienced, manager, etc.

discipline id Consulting, HR, etc.

industry id Internet, Automotive, Finance, etc.

country Country where the user is actually work-

ing

region Region where user is actually working

(only if in Germany)

experience n entries class Identifies the number of CV entries that

the user has listed as work experiences

experience years experience Is the estimated number of years of work

experience that the user has

experience years in current Is the estimated number of years that the

user is already working in his/her current

job

edu degree Estimated university degree of the user

edu fieldofstudies Engineering, Economics and Legal, etc.

Table 5.2: User features

36

5.1.2 Item Profile

Shows information about the job postings present in the dataset, similar to

the User Profile.

Feature Description

id Anonymized ID of the item

title Concepts that have been extracted from

the job title of the job posting

tags Concepts that have been extracted from

the tags, skills or company name

career level Beginner, experienced, manager, etc.

discipline id Consulting, HR, etc.

industry id Internet, Automotive, Finance, etc.

country Describes the country where the user is

actually working

region Region where user is actually working

(only if in Germany)

latitude Latitude coordinate

longitude Longitude coordinate

employment Full-time, part-time, etc.

created at Unix timestamp representing the time

when the interaction was created

active during test 1 if the item is still active (= recommend-

able) during the test period, 0 if the item

is not active anymore in the test period

(= not recommendable)

Table 5.3: Item features

37

5.1.3 Interactions

Interactions that the user performed on the job postings.

The set refers to the implicit feedback registered in the considered time

period. Each entries records the user, the item interacted, the time of the

interaction and the type of interaction performed: click, bookmark, apply

to job, discard.

Feature Description

user id ID of the user who performed the interac-

tion

item id ID of the item on which the interaction

was performed

interaction type Refers to type of interaction that was per-

formed on the item:

• 1 = the user clicked on the item

• 2 = the user bookmarked the item

on XING

• 3 = the user clicked on the reply but-

ton or application form button that

is shown on some job postings

• 4 = the user deleted a recommen-

dation from his/her list of recom-

mendation (clicking on “x”) which

has the effect that the recommen-

dation will no longer been shown to

the user and that a new recommen-

dation item will be loaded and dis-

played to the user

created at Unix timestamp representing the time

when the interaction was created

Table 5.4: Interactions description

38

5.1.4 Impressions

Items shown by the existing XING recommender engine to the users of the

platform.

Feature Description

user id ID of the user

year Year when the impression was presented

week Week of the year

items Comma-separated list (not set) of items

that were displayed to the user

Table 5.5: Impressions dataset

5.1.5 Test Users

Are just the ID of those users for which we need to make the recommenda-

tion.

39

Chapter 6

Source Algorithms

This chapter provides a brief description of the eleven input algorithms used

in our ensemble. In the first section we describe the solution obtained using

only Interaction and Impression datasets, in the second section we discuss

the collaborative filtering approach while in the third the content based

methods.

Many of the approaches present in our ensemble derive from the thesis re-

search of Sacchi and Kamberoski, so any further details about them can be

found in [6]. On the other hand all the methods that use the Impression

dataset as source data were implemented by us following the approach of

our colleagues Sacchi and Kamberoski, that’s the reason why we decided to

not cover all the particular but just give an overview of how they work.

6.1 Past Interaction & Past Impression

As already said in the job recommendation scenario, Chapter 2.1, it makes

sense to suggest already interacted job-postings. Indeed to better take ad-

vantage of this peculiarity we decided to filter out those items from the rec-

ommendation provided by Collaborative Filtering and Content-Based ap-

proaches, and then, starting from the Interactions and Impression source

dataset, create two file to be treated as single submission:

• Past Interactions: they refers to the jobs previously clicked by an user

that were filtered and ordered according to some criterion related to

the user profile.

• Past Impressions: they concerns the jobs showed to an user by the

Xing recommender system, but we kept only the past two weeks of

40

data, that we thought being the most relevant, and then performed

the same filtering and reordering steps mentioned for the Interactions.

In this way we had disjoint sets of recommendations, making our learners

to recommend previously un-clicked jobs.

6.2 Collaborative Filtering Algorithms

We have six input algorithms belonging to the Collaborative Filtering family,

two out of six were completely provided and implemented by Sacchi and

Kamberosky [6] the remaining were implemented by us following the work

of our colleagues. In the following part of this Chapter we use a particular

syntactic structure to define the algorithms, e.g. IntImp. The former word

”Int” refers to the training data, i.e. the base for the similarity function,

while the latter refers to the recommendation part, i.e. from which set we

extrapolate the items.

• UBCF IntInt (User Collaborative Interaction - Interaction): starting

from the past interactions performed by two users we calculate the

similarity among them and recommend to an user u the interactions

of those belonging to its neighborhood 1.

• UBCF IntImp (User Collaborative Interaction - Impression): given

the similarity between the users, based on their interactions, instead

of recommending to an user u the interactions of its neighbours we

suggest him their impressions.

• UBCF ImpImp (User Collaborative Impression - Impression): in

this case we calculate the similarity between the users no more over

their interaction but over their impressions and then we recommend

to an user u the impressions shown to its most similar users.

• UBCF ImpInt (User Collaborative Impression - Interaction): pro-

vided the neighborhood of an user, calculated over the impressions, we

recommend to him the interactions performed by its neighbours.

• IBCF IntInt (Item Collaborative Interaction): we provide to an user

the jobs that are similar to its previously viewed job postings, the

similarity between two jobs is computed over the users that interacted

with them, hence two items are similar if the same users clicked on

them.1With neighbourhood we refer to a set of users whose similarity value with u is above

a certain threshold

41

• IBCF ImpImp (Item Collaborative Impression): we recommend to

an user the jobs similar to its past impressions and the similarity

among two jobs is calculated over the user that they were shown to,

indeed two items are considered similar if they were suggested to the

same users.

6.3 Content Based Algorithm

Three submission composing our final ensemble derive from a content based

approach, the Baseline was the only one that were deployed by us while the

remaining two came from the research work of Sacchi and Kamberosky [6]

• Baseline: it’s a purely content based algorithm that Xing gave as

benchmark that tries to exploit the similarity between the user profile

and item profile using information such as the career level, the region

and so on. Being a really general approach it is able to perform a

recommendation to all users in the test set.

• KBIS (Concept Based Item similarity): we suggest to an user the jobs

that are similar to its former interactions. The similarity between two

items is computed over the titles and tags that describe their profiles,

thus two items are similar if they share the same concepts in their

description.

• KBUIS (Concept Based joint User-Item similarity): first we create

an new user profile, based on the titles and tags of his old interactions

combined with the user’s jobroles, and then we recommend him the

jobs that are most similar to it. More the user and job profile are

similar more the user is likely to be interested in the job posting.

42

Chapter 7

Ensemble Technique

In this chapter we will describe the ensemble algorithm that we implemented

for the RecSys Challenge. The technique that we created is called “Multi-

Stack Ensemble” [17] [18] and it consists of a hierarchy of hybrid models.

The input learners that we used are those described in Ervin and Elena’s

thesis work [6], plus other 4 Collaborative Filtering techniques that we im-

plemented working on the impressions of the dataset. After introducing the

general concept we will describe in detail the 2-step function that we used

inside each level of the stack.

7.1 Multi-Stack Ensemble

Multi-Stack Ensemble is an hybridization technique that uses a multi-layered

stack structure, shown in Figure 7.1, in order to create a hierarchy of input

learners. The stack implements a 2-step algorithm, shown in Fugure 7.2:

1. Voting-based method: assigns to each item of the input recom-

mendations a score using a linear or score-based function, described

Section 7.2;

2. Reduce function: performs a reduce step among the input recom-

mendation, grouped by user, in order to exploit a majority preference,

as it was described in section 3.2.1;

So every layer of the hierarchy implements a possibly different voting-

method always followed by the reduce function, mathematically this tech-

InteractionsImpressions

UBCF IntInt UBCF IntImpUBCF

ImpImpUBCF ImpInt IBCF IntInt IBCF ImpImp KBUIS KBIS

Baseline

Final

Ensemble

Ens

CF+CB

Ens

CF

Ens

CB

Figure 7.1: Ensemble Hierarchy

44

Voting-Based Method

Input Recommendation

Voting-Based Method


Voting-Based Method


Reduce Function

Output Recommendation

Figure 7.2: Stack Layer Structure

nique can be described as:

sa(u, i) = fE(ranka(u, i),ΘE) (7.1)

sE(u, i) =∑a∈E

sa(u, i) (7.2)

rankE(u, i) = sort(sE(u, i)) (7.3)

where a ∈ E is an algorithm in the ensemble E, sa(u, i) is the score

assigned to item i of user u for algorithm a by the scoring function fEdependent on the parameters of the ensemble ΘE . sE(u, i) is the final score

for each item i of the user u for the ensemble E. rankE(u, i) is the final

rank of each item i for the user u calculated by the sort function that sorts

the items in descending sE(u, i) order and takes the top 30 items. If an

algorithm provides less than 30 recommendations, the remaining part of the

list is filled with lower priority algorithms.

The innovative idea behind Multi-Stack Ensemble is that you can start

with ensembling weak models that increase their accuracy while climbing

the stack in order to be comparable with stronger algorithms. This cannot

be achieved using a weighted version of the techniques described in Section

3.2 since correct items in weak recommendations will always be penalized.

Here instead at each layer all previous computations are erased, so that

hybrid models coming from lower layers will be considered as vanilla inputs.

45

The general rule is that in each layer we should have as much as possible

comparable recommendations in terms of accuracy, which means that the

number of layers depends on the domain, dataset and input algorithms.

7.2 Voting-Based Methods

Hereafter we describe the different voting techniques that we implemented

in order to assign a score to each element of the algorithms inside the stack.

7.2.1 Linear Ensemble

In the linear ensemble we use two per-algorithm parameters: the weight wa

and the decay da of the algorithm a. The score sa(u, i) is calculated as:

sa(u, i) = wa − ranka(u, i) · da (7.4)

We use integers for wa in order to establish a priority over the algorithms,

for example there are techniques which are much stronger than others for a

subset of users, so it makes sense to give more importance to them assigning

an higher weight. For example past interactions and impressions contain

stronger recommendations with respect to what other learners provide, so

whenever we find a user with interactions and impressions we give them an

higher priority, i.e. higher weights.

The decay da on the other hand has a very low value (in the order of magni-

tude of 10−3), so that it helps defining the ordering inside a recommendation

list who was given weight wa. Let’s see an example: suppose we have an al-

gorithm a, providing 5 recommendations, which was given the weight wa = 1

and decay da = 0.001, our algorithm would work as in Table 7.1.

reca Rating

a1 0.999

a2 0.998

a3 0.997

a4 0.996

a5 0.995

Table 7.1: Linear Ensemble Example.

da is also used as an interleaving factor that allows to alternate the rec-

ommendations of algorithms with the same priority (i.e. the same weight).

So here, differently from what we saw in Section 3.2.4, the interleaving is

46

not explicit, for example using a round-robin approach, but it is implicit in-

stead. As a matter of fact all the processing follows no explicit rule, it is only

based on the scores that the voting technique assign to each item. Again

let’s see an example: suppose to have algorithms a and b where wa = wb = 1,

da = 0.001 and db = 0.002, Table 7.2 shows the dynamics of our technique.

reca scorea recb scoreb recENS scoreENS

a1 0.999 b1 0.998 a1 0.999

a2 0.998 b2 0.996 b1 0.998

a3 0.997 b3 0.994 a2 0.998

a4 0.996 b4 0.992 a3 0.997

a5 0.995 b5 0.990 b2 0.996

Table 7.2: Linear Ensemble Interleaving Example

To decide which couple (w, d) to assign to each input algorithm we per-

formed a Grid Search [19] over a limited set of values. The Grid Search is

a process that tries every possible combination of the parameters in order

to find best configuration, so we had to limit the possible outcomes of this

process. To do that we defined a limited pool of possible values for both

weight and decay, shown in Table 7.3.

Parameter Set of values

weight 0, 1, 2, 3, 4

decay 0.001, 0.0015, 0.002, 0.0025, 0.003, 0.004, 0.005

Table 7.3: Weight and Decay set of values

The value 0 for the weight would mean that the input recommendation

is to be discarded, i.e. not considered for the ensemble.

For every configuration we then calculated the ensemble and checked the

score in our offline environment. This process was repeated for each layer

of the stack, so we started from the first layer, searched for the best set of

parameters using a Grid Search, selected the ensemble with the best config-

uration and passed it to the upper level, where the process was repeated.

Complete results will be reported in Chapter 8.

47

7.2.2 Evaluation Score Ensemble

The Evaluation Score ensemble assigns to each item in the recommendation

list a score that reflects the accuracy of the algorithm a, represented by the

weight wa, and the points that can be obtained using the Xing evaluation

metric, described in Section 4.1.4, given the rank of the item. As a matter

of fact the ordering of items matters, so as you can see from Equation 7.6

a relevant item in the first or second position contributes much more than

one at the end of the list.

sa(u, i) = wa · e(ranka(u, i)) (7.5)

where e(ranka(u, i)) is defined as:

e(ranka(u, i)) =

37.83, ranka(u, i) ∈ [1, 2]

27.83, ranka(u, i) ∈ [3, 4]

22.83, ranka(u, i) ∈ [5, 6]

21.17, ranka(u, i) ∈ [7, 20]

20.67, ranka(u, i) ∈ [21, N ]

(7.6)

As we stated before the weight wa represents the accuracy of the algo-

rithm a, defined as score density ratio, which is the ratio between what we

called leaderboard score and the total number of recommended items:

wa =lana

(7.7)

where na is the number of items recommended by algorithm a and la is

the score that algorithm a performed in our offline environment.

The main idea behind this technique is to exploit the points per item of

an algorithm, for instance there may be two different learners, say A and

B, that have an equal score in our leaderboard. You may think they are

comparable and can be ensembled using a round-robin interleaving, as de-

scribed in Section 3.2.4. But what if A recommends twice the items of B?

It would mean that the score density ratio of B is double with respect to

A, making A a weaker algorithm than B, because it achieves the same score

recommending twice the items.

Let’s see an example of how this technique works, suppose we have two

learners, a and b, with the following characteristics: la = lb = 10, na = 5

and nb = 2.

From the data above we can calculate the weights for both a and b as de-

scribed in Formula 7.7, obtaining wa = lana

= 2 and wb = lbnb

= 5. Now with

the weights and the function e(ranka(u, i)) we can compute the score for

48

reca scorea recb scoreb recens scoreens

a1 75.66 b1 189.15 b1 189.15

a2 75.66 b2 189.15 b2 189.15

a3 55.66 a1 75.66

a4 55.66 a2 75.66

a5 45.66 a3 55.6

Table 7.4: Evaluation Score Example

each item. As we can see from Table 7.4 algorithm b is much more accurate

than algorithm a, so the evaluation-score technique gives it an higher prior-

ity recommending b’s elements before a’s.

This technique resulted quite powerful especially with algorithms that make

recommendations on different, but overlapping, set of target users (e.g. users

with interactions and users with impressions), because it can give you more

information on the actual accuracy of the learner.

7.3 Reduce Function

While describing the voting-based methods we have always considered rec-

ommendations of different algorithms belonging to disjoint sets. The reality

of the competition was different though, in fact it happened that different

learners recommended the same item, for example an item i could be recom-

mended by a Collaborative Filtering method to user u, provided other users

similar to him clicked on that job posting, and at the same time it could

be recommended by a Content-Based technique, provided i had similar at-

tributes to previously clicked interactions of u. Therefore we though that

this information shouldn’t have been lost or not considered, that is why we

decided to implement a way to take advantage of that.

The idea behind the reduce function is that if more than 1 algorithm rec-

ommend the same item to a user then probably that item is a good recom-

mendation. This technique derives from the majority method described in

section 3.2.1, but we pushed it forward because we do not count the num-

ber of occurrences of an item among multiple recommendations, instead we

sum up the ratings that item received during the voting step, described in

section 7.2. Eventually as a final step we use a sorting procedure which

exploits the new order of items based on the ratings in a descending fashion.

An interesting use case for this technique is the following: suppose to have

49

A, 3 B, 2.9 C, 2.8 D, 2.7 E, 2.6

R, 3 G, 2.9 B, 2.8 P, 2.7 D, 2.6

B, 5.7 D, 5.3 A, 3 R, 3 G, 2.9

Algorithm 1

Algorithm 2

Reduce Step

1° 2° 3° 4° 5°

Figure 7.3: Reduce Function Example. Inside each block the values represents the

structure (item, rating)

two input algorithms M and N, and that an item i is recommended to user

u by both algorithms in the last position such that:

rankM (u, i) = rankN (u, i) = 30 (7.8)

If we had simply counted the number of occurrences of items then item i

would have been quite a good recommendation given that it occurs in every

input algorithm (N,M). The problem is that it’s in the thirtieth position, so

if N and M rank the recommendations by likelihood, i is quite far from the

top of the list.

Our approach on the other hand considers the rating that our voting meth-

ods assigned to i for both N and M, and being i in the thirtieth position its

value may not be really high. So our reduce function will certainly rank up

i, since it will get the sum of the score sM (u, i) and sN (u, i), but it probably

will not make it to the top of the recommendation list where we have higher

rated items.

The implementation of our reduce function is pretty straightforward, we

simply used the map-reduce paradigm of Apache Spark.

50

7.4 Stack Layers

In our solution we implemented a three layered stack that we will describe

in the following sections. The main idea is to perform a batch hybridization

with the possibility to apply different combining techniques to each subgroup

of algorithms.

7.4.1 Layer 1

The first layer contains 2 different ensembles:

• Ensemble CF: the aim of this ensemble is to find a single strong

Collaborative Filtering representative out of 6 different techniques [6]

that we implemented. Those 6 input algorithms were all from the same

family of Collaborative Filtering, but they provided different predic-

tions due to diverse parametrization or users on which they trained.

On average we had a 20% intersection among the predictions of these

algorithms. Ensemble CF would then provide us with a single rec-

ommendation that collected the results of all the Collaborative tech-

niques.

For Ensemble CF we used the Evaluation Score method, described in

7.2.2, for the voting step and then applied our Reduce function.

• Ensemble CB: as for the previous case this ensemble aims at finding

a single strong Content-Based representative out of 2 techniques [6]

that we implemented. Again the 2 algorithms belong to the same

family, but they provided different recommendations, this time due to

the different calculation of IDF values.

For Ensemble CB we applied the Linear method instead, described in

7.2.1, followed by our Reduce step.

The 2 ensemble generated at this stage will then be used in upper layers of

the stack. Table 7.5 shows summary of layer 1.

7.4.2 Layer 2

The second layer is the simplest one, the goal here is to create a new single

recommendation that unites a Collaborative Filtering and Content-Based

approach. Here there is no external input, the only algorithms used in this

layer are those coming from the previous one, which in fact are the two

representative of Collaborative and Content. The output will then be a

single recommendation list that will be propagated to the upper layer.

The voting method used is again Linear.

51

Input Output

UBCF IntInt

Ens CF

UBCF IntImp

UBCF ImpInt

UBCF ImpImp

IBCF IntInt

IBCF ImpImp

KBUISEns CB

KBIS

Table 7.5: I/O Layer 1

Input Output

Ens CFEns CF+CB

Ens CB


7.4.3 Layer 3

The third and final layer of our stack will output the recommendation that

allowed us to achieve the 4th position in the competition.

Let us describe the input algorithms that we find in this layer:

• Interactions: contains the past interactions of the user. As we stated

at the beginning of this thesis work, the goal of this research is next-

click prediction, so given the characteristics of the job-recommendation

domain it make sense to recommend what we know users have already

seen, since users tend to interact with same item multiple times;

• Impressions: contains the impressions that each users received, which

is what the Xing recommender systems showed them. Since these are

already recommendations presented in the user interface of the Xing

platform it is likely that users actually clicked on them;

• Ensemble CF+CB : contains the recommendations obtained in the sec-

ond layer of our stack, basically it incorporates the results of our Col-

laborative Filtering and Content-Based approaches;

• Baseline: contains recommendations derived from the baseline algo-

rithm that Xing provided. Since it is a general purpose approach it

52

presents recommendations for almost every users disregarding require-

ments that more advanced techniques may have (e.g. Collaborative

Filtering needs a minimum number of past interactions for a user in

order to provide valuable recommendations). That being said we used

this baseline not as an input model for our ensemble, since its score

was far lower than other algorithms, but as a filler, therefore if for any

reasons we could not provide any or enough job-posting to a user we

completed the 30-items recommendation with the baseline.

Input Output

Ens CF+CB

Final RecInteractions

Impressions

Baseline


In this layer we found again the Linear method to be the most accurate.

Since interactions and impressions were really strong recommendations our

goal was to use the previous layer of the stack to build a new model, Ensem-

ble CF+CB, that could somehow compare with them. It is also important

to remember that each Content-Based or Collaborative Filtering methods

do not include elements present in interactions and impressions, the 2 sets

are completely disjoint, which makes it harder to find an equally accurate

recommendation.

53

Chapter 8

Results

In this chapter we show the results of our research work compared to other

state of the art ensemble techniques.

In the first part we present the tuning process that characterized each layer

of our stack. This tuning involves the voting-based techniques to be used

and the related parametrization. In the second part we compare our Multi-

Stack Ensemble with some of the hybrids described in Chapter 3.

Table 8.11 summarize the results showing that our approach outperforms

all other solutions.

8.1 Layers Tuning

As we stated in Chapter 7 our 2-step algorithm consisted in a Voting-based

method followed by a Reduce function. While describing the former in

Section 7.2 we introduced 2 different techniques: Linear and Evaluation-

Score.

The problem here is that during the computation we could only apply 1

voting-based method, so in our offline testing we used both techniques to

understand which one was more suitable for each layer of the stack. In the

following subsection we will report the results of this testing to justify what

was presented in Section 7.4. To find the best set of parameters for the

Linear Voting-based method we performed a Grid Search over a fixed set

of values for both weight and decay. Hereafter we will show only the most

significant combinations for the sake of brevity.

In the following sections we will use the notation 1.1 and 1.2 to discriminate

between the 2 different ensembles that resides at layer 1, the former will refer

to the Collaborative Filtering techniques, the latter to the Content-Based.

8.1.1 Input Algorithms

In Table 8.1 we present the algorithms we used as input for our Multi-Stack

Ensemble. Beside each one of them you can find the score performed in our

offline environment. It is the result of the evaluation metric adopted for the

Challenge and this thesis work, described in Section 4.1.4.

Algorithm Score

Baseline 2k

UBCF IntInt 18.9k

IBCF ImpImp 20k

IBCF IntInt 22k

UBCF IntImp 22.4k

UBCF ImpImp 31k

UBCF ImpInt 33.2k

KBIS 37k

KBUIS 45k

Interactions 82k

Impressions 116k

Table 8.1: Scores of Input Algorithms.

The Score value is obtained using the evaluation metric

described in Section 4.1.4

An important aspect to note is that even though impressions, as a rec-

ommendation, perform better than interaction, the latter are more accurate.

As a matter of fact if we calculate the ratio between the score and the num-

ber of recommended items, i.e. the points per item shown in Equation 8.1,

what we obtain is in Table 8.2.

ratio =score

number of elements(8.1)

Algorithm Score Number of items Ratio1

Interactions 82k 45103 1.1818

Impressions 116k 208421 0.5565

Table 8.2: Points per item of an algorithm

1Equation 8.1

55

As you can see interactions have a much higher ratio, which means that

even though it contains less items the recommendations are actually more

accurate. The reason is clear, interactions are job postings users actively

interacted with, hence they all represent preferences of the user, while im-

pressions are all those postings that are presented to the user as possible

interesting items, therefore only a subset of them is actually clicked.

8.1.2 Layer 1.1

In the first layer we combine all the recommendations coming from Collab-

orative Filtering algorithms.

We start with the Linear Ensemble technique tuning the weights and de-

cays using a Grid Search algorithm. Table 8.3 shows the most performing

configuration of parameters for the aforementioned voting method.

Algorithm Weight Decay

UBCF IntInt 1 0.002

UBCF IntImp 1 0.002

UBCF ImpInt 1 0.0015

UBCF ImpImp 1 0.001

IBCF IntInt 1 0.002

IBCF ImpImp 1 0.005

Table 8.3: Layer 1.1 (Collaborative Filtering) Linear Method Parametrization

We then implemented the Evaluation Score technique, which needs no

tuning of parameters as shown in Section 7.2.2. Hereafter in Table 8.4 are

presented the score obtained in our local test.

Voting Method Score

Linear 34656

Evaluation-Score 34914

Table 8.4: Layer 1.1 Results.



The most performing result is then obtained using the Evaluation-Score

voting-based methodology which is slightly better than the Linear one. The

reason may be that the accuracy of these Collaborative Filtering techniques

56

decays very fast, so the latter technique is not able to address this behaviours

better that than the former one.

8.1.3 Layer 1.2

Here instead we create a single recommendation list out of the 2 Content-

Based techniques that we implemented. Table 8.5 shows the best parametriza-

tion for the Linear method.


KBUIS 1 0.001

KBIS 1 0.001

Table 8.5: Layer 1.2 (Content-Based) Linear Method Parametrization

This time is the Linear method the most accurate one as of Table 8.6.

The best solution is obtained by giving the same weight and decay (1 and

0.0001) to both the input algorithms. We can imagine then that they are

almost equally accurate in the first half of the recommendation which entails

that a round-robin interleaving may be the optimal solution.

Voting Method Score

Linear 46743


Table 8.6: Layer 1.2 Results.



8.1.4 Layer 2

The second layer takes as an input the 2 representatives created in the pre-

vious level and combines them together to obtain a single recommendation

with both collaborative and content characteristics. As we stated in previ-

ous chapters this hybrid helps at overcoming the limitations of the single

approaches.

Table 8.8 shows that again the Linear method performs much better

than the Evaluation Score. The best result is obtained by giving the same

weight to both input algorithms, but assigning a different set of decays,

57


ENS CF 1 0.0015

ENS CB 1 0.001

Table 8.7: Layer 2 Linear Method Parametrization

as of Table 8.7. This gives a new rule to the interleaving process which

differs now from a standard round-robin approach. As a matter of fact the

proportion of items inside the resulted ensemble, on average, will be 2:3, i.e.

2 Collaborative recommendation every 3 Content-Based ones.

Voting Method Score

Linear 52472


Table 8.8: Layer 2 Results.



8.1.5 Layer 3

The third layer ensembles the representative of our implemented algorithms,

obtained at the second layer, with the recommendation processed from the

interactions and impressions of the dataset. This is the final stage of our

solution.

As we can see from the input data in Table 8.1, interactions and impres-

sions are far more performing than the other techniques, the reason is that

we decided to work with disjoint sets, hence our Collaborative and Content-

Based solutions do not include any recommendation present in the former

ones. As we stated before, being this a click-prediction problem, it happens

that users click multiple times on an already seen item, for example for com-

parison reasons. Therefore many items present in the test set are actually

hidden in the training one. That also explains the score gap between the

recommendations. The choice was made to leave more space for novelty,

since the other items could be inferred elaborating directly interactions and

impressions. Hence it made no sense to have them repeated in multiple lists.

As we can see from Table 8.10 the Linear methods performs better,

achieving a great result. The best set of parameters, shown in Table 8.9,

58


Interactions 3 0.001

Impressions CB 2 0.001

Ensemble CF+CB 1 0.001

Baseline - -

Table 8.9: Layer 3 Linear Method Parametrization

include the same decay for all the input algorithms, since the ordering is

created using the weights. As a matter of fact we can create a queue of

recommendations where the first part in filled with interactions, followed

by impressions and then closed with the combination of Collaborative and

Content-Based. This is perfectly in line with what we stated in the first part

of this section and in Section 8.1.1. Anyway one should not be confused

thinking that it is just an append of submissions. In fact the reduce step

will act as a reordering technique since interactions and impressions are not

disjoint sets. Therefore, given our rules, it is likely that items belonging to

the intersection are pushed upwards in the list, rewriting the ordering.

The Baseline was not included in the parametrization of Table 8.9 because

it acts as a filler. Whenever we are not able to recommend enough or any

items to a user with the other techniques we use this algorithm to fill the

list. As you see in Table 8.1 it is not really accurate, but it can provide a

recommendation for every test user, being it a really general technique.

Voting Method Score

Linear 164402


Table 8.10: Layer 3 Results.



8.2 Ensemble Comparison

In this section we will compare the performance of our Multi-Stack Ensemble

with other hybridization techniques that we discussed in Chapter 4 in order

to demonstrate that our approach actually works better than other state of

59

the art solutions.

Let’s first discuss the standard ensembles that we implemented:

• Majority Voting : we simply grouped the recommendation per each

users and counted the occurrences of each element, sorting them in

descending order. The general idea is that if an item is present in

more that one list of recommendations it is probably an interesting

item for the user. For those items with only 1 occurrence we sorted

them in the final recommendation based on their original rank, hence

giving higher priority to top-ranked one;

• Interleaving Random: we performed a round-robin approach taking

one recommendation at a time from each input algorithm. We used a

random per-user selection for the ordering of algorithms;

• Interleaving Order : we performed a round-robin approach taking one

recommendation at a time from each input algorithm. Differently from

the previous case we used a prefixed ordering of algorithms, based on

the score that each one of them performed in our offline environment,

Table 8.1, in descending order;

• Score Averaging : each algorithm that we implemented performs a rat-

ing predictions over the items a user may like and then picks the top

30 elements with higher values. For the Score Averaging we imple-

mented an averaging technique using the normalized scores that each

algorithms assigned to an items based on the user-rating prediction.

The final rating r(u, i) is calculated as in Formula 8.2, where n is the

number of occurrences of item i, recommended to user u, in the input

algorithms.

r(u, i) =1

n

n∑a=1

ra(u, i) (8.2)

Obviously the recommendations represented by interactions and im-

pressions did not have a rating, since our processing consisted in fil-

tering and reordering techniques. Therefore the aforementioned tech-

nique was applied to Content-Based and Collaborative algorithms, and

the hybrid obtained was appended to the former 2, using the exact

same weighing technique applied in the 3rd layer of our Multi-Stack

Ensemble;

• Rank Averaging : we assigned to each item i a rating ri corresponding

to its rank, e.g. the first element would have ri = 1 and so on. If i

was present in multiple recommendation lists for the same user u we

60

computed the average rating for i. Then ranked the list in ascending

order;

• Weighted Voting : we performed a Grid Search over a set of parameters

to assign a weight to each input algorithms. After that we applied the

aforementioned Majority Voting technique. What differs now is that

we do not simply count the occurrences, but rather sum up the weights,

as shown in Equation 8.3, where wa(u, i) is the weight assigned to

algorithm a that contains the item i recommended to user u.

r(u, i) =∑a

wa(u, i) (8.3)

In Table 8.11 is presented only the result of the most performing so-

lution that we managed to obtain;

• Evaluation Score: we applied the Evaluation Score technique, de-

scribed in Section 7.2.2, using all input algorithms. This approach

would be the same as having 1 single layer in our Multi-Stack Ensem-

ble;

• Linear : we applied the Linear technique, described in Section 7.2.1,

using all input algorithms. For weights and decays we performed a

Grid Search to find the best configuration of parameters. Again this

technique can be seen as if we had 1 single layer in our Multi-Stack

Ensemble;

In Table 8.11 are shown the scores obtained with the aforementioned tech-

niques with respect to Multi-Stack Ensemble.

Our solution outperforms all state of the art techniques. Most of them

are too general that cannot exploit the characteristic of the domain, for

example Majority Vote cannot decide whether there are recommendations

more accurate than others, as we demonstrated in Section 8.1.1, without a

proper weighing function. In fact the weighted version performs much bet-

ter, achieving a score which is almost 3 times the former one. Anyway even

with a proper weighing the results show that the hierarchical structure of

the Multi-Stack Ensemble is able to push forward the accuracy, being able

to exploit deeper the characteristics of the input algorithms.

Another interesting aspect is that the standalone application of the 2 tech-

niques that we created, Evaluation Score and Linear, performed better than

other state of the art solutions, but worse with respect to Multi-Stack.

Therefore we can say that both the combination of different voting-based

methods and the layered structure contributed deeply in the final result.

61

hybrid techniques score

State of the Art

Majority Voting 45937

Interleaving Random 112735

Interleaving Order 114591

Rank Averaging 113649

Score Averaging 126322

Weighted Voting 139425

Evaluation Score 148878

Linear 152731

Our solution Multi-Stack 164402

Table 8.11: Ensemble Comparison.



62

Chapter 9

Conclusion and Future

Developments

In this thesis work we discussed an innovative approach to ensemble different

recommendation sources in the job-recommendation domain. We worked in

collaboration with TU Delft and Xing AG, a Business Social Network, which

gave us the dataset coming from real data collected of the Xing web appli-

cation.

The aim of this work is to push forward the research in the field of ensemble

techniques, creating an innovative approach that exploits the characteristic

of this specific domain.

Our solution is based on a multi-layered stack composed of 3 separate levels.

Inside each one of them the input recommendations are combined using a

voting-based technique followed by a reduce function in order to generate

a combined recommendation from the input sources. The output is then

propagated as an input to the upper layer.

The strength of this technique is to overcome the limitations of standard

predictive models by combining the recommendations in an innovative ar-

chitecture, to obtain a more accurate and reliable prediction for the users.

Moreover Multi-Stack ensemble outperforms all other state of the art hy-

bridization techniques which use standard approaches that may result too

general for the context.

The power of our solution resides in the layered structure, which allows to

combine recommendations in batches with the possibility to use multiple

hybridization techniques. As a matter of fact in each layer one can imple-

ment a different methodology to ensemble the recommendations, choosing

the best solution with respect to the input data.

We also participated to the RecSys Challenge 2016, organized among others

by TU Delft and Xing, were our Multi-Stack Ensemble allowed us to end

in the 4th position and 1st among Academic teams (first 3 were companies)

out of more that 120 teams. We were the only Academic team composed of

Master students.

We were then invited to the ACM RecSys Conference, held in Boston at the

Massachusetts Institute of Technology, to present our solution to researchers

and companies from all over the world. Here we also received a special men-

tion as youngest team during the prize-giving ceremony.

Our paper “Multi-Stack Ensemble for Job Recommendation” was then ac-

cepted and published in the ACM RecSys proceedings. We were also awarded

by Politecnico of Milan for our results with a scholarship.

Next year the RecSys Challenge 2017 [20] will again be hosted by Xing,

but will introduce a new really interesting aspect: online evaluation. Ba-

sically the competition should be divided in 2 phases, the first one will be

similar to the 2016 competition. The first top-n teams will then be allowed

to proceed to the second phase which will consist in an online evaluation.

Teams will provide recommendations that will be proposed to real users

on the platform, to actually discover whether in a real environment some

solutions may be better than others. That being said it would be really

interesting to participate to the second phase and test the Multi-Stack tech-

nique with the online evaluation. The problem with the offline environment

is that users are biased by the presence of another recommender engine, the

one currently working on the platform (i.e. impressions for the RecSys com-

petition) during the collection of the data. Therefore the real problem for

the offline setting may turn out to be: “predict the items users will interact

with, among the ones already recommended (i.e. impressions)”. This means

that there may be an algorithm that perform really well on this task, but

fail in an online setting where its recommendations are actually presented to

the user. On the other hand a fancy learner that propose novel items, never

shown to the user by the company’s recommender system, may capture the

attention in the online setting.

Another interesting future development would be to try the aforementioned

ensemble in different contexts, for example in movie recommender systems

or e-commerce. We think that it could work really well since it embodies

a general technique applicable to different domains. What may differ is

the structure of the stack that should be customized for every application,

but the inner concept of the 2-step algorithm is context independent. One

may also apply totally different combination techniques that fit better the

characteristic of the domain.

64

Bibliography

[1] Daniel Kluver. What is the goal of a Recommender System?

[2] Shaha T. Al-Otaibi and Mourad Ykhlef. A survey of job recommender

systems. 2012.

[3] Andrea Pagliarani Giacomo Domeniconi, Gianluca Moro and Roberto

Pasolini. Job recommendation from semantic similarity of linkedin

users’ skills. 2016.

[4] Mitul Tiwari Christian Posse Azarias Reda, Yubin Park and Sam Shah.

Metaphor: A system for related search recommendations. 2012.

[5] Mitul Tiwari Christian Posse Lili Wu, Sean Choi and Sam Shah. The

browsemaps: Collaborative filtering at linkedin. 2010.

[6] Elena Sacchi and Ervin Kamberoski. Collaborative Filtering and

Content-Based Filtering Algorithms for the Job Recommendation Prob-

lem. 2016.

[7] Fabian Abel, Andras Benczur, Daniel Kohlsdorf, Martha Larson, and

Robert Palovics. Recsys challenge 2016: Job recommendations. In

Proceedings of the 10th ACM Conference on Recommender Systems,

RecSys ’16, pages 425–426, New York, NY, USA, 2016. ACM.

[8] Wikipedia. Trial and Error.

[9] Youngtae Kim Yeonjeong Lee, Kyoung-jae Kim. Recommender sys-

tems using ensemble techniques. International Journal of Computer,

Electrical, Automation, Control and Information Engineering, 7, 2013.

[10] Michael D. Ekstrand, John T. Riedl, and Joseph A. Konstan. Collab-

orative filtering recommender systems. Found. Trends Hum.-Comput.

Interact., 4(2):81–173, February 2011.

65

66 BIBLIOGRAPHY

[11] Michael J. Pazzani and Daniel Billsus. The adaptive web. chap-

ter Content-based Recommendation Systems, pages 325–341. Springer-

Verlag, Berlin, Heidelberg, 2007.

[12] Robin Burke. Hybrid recommender systems: Survey and experiments.

The adpative web, 4321:377–408, 2007.

[13] David H. Wolpert. Stacked generalization. Neural Networks, 5:241–259,

1992.

[14] Leo Liberti Fabio Roda, Alberto Costa. Optimal recommender systems

blending. 2011.

[15] Tim Miranda Pavel Murnikov Mark Claypool, Anuja Gokhale. Com-

bining content-based and collaborative filters in an online newspaper.

[16] Alexander Felfernig Dietmar Jannach, Markus Zanker and Gerhard

Friedrich. Recommender System, An Introduction. 2011.

[17] Elena Sacchi Ervin Kamberoski Paolo Cremonesi Roberto Pagano Mas-

simo Quadrana Tommaso Carpi, Marco Edemanti. Multi-stack ensem-

ble for job recommendation. RecSys Challenge 2016 Proceedings, 2016.

[18] Joseph Sill, Gabor Takacs, Lester Mackey, and David Lin. Feature-

weighted linear stacking. arXiv preprint arXiv:0911.0460, 2009.

[19] Wikipedia. Hyperparameter Optimization.

[20] ACM RecSys Conference 2017.

Documents

Multi-Stack Ensemble for Job Recommendation€¦ · job-recommendation and the problem \next click prediction". Given a Xing user, the goal was to predict those job postings that