Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
POLITECNICO DI MILANOCorso di Laurea Magistrale in Ingegneria Informatica
Dipartimento di Elettronica e Informazione
Multi-Stack Ensemble for Job
Recommendation
Relatore: Prof. Paolo Cremonesi
Correlatore: Ing. Roberto Pagano
Tesi di Laurea di:
Tommaso Carpi, matricola 836986
Marco Edemanti, matricola 838979
Anno Accademico 2015-2016
Abstract
Recommender Systems are a subclass of information filtering systems that
try to predict the preferences of users with respect to a set of items.
This thesis was developed in collaboration with TU Delft and Xing AG, a
Business Social Network, that gave us the dataset used in this research.
The technique that we created is called Multi-Stack Ensemble and it consists
of a series of different hybridization layers. The general idea is to ensemble
algorithms in batches, starting with weak learners at the bottom to have
recommendations more and more accurate as we climb the layers of the
stack.
As a proof of the quality of our results, we participated to the RecSys Chal-
lenge 2016, organized among others by TU Delft and Xing, representing
Politecnico of Milan as team “PumpkinPie”, stating that we would have
renounced any monetary prize, given our relationship with the organizers.
We tackled the problem using our Multi-Stack Ensemble which performed
really well allowing us to end in the 4th position and 1st among Academic
teams (the first 3 were companies) out of more that 120 teams. We were
the only Academic team composed of Master students.
We were then invited to the ACM RecSys Conference, held in Boston at
the Massachusetts Institute of Technology, to present our solution to re-
searchers and companies from all over the world. Here we also received a
special mention as youngest team during the prize-giving ceremony. Our
paper “Multi-Stack Ensemble for Job Recommendation” was then accepted
and published in the ACM RecSys proceedings. We were also awarded by
Politecnico of Milan for our results with a scholarship.
3
Sommario
I Recommender Systems sono una sottoclasse dei sistemi di information fil-
tering per predirre le preferenze degli utenti nei confronti di un set di oggetti.
Questa tesi e stata sviluppata in collaborazione con TU Delft e Xing AG,
un Business Social Network, che ci ha fornito il dataset per questa ricerca.
La tecnica che abbiamo creato e chiamata Multi-Stack Ensemble e consiste
di una serie di diversi layer di ibridazione. L’idea generale e quella di com-
binare gli algoritmi a grappoli, partendo da quelli piu deboli alla base in
modo da avere raccomandazione sempre piu accurate mentre si salgono i
layer dello stack.
Come prova della qualita dei nostri risultati abbiamo partecipato alla Rec-
Sys Challenge 2016, organizzata tra gli altri da TU Delft e Xing, rappre-
sentando il Politecnico di Milano come team “PumpkinPie”, dichiarando di
voler rinunciare a qualsiasi premio monetario data la nostra relazione con
gli organizzatori.
La nostra tecnica ha dato ottimi risultati permettendoci di chiudere al 4◦
posto e 1◦ tra i team Accademici (i primi 3 erano aziende) tra oltre 120
partecipanti. Eravamo l’unico team composto da Master students. Abbi-
amo dunque ricevuto un invito per la ACM RecSys Conference, ospitata al
MIT di Boston, per presentare la nostra soluzione a ricercatori e aziende
provenienti da tutto il mondo, e abbiamo anche ricevuto una menzione spe-
ciale come team piu giovane durante la cerimonia di premiazione.
Il nostro paper “Multi-Stack Ensemble for job Recommendation” e stato in-
oltre accettato e pubblicato negli ACM RecSys proceedings.
Siamo infine stati premiati dal Politecnico di Milano per i nostri risultati
con una borsa di studio.
4
Contents
Abstract 3
Sommario 4
1 Introduction 11
1.1 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Problem Description 15
2.1 Job Recommendation . . . . . . . . . . . . . . . . . . . . . . 15
2.2 ACM RecSys Challenge . . . . . . . . . . . . . . . . . . . . . 16
2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 State of the Art 19
3.1 Ensemble in the context of Recommender Systems . . . . . . 19
3.2 Ensemble Techniques . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Rank Averaging . . . . . . . . . . . . . . . . . . . . . 26
3.2.4 Interleave . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.5 Stacked Generalization . . . . . . . . . . . . . . . . . . 27
3.2.6 Blending . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.7 Collaborative via Content . . . . . . . . . . . . . . . . 30
3.2.8 Monolithic Hybridization . . . . . . . . . . . . . . . . 31
4 Evaluation 32
4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Precision at K . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 User Success . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.4 Competition Metric . . . . . . . . . . . . . . . . . . . 34
6
5 Dataset 35
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.1 User Profile . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.2 Item Profile . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.4 Impressions . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.5 Test Users . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Source Algorithms 40
6.1 Past Interaction & Past Impression . . . . . . . . . . . . . . . 40
6.2 Collaborative Filtering Algorithms . . . . . . . . . . . . . . . 41
6.3 Content Based Algorithm . . . . . . . . . . . . . . . . . . . . 42
7 Ensemble Technique 43
7.1 Multi-Stack Ensemble . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Voting-Based Methods . . . . . . . . . . . . . . . . . . . . . . 46
7.2.1 Linear Ensemble . . . . . . . . . . . . . . . . . . . . . 46
7.2.2 Evaluation Score Ensemble . . . . . . . . . . . . . . . 48
7.3 Reduce Function . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.4 Stack Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.4.1 Layer 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.4.2 Layer 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.4.3 Layer 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8 Results 54
8.1 Layers Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.1.1 Input Algorithms . . . . . . . . . . . . . . . . . . . . . 55
8.1.2 Layer 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.1.3 Layer 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.1.4 Layer 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.1.5 Layer 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.2 Ensemble Comparison . . . . . . . . . . . . . . . . . . . . . . 59
9 Conclusion and Future Developments 63
7
List of Figures
2.1 Cluster mode overview . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Hybrid recommender system’s scheme . . . . . . . . . . . . . 20
3.2 Stacked Generalization scheme . . . . . . . . . . . . . . . . . 28
3.3 Collaborative via Content scheme . . . . . . . . . . . . . . . . 30
7.1 Ensemble Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 44
7.2 Stack Layer Structure . . . . . . . . . . . . . . . . . . . . . . 45
7.3 Reduce Function Example. Inside each block the values rep-
resents the structure (item, rating) . . . . . . . . . . . . . . . 50
8
List of Tables
3.1 Voting Example (1) . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Voting Example (2) . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Voting Example (3) . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Voting Example (4) . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Voting Example (5) . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Score Averaging Example . . . . . . . . . . . . . . . . . . . . 25
3.7 Rank Averaging Example . . . . . . . . . . . . . . . . . . . . 26
3.8 Interleave Example . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 User features . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Item features . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.4 Interactions description . . . . . . . . . . . . . . . . . . . . . 38
5.5 Impressions dataset . . . . . . . . . . . . . . . . . . . . . . . . 39
7.1 Linear Ensemble Example. . . . . . . . . . . . . . . . . . . . . 46
7.2 Linear Ensemble Interleaving Example . . . . . . . . . . . . . 47
7.3 Weight and Decay set of values . . . . . . . . . . . . . . . . . 47
7.4 Evaluation Score Example . . . . . . . . . . . . . . . . . . . . 49
7.5 I/O Layer 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.6 I/O Layer 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.7 I/O Layer 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.1 Scores of Input Algorithms. The Score value is obtained using
the evaluation metric described in Section 4.1.4 . . . . . . . . 55
8.2 Points per item of an algorithm . . . . . . . . . . . . . . . . . 55
8.3 Layer 1.1 (Collaborative Filtering) Linear Method Parametriza-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.4 Layer 1.1 Results. The Score value is obtained using the
evaluation metric described in Section 4.1.4 . . . . . . . . . . 56
8.5 Layer 1.2 (Content-Based) Linear Method Parametrization . 57
9
8.6 Layer 1.2 Results. The Score value is obtained using the
evaluation metric described in Section 4.1.4 . . . . . . . . . . 57
8.7 Layer 2 Linear Method Parametrization . . . . . . . . . . . . 58
8.8 Layer 2 Results. The Score value is obtained using the eval-
uation metric described in Section 4.1.4 . . . . . . . . . . . . 58
8.9 Layer 3 Linear Method Parametrization . . . . . . . . . . . . 59
8.10 Layer 3 Results. The Score value is obtained using the eval-
uation metric described in Section 4.1.4 . . . . . . . . . . . . 59
8.11 Ensemble Comparison. The Score value is obtained using the
evaluation metric described in Section 4.1.4 . . . . . . . . . . 62
10
Chapter 1
Introduction
Recommender Systems are a subclass of information filtering systems that
try to predict the preferences of users with respect to a set of items. The
extensive growth of product catalogues made necessary to create automatic
systems that could help customers discover relevant items, and in doing so
they changed the way websites communicate with their users. Rather than
providing a static experience in which users search for and potentially buy
products, Recommender Systems increase interaction to provide a richer
experience. As a matter of fact it is not just about preferences prediction
to maximize clickthrough, the goal of a Recommender Engine is to “Assist
users in accessing and understanding large digital collections in domains
subject to significant personal taste” [1].
In recent years they have become really popular and are employed in a va-
riety of domains: some popular applications include movies, music, news,
books, jobs, e-commerce, e-tourism and online dating.
This thesis work was developed in collaboration with TU Delft and Xing AG,
and it explores an innovative approach for the creation of hybrid models in a
particular domain of Recommender Systems, which is job-recommendation.
Job Recommender Systems [2] [3] were born when the Internet-based recruit-
ing platforms become a primary recruitment channel in most companies,
due to the less costs in recruitment time and advertising. Many platforms
exist to connect companies with employees, the so called “Business Social
Networks” such as Linkedin [4] [5] and Xing AG. Since the companies and
customers database grew rapidly it is necessary to use a Recommender En-
gine in order to help users discover jobs that match their personal interests.
The whole recruiting and hiring process has some peculiarities with respect
to other most famous Recommender System domains such as movies, music
and e-commerce. In fact job postings, i.e. open positions posted by a com-
pany, are limited-quantity items whose expiration time is not predictable,
either the company hires a candidate or decide to dismiss the position in
a certain point in time. Employments moreover may require life changes,
features like the geographical relocation may influence users differently, it is
not as simple as choosing a movie on your sofa or buying an item shipped
to your address.
Also ratings, i.e. preferences expressed by a user with respect to an item, are
implicit. We can only infer information from the users behaviour, whether
he clicked on or replied to a job offering. This pushes forward the difficulty
of the task because a bad interpretation of the preferences of the users may
lead to poor results. On the contrary an explicit rating system helps at
clearly identify what users like or don’t like since they consciously leave a
preference. For example reviewing a movie with 1 star out of 5 is a strong
signal that the user didn’t like it.
As we can see job-recommendation tackles different problems with respect
to other domains, which makes it interesting from a research point of view.
The aim of this thesis work is to develop a new ensemble technique, trying
to overcome the limitations of single learners to achieve a better accuracy.
The basic models that we use as input algorithms are those presented in
the thesis work of Elena Sacchi and Ervin Kamberoski [6]. They consists
of 2 Collaborative Filtering and 2 Content-Based techniques, followed by 1
recommendations obtained from the processing of interactions (i.e. job post-
ings that users actively clicked on in the past). To these models we added 4
other Collaborative Filtering techniques, that either train on or recommend
impressions (i.e. job postings recommended by the Xing recommender sys-
tems), and 1 other list of items derived from the processing of impressions
directly from the data.
The dataset provided by Xing AG is the same used in the RecSys Challenge
2016, a competition hosted for the annual RecSys Conference, for which
Xing and TU Delft were organizers. The topic of the challenge was then
job-recommendation and the problem “next click prediction”. Given a Xing
user, the goal was to predict those job postings that a user will positively
interact with.
The technique that we created is called Multi-Stack Ensemble, it consists
of a series of different hybridization layers. The general idea is to ensem-
ble the algorithms in batches, starting with weak learners at the bottom to
have recommendations more and more accurate as we climb the layers of
the stack. Inside each layer a 2-step function is applied for combining the
recommendations: first a voting method is called in order to assign a score
to every item, based on the characteristics of the input algorithms, then a
12
reduce function is applied to sort and select the top items for the final rec-
ommendation. Our implementation for the Xing dataset uses three different
layers.
The novel idea behind Multi-Stack ensembling is that one can create hy-
bridization batches and apply different techniques to combine recommen-
dations depending on the input algorithms. In each layer different voting
methods can be used as much as totally different mechanisms. We tested
our solution on the aforementioned dataset and compared the results with
other state of art hybridization techniques. Multi-Stack Ensemble outper-
forms all of them, achieving an higher accuracy degree, as shown in Table
8.11.
As a proof of the quality of our results, we participated to the RecSys Chal-
lenge 2016 representing Politecnico of Milan as team “PumpkinPie”, stating
that we would have renounced any monetary prize given our relationship
with TU Delft and Xing.
We tackled the problem using our Multi-Stack Ensemble which performed
really well allowing us to end in the 4th position and 1st among Academic
teams (the first 3 were companies) out of more that 120 teams. We were
the only Academic team composed of Master students.
We were then invited at the ACM RecSys Conference, held in Boston at the
Massachusetts Institute of Technology, to present our solution to researchers
and companies from all over the world. Here we also received a special men-
tion as youngest team during the prize-giving ceremony.
Our paper “Multi-Stack Ensemble for Job Recommendation” (Tommaso
Carpi, Marco Edemanti, Ervin Kamberoski, Elena Sacchi, Paolo Cremonesi,
Roberto Pagano, Massimo Quadrana) was then accepted and published in
the ACM RecSys proceedings.
We were also awarded by Politecnico of Milan for our results with a schol-
arship.
1.1 Thesis Structure
This thesis work is structured in the following way:
In Chapter 2 we describe in the first part the job-recommendation domain,
following with the description of RecSys Challenge. In the end we present
the technologies used for both this thesis work and the competition.
In Chapter 3 is presented the State of the Art of Ensemble techniques, first
describing the reasons why hybrid models are so powerful and then listings
a series of well known techniques used in both research and business envi-
ronments.
13
In Chapter 4 is shown the evaluation metric used internally at Xing which
was adopted for the competition and used also for this thesis work .
In Chapter 5 is described the dataset used for this thesis work.
In Chapter 6 we present the input algorithms used for our Multi-Stack en-
semble.
In Chapter 7 we present our Ensemble technique, first describing the 2-step
algorithm that we implemented: (i) the two voting-based methods that we
implemented, Linear and Evaluation-Score and (ii) the Reduce function.
Then we describe the overall architecture of our stack structure, covering
each one of the three layers that we created in the hierarchical stack, show-
ing the input/output parameters and the voting technique used, in order to
compute the final recommendation.
In Chapter 8 are presented the results of our technique with respect to “state
of art” algorithms that were implemented, showing how, in this domain, our
solution achieves a better accuracy.
In Chapter 9 we leave the conclusions of this thesis work and possible future
developments.
14
Chapter 2
Problem Description
This chapter wants to highlight the problems faced during either our col-
laboration with Xing AG and TU Delft and our participation to the ACM
RecSys Challenge 2016. In the first section we present some information
about the job recommendation scenario. In the second section, we describe
the challenge and the problems that came with it. In the third section,
we describe how we propose to solve these problems in our work and the
fourth section explains the technologies involved in the deployment of our
infrastructure.
2.1 Job Recommendation
Recommender Systems have multiple useful applications in the business
world, one of them regards Business Social Networks in which the goal is
suggesting to an user a potential job that fits its skills and interests.
Thanks to the data provided by Xing AG, a Business Social Network well
known and used in German speaking countries, we were able to explore and
exploit some characteristics that are peculiar to the job recommendation
scenario with respect to the most famous e-commerce or movie recommen-
dations. One of the most interesting characteristic is the fact that users
perform multiple interaction with the same job. If you think about a movie
platform it is really uncommon to recommended an already seen item, since
users are not likely to watch again the same movie, but in the job domain
instead it happens frequently for many reasons: users may be interested
in reviewing the description of the jobs, replying to the posting or simply
comparing two offers. It follows also that the preferences expressed by an
user are implicit, we actually don’t know how much an user is interested in a
job posting and this pushes forward the difficulty of the task because a bad
15
interpretation of the user behavior may lead to poor results, on the other
hand in an explicit rating system the users leave consciously a preference.
Another interesting aspect is the sometimes odd relation between a user and
an item, for example we found top-managers clicking on internship job post-
ings for new graduates. This is not a random event of course, if we think
about the domain we can find some explanation for this strange behaviour:
top-manager may have children that use his/her account to look for jobs,
probably with premium membership, or the manager may be interested in
the way competitors hire people. So all this shows that it might actually
be valuable to recommend internships to top-managers based on their past
interactions.
Of course all of these considerations and difficulties where taken in account
when deploying our ensemble technique.
2.2 ACM RecSys Challenge
ACM RecSys Conference 2016, hosted in Boston at the Massachusetts In-
stitute of Technology, found the job-recommendation problem so interesting
they decided to make a competition on the topic. The RecSys Challenge
2016 [7] was then organized by Xing AG, which provided as source data
the same dataset we were given. We then decided to take part to the com-
petition representing Politecnico of Milan as team “PumpkinPie” and for
correctness, having a collaboration with Xing AG, we refused to take any
money prize in case of victory; for us it was the best opportunity to show off
that our ideas and solutions where actually working in a real environment,
luckily for us our approaches turned out to be effective as in the local tests.
The task of the challenge was to predict those job postings that were likely
to be relevant for the user. Participants should provide a set of up to 30
recommendations, ranked by relevance, for each one of the 150k users in
the test set. The evaluation metric was an hybrid of different classic metrics
that we will describe in Chapter 4.
In our solution we started working with both Content-Based and Collabora-
tive Filtering techniques, whose recommendations were then combined using
an ensemble technique called “Multi-Stack Ensemble” in order to overcome
the weak points of the single learners.
We finished the competition in the 4th position, but ranking 1st among the
Academic teams (first 3 were companies) over more than one hundred total
teams and we also receive a mention as the youngest team participating, in
fact our team were composed only by master students.
16
2.3 Our Approach
One of the biggest issue when dealing with ensemble technique is surely the
fact that there is no one single approach, because the solutions are often
domain dependent and you have actually to try them all in order to come
up with a decent result, moreover there was not any literature regarding
how to deal with the job recommendation scenario.
We thought that a good way for starting was deploying the basic hybrid
approaches, common to every recommender system, see their performance
and then, according to the assumption made in Section 2.1, customize these
solution to see if they performed better.
At the end we came up with a new fresh idea that tries to unify the pecu-
liarities of all the basic ensemble techniques, probably one of the best key
feature is that you just need the submission files to obtain a new recommen-
dation, no additional models are required.
We might say that we had a trial and error approach [8] in order to achieve
the best possible result, of course this does not mean that we randomly
changed the parameters, but we manipulated methodically the variables in
an attempt to exploit the best possible configuration.
Our method has proved to be successful, indeed our solution was better than
any traditional hybrid recommender system known so far in this domain.
2.4 Technologies
All of our research works were deployed such that all the experiments were
reproducible despite the size of the input. We thought that in order to have
a fully customizable and scalable system all of our algorithms should be
re-implemented from scratch, that’s why we did not use any framework or
library because it gave us the possibility to implement or change any lesser
details of our algorithms in a painless and faster way than having a black
box system.
To achieve this purpose we develop our “infrastructure” using different tools:
• Python 2.7: is a widely used high-level, general-purpose, interpreted,
dynamic programming language; all of our script and algorithms are
written in Python;
• Apache Spark 1.6: is an open source cluster computing framework,
is a fast and general engine for big data processing, with built-in mod-
ules for streaming, SQL, machine learning and graph processing.We
used it to parallelize our tasks and to have a cluster infrastructure if
17
needed when computational effort becomes too heavy on a standalone
machine. Figure 2.1 provides a rough idea of how a network of multi-
ple nodes is managed; you just need to specify which node is the driver
and which node is a worker and Apache Spark will automatically split
the workload among the network;
Figure 2.1: Cluster mode overview
• PoliCloud: is the IaaS cloud designed, managed, and deployed by
Politecnico di Milano. Is a cloud infrastructure for research and ex-
perimentation on big data, distributed computing, cloud architectures
and Internet of Things. The datasets provided by Xing were too big
to be managed on our machine thus we use this network to run our
experiments, we were provided with 5 different machines having 16
Gb of RAM and 8 cores. On top of these small infrastructure we then
setup Apache Spark.
• Jupyter Notebook: is a web application that allows you to create
and share documents that contain live code, equations, visualizations
and explanatory text. We use it as our high level IDE;
• Graphlab: is an extensible machine learning framework that enables
developers and data scientists to easily build and deploy intelligent
applications and services at scale. But we use this framework only to
perform data analysis because we found out that if offers some faster
solution than the one offered by Apache Spark;
Chapter 3
State of the Art
In this chapter we present the state of the art of Ensemble Techniques [9]
for Recommender Systems.
In the first part we describe what is an Ensemble in this domain and why
it can be so powerful, especially to overcome the limitations of standalone
learners.
In the second part we present the most known techniques used in both
research and business environments, applicable to the context of job recom-
mendation.
3.1 Ensemble in the context of Recommender Sys-
tems
The most prominent recommendation approaches, discussed in the thesis of
Elena Sacchi and Ervin Kamberoski [6], exploit different sources of infor-
mation and follow different paradigms to create a recommendation. Even if
they produce results that are considered to be personalized based on the as-
sumed interests of their users, they perform with varying degrees of accuracy
depending on the quality of the data and application domain. Collaborative
Filtering [10] exploits a specific type of information from the user model
together with community data to derive recommendations, while Content-
Based [11] approaches rely on product features and/or user’s features.
Each of these basic approaches has its pros and cons, for instance the former
is able to exploits trends and increase serendipity in the recommendations,
suggesting new items that may not be related to previously interacted ones,
while the latter can solve the cold-start problem, providing recommenda-
tions to new users whose profile’s information are too scarce or anomalous
to give the collaborative technique any traction. However, none of the basic
approaches is able to fully exploit all of these characteristics, therefore hy-
brid systems helps at overcoming these limitations.
An excellent example for combining different recommendation algorithm
variants is the Netflix Prize competition, in which hundreds of students and
researchers worked to improve a Collaborative movie recommender engine
by hybridizing hundreds of different Collaborative Filtering techniques to
improve the overall accuracy. Figure 3.1 gives an high level overview of
a hybrid recommendation system: starting from different recommendation
sources as input data it combines them and outputs a new enriched item list.
Usually the methods involved in the hybridization step are based on very
Figure 3.1: Hybrid recommender system’s scheme
different approaches in order to smooth the error of each different techniques
but as you will see in Chapter 7 there is no reason why several different tech-
niques of the same type could not be hybridized, for example, two or more
different Collaborative Filtering system could work together.
Unfortunately there is little on hybrid recommender systems in the actual
state of the art, probably because it is really context-dependent, making it
difficult to identify a standard solution. Therefore we had really few materi-
als to work with while creating our Multi-Stack Ensemble, we started taking
some knowledge from standard techniques, but then tried to implement an
innovative approach that could fit well the job-recommendation domain.
20
3.2 Ensemble Techniques
3.2.1 Voting
Voting ensemble techniques are based on the same way simple error cor-
recting codes work. The simplest error correcting code is a repetition-code,
where we have the string of bits repeated n-times and we extract the correct
original sequence using a majority vote. So if by chance one bit of a string is
corrupted then it is likely that all the other sequences still have the correct
value. Applying a majority vote our output will then be the original string.
This technique is generally used for machine learning classification prob-
lems, but it is also used in Recommender Systems. We will discuss some
application later in this section.
Now let’s describe the main idea behind the algorithm. Suppose to have a
test set of 10 samples and the ground truth is 10 times “1”.
1111111111
We then have 3 binary classifiers (A,B and C) with a 70% accuracy, which
means they list seven 1s and three 0s.
That being said our majority vote technique will have 4 possible outcomes
for each triple of bits:
• All 3 are correct (i.e. there are three 1s)
0.7 ∗ 0.7 ∗ 0.7
= 0.3429
• Only 2 are correct
0.7 ∗ 0.7 ∗ 0.3+
0.7 ∗ 0.3 ∗ 0.7+
0.3 ∗ 0.7 ∗ 0.7
= 0.4409
• Only 1 is correct
0.7 ∗ 0.3 ∗ 0.3+
0.3 ∗ 0.3 ∗ 0.7+
0.3 ∗ 0.7 ∗ 0.3
= 0.189
• All are wrong
0.3 ∗ 0.3 ∗ 0.3+
0.3 ∗ 0.3 ∗ 0.3+
0.3 ∗ 0.3 ∗ 0.3
= 0.027
21
What we see from this statistics is that almost 44% of the time the majority
vote corrects the error. To wrap everything up we can say that overall
this technique allows our ensembled prediction to have an accuracy of 78%
(0.3429+0.4409), higher with respect to each one of the single learners. So
it actually improves our recommendation.
In the same way as error correcting codes work the more replicated items
we have, i.e. basic learners, the more accurate will be our final result. As a
matter of fact using the above example with 5 binary classifier rather than
3 we would have 83% accuracy.
From here we can go a step ahead noting that the more the single predictions
are uncorrelated the higher would be the accuracy of our final model. Let’s
start from the previous example were we had highly correlated models.
Description Configuration Accuracy
Inputs
1111111100 80%
1111111100 80%
1011111100 70%
Ensemble 1111111100 80%
Table 3.1: Voting Example (1)
Applying the same algorithms as before we see no improvement in the
accuracy, which stays on the 80%, as stated in Table 3.1.
Now let’s try three different models which may be less accurate, but highly
uncorrelated.
Description Configuration Accuracy
Inputs
1111111100 80%
0111011101 70%
1000101111 60%
Ensemble 1111111101 90%
Table 3.2: Voting Example (2)
When ensembling with majority vote we get a 90% accuracy, as of Table
3.2. Which is a huge improvement with respect to the basic models.
A further improvement can be obtained using a weighing technique, as a
matter of fact it is unlikely that all the input models are equally accurate,
which make sense to assign an higher weight to better models. Obviously
22
the counter part of this is that low weighted models would lightly affect high
weighted ones, leading to a small improvement of the accuracy.
As you could imagine it in mostly used in machine learning classification
problem, but can also be implemented int the context of Recommender Sys-
tems. For example one can see the problem as a binary classification, where
one class contains the list of recommended items, while the other one con-
tains all the remaining ones.
The simplest approach would be to apply a majority vote directly on the
output of each basic learner. Suppose to have a list of all possible recom-
mendable items with value 1 if the item was recommended by that algo-
rithm and 0 otherwise. The majority vote would select those items which
are recommended by multiple algorithms exploiting the fact that if different
learners recommend the same items it is probably a good recommendation
for the user.
Let’s see this with an example: imagine to have a list of 10 recommendable
items where you have to provide 3 recommendations. We have our three
learners A,B and C, where a 1 correspond to the i-th element recommended.
What we can do is to count the number of times an item i is recommended
and then select the 3 highest ones.
Description Configuration
Inputs
1010010000
1001001000
1010001000
Occurrences 3021012000
Top 3 Rec 1010001000
Table 3.3: Voting Example (3)
Usually there are some learners that are better than others, so it make
sense to give different weights to different algorithms. Following the example
above we may have Table 3.4. So as you can see we have a slightly different
recommendation given the fact that the three input algorithms have differ-
ent weights.
The example that we have just showed is based on the output of each al-
gorithm, so we basically perform a standalone recommendation using each
single learners and then mix them together in a second step. This method
is the simplest and more efficient one as you can add up new algorithm over
time without the need to re-run previous techniques, you simply use their
23
Description Configuration Weight
Inputs
2020020000 2
4004004000 4
1010001000 1
Sum 7034025000 -
Top 3 Rec 1001001000 -
Table 3.4: Voting Example (4)
output. Anyway there is another implementation of the voting technique
which uses the explicit rating ra(u, i), that is the rating that algorithm a
assigned to item i recommended for user u. Let’s see an example, suppose
that each learners assigns a rating to each recommended items in a scale of
0-10, as in Table 3.5.
Description Configuration
Inputs
1080080000
4003001000
1010006000
Sum 6093087000
Top 3 Rec 0010011000
Table 3.5: Voting Example (5)
What the voting technique does is to sum the rating given to each items,
so even though the 1st item is recommended in all 3 input algorithms it is
not going to be recommend as a result of the ensemble, since its final rating is
lower than other items. Obviously this implementation can take advantage
of the weighing of algorithms simply multiplying the weight and the rating
wa × ra(u, i).
3.2.2 Averaging
Ensemble averaging is one of the most common strategy applied in the ma-
chine learning field. It works well for a wide range of problems (both classifi-
cation and regression) and metrics (AUC squared error or logarithmic loss).
The main idea is to create multiple models and combine them to produce
a desired output. Often an ensemble of models performs better than any
24
individual model, because the various errors of the models average out, in-
deed this approach should prevent overfitting; we create multiple predictors
with low bias and high variance and then hopefully combine them to have
a predictor with low bias and low variance.
Generally in a machine learning problem this means to create a set of learn-
ers with varying parameters, such as the learning rate, momentum, etc. ,
and then average their results. Instead in the recommender systems sce-
nario the averaging strategy combines the recommendations of two or more
different recommendation systems by computing the average of their scores.
This indicate that when adopting this approach we should prefer averaging
the score coming from completely different approach such as Collaborative
Filtering and Content-Based Filtering.
Thus, given n different rating functions rk with associated relative weights
βk the final score of an user u for an item i is:
rweighted(u, i) =n∑
k=1
βk × rk(u, i) (3.1)
where all rk(u, i) need to be normalized to have consistent values among
different recommendations and∑n
k=1 βn = 1.
This technique is quite straight forward and that is why it is a popular
strategy for combining the predictive power of different recommendation
techniques. Consider an example in which two recommender systems are
used to suggest one out of five items for a user Alice. As can be easily
seen from Table 3.6, these recommendation lists are hybridized by using a
uniform weighing scheme with β1 = β2 = 0.5. Item c is then the one that
on average received the highest score.
r1 r2 rweighted
item score score score
a 1 4 2.5
b 2 1 1.5
c 3 5 4
d 4 3 3.5
e 5 2 3.5
Table 3.6: Score Averaging Example
25
3.2.3 Rank Averaging
When averaging is used to ensemble multiple different models some prob-
lems may arise because not all predictors are perfectly consistent with the
score assignment or they can have different scales. So it is good practice
to normalize the values before applying an average technique. Instead the
proposed solution here is to use the rank of items. Each recommendations is
nothing but an ordered list, so what we do is to exploit the position of each
items, i.e. the first item will have rank 1 while the n-th will have rank n. Af-
ter this fast processing phase we can apply the averaging technique based on
the ranks. The basic implementation simply perform an arithmetic average
of the ranks that each input algorithm assigned to an item i
r′(i) =
∑nk rk(i)
n(3.2)
Suppose to have 5 recommendable items [a,b,c,d,e] and 2 basic algorithms
M and N, in Table 3.7 we show each item associated with its corresponding
rank. The column Ens represents the average rank computed using Equation
3.2, while column Rec is the ordered final recommendation.
rec1 rec2 recens recfinalitem rank rank rank rank
a 1 4 2.5 2
b 2 1 1.5 1
c 3 5 4 5
d 4 3 3.5 3
e 5 1 3 4
Table 3.7: Rank Averaging Example
As you can see the result of this technique is different from the one ob-
tained in Section 3.6 where we applied the averaging on the rating ra(u, i).
26
3.2.4 Interleave
When it is practical to make large number of recommendations simultane-
ously, it may be possible to use a hybrid approach where recommendations
from more than one technique are presented together. Interleaving meth-
ods is a trivial hybrid technique that alternates the recommendation on the
input algorithms in a round robin fashion. Obviously this technique may
lead to poor results especially if some learners are much weaker than others.
For this reason a weighted approach can be implemented which takes into
consideration the different characteristics of the input data. This can be
achieved using a custom interleaving factor instead on a plain round-robin
method to privilege stronger input recommendations.
Another problem that may arise is if the order of the final recommendation
matters, as for the RecSys Challenge. As a matter of fact the order of our
round robin approach would be determinant, giving higher priority to the
first algorithms visited during the iteration. A possible solution would be
to randomly change the order whenever we perform the computation for a
different user, or to define it a priori given some euristics.
The interleaving was proposed in [12] with the name mixed hybridization.
Table 3.8 shows how interleave works with three different input sources using
a round-robin approach.
rec1 rec2 rec3 recinterleaveitem item item item
a e l a
b f m e
c g a l
d h b b
e i c f
Table 3.8: Interleave Example
3.2.5 Stacked Generalization
Averaging and voting methods are really straightforward to understand and
to implement because there is no need to train new complex learners, indeed
they rely only on combining the predictions files obtained from the different
models to hopefully reduce the error.
Stacked generalization was introduced by Wolpert [13] in a 1992 paper and
the basic idea behind stacked generalization is to use a pool of base classifiers,
27
then using another classifier to combine their predictions, with the aim of
reducing the generalization error.
Figure 3.2: Stacked Generalization scheme
Figure 3.2 highlights the two main phases of the staked generalization
process: first step is to collect the output of each model into a new set of data.
Each instance in the original training set is now represented by every model’s
prediction of that instance’s value along with it’s true classification. When
constructing these models we must take care to ensure that the predictors
are formed from a batch of samples that does not include the instance in
question, just in the same way as cross validation does. The new constructed
data set is now treated as training set for another learning problem thus in
the second step a learning algorithm is employed to solve this problem.
According to Wolpert’s terminology the data and the models constructed
for it in the first step are referred as level-0 data and level-0 models while the
second-stage learning algorithm are referred to as level-1 data and level-1
generalizer.
Let us imagine that our given data set ν = {(yn, xn), n = 1, .., N} where ynrepresents integer value and xn represents the attribute values of the nth
instance. We randomly split the sample into J almost equal parts ν, .., νJand let define νj and ν(−j) = ν - νj to be respectively the test and
the training set for the jth fold of a J-fold cross-validation. Now given K
learning algorithms which we call level-0 generalizers, for each k = 1,...,K
invoke the kth algorithm on the data in the training set ν(−j) to induce a
model M(−j)k .
We can now denote the prediction of the modelM(−j)k on instance x belong
to νj as:
zkn = ν−jk (xn) (3.3)
28
At the end of the entire cross-validation process the data assembled from
the outputs of the K models is :
νcv = {(yn, z1n, ..., zKn), n = 1, ..., N} (3.4)
We can refer to this new data as the level-1 data and using a new learning
algorithm we can derive a data model M or level-1 model that taking as
input the vector (z1, ..., zk) output the final prediction (yn) or classification
for our instance.
This last process completes the description of stacked generalization method
proposed by Wolpert [1992].
3.2.6 Blending
It’s almost identical to the stacked generalization method proposed by Wolpert
[1992] and is a term introduced by the Netflix winners. It’s simpler that it’s
original version and has less risk of an information leak.
With blending [14], instead of creating out-of-fold predictions for the train
set, you create a small holdout set of say 10% of the train set.
The k level−0 models then trains on this holdout set only and the level−1
model learn on the new training set obtnained at the previous step.
This approach as few benefits:
• It’s simpler than stacking;
• The level-0 generalizers and level-1 generalizers use different data pre-
venting information leak;
• There’s no need to share a seed for stratified folds with your team-
mates;
and the cons are:
• You use less data;
• The final model may overfit to the hold out set;
• Your cross validation is more solid with stacking (calculated over more
folds) than using a single small holdout set;
As for performance, both techniques are able to give similar results and
we can also combine them creating stacked ensembles with stacked gener-
alization and out-of-fold predictions and then use a holdout set to further
combine these models at a third stage.
29
3.2.7 Collaborative via Content
A very well known issue in Recommender System is the so called “cold start”
problem, which means that the system cannot draw any inferences for users
about which it has not yet gathered sufficient information. This happens
every time the platform acquires a new user or if the dataset is really sparse,
i.e. the number of interactions between users and items is far lower that the
number of items present in the database. For example think about the num-
ber of movies present in a database like IMDB and the number of movies
watched on average by a user.
Collaborative filtering techniques are those that suffer the most the sparsity
of the dataset, since they need as much matching information as possible
among users. For example, if one user liked the movie “Rocky” and another
liked the movie “Rocky II” they would not necessarily be matched together.
On the other hand Content-Based techniques can deal better with sparsity
given the fact that the system has at least some information about the user,
may them be interactions or explicit preferences. For example if a user liked
“Rocky” a Content-based technique would find similarities with “Rocky II”
and recommend it to that user.
To address this problem a 2-step pipelined solution was proposed to lower
sparsity [15]: we use the predictions obtained from a Content-Based tech-
nique to enrich the dataset, thus reducing sparsity by increasing the links
between users and items. Then we can apply a collaborative filtering algo-
rithm that exploits the less sparse dataset as you can see in Figure 3.3.
Content-Based
Dataset
Enriched Dataset
Collaborative
Filtering
Train
Enrich
Train
Figure 3.3: Collaborative via Content scheme
30
3.2.8 Monolithic Hybridization
Monolithic [16] denotes a hybridization design that incorporates aspects of
different recommendation algorithms, mainly Content-Based and Collabo-
rative Filtering, in one single implementation.
The idea behind this technique is that the hybrid uses additional input
data that are specific to another recommendation algorithm, for example
a Content-Based recommender that also exploits community data to deter-
mine item similarities falls into this category. What is needed then is a
preprocessing and combination of different knowledge sources followed by
the modification of the algorithm behaviour in order to exploit different
types of input data.
There are 2 main approaches to Monolithic hybridization:
• Feature Combination: the algorithm uses a different range of input
data combined together trying to build an algorithm with both content
and collaborative capabilities. An example for this methodology was
presented by Basu et al.(1998) that proposed a feature combination
hybrid that combines collaborative features, such as a user’s likes and
dislikes, with content features of catalog items.
• Feature Augmentation: differently from feature combination, this hy-
brid does not simply combine and preprocess several types of input,
but rather applies more complex transformation steps. In fact, the
output of a recommender system augments the feature space of an-
other recommender by preprocessing its knowledge sources. However,
this must not be mistaken for a pipelined design, as we discussed before
in this chapter, because the implementation of the input recommender
is strongly interwoven with the main component for reasons of perfor-
mance and functionality.
An example of feature augmentation can be found in Content-boosted
Collaborative Filtering approach (Melville et al. 2002). It predicts
a user’s assumed rating based on a collaborative mechanism that in-
cludes Content-Based predictions.
31
Chapter 4
Evaluation
This chapter describes in details the evaluation metric used for the compe-
tition and for this thesis work. It is an hybrid of different more standard
metrics.
4.1 Evaluation Metrics
The dataset that we have has this important feature: it lacks explicit ratings.
With such datasets, all metrics relying on precision and rating prediction
cannot be adopted, also because we have no indication that a user disliked
an item, hence recall-based methods are more suitable for this problem.
As we have seen in previous sections, the final output of our recommender
system is a top-k of items recommended for the user ordered by a evaluated
score that represents the preference of that item for the user. The main
idea when making recommendation is that ordering matters, therefore the
items on top of the list are likely to be the most interesting for the user. In
an online evaluation setting, the evaluation metrics can measure different
aspects of the recommendation: how much the item recommended is of
interest for the user, but also how different the items provided and then liked
by the user are from the previously viewed ones (we talk about novelty) or
also how likely it was for the user to find these interesting items by himself
(serendipity). In an offline evaluation setting like ours all these other aspects
cannot be measured, and all the metrics revolve around how much the list
of recommended items corresponds to the list present in the ground truth,
hoping that a high scoring algorithm in this setting will also behave well
in a online setting. In this section we first provide the description of the
standard metrics that composed the competition metrics and then describe
it.
4.1.1 Recall
Recall is measure of relevance and it finds application in many fields. It was
born in the information retrieval scenario: here the items are documents
and the task is to return a set of relevant documents given a search term; or
equivalently, to assign each document to one of two categories, “relevant”
and “not relevant”. The recall metric represents the proportion of relevant
documents that are retrieved: it is the ratio between the number of relevant
documents retrieved by a query and the size of the set of relevant documents.
Using classification terms, it represents the ratio between true positives and
true positive + false negative instances.
In recommender system field, the recall is calculated on the list of items
returned by the algorithms to the user. This list is a composed by the k
most relevant items, ranked with the best one on top. The metric takes in
input the lists and evaluates how many items of the k recommended items are
truly relevant for the current user, that is how many recommended items are
then present in the ground truth. The result is than calculated by dividing
the length of this list by the length of the whole list of relevant items. The
mathematical representation is:
recall(k) =#hits
|T |(4.1)
where #hits is the number of good recommendations and T is the set of
relevant items.
4.1.2 Precision at K
Precision, is a known accuracy metrics, extensively used in information re-
trieval and data mining. Differently from recall, precision in general mea-
sures the proportion of retrieved documents that are relevant, or the ratio
between true positive and true positive + true negative instances.
Precision takes all retrieved documents into account, but it can also be eval-
uated at a given cut-off rank, considering only the topmost results returned
by the system. This measure is called precision at k or P@k:
P@k =#hits@k
k(4.2)
where #hits@k are the number of good recommendation at a given cut-off
k.
33
4.1.3 User Success
It’s a simple measure not very used in information retrieval or data mining
because it does not return any value regarding the quality of the recommen-
dation. It just returns 1 whether there is at least one correct recommended
item, otherwise it returns 0. Either you get all items right or you recommend
only one right items the score that you’ll get it’s always 1.
4.1.4 Competition Metric
Is an hybrid metric that combine all the aforementioned ones. It reflects the
typical use cases at XING: users are presented with their top-k personalized
recommendations, and a user interaction with one of the top-k is counted
as a success.
According to this formula for each user one can earn up to 100 points :
C(u) =20× (P@2 + P@4 + recall(30) + userSucces)
+ 10× (P@6 + P@20)(4.3)
The final score is then computed as the sum over the test set U:
S =
U∑u
C(u) (4.4)
34
Chapter 5
Dataset
The current chapter provides a description of the datasets that we were pro-
vided by Xing AG, different from the thesis work of Sacchi and Kamberoski
[6] in our research we use also the source data called Impressions. To re-
duce the computational effort all our algorithms were previously tested on
a sampled dataset that preserved all its statistical properties.
5.1 Datasets
The dataset provided by Xing was rather big and was not the classic User-
Rating Matrix found in recommender systems researches. We extracted a
sort of implicit (and very sparse) rating matrix from the interactions and
impressions files, but also some side information were available, like the user
and item profiles. The training set time span is of about three months, from
mid-August to mid-November 2015. The test set period on the other hand
is just a week long, and starts immediately after the end of the training set
time period. Table 5.1 shows the quantity of each entry of the dataset used
for this thesis work.
DB Entry Quantity
Users 40 000
Test Users 10 000
Items 274 318
Interactions 474 198
Impressions 38 511 532
Table 5.1: Dataset
35
5.1.1 User Profile
Contains the details about a user, its fields are described in Table 5.2.
Feature Description
id Anonymized ID of the user
jobroles List containing the anonymized skills of
the users e.g. python, java, HTML, etc.
career level Beginner, experienced, manager, etc.
discipline id Consulting, HR, etc.
industry id Internet, Automotive, Finance, etc.
country Country where the user is actually work-
ing
region Region where user is actually working
(only if in Germany)
experience n entries class Identifies the number of CV entries that
the user has listed as work experiences
experience years experience Is the estimated number of years of work
experience that the user has
experience years in current Is the estimated number of years that the
user is already working in his/her current
job
edu degree Estimated university degree of the user
edu fieldofstudies Engineering, Economics and Legal, etc.
Table 5.2: User features
36
5.1.2 Item Profile
Shows information about the job postings present in the dataset, similar to
the User Profile.
Feature Description
id Anonymized ID of the item
title Concepts that have been extracted from
the job title of the job posting
tags Concepts that have been extracted from
the tags, skills or company name
career level Beginner, experienced, manager, etc.
discipline id Consulting, HR, etc.
industry id Internet, Automotive, Finance, etc.
country Describes the country where the user is
actually working
region Region where user is actually working
(only if in Germany)
latitude Latitude coordinate
longitude Longitude coordinate
employment Full-time, part-time, etc.
created at Unix timestamp representing the time
when the interaction was created
active during test 1 if the item is still active (= recommend-
able) during the test period, 0 if the item
is not active anymore in the test period
(= not recommendable)
Table 5.3: Item features
37
5.1.3 Interactions
Interactions that the user performed on the job postings.
The set refers to the implicit feedback registered in the considered time
period. Each entries records the user, the item interacted, the time of the
interaction and the type of interaction performed: click, bookmark, apply
to job, discard.
Feature Description
user id ID of the user who performed the interac-
tion
item id ID of the item on which the interaction
was performed
interaction type Refers to type of interaction that was per-
formed on the item:
• 1 = the user clicked on the item
• 2 = the user bookmarked the item
on XING
• 3 = the user clicked on the reply but-
ton or application form button that
is shown on some job postings
• 4 = the user deleted a recommen-
dation from his/her list of recom-
mendation (clicking on “x”) which
has the effect that the recommen-
dation will no longer been shown to
the user and that a new recommen-
dation item will be loaded and dis-
played to the user
created at Unix timestamp representing the time
when the interaction was created
Table 5.4: Interactions description
38
5.1.4 Impressions
Items shown by the existing XING recommender engine to the users of the
platform.
Feature Description
user id ID of the user
year Year when the impression was presented
week Week of the year
items Comma-separated list (not set) of items
that were displayed to the user
Table 5.5: Impressions dataset
5.1.5 Test Users
Are just the ID of those users for which we need to make the recommenda-
tion.
39
Chapter 6
Source Algorithms
This chapter provides a brief description of the eleven input algorithms used
in our ensemble. In the first section we describe the solution obtained using
only Interaction and Impression datasets, in the second section we discuss
the collaborative filtering approach while in the third the content based
methods.
Many of the approaches present in our ensemble derive from the thesis re-
search of Sacchi and Kamberoski, so any further details about them can be
found in [6]. On the other hand all the methods that use the Impression
dataset as source data were implemented by us following the approach of
our colleagues Sacchi and Kamberoski, that’s the reason why we decided to
not cover all the particular but just give an overview of how they work.
6.1 Past Interaction & Past Impression
As already said in the job recommendation scenario, Chapter 2.1, it makes
sense to suggest already interacted job-postings. Indeed to better take ad-
vantage of this peculiarity we decided to filter out those items from the rec-
ommendation provided by Collaborative Filtering and Content-Based ap-
proaches, and then, starting from the Interactions and Impression source
dataset, create two file to be treated as single submission:
• Past Interactions: they refers to the jobs previously clicked by an user
that were filtered and ordered according to some criterion related to
the user profile.
• Past Impressions: they concerns the jobs showed to an user by the
Xing recommender system, but we kept only the past two weeks of
40
data, that we thought being the most relevant, and then performed
the same filtering and reordering steps mentioned for the Interactions.
In this way we had disjoint sets of recommendations, making our learners
to recommend previously un-clicked jobs.
6.2 Collaborative Filtering Algorithms
We have six input algorithms belonging to the Collaborative Filtering family,
two out of six were completely provided and implemented by Sacchi and
Kamberosky [6] the remaining were implemented by us following the work
of our colleagues. In the following part of this Chapter we use a particular
syntactic structure to define the algorithms, e.g. IntImp. The former word
”Int” refers to the training data, i.e. the base for the similarity function,
while the latter refers to the recommendation part, i.e. from which set we
extrapolate the items.
• UBCF IntInt (User Collaborative Interaction - Interaction): starting
from the past interactions performed by two users we calculate the
similarity among them and recommend to an user u the interactions
of those belonging to its neighborhood 1.
• UBCF IntImp (User Collaborative Interaction - Impression): given
the similarity between the users, based on their interactions, instead
of recommending to an user u the interactions of its neighbours we
suggest him their impressions.
• UBCF ImpImp (User Collaborative Impression - Impression): in
this case we calculate the similarity between the users no more over
their interaction but over their impressions and then we recommend
to an user u the impressions shown to its most similar users.
• UBCF ImpInt (User Collaborative Impression - Interaction): pro-
vided the neighborhood of an user, calculated over the impressions, we
recommend to him the interactions performed by its neighbours.
• IBCF IntInt (Item Collaborative Interaction): we provide to an user
the jobs that are similar to its previously viewed job postings, the
similarity between two jobs is computed over the users that interacted
with them, hence two items are similar if the same users clicked on
them.1With neighbourhood we refer to a set of users whose similarity value with u is above
a certain threshold
41
• IBCF ImpImp (Item Collaborative Impression): we recommend to
an user the jobs similar to its past impressions and the similarity
among two jobs is calculated over the user that they were shown to,
indeed two items are considered similar if they were suggested to the
same users.
6.3 Content Based Algorithm
Three submission composing our final ensemble derive from a content based
approach, the Baseline was the only one that were deployed by us while the
remaining two came from the research work of Sacchi and Kamberosky [6]
• Baseline: it’s a purely content based algorithm that Xing gave as
benchmark that tries to exploit the similarity between the user profile
and item profile using information such as the career level, the region
and so on. Being a really general approach it is able to perform a
recommendation to all users in the test set.
• KBIS (Concept Based Item similarity): we suggest to an user the jobs
that are similar to its former interactions. The similarity between two
items is computed over the titles and tags that describe their profiles,
thus two items are similar if they share the same concepts in their
description.
• KBUIS (Concept Based joint User-Item similarity): first we create
an new user profile, based on the titles and tags of his old interactions
combined with the user’s jobroles, and then we recommend him the
jobs that are most similar to it. More the user and job profile are
similar more the user is likely to be interested in the job posting.
42
Chapter 7
Ensemble Technique
In this chapter we will describe the ensemble algorithm that we implemented
for the RecSys Challenge. The technique that we created is called “Multi-
Stack Ensemble” [17] [18] and it consists of a hierarchy of hybrid models.
The input learners that we used are those described in Ervin and Elena’s
thesis work [6], plus other 4 Collaborative Filtering techniques that we im-
plemented working on the impressions of the dataset. After introducing the
general concept we will describe in detail the 2-step function that we used
inside each level of the stack.
7.1 Multi-Stack Ensemble
Multi-Stack Ensemble is an hybridization technique that uses a multi-layered
stack structure, shown in Figure 7.1, in order to create a hierarchy of input
learners. The stack implements a 2-step algorithm, shown in Fugure 7.2:
1. Voting-based method: assigns to each item of the input recom-
mendations a score using a linear or score-based function, described
Section 7.2;
2. Reduce function: performs a reduce step among the input recom-
mendation, grouped by user, in order to exploit a majority preference,
as it was described in section 3.2.1;
So every layer of the hierarchy implements a possibly different voting-
method always followed by the reduce function, mathematically this tech-
InteractionsImpressions
UBCF IntInt UBCF IntImpUBCF
ImpImpUBCF ImpInt IBCF IntInt IBCF ImpImp KBUIS KBIS
Baseline
Final
Ensemble
Ens
CF+CB
Ens
CF
Ens
CB
Figure 7.1: Ensemble Hierarchy
44
Voting-Based Method
Input Recommendation
Voting-Based Method
Input Recommendation
Voting-Based Method
Input Recommendation
Reduce Function
Output Recommendation
Figure 7.2: Stack Layer Structure
nique can be described as:
sa(u, i) = fE(ranka(u, i),ΘE) (7.1)
sE(u, i) =∑a∈E
sa(u, i) (7.2)
rankE(u, i) = sort(sE(u, i)) (7.3)
where a ∈ E is an algorithm in the ensemble E, sa(u, i) is the score
assigned to item i of user u for algorithm a by the scoring function fEdependent on the parameters of the ensemble ΘE . sE(u, i) is the final score
for each item i of the user u for the ensemble E. rankE(u, i) is the final
rank of each item i for the user u calculated by the sort function that sorts
the items in descending sE(u, i) order and takes the top 30 items. If an
algorithm provides less than 30 recommendations, the remaining part of the
list is filled with lower priority algorithms.
The innovative idea behind Multi-Stack Ensemble is that you can start
with ensembling weak models that increase their accuracy while climbing
the stack in order to be comparable with stronger algorithms. This cannot
be achieved using a weighted version of the techniques described in Section
3.2 since correct items in weak recommendations will always be penalized.
Here instead at each layer all previous computations are erased, so that
hybrid models coming from lower layers will be considered as vanilla inputs.
45
The general rule is that in each layer we should have as much as possible
comparable recommendations in terms of accuracy, which means that the
number of layers depends on the domain, dataset and input algorithms.
7.2 Voting-Based Methods
Hereafter we describe the different voting techniques that we implemented
in order to assign a score to each element of the algorithms inside the stack.
7.2.1 Linear Ensemble
In the linear ensemble we use two per-algorithm parameters: the weight wa
and the decay da of the algorithm a. The score sa(u, i) is calculated as:
sa(u, i) = wa − ranka(u, i) · da (7.4)
We use integers for wa in order to establish a priority over the algorithms,
for example there are techniques which are much stronger than others for a
subset of users, so it makes sense to give more importance to them assigning
an higher weight. For example past interactions and impressions contain
stronger recommendations with respect to what other learners provide, so
whenever we find a user with interactions and impressions we give them an
higher priority, i.e. higher weights.
The decay da on the other hand has a very low value (in the order of magni-
tude of 10−3), so that it helps defining the ordering inside a recommendation
list who was given weight wa. Let’s see an example: suppose we have an al-
gorithm a, providing 5 recommendations, which was given the weight wa = 1
and decay da = 0.001, our algorithm would work as in Table 7.1.
reca Rating
a1 0.999
a2 0.998
a3 0.997
a4 0.996
a5 0.995
Table 7.1: Linear Ensemble Example.
da is also used as an interleaving factor that allows to alternate the rec-
ommendations of algorithms with the same priority (i.e. the same weight).
So here, differently from what we saw in Section 3.2.4, the interleaving is
46
not explicit, for example using a round-robin approach, but it is implicit in-
stead. As a matter of fact all the processing follows no explicit rule, it is only
based on the scores that the voting technique assign to each item. Again
let’s see an example: suppose to have algorithms a and b where wa = wb = 1,
da = 0.001 and db = 0.002, Table 7.2 shows the dynamics of our technique.
reca scorea recb scoreb recENS scoreENS
a1 0.999 b1 0.998 a1 0.999
a2 0.998 b2 0.996 b1 0.998
a3 0.997 b3 0.994 a2 0.998
a4 0.996 b4 0.992 a3 0.997
a5 0.995 b5 0.990 b2 0.996
Table 7.2: Linear Ensemble Interleaving Example
To decide which couple (w, d) to assign to each input algorithm we per-
formed a Grid Search [19] over a limited set of values. The Grid Search is
a process that tries every possible combination of the parameters in order
to find best configuration, so we had to limit the possible outcomes of this
process. To do that we defined a limited pool of possible values for both
weight and decay, shown in Table 7.3.
Parameter Set of values
weight 0, 1, 2, 3, 4
decay 0.001, 0.0015, 0.002, 0.0025, 0.003, 0.004, 0.005
Table 7.3: Weight and Decay set of values
The value 0 for the weight would mean that the input recommendation
is to be discarded, i.e. not considered for the ensemble.
For every configuration we then calculated the ensemble and checked the
score in our offline environment. This process was repeated for each layer
of the stack, so we started from the first layer, searched for the best set of
parameters using a Grid Search, selected the ensemble with the best config-
uration and passed it to the upper level, where the process was repeated.
Complete results will be reported in Chapter 8.
47
7.2.2 Evaluation Score Ensemble
The Evaluation Score ensemble assigns to each item in the recommendation
list a score that reflects the accuracy of the algorithm a, represented by the
weight wa, and the points that can be obtained using the Xing evaluation
metric, described in Section 4.1.4, given the rank of the item. As a matter
of fact the ordering of items matters, so as you can see from Equation 7.6
a relevant item in the first or second position contributes much more than
one at the end of the list.
sa(u, i) = wa · e(ranka(u, i)) (7.5)
where e(ranka(u, i)) is defined as:
e(ranka(u, i)) =
37.83, ranka(u, i) ∈ [1, 2]
27.83, ranka(u, i) ∈ [3, 4]
22.83, ranka(u, i) ∈ [5, 6]
21.17, ranka(u, i) ∈ [7, 20]
20.67, ranka(u, i) ∈ [21, N ]
(7.6)
As we stated before the weight wa represents the accuracy of the algo-
rithm a, defined as score density ratio, which is the ratio between what we
called leaderboard score and the total number of recommended items:
wa =lana
(7.7)
where na is the number of items recommended by algorithm a and la is
the score that algorithm a performed in our offline environment.
The main idea behind this technique is to exploit the points per item of
an algorithm, for instance there may be two different learners, say A and
B, that have an equal score in our leaderboard. You may think they are
comparable and can be ensembled using a round-robin interleaving, as de-
scribed in Section 3.2.4. But what if A recommends twice the items of B?
It would mean that the score density ratio of B is double with respect to
A, making A a weaker algorithm than B, because it achieves the same score
recommending twice the items.
Let’s see an example of how this technique works, suppose we have two
learners, a and b, with the following characteristics: la = lb = 10, na = 5
and nb = 2.
From the data above we can calculate the weights for both a and b as de-
scribed in Formula 7.7, obtaining wa = lana
= 2 and wb = lbnb
= 5. Now with
the weights and the function e(ranka(u, i)) we can compute the score for
48
reca scorea recb scoreb recens scoreens
a1 75.66 b1 189.15 b1 189.15
a2 75.66 b2 189.15 b2 189.15
a3 55.66 a1 75.66
a4 55.66 a2 75.66
a5 45.66 a3 55.6
Table 7.4: Evaluation Score Example
each item. As we can see from Table 7.4 algorithm b is much more accurate
than algorithm a, so the evaluation-score technique gives it an higher prior-
ity recommending b’s elements before a’s.
This technique resulted quite powerful especially with algorithms that make
recommendations on different, but overlapping, set of target users (e.g. users
with interactions and users with impressions), because it can give you more
information on the actual accuracy of the learner.
7.3 Reduce Function
While describing the voting-based methods we have always considered rec-
ommendations of different algorithms belonging to disjoint sets. The reality
of the competition was different though, in fact it happened that different
learners recommended the same item, for example an item i could be recom-
mended by a Collaborative Filtering method to user u, provided other users
similar to him clicked on that job posting, and at the same time it could
be recommended by a Content-Based technique, provided i had similar at-
tributes to previously clicked interactions of u. Therefore we though that
this information shouldn’t have been lost or not considered, that is why we
decided to implement a way to take advantage of that.
The idea behind the reduce function is that if more than 1 algorithm rec-
ommend the same item to a user then probably that item is a good recom-
mendation. This technique derives from the majority method described in
section 3.2.1, but we pushed it forward because we do not count the num-
ber of occurrences of an item among multiple recommendations, instead we
sum up the ratings that item received during the voting step, described in
section 7.2. Eventually as a final step we use a sorting procedure which
exploits the new order of items based on the ratings in a descending fashion.
An interesting use case for this technique is the following: suppose to have
49
A, 3 B, 2.9 C, 2.8 D, 2.7 E, 2.6
R, 3 G, 2.9 B, 2.8 P, 2.7 D, 2.6
B, 5.7 D, 5.3 A, 3 R, 3 G, 2.9
Algorithm 1
Algorithm 2
Reduce Step
1° 2° 3° 4° 5°
Figure 7.3: Reduce Function Example. Inside each block the values represents the
structure (item, rating)
two input algorithms M and N, and that an item i is recommended to user
u by both algorithms in the last position such that:
rankM (u, i) = rankN (u, i) = 30 (7.8)
If we had simply counted the number of occurrences of items then item i
would have been quite a good recommendation given that it occurs in every
input algorithm (N,M). The problem is that it’s in the thirtieth position, so
if N and M rank the recommendations by likelihood, i is quite far from the
top of the list.
Our approach on the other hand considers the rating that our voting meth-
ods assigned to i for both N and M, and being i in the thirtieth position its
value may not be really high. So our reduce function will certainly rank up
i, since it will get the sum of the score sM (u, i) and sN (u, i), but it probably
will not make it to the top of the recommendation list where we have higher
rated items.
The implementation of our reduce function is pretty straightforward, we
simply used the map-reduce paradigm of Apache Spark.
50
7.4 Stack Layers
In our solution we implemented a three layered stack that we will describe
in the following sections. The main idea is to perform a batch hybridization
with the possibility to apply different combining techniques to each subgroup
of algorithms.
7.4.1 Layer 1
The first layer contains 2 different ensembles:
• Ensemble CF: the aim of this ensemble is to find a single strong
Collaborative Filtering representative out of 6 different techniques [6]
that we implemented. Those 6 input algorithms were all from the same
family of Collaborative Filtering, but they provided different predic-
tions due to diverse parametrization or users on which they trained.
On average we had a 20% intersection among the predictions of these
algorithms. Ensemble CF would then provide us with a single rec-
ommendation that collected the results of all the Collaborative tech-
niques.
For Ensemble CF we used the Evaluation Score method, described in
7.2.2, for the voting step and then applied our Reduce function.
• Ensemble CB: as for the previous case this ensemble aims at finding
a single strong Content-Based representative out of 2 techniques [6]
that we implemented. Again the 2 algorithms belong to the same
family, but they provided different recommendations, this time due to
the different calculation of IDF values.
For Ensemble CB we applied the Linear method instead, described in
7.2.1, followed by our Reduce step.
The 2 ensemble generated at this stage will then be used in upper layers of
the stack. Table 7.5 shows summary of layer 1.
7.4.2 Layer 2
The second layer is the simplest one, the goal here is to create a new single
recommendation that unites a Collaborative Filtering and Content-Based
approach. Here there is no external input, the only algorithms used in this
layer are those coming from the previous one, which in fact are the two
representative of Collaborative and Content. The output will then be a
single recommendation list that will be propagated to the upper layer.
The voting method used is again Linear.
51
Input Output
UBCF IntInt
Ens CF
UBCF IntImp
UBCF ImpInt
UBCF ImpImp
IBCF IntInt
IBCF ImpImp
KBUISEns CB
KBIS
Table 7.5: I/O Layer 1
Input Output
Ens CFEns CF+CB
Ens CB
Table 7.6: I/O Layer 2
7.4.3 Layer 3
The third and final layer of our stack will output the recommendation that
allowed us to achieve the 4th position in the competition.
Let us describe the input algorithms that we find in this layer:
• Interactions: contains the past interactions of the user. As we stated
at the beginning of this thesis work, the goal of this research is next-
click prediction, so given the characteristics of the job-recommendation
domain it make sense to recommend what we know users have already
seen, since users tend to interact with same item multiple times;
• Impressions: contains the impressions that each users received, which
is what the Xing recommender systems showed them. Since these are
already recommendations presented in the user interface of the Xing
platform it is likely that users actually clicked on them;
• Ensemble CF+CB : contains the recommendations obtained in the sec-
ond layer of our stack, basically it incorporates the results of our Col-
laborative Filtering and Content-Based approaches;
• Baseline: contains recommendations derived from the baseline algo-
rithm that Xing provided. Since it is a general purpose approach it
52
presents recommendations for almost every users disregarding require-
ments that more advanced techniques may have (e.g. Collaborative
Filtering needs a minimum number of past interactions for a user in
order to provide valuable recommendations). That being said we used
this baseline not as an input model for our ensemble, since its score
was far lower than other algorithms, but as a filler, therefore if for any
reasons we could not provide any or enough job-posting to a user we
completed the 30-items recommendation with the baseline.
Input Output
Ens CF+CB
Final RecInteractions
Impressions
Baseline
Table 7.7: I/O Layer 3
In this layer we found again the Linear method to be the most accurate.
Since interactions and impressions were really strong recommendations our
goal was to use the previous layer of the stack to build a new model, Ensem-
ble CF+CB, that could somehow compare with them. It is also important
to remember that each Content-Based or Collaborative Filtering methods
do not include elements present in interactions and impressions, the 2 sets
are completely disjoint, which makes it harder to find an equally accurate
recommendation.
53
Chapter 8
Results
In this chapter we show the results of our research work compared to other
state of the art ensemble techniques.
In the first part we present the tuning process that characterized each layer
of our stack. This tuning involves the voting-based techniques to be used
and the related parametrization. In the second part we compare our Multi-
Stack Ensemble with some of the hybrids described in Chapter 3.
Table 8.11 summarize the results showing that our approach outperforms
all other solutions.
8.1 Layers Tuning
As we stated in Chapter 7 our 2-step algorithm consisted in a Voting-based
method followed by a Reduce function. While describing the former in
Section 7.2 we introduced 2 different techniques: Linear and Evaluation-
Score.
The problem here is that during the computation we could only apply 1
voting-based method, so in our offline testing we used both techniques to
understand which one was more suitable for each layer of the stack. In the
following subsection we will report the results of this testing to justify what
was presented in Section 7.4. To find the best set of parameters for the
Linear Voting-based method we performed a Grid Search over a fixed set
of values for both weight and decay. Hereafter we will show only the most
significant combinations for the sake of brevity.
In the following sections we will use the notation 1.1 and 1.2 to discriminate
between the 2 different ensembles that resides at layer 1, the former will refer
to the Collaborative Filtering techniques, the latter to the Content-Based.
8.1.1 Input Algorithms
In Table 8.1 we present the algorithms we used as input for our Multi-Stack
Ensemble. Beside each one of them you can find the score performed in our
offline environment. It is the result of the evaluation metric adopted for the
Challenge and this thesis work, described in Section 4.1.4.
Algorithm Score
Baseline 2k
UBCF IntInt 18.9k
IBCF ImpImp 20k
IBCF IntInt 22k
UBCF IntImp 22.4k
UBCF ImpImp 31k
UBCF ImpInt 33.2k
KBIS 37k
KBUIS 45k
Interactions 82k
Impressions 116k
Table 8.1: Scores of Input Algorithms.
The Score value is obtained using the evaluation metric
described in Section 4.1.4
An important aspect to note is that even though impressions, as a rec-
ommendation, perform better than interaction, the latter are more accurate.
As a matter of fact if we calculate the ratio between the score and the num-
ber of recommended items, i.e. the points per item shown in Equation 8.1,
what we obtain is in Table 8.2.
ratio =score
number of elements(8.1)
Algorithm Score Number of items Ratio1
Interactions 82k 45103 1.1818
Impressions 116k 208421 0.5565
Table 8.2: Points per item of an algorithm
1Equation 8.1
55
As you can see interactions have a much higher ratio, which means that
even though it contains less items the recommendations are actually more
accurate. The reason is clear, interactions are job postings users actively
interacted with, hence they all represent preferences of the user, while im-
pressions are all those postings that are presented to the user as possible
interesting items, therefore only a subset of them is actually clicked.
8.1.2 Layer 1.1
In the first layer we combine all the recommendations coming from Collab-
orative Filtering algorithms.
We start with the Linear Ensemble technique tuning the weights and de-
cays using a Grid Search algorithm. Table 8.3 shows the most performing
configuration of parameters for the aforementioned voting method.
Algorithm Weight Decay
UBCF IntInt 1 0.002
UBCF IntImp 1 0.002
UBCF ImpInt 1 0.0015
UBCF ImpImp 1 0.001
IBCF IntInt 1 0.002
IBCF ImpImp 1 0.005
Table 8.3: Layer 1.1 (Collaborative Filtering) Linear Method Parametrization
We then implemented the Evaluation Score technique, which needs no
tuning of parameters as shown in Section 7.2.2. Hereafter in Table 8.4 are
presented the score obtained in our local test.
Voting Method Score
Linear 34656
Evaluation-Score 34914
Table 8.4: Layer 1.1 Results.
The Score value is obtained using the evaluation metric
described in Section 4.1.4
The most performing result is then obtained using the Evaluation-Score
voting-based methodology which is slightly better than the Linear one. The
reason may be that the accuracy of these Collaborative Filtering techniques
56
decays very fast, so the latter technique is not able to address this behaviours
better that than the former one.
8.1.3 Layer 1.2
Here instead we create a single recommendation list out of the 2 Content-
Based techniques that we implemented. Table 8.5 shows the best parametriza-
tion for the Linear method.
Algorithm Weight Decay
KBUIS 1 0.001
KBIS 1 0.001
Table 8.5: Layer 1.2 (Content-Based) Linear Method Parametrization
This time is the Linear method the most accurate one as of Table 8.6.
The best solution is obtained by giving the same weight and decay (1 and
0.0001) to both the input algorithms. We can imagine then that they are
almost equally accurate in the first half of the recommendation which entails
that a round-robin interleaving may be the optimal solution.
Voting Method Score
Linear 46743
Evaluation-Score 44246
Table 8.6: Layer 1.2 Results.
The Score value is obtained using the evaluation metric
described in Section 4.1.4
8.1.4 Layer 2
The second layer takes as an input the 2 representatives created in the pre-
vious level and combines them together to obtain a single recommendation
with both collaborative and content characteristics. As we stated in previ-
ous chapters this hybrid helps at overcoming the limitations of the single
approaches.
Table 8.8 shows that again the Linear method performs much better
than the Evaluation Score. The best result is obtained by giving the same
weight to both input algorithms, but assigning a different set of decays,
57
Algorithm Weight Decay
ENS CF 1 0.0015
ENS CB 1 0.001
Table 8.7: Layer 2 Linear Method Parametrization
as of Table 8.7. This gives a new rule to the interleaving process which
differs now from a standard round-robin approach. As a matter of fact the
proportion of items inside the resulted ensemble, on average, will be 2:3, i.e.
2 Collaborative recommendation every 3 Content-Based ones.
Voting Method Score
Linear 52472
Evaluation-Score 49601
Table 8.8: Layer 2 Results.
The Score value is obtained using the evaluation metric
described in Section 4.1.4
8.1.5 Layer 3
The third layer ensembles the representative of our implemented algorithms,
obtained at the second layer, with the recommendation processed from the
interactions and impressions of the dataset. This is the final stage of our
solution.
As we can see from the input data in Table 8.1, interactions and impres-
sions are far more performing than the other techniques, the reason is that
we decided to work with disjoint sets, hence our Collaborative and Content-
Based solutions do not include any recommendation present in the former
ones. As we stated before, being this a click-prediction problem, it happens
that users click multiple times on an already seen item, for example for com-
parison reasons. Therefore many items present in the test set are actually
hidden in the training one. That also explains the score gap between the
recommendations. The choice was made to leave more space for novelty,
since the other items could be inferred elaborating directly interactions and
impressions. Hence it made no sense to have them repeated in multiple lists.
As we can see from Table 8.10 the Linear methods performs better,
achieving a great result. The best set of parameters, shown in Table 8.9,
58
Algorithm Weight Decay
Interactions 3 0.001
Impressions CB 2 0.001
Ensemble CF+CB 1 0.001
Baseline - -
Table 8.9: Layer 3 Linear Method Parametrization
include the same decay for all the input algorithms, since the ordering is
created using the weights. As a matter of fact we can create a queue of
recommendations where the first part in filled with interactions, followed
by impressions and then closed with the combination of Collaborative and
Content-Based. This is perfectly in line with what we stated in the first part
of this section and in Section 8.1.1. Anyway one should not be confused
thinking that it is just an append of submissions. In fact the reduce step
will act as a reordering technique since interactions and impressions are not
disjoint sets. Therefore, given our rules, it is likely that items belonging to
the intersection are pushed upwards in the list, rewriting the ordering.
The Baseline was not included in the parametrization of Table 8.9 because
it acts as a filler. Whenever we are not able to recommend enough or any
items to a user with the other techniques we use this algorithm to fill the
list. As you see in Table 8.1 it is not really accurate, but it can provide a
recommendation for every test user, being it a really general technique.
Voting Method Score
Linear 164402
Evaluation-Score 157197
Table 8.10: Layer 3 Results.
The Score value is obtained using the evaluation metric
described in Section 4.1.4
8.2 Ensemble Comparison
In this section we will compare the performance of our Multi-Stack Ensemble
with other hybridization techniques that we discussed in Chapter 4 in order
to demonstrate that our approach actually works better than other state of
59
the art solutions.
Let’s first discuss the standard ensembles that we implemented:
• Majority Voting : we simply grouped the recommendation per each
users and counted the occurrences of each element, sorting them in
descending order. The general idea is that if an item is present in
more that one list of recommendations it is probably an interesting
item for the user. For those items with only 1 occurrence we sorted
them in the final recommendation based on their original rank, hence
giving higher priority to top-ranked one;
• Interleaving Random: we performed a round-robin approach taking
one recommendation at a time from each input algorithm. We used a
random per-user selection for the ordering of algorithms;
• Interleaving Order : we performed a round-robin approach taking one
recommendation at a time from each input algorithm. Differently from
the previous case we used a prefixed ordering of algorithms, based on
the score that each one of them performed in our offline environment,
Table 8.1, in descending order;
• Score Averaging : each algorithm that we implemented performs a rat-
ing predictions over the items a user may like and then picks the top
30 elements with higher values. For the Score Averaging we imple-
mented an averaging technique using the normalized scores that each
algorithms assigned to an items based on the user-rating prediction.
The final rating r(u, i) is calculated as in Formula 8.2, where n is the
number of occurrences of item i, recommended to user u, in the input
algorithms.
r(u, i) =1
n
n∑a=1
ra(u, i) (8.2)
Obviously the recommendations represented by interactions and im-
pressions did not have a rating, since our processing consisted in fil-
tering and reordering techniques. Therefore the aforementioned tech-
nique was applied to Content-Based and Collaborative algorithms, and
the hybrid obtained was appended to the former 2, using the exact
same weighing technique applied in the 3rd layer of our Multi-Stack
Ensemble;
• Rank Averaging : we assigned to each item i a rating ri corresponding
to its rank, e.g. the first element would have ri = 1 and so on. If i
was present in multiple recommendation lists for the same user u we
60
computed the average rating for i. Then ranked the list in ascending
order;
• Weighted Voting : we performed a Grid Search over a set of parameters
to assign a weight to each input algorithms. After that we applied the
aforementioned Majority Voting technique. What differs now is that
we do not simply count the occurrences, but rather sum up the weights,
as shown in Equation 8.3, where wa(u, i) is the weight assigned to
algorithm a that contains the item i recommended to user u.
r(u, i) =∑a
wa(u, i) (8.3)
In Table 8.11 is presented only the result of the most performing so-
lution that we managed to obtain;
• Evaluation Score: we applied the Evaluation Score technique, de-
scribed in Section 7.2.2, using all input algorithms. This approach
would be the same as having 1 single layer in our Multi-Stack Ensem-
ble;
• Linear : we applied the Linear technique, described in Section 7.2.1,
using all input algorithms. For weights and decays we performed a
Grid Search to find the best configuration of parameters. Again this
technique can be seen as if we had 1 single layer in our Multi-Stack
Ensemble;
In Table 8.11 are shown the scores obtained with the aforementioned tech-
niques with respect to Multi-Stack Ensemble.
Our solution outperforms all state of the art techniques. Most of them
are too general that cannot exploit the characteristic of the domain, for
example Majority Vote cannot decide whether there are recommendations
more accurate than others, as we demonstrated in Section 8.1.1, without a
proper weighing function. In fact the weighted version performs much bet-
ter, achieving a score which is almost 3 times the former one. Anyway even
with a proper weighing the results show that the hierarchical structure of
the Multi-Stack Ensemble is able to push forward the accuracy, being able
to exploit deeper the characteristics of the input algorithms.
Another interesting aspect is that the standalone application of the 2 tech-
niques that we created, Evaluation Score and Linear, performed better than
other state of the art solutions, but worse with respect to Multi-Stack.
Therefore we can say that both the combination of different voting-based
methods and the layered structure contributed deeply in the final result.
61
hybrid techniques score
State of the Art
Majority Voting 45937
Interleaving Random 112735
Interleaving Order 114591
Rank Averaging 113649
Score Averaging 126322
Weighted Voting 139425
Evaluation Score 148878
Linear 152731
Our solution Multi-Stack 164402
Table 8.11: Ensemble Comparison.
The Score value is obtained using the evaluation metric
described in Section 4.1.4
62
Chapter 9
Conclusion and Future
Developments
In this thesis work we discussed an innovative approach to ensemble different
recommendation sources in the job-recommendation domain. We worked in
collaboration with TU Delft and Xing AG, a Business Social Network, which
gave us the dataset coming from real data collected of the Xing web appli-
cation.
The aim of this work is to push forward the research in the field of ensemble
techniques, creating an innovative approach that exploits the characteristic
of this specific domain.
Our solution is based on a multi-layered stack composed of 3 separate levels.
Inside each one of them the input recommendations are combined using a
voting-based technique followed by a reduce function in order to generate
a combined recommendation from the input sources. The output is then
propagated as an input to the upper layer.
The strength of this technique is to overcome the limitations of standard
predictive models by combining the recommendations in an innovative ar-
chitecture, to obtain a more accurate and reliable prediction for the users.
Moreover Multi-Stack ensemble outperforms all other state of the art hy-
bridization techniques which use standard approaches that may result too
general for the context.
The power of our solution resides in the layered structure, which allows to
combine recommendations in batches with the possibility to use multiple
hybridization techniques. As a matter of fact in each layer one can imple-
ment a different methodology to ensemble the recommendations, choosing
the best solution with respect to the input data.
We also participated to the RecSys Challenge 2016, organized among others
by TU Delft and Xing, were our Multi-Stack Ensemble allowed us to end
in the 4th position and 1st among Academic teams (first 3 were companies)
out of more that 120 teams. We were the only Academic team composed of
Master students.
We were then invited to the ACM RecSys Conference, held in Boston at the
Massachusetts Institute of Technology, to present our solution to researchers
and companies from all over the world. Here we also received a special men-
tion as youngest team during the prize-giving ceremony.
Our paper “Multi-Stack Ensemble for Job Recommendation” was then ac-
cepted and published in the ACM RecSys proceedings. We were also awarded
by Politecnico of Milan for our results with a scholarship.
Next year the RecSys Challenge 2017 [20] will again be hosted by Xing,
but will introduce a new really interesting aspect: online evaluation. Ba-
sically the competition should be divided in 2 phases, the first one will be
similar to the 2016 competition. The first top-n teams will then be allowed
to proceed to the second phase which will consist in an online evaluation.
Teams will provide recommendations that will be proposed to real users
on the platform, to actually discover whether in a real environment some
solutions may be better than others. That being said it would be really
interesting to participate to the second phase and test the Multi-Stack tech-
nique with the online evaluation. The problem with the offline environment
is that users are biased by the presence of another recommender engine, the
one currently working on the platform (i.e. impressions for the RecSys com-
petition) during the collection of the data. Therefore the real problem for
the offline setting may turn out to be: “predict the items users will interact
with, among the ones already recommended (i.e. impressions)”. This means
that there may be an algorithm that perform really well on this task, but
fail in an online setting where its recommendations are actually presented to
the user. On the other hand a fancy learner that propose novel items, never
shown to the user by the company’s recommender system, may capture the
attention in the online setting.
Another interesting future development would be to try the aforementioned
ensemble in different contexts, for example in movie recommender systems
or e-commerce. We think that it could work really well since it embodies
a general technique applicable to different domains. What may differ is
the structure of the stack that should be customized for every application,
but the inner concept of the 2-step algorithm is context independent. One
may also apply totally different combination techniques that fit better the
characteristic of the domain.
64
Bibliography
[1] Daniel Kluver. What is the goal of a Recommender System?
[2] Shaha T. Al-Otaibi and Mourad Ykhlef. A survey of job recommender
systems. 2012.
[3] Andrea Pagliarani Giacomo Domeniconi, Gianluca Moro and Roberto
Pasolini. Job recommendation from semantic similarity of linkedin
users’ skills. 2016.
[4] Mitul Tiwari Christian Posse Azarias Reda, Yubin Park and Sam Shah.
Metaphor: A system for related search recommendations. 2012.
[5] Mitul Tiwari Christian Posse Lili Wu, Sean Choi and Sam Shah. The
browsemaps: Collaborative filtering at linkedin. 2010.
[6] Elena Sacchi and Ervin Kamberoski. Collaborative Filtering and
Content-Based Filtering Algorithms for the Job Recommendation Prob-
lem. 2016.
[7] Fabian Abel, Andras Benczur, Daniel Kohlsdorf, Martha Larson, and
Robert Palovics. Recsys challenge 2016: Job recommendations. In
Proceedings of the 10th ACM Conference on Recommender Systems,
RecSys ’16, pages 425–426, New York, NY, USA, 2016. ACM.
[8] Wikipedia. Trial and Error.
[9] Youngtae Kim Yeonjeong Lee, Kyoung-jae Kim. Recommender sys-
tems using ensemble techniques. International Journal of Computer,
Electrical, Automation, Control and Information Engineering, 7, 2013.
[10] Michael D. Ekstrand, John T. Riedl, and Joseph A. Konstan. Collab-
orative filtering recommender systems. Found. Trends Hum.-Comput.
Interact., 4(2):81–173, February 2011.
65
66 BIBLIOGRAPHY
[11] Michael J. Pazzani and Daniel Billsus. The adaptive web. chap-
ter Content-based Recommendation Systems, pages 325–341. Springer-
Verlag, Berlin, Heidelberg, 2007.
[12] Robin Burke. Hybrid recommender systems: Survey and experiments.
The adpative web, 4321:377–408, 2007.
[13] David H. Wolpert. Stacked generalization. Neural Networks, 5:241–259,
1992.
[14] Leo Liberti Fabio Roda, Alberto Costa. Optimal recommender systems
blending. 2011.
[15] Tim Miranda Pavel Murnikov Mark Claypool, Anuja Gokhale. Com-
bining content-based and collaborative filters in an online newspaper.
[16] Alexander Felfernig Dietmar Jannach, Markus Zanker and Gerhard
Friedrich. Recommender System, An Introduction. 2011.
[17] Elena Sacchi Ervin Kamberoski Paolo Cremonesi Roberto Pagano Mas-
simo Quadrana Tommaso Carpi, Marco Edemanti. Multi-stack ensem-
ble for job recommendation. RecSys Challenge 2016 Proceedings, 2016.
[18] Joseph Sill, Gabor Takacs, Lester Mackey, and David Lin. Feature-
weighted linear stacking. arXiv preprint arXiv:0911.0460, 2009.
[19] Wikipedia. Hyperparameter Optimization.
[20] ACM RecSys Conference 2017.