Collaborative Filtering with Variational … In this work we integrate collaborative ltering models that make use of Stochastic Gradient Variational Bayes with more recent posterior

MSc Artificial IntelligenceTrack: Learning Systems

Master Thesis

Collaborative Filtering with VariationalAutoencoders and Normalizing Flows

by

Francesco Stablum6200982

August 28, 2018

36 ECSupervisors:Christos Louizos, Msc

Assessors:Dr. Mettes Pascal

Prof. Dr. Max WellingDr. Miguel Angel Rios Gaona

Dr. Wilker Ferreira Aziz

2

Contents

Abstract 7

Preface 9Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1 Introduction 11

2 Background 132.1 Representation Learning . . . . . . . . . . . . . . . . . . . . . 132.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . 142.3 Variational inference . . . . . . . . . . . . . . . . . . . . . . . 142.4 The Variational Auto-Encoder . . . . . . . . . . . . . . . . . . 162.5 Normalizing Flows . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.1 Planar Transformations . . . . . . . . . . . . . . . . . 172.5.2 RealNVP Transformations . . . . . . . . . . . . . . . . 17

2.6 Variational posterior approximation collapse . . . . . . . . . . 182.6.1 KL Annealing . . . . . . . . . . . . . . . . . . . . . . . 182.6.2 Free Bits and Soft Free Bits . . . . . . . . . . . . . . . 19

3 Related work 213.1 Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . 213.2 AutoRec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Matrix Factorizing Variational Autoencoder . . . . . . . . . . 22

4 Method 234.1 VAERec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Sampling the ratings for the UI variants . . . . . . . . 244.2 VAERec with Normalizing Flows . . . . . . . . . . . . . . . . 24

3

4 CONTENTS

4.2.1 Normalizing Flow using RealNVP’s invertible transfor-mation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.2 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Tackling underfitting . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.1 KL annealing . . . . . . . . . . . . . . . . . . . . . . . 26

4.4 Dealing with overfitting . . . . . . . . . . . . . . . . . . . . . 26

4.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5.1 RPROP update algorithm . . . . . . . . . . . . . . . . 27

4.5.2 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5.3 Learning rate annealing . . . . . . . . . . . . . . . . . 28

5 Experiments 31

5.1 Technologies used . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 The Movielens datasets . . . . . . . . . . . . . . . . . . . . . . 32

5.3 Scaling issues and regularization . . . . . . . . . . . . . . . . . 33

5.4 Prevention of exploding gradients . . . . . . . . . . . . . . . . 35

5.5 Soft Free Bits settings . . . . . . . . . . . . . . . . . . . . . . 36

5.5.1 Soft free bits settings in a deeper model with high la-tent dimensionality . . . . . . . . . . . . . . . . . . . . 40

5.6 Choice of hyperparameters . . . . . . . . . . . . . . . . . . . . 42

5.7 Normalizing Flows . . . . . . . . . . . . . . . . . . . . . . . . 43

5.7.1 Experiments with Planar transformations . . . . . . . 43

5.7.2 Experiments with RealNVP transformations . . . . . . 45

5.8 Equivalences between AutoRec and VaeRec models . . . . . . 46

5.9 User+Item concatenation vs traditional Item or User data-points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.9.1 User+Item concatenation on AutoRec . . . . . . . . . 47

5.9.2 User+Item concatenation on VaeRec . . . . . . . . . . 48

5.10 Experiments with regularization techniques . . . . . . . . . . . 49

5.10.1 Dropout layer on the input of an AutoRec model . . . 49

5.10.2 Dropout layer on the input on a deep VaeRec model . 50

5.10.3 Tradeoff between KL divergence and weights regular-ization . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Conclusion 55

6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

CONTENTS 5

Appendix A Derivations 57A.1 Density of a transformed multivariate random variable . . . . 57A.2 Variational Expectation Lower Bound . . . . . . . . . . . . . . 58A.3 ELBO as sum of terms dependent on individual datapoints . . 60A.4 Rearranging the ELBO . . . . . . . . . . . . . . . . . . . . . . 62

A.4.1 A closer look to the terms of the free energy . . . . . . 63A.5 Jacobian of a Planar Transformation . . . . . . . . . . . . . . 64A.6 Energy of single planar transformation step . . . . . . . . . . . 65

A.6.1 Model form . . . . . . . . . . . . . . . . . . . . . . . . 65A.6.2 Derivation of the free-energy F

(x(i))

. . . . . . . . . . 66A.7 Multiple nested transformation steps . . . . . . . . . . . . . . 67

A.7.1 Implementation with Planar Transformations . . . . . 68A.8 KL between diagonal-covariance Gaussian qφ (z|x) and spher-

ical Gaussian prior . . . . . . . . . . . . . . . . . . . . . . . . 68A.8.1 KL of diagonal covariance gaussians is a sum of the KL

of the individual variables . . . . . . . . . . . . . . . . 69A.8.2 KL of a one-dimensional Gaussian vs unit circle Gaussian 71

6 CONTENTS

Abstract

In this work we integrate collaborative filtering models that make use ofStochastic Gradient Variational Bayes with more recent posterior distribu-tion approximation improvements, such as Planar and RealNVP Normaliz-ing Flows. A model based on the AutoRec collaborative filtering autoen-coder model is used as baseline in order to compare it to our Variational-Autoencoder-based, named VaeRec and its variant VaeRec-NF which makesuse of Normalizing Flows. Modifications to gradient-based parameter updatealgorithms are introduced in order to take into account the sparsity of thedata tensors. Extensive hyperparameter search is performed and regulariz-ing techniques have been investigated, such as soft free bits, which employsan adaptive coefficient to the Kullback-Leibler divergence of the variationallower bound. Methods to prevent gradient explosion are also utilized. Anovel collaborative filtering input schema that makes use of the concatena-tion of user and item vectors has been tried, alongside inputs that make useof solely the item or user vectors.

7

8 Abstract

Preface

When I was proposed me the topic of collaborative filtering I accepted withenthusiasm. The idea of being able to infer user-on-item preferences with-out any description of either users or items fascinated me. How would itbe possible to do machine learning having only relational information be-tween different entities? This less known application of machine learningstill puzzles me and makes me wonder about the incredible potential of thesemodels.

This thesis has been a long journey with peaks and flats in which I couldexperience both the excitement of attempting new ideas on how to solve theproblem, as well as reconsidered expectations. This is normal part of the lifeof any research scientist and I’m glad of the opportunity of getting to knowwhat this is all about.

The main aspect that motivates me into this thesis and in a broaderscope to Machine Learning and Artificial Intelligence is the sheer amount ofnew discoveries and techniques that are being relentlessly produced by thescientific community and my desire to combine the state of the art in termsof neural models, update algorithms, regularization techniques, and proba-bilistic interpretations in order to push the boundary of the best achievableprecision of the predictions. I’m always been interested in the nature of hu-man conceptualization and how this can be related to computability. Thisthesis allowed me to get an additional perspective on this matter.

I believe that Collaborative Filtering techniques will find a broader appli-cation that goes way beyond mere user/item rating prediction. They provideanother way to model learning by association, in which observations of howobjects interact lead to answers about what these objects actually are, alsovia interpretation of their location in the so-called latent space.

I hope that the reader will find interesting how techniques that are typi-cally used for dimensionality reduction and probabilistic inference, with vari-

9

10 Preface

ational approximations, have been employed for attempting a solution ofuser/item rating modeling. I tried my best to derive all necessary math inorder to lead the reader to understand, step by step, the topics of variationalinference, the variational autoencoder and improvements to the approxima-tion such as the normalizing flows.

I also included a description of the experiments and their results of manyattempts at combining various algorithms into searching the synthesis thatwould lead to the best results and some attempts at explaining differentoutcomes.

Acknowledgements

I would like to express my gratitude to my adviser Christos Louizos for theconsiderable patience and useful advices that got me unstuck many times,professor Max Welling for the thesis topic, the team behind the DAS4 su-percomputer [Bal et al., 2016] for letting me use very intensely their compu-tational resources, to my company, ORTEC, and my colleagues for allowingme flexibility.

Chapter 1

Introduction

This work presents an exploration on the use of Variational Autoencoders forcollaborative filtering. The baseline model has been chosen to be AutoRec[Sedhain et al., 2015], which uses latent representation to reconstruct missingratings. The natural evolution of this model has been considered to be similarmodel based on a Variational-AutoEncoder, which we called VaeRec. Variousextensions to this model have been examined. Specifically, Normalizing Flowstransformations of the posterior approximation have been investigated, aswell as regularization techniques.

The structure of this thesis represents how this research has evolved intime: Chapter 2 offers a highlight of the notions required to delve into theactual contributions of the model ; Chapter 3 presents similar models thathave been used as inspiration ; Chapter 4 presents our models with theirvariants ; Chapter 5 illustrates the experiments that have been performedon our models ; Chapter 6 summarizes the contributions of the models withinsights that emerged from all the experiments ; Most detailed derivationsand proofs, very useful for a beginner that is trying to figure out mathematicaldetails of the models, have been left to the Appendix.

11

12 CHAPTER 1. INTRODUCTION

Chapter 2

Background

2.1 Representation Learning

Representation Learning (RL) is a developing branch of Machine Learningthat has one of its focuses on extracting representation codes Z = ziNi=1

from the datapoints in a dataset X = xiNi=1. It is usually desirable for theserepresentations to be characterized by properties such as low-dimensionality,clusterability, increased linear separability (especially when used for furtherclassification tasks) and intuitive ”semantic” explainability of the dimensionsof the learned manifold.

One common attempt to achieve such properties is the use of PrincipalComponent Analysis (PCA) which transforms the original features of theraw input into a set of linearly uncorrelated variables. The main drawbackof PCA is the assumption that the explaining dimensions are linearly relatedto the directions of maximum variance in the data. This assumption is nottrue for most complex datasets, in which the original features of a datasetmight actually be the result of arbitrarly complicated nonlinear unknownfunctions.

For this reason alternative approaches to RL are being employed, such asAutoencoders (AE) as specific forms of Artificial Neural Networks (ANN).AEs have typically a ”diabolo” shape, as an arbitrarly highly dimensionalinput is progressively being reduced to lower dimensionalities over layers ofprogressively shrinking sizes. This part of the neural network is an encoder,as its purpose is to generate low-dimensional compressed codes from a high-dimensional input. The last layer of the encoder is the usually the smallest

13

14 CHAPTER 2. BACKGROUND

layer of the network, hence called bottleneck. This is not always the case, forexample with instances with sparse or over-complete codes. To the bottleneckis attached a decoder network with layers of progressively increasing size.The last layer of the decoder is matching the dimensionality of the inputlayer. The learning of the network’s parameters is performed by minimizinga loss function containing the error between the reconstructed output of thenetwork and its input. This objective function is then minimized via GradientDescent and its variants.

2.2 Collaborative Filtering

Collaborative Filtering [Bobadilla et al., 2013] is a recomendation systemtechnique apt to predict user-item ratings solely via the sparse matrix R ofthe available ratings given by users to items without using any informationabout either users or items. The main aspect that makes CF work is thatsimilar users are recognizable as similar by having similar ratings on the sameitems. Hence, it’s possible to predict a missing rating of a user to an itemby considering the ratings of the users that are similar to him.

R =

r1,1 . . . r1,M

. . . . ..

... ri,j...

. .. . . .

rN,1 . . . rN,M

← sparse user row ri,·

↑sparse item column r·,j

(2.1)

2.3 Variational inference

Bayesian inference is concerned on updating an existing hypothesis on astatistical model on a data source, with data samples empirically obtainedfrom that data source.

In other words, an existing model hypothesis is called a prior distributionp(M); the probability of the samples D under the model M is called the

2.3. VARIATIONAL INFERENCE 15

likelihood p(D|M). The usually not available true probability of the samplesD is called evidence p(D).

By using Bayes’ theorem it’s possible to obtain the posterior distributionp(M|D) of the model M after observation of the data D:

p(M|D) =p(D|M)p(M)

p(D)(2.2)

In representation learning it’s assumed that each datapoint x are gen-erated by unknown (latent) variables z. Hence the problem of finding thegenerative model of the data becomes learning the parameters of a systemthat, given instances of z is able to produce as faithfully as possible, therespective datapoints x. In this scenario, inference is concerned with thedual problem of finding a distribution over z conditioned by the datapoint x.The initial hypothesis on how the latent variables are distributed, which isdescribed by the prior distribution pθ(z), is updated to the datapoint x andlikelihood pθ(x|z) within the framework of a generative model represented byθ. This framework describes how x relates to a certain latent variable assign-ment z. In this new context the bayesian rule is used to infer the posteriordistribution of an arbitrary setting of the latent variables z:

pθ(z|x) =pθ(x|z)pθ(z)

pθ(x)(2.3)

As the true posterior pθ(z|x) is typically unavailable, being pθ(x) =∫zpθ(x|z)pθ(z)dz intractable, an approximation q(z|x) is looked for via vari-

ational inference methods. Variational inference is concerned to minimizethe distance between the approximation and the true posterior [Fox andRoberts, 2012], which is typically done by minimizing the Kullback-Leiblerdistance KL [q(z|x)||pθ(z|x)].

The KL can be decomposed into:

KL [q(z|x)||pθ(z|x)] = Eq(z|x)

[log

q(z|x)

pθ(x, z)

]+ log pθ(x) (2.4)

We can use the shorthand L (x) = −Eq(z|x)

[log q(z|x)

pθ(x,z)

].It is clear that

log pθ(x) is a fixed quantity w.r.t. z, and that KL quantities are always non-negative, hence it’s easy to see how L (x) is a lower bound to pθ(x) and themaximization of L (x) implies necessarily the minimization of KL [q(z|x)||pθ(z|x)].


This is the basis of variational inference, and you can refer to AppendixA.2, A.3 and A.4 for details on the derivations of the lower bound.

2.4 The Variational Auto-Encoder

[Kingma and Welling, 2013] introduced a model aimed at posterior inferenceon datasets with high-dimensional datapoints.

The model is based on a generator network which outputs a conditionaldistribution pθ (x|z) in datapoint-space given a realization of the latent vari-ables z.

The posterior distribution pθ (z|x) =∫zpθ (x|z) pθ (z) dz is intractable,

hence an approximating recognition network qφ (z|x) is introduced whoseparameters φ are optimized via variational inference. The optimization of φhappens simultaneously with the parameters θ.

It was also shown experimentally how a Monte Carlo approximation ofthe ELBO (section A.3) by sampling the posterior approximation is sufficientto achieve good learning performances.

Moreover, [Kingma and Welling, 2013] experimentally demonstrated howjust a single Monte Carlo samples might achieve good approximation.

Since values of z are being sampled, this would prevent gradients fromflowing in a backpropagation-like way. To circumvent this problem, a repa-rameterization trick has been employed by using a sample ε which is alwaysdrawn from a N (0, I) Normal distribution. By using the transformation:

z = µφ + σφ · ε (2.5)

a sample is obtained from the distribution N (µφ,σφ).The sum-based form that allows for SGD-like updates described in section

A.3 and the fact that a Monte Carlo approximation is used for the approxi-mation of one datapoint term are the reason that [Kingma and Welling, 2013]gave Stochastic Gradient Variational Bayes as a name for this technique.

2.5 Normalizing Flows

The original VAE model is charachterized by having a simple diagonal-covariance Gaussian posterior approximation. In order to achieve more com-plex distribution forms, multi-step transformations of the latent variable zare being employed.

2.5. NORMALIZING FLOWS 17

In our work we focused on two forms: Planar Flows and RealNVP.

2.5.1 Planar Transformations

It has been proposed [Rezende and Mohamed, 2015] to achive a more complexposterior approximation by using a type of transformations with the followingform:

t(z) = z + uh(w>z + b) (2.6)

This transformation can be applied to a simpler distribution, such as thediagonal-covariance gaussian introduced in [Kingma and Welling, 2013].

The parameters are: b which is a scalar, w> ∈ RD and u ∈ RD; h is anelement-wise nonlinearity, such as a tanh.

The expression w>z + b is a scalar value, and h(w>z + b) can be seen asone perceptron layer with a single output unit. u is a parameter that acts asa coefficient vector representing the amount of the transformation h(w>z+b)applied to the input z vector.

The derivations in Appendix A.1 show how just the determinant of Jaco-bian of the transformation is used in order to express the probability of thetransformed variable as a function of the probability of the original variablez0. For the derivation of the Jacobian please refer to Appendix A.5

2.5.2 RealNVP Transformations

[Dinh et al., 2016] introduced a very simple invertible function of the form:t(z)1:d = z1:d

t(z)d+1:K = zd+1:K exp (s(z1:d)) + a(z1:d)(2.7)

The inverse can be trivially obtained as:z1:d = t(z)1:d

zd+1:K = (t(z)1:d − a(z1:d)) exp(s(t(z1:d)))︸︷︷︸ exp(−s(t(z1:d)))

(2.8)

s(·) can be any dimensionality-preserving nonlinear function, such as aneural network with nonlinear activations. a(·) is an affine transformation.In this work’s implementation d is set d = K/2.


The main advantage of using such transformations is that the Jacobianmatrix is triangular, hence its determinant is obtained from the diagonal,

culminating with the form exp(∑

j s(z1:d)j

)Another great advantage over planar flows is that, while planar flows force

the transformation to be channeled to a scalar value, RealNVP do not havethis restriction, as the nonlinearity is applied to a dimensionality which isthe same as the latent variable’s.

2.6 Variational posterior approximation col-

lapse

It was observed [Kingma, 2017] [Chen et al., 2016] that in the initial phases oftraining, due to weakness of the term pθ (x|z) the term KL [qφ (z|x) ||pθ (z)]promotes qφ (z|x) to collapse to the prior pθ (z).

If the latent variables are independent, then this phenomenon can bediagnosed by looking at the individual Kullback-Leibler divergences at eachlatent dimension, as shown in A.50 and, for the diagonal-covariance Normal,in A.51, A.48.

The KL [qφ (z|x) ||pθ (z)] term of the L (x), if seen in the context of av-eraging within a minibatch M, as in Ex∼M [KL [qφ (z|x) ||pθ (z)]], can beinterpreted as an approximation to a mutual information term I (z; x). Theimplied minimization of the mutual information during optimization of theELBO forces a high dependence of the x datapoints to the prior pθ (z), lead-ing to over-regularization of qφ (z|x).

2.6.1 KL Annealing

[Bowman et al., 2015] has done extensive experiments with variational au-toencoders in recurrent neural networks, and points out that it’s very likelythat the KL term is much easier to be optimized and is quickly brought to 0,forcing the qφ (z|x) term to collapse to the prior p(z). He proposes annealingof the KL term to prevent this phenomenon by lowering the contribution ofthe term in the initial phases of the learning.

A simple implementation for the annealing is the following:

γ =min(t, T )

T(2.9)

2.6. VARIATIONAL POSTERIOR APPROXIMATION COLLAPSE 19

Where t is the current epoch number, T is the amount of epochs requiredto reach regimen and γ is the coefficient to the KL term.

2.6.2 Free Bits and Soft Free Bits

In order to prevent the collapse of the posterior approximation to the prior,the gradients of the KL term can be zeroed by setting a lower-bound valueto the nats expressed from that term, as in:

max [λ,Ex∼M [KL [qφ (z|x) ||pθ (z)]]] (2.10)

Alternatively, as described in a revision of [Chen et al., 2016] Soft FreeBits can be used by adapting a KL annealing rate γ by updating it at everyiteration. γ is hence repeatedly multiplied by 1 + ε or 1 − ε, according tothe KL being, respectively, larger or lower than γ. This is described by thefollowing algorithm:

Algorithm 1 Soft Free Bits

Require:(1) Initial annealing rate γ (to the KL)(2) ε value to adjust the annealing rate(3) λ desired target nats from the KL

Ensure:(1) The annealing rate γ will be adjusted to ease the convergence of theKL to the target value λ

1: if KL > λ then2: γ ← γ * (1 + ε)3: else4: γ ← γ * (1 - ε)5: end if


Chapter 3

Related work

3.1 Probabilistic Matrix Factorization

Probabilistic Matrix Factorization [Salakhutdinov and Mnih, 2008] is dimensionality-reduction technique for the CF problem that learns a matrix factorizationof the sparse matrix of observed ratings R into two low-dimensional factormatrices U ∈ RD×N and V ∈ RD×M where D is the size of the low dimen-sionality. Hence, R = U>V .

The learning algorithm is based on a probabilistic assumption:

p(R|U, V, σ2) =N∏i=1

M∏j=1

[N(Rij|Ui>Vj, σ2

)]Iij(3.1)

Here Iij is 0 if the Rij is not set and is 1 if it is set.In PMF, the log-likelihood is a sum of terms, each dependent on a specific

user and item with Iij = 1. This allows for SGD-like updates of the vectorsUi and Vj that are progressively refined trough the iterations.

3.2 AutoRec

AutoRec [Sedhain et al., 2015], differently from PMF, does not store learnedlatent vectors, but is able to produce them on-the-fly via an encoder-decoderneural network architecture.

This model is particularly interesting as a single query with an entiresparse ratings vector results in all the missing ratings to be estimated atonce.

21

22 CHAPTER 3. RELATED WORK

The missing ratings are not provided to the encoder but an estimationof those is nevertheless being provided by the decoder, making use of therepresentation learning and denoising intrinsic capabilities of an autoencoderwith low dimensional bottleneck layer.

The loss function is hence the error between the user (or item) vectorr and its reconstruction, but considering, via element-wise multiplication with the vector mask M, only the existing ratings, otherwise the learningwould be incorrectly taking account of the 0 placeholders for the missingratings in the sparse matrix:

min∑k

|| [rk −Dec (Enc (rk))]Mk||22 (3.2)

Even with this model, the sum allows for SGD-like updates.

3.3 Matrix Factorizing Variational Autoen-

coder

MFVA [van Baalen, 2016] makes use of the findings in [Kingma and Welling,2013]: variational autoencoders are being used in order to yield posteriordistributions approximations as diagonal Gaussians q(ui|ri·) and q(vj|r·j).

The decoder/recognition functions that have been used, differently fromAutoRec, output the single rij rating. A dot-product between ui and vj aswell as MLP have been employed for the task, with rij being either expressedwith a Gaussian distribution or with a multinomial distribution.

The lower bound assumes this form:

L = −∑i

KL [Qu(ui|ri,·)||p(ui)]

−∑j

KL [Qv(vj|r·,j)||p(vj)]

+∑i,j

EQu,Qv [log p(rij|ui,vj)]

(3.3)

Chapter 4

Method

This chapter will mostly expose the original contributions of this thesis. Ourmodels, as typical in machine learning, are constitued of many ”buildingblocks”, which will be individually elucidated trough the following sections.

4.1 VAERec

VAERec is one of the contributions of this thesis. It extends the AutoRecmodel making use of the VAE framework.

It has been implemented in three variants:

U-VAERec assumes the presence of latent variables ui, which representa specific user i in latent space, whose observed ratings are represented bythe sparse row ri·. This model reconstructs user rows by learning the condi-tional distribution pθ (ri·|ui) assumed to be a diagonal-covariance Gaussianand, jointly, the variational approximation to the posterior qφ (ui|ri·), alsoassumed to be a diagonal-covariance Gaussian.

I-VAERec is dual to U-VAERec. It assumes the presence of latent vari-ables vj which represent a specific item in latent space, whose observed rat-ings are represented by the sparse column r·j. Hence, the target of thelearning are the parameters of the distribution pθ (r·j|vj) and qφ (vj|r·j).

UI-VAERec reconstructs a vector consisting of the concatenation of a userrow and item column. It learns pθ (ri·, r·j|z) and qφ (z|ri·, r·j). This differs

23

24 CHAPTER 4. METHOD

from the MFVA model with the MLP decoder proposed by [van Baalen,2016], as a distribution on a single latent vector zij representing the user-item pairing is being produced instead of having two distinct distributionson ui and vj.

4.1.1 Sampling the ratings for the UI variants

Training the UI-VAERec would require very long epochs, as the number oftraining datapoints would be the number of ratings O(N ∗M). To preventproblems related to excessive memory usage, as one rating would be storedas a concatenation of its user vector ri· and item vector r·j, epochs havebeen implemented as random samplings of a fixed amount (5000) of theratings. The validation set is comprised by a similar sampling, on differentratings. It’s worth noting that in our implementation the ratings selected inthe training set will never be present in the vectors of the validation set andvice-versa. Moreover, ratings are being split between training ratings andvalidation ratings at the very beginning and this sampling is kept unchangedthrough the epochs, effectively creating two non-overlapping sparse matricesR(t) for training and R(v) for validation.

4.2 VAERec with Normalizing Flows

The VAERec-NF model extends the VAERec by improving the posteriorapproximation with Normalizing Flows [Rezende and Mohamed, 2015] asexplained in section 2.5.1, A.6 and A.7.

4.2.1 Normalizing Flow using RealNVP’s invertible trans-formation

In this VAERec-NF variant, a transformation of the type previously de-scribed in section 2.5.2 is introduced. This transformation was consideredinteresting because of its implementation ease and very simple determinantof the Jacobian. Specifically, the function s(·) is implemented as a singleperceptron layer with nonlinear activation function tanh. The function a(·)which is required to be an affine transform, is implemented as a single per-ceptron layer with linear activation function.

4.2. VAEREC WITH NORMALIZING FLOWS 25

All the parameters of the transformation are being produced as outputof the encoder network, exactly as happens with the parameters of qφ (z|x)in a Variational Auto-Encoder. This differs from the model of 2.5.2 as theirnetwork parameters are not given as a function of the inputs, but is rathera globally initialized global network which is the same for every input. Theweights of the a(·) and s(·) layers are implemented as vectors of size K2, thenreshaped into (K,K) dimensions. This limits the model into very low latentdimensionalities.

4.2.2 Masking

To ease the implementation, the selection of the first and second parts of zhave been implemented with random hyperparameter masks. These masksare unique for each transformation step, and are computed as follows:

Algorithm 2 Half-full random masks for RealNVP transformations

Require:(1) Latent dimensionality K(2) Number of transformation steps k

Ensure:(1) Random masks m1 . . .mk which have half of their elements set at 1 ‘

1: for i ∈ 1 . . . k do

2: (a)j ←

1 j < K/20 K/2 ≤ i < K

3: mi = shuffle(a)4: end for5: return m1 . . .mk

The invertible function, for a transformation step i, is hence implementedas:

zi+1 = zi mi + (1−mi) [zi exp (s(mi z)) + a(mi z)] (4.1)

The masks are computed before training and are left unchanged, as theyshould be considered as hyperparameters.


4.3 Tackling underfitting

It’s very common, when working on new models, to have difficulties in get-ting the model to even learn from the training data. In other words, themodel may be configured in such a way that, even before overfitting arises,underfitting is still an issue.

This might be caused by many factors: limited width or depth of thenetworks, over-regularization, low quality datasets, learning rate too small.

In our models attempts to tackle underfitting have been widening thenetwork and adding more hidden layers, as well as KL annealing, describedin the following section.

4.3.1 KL annealing

In VAERec models, one source of regularization is the KL divergence in theELBO.

In our models, a linear slowly annealing coefficient on the KL has beentried. With the following rule:

a =max(t, T )

T(4.2)

Where a is the value of the annealing coefficient to the KL, t is the currentepoch and T is the epoch number from which the coefficient will be 1.0.

4.4 Dealing with overfitting

One of the major issues in some AutoRec/VaeRec models is overfitting.Notwithstanding the fact that VAE models are plagued by over-regularizationcaused by the posterior approximation collapse described in section 2.6, underspecific circumstances overfitting is prominent, as in using UI (user+item)datapoint schemas. In these cases, the model performs very well on thetraining set but unsatisfactory on the test set.

Dropout on the input layer Dropout [Srivastava et al., 2014] is a tech-nique aimed at preventing overfitting which employs randomly dropping unitsand their connections during training. This would ensure to obtain a neuralnetwork which can function even when parts are deactivated. Moreover, it

4.5. OPTIMIZATION 27

would prevent that a single unit becomes entirely representative of a certainaspect of the training data data.

It has been observed that applying Dropout on just the input layer of theVaeRec models, overfitting can be prevented to a certain degree, possibly bya similar way of action as denoising autoencoders. [Vincent et al., 2010a].

Narrowing the bottleneck This is a common technique, typically usedwith regular autoencoders. Information is being channeled trough a limitednumber of nodes, forcing the neural network to lose hopefully unrelevantinformation which might be dataset noise.

Regularization Regularization has been tested with either L1 or L2 norms,without significant improvements.

Increasing the depth Depth has been studied in [Mhaskar et al., 2017],[Poggio et al., 2015] as a method to improve representations. Better repre-sentations usually lead to a better identification and disposition of overfittingnoise.

4.5 Optimization

The application of gradients from the objective function to the parameters isnontrivial. Problems such as choice of learning rate, prevention of explodinggradients and vanishing gradients, as well as avoiding getting stuck in localminima are well known challenges of model training. In this sections aredescribed a few optimization methods as well as details on implementationsto tackle sparsity.

4.5.1 RPROP update algorithm

RPROP [Riedmiller and Braun, 1993] is a gradient-based parameter updateschema that does not take into account the magnitudes of the gradients, butonly their sign.

The idea is simple: if the gradient keep pointing towards the same (ei-ther positive or negative) direction, then the parameter-specific update deltaneeds to be increased, otherwise, in case the gradient for a parameter keeps


changing sign, then the update delta needs to decrease, in order to fine-tunethe parameter to its optimum.

The change of variation is detected by the product of the gradient of anparameter wi calculated to minimize an objective function J parameter atime step t by the gradient of that same parameter at the previous time stept− 1:

p =

(∂J

∂wi

)(t−1)

∗(∂J

∂wi

)(t)

(4.3)

The sign of p determines the increase or decrease of the parameter-specificdelta ∆i:

∆(t)i =

minη+ ∗∆

(t−1)i ,∆max if p > 0

maxη− ∗∆(t−1)i ,∆min if p < 0

∆(t−1)i if p = 0

(4.4)

Where 0 < η− < 1 < η+.

Typical parameter settings are η− = 0.5, η+ = 1.2, ∆min = 1e−6, ∆max =50.0 and initial delta values ∆0 = 0.1.

RPROP has been used specifically for AutoRec, as suggested in the orig-inal paper [Sedhain et al., 2015].

4.5.2 Adam

Given its properties of adaptive momentum , Adam [Kingma and Ba, 2014]has been choosen as optimization algorithm.

The VAERec, similarly to the AutoRec, needs to be selective on whichparameters needs to be updated: both in the first layer of the encoder andthe last layer of the decoder, only the weights that are connected to existingratings can be updated.

Provided a binary mask M of a parameters tensor θ then Adam has thetwo assignments of mt and vt modified from the original algorithm as follows:

mt ←mt

1− βt1M + mt−1 (1−M)

vt ←vt

1− βt2M + vt−1 (1−M)

(4.5)

4.5. OPTIMIZATION 29

4.5.3 Learning rate annealing

Annealing of the learning rate is a technique that uses the progressive reduc-tion of the learning rate in order to facilitate achieving an optimum of theparameters.

Intuitively one might think that, as the learning progresses, the param-eters get progressively near the optimum, and, as a consequence, the pa-rameter adjustment needs to be of inferior magnitude than the initial one.This desirable aspect of the learning is achieved by progressively reducingthe learning rate over the epochs.

This is demonstrated by [Robbins and Monro, 1951] which establishedconditions on the sequence of learning rates that would ensure reaching anoptimum.

The learning rate annealing schema that has been chosen is described bythe following formula [Orr, 1999]:

γ(t) =γ(0)

1 + t/T(4.6)

where γ(t) is the learning rate at epoch t, γ(0) is the initial learning rateand T is a hyperparameter whose amount is the number of epochs it takesfor the learning rate to halve.

The decaying curve is gradual and non-linear, with a long tail.


Chapter 5

Experiments

This chapter presents experiments on our models. At first some contextualinformation is given, such as details on technologies that have been employed,dataset used, technicalities about scaling of learning rate and regularizationcoefficients and hyperparameter settings.

Follow results on the experiments runs. Interesting information on howthe model determines learning progress can be obtained by observation ofthe RMSE progress over the epochs , both for the training set and validationset. Analysis of the charts provides insight on issues such as hyperparameterchoice, overfitting and underfitting.

5.1 Technologies used

Python version 3 The Python programming language is well suited fordata science applications. Its large number of libraries available makes itsuitable for avoiding re-inventing the wheel. Its conciseness make it veryreadable and hands-on.

Theano [Al-Rfou et al., 2016] It’s a Python library useful to create com-putational graphs and automatic differentiation, specifically using tensors.

DAS4 [Bal et al., 2016] Grid computing environment from a partnershipbetween Dutch universities. Allows the use of concurrent jobs, also on nodesthat are provided with GPUs to speed-up deep learning computations.

31

32 CHAPTER 5. EXPERIMENTS

5.2 The Movielens datasets

To test the models the Movielens datasets [Harper and Konstan, 2015] havebeen used. Specifically, the small dataset was used for local debugging, andthe 1M dataset was used for the main experiments. The reason that the 1Mhas been chosen as dataset for the main results was that most papers reportresults for their model being trained on this specific dataset.

Follow statistics on the two Movielens datasets.

number of users N 668number of items M 10325average rating 3.51685standard deviation 1.04487

Table 5.1: small Movielens dataset statistics

Figure 5.1: small Movielens dataset ratings distribution

5.3. SCALING ISSUES AND REGULARIZATION 33

number of users N 6040number of items M 3706average rating 3.58156standard deviation 1.1171

Table 5.2: 1M Movielens dataset statistics

Figure 5.2: 1M Movielens dataset ratings distribution

5.3 Scaling issues and regularization

In order to achieve the same intensity of learning per epoch even by varyingthe minibatch size it is necessary to re-scale some hyperparameters.

Let’s consider a complete objective for a typical learning task of quan-tities yi from respective datapoints xi, using a dataset D = xi, yi|D|i . Asthe datapoints are independent and identically-distributed, then it can beexpressed as a sum over all the datapoints. One step of learning from thisobjective is usually referred as an ”epoch”.

J =

|D|∑i=1

`(xi, yi) + λΩ(Θ) (5.1)


Where Ω is a regularization term, Θ is the set of the regularizable pa-rameters and λ is a fixed hyperparamter that determines the regularizationamount.

This objective is subject to Gradient Descent learning on the trainableparameters:

Θt+1 = Θt − γ∇ΘJt (5.2)

Where γ is the learning rate hyperparameter.As J is defined as a sum over independent datapoints, it is possible to

use Stochastic Gradient Descent learning strategies, which take into accountonly a limited number of datapoints at each time.

It’s desirable to consider the average contribution Ja of each datapoint tothe objective J 5.1:

Ja =1

|D|J =

1

|D|

|D|∑i=1

`(xi, yi) +λ

|D|Ω(Θ) (5.3)

If learning using Ja is repeated |D| times within an epoch, then the al-gorithm achieves the same learning intensity as with J by keeping the samelearning rate γ.

By considering splitting the dataset and the objective J 5.1 into a num-ber of minibatches of size B, an approximation to Ja, useful for SGD-likealgorithms, can be obtained:

Jb =1

B

B∑i=1

`(xi, yi) +λ

|D|Ω(Θ) (5.4)

An important consequence of using Jb is that the intensity of the learningis altered, because less updates would be applied to Θ at each epoch. In orderto balance this phenomenon, the learning rule can be modified as follows:

Θt+1 = Θt −Bγ∇ΘJb,t (5.5)

The presence of the B coefficient cancels out the effect of the 1B

coefficientin the Jb average on the datapoints:

Θt+1 = Θt − γ

[∑i=1

∇Θ`(xi, yi) +Bγ

|D|∇ΘΩ(Θ)

](5.6)

5.4. PREVENTION OF EXPLODING GRADIENTS 35

5.4 Prevention of exploding gradients

It is possible, under specific circumstances, that the gradients may becomeunstable and compromise the parameters of the model with infinities or ”nota number” values.

In order to prevent this phenomenon, a few ”tricks” have been imple-mented:

Gradient clipping [Markou, 2017] has been implemented with the follow-ing norm-based scaling algorithm:

Algorithm 3 Norm-based gradient clipping

Require:(1) Gradient tensor g(2) Threshold θ (defaulted to value 10)

Ensure:(1) Scaled gradients g whenever their L2 norm surpasses a threshold θ. ‘

1: if ||g||2 > θ then2: g← θ

||g||2 g3: else4: g← g5: end if6: return g

Scaled tanh activation function . Some layers have ”log σ” outputs. Asthe output of these layers needs to be processed to exponentiation in the likeli-hood function, if the activation is kept linear there is a great risk of instabilityand value explosion. For these reason a ”pseudo-linear” soft-bounded activa-tion function has been implemented by re-scaling a tanh activation function.A tanh activation function has the co-domain of the function bounded be-tween -1 and +1. Moreover, its derivative is approximately 1 near the origin.It follows that tanh perfectly suits the role of pseudo-linear bounded activa-tion function if it’s rescaled as follows:

f(x) = K ∗ tanh(x/K) (5.7)


Where K is a small integer which is greater than 1. In this way thisactivation function will be bounded between −K and +K. A good value forK might be 5, in order to obtain σ-values properly bounded between 0.0067and 148.4.

For layers that are supposedly linear in their outputs, such as PlanarFlow’s w, b, and u quantities, as well as the means µ of the gaussian distri-butions, a ”pseudo-linear” function on the same guise has been implementedwith K = 20.

Learning rate ”warm-up” has been implemented to prevent immediatedivergence in the first epochs due to steep gradients. Hence, the learning ratehas been raised from a very small value to regimen value during the courseof the very first epochs.

Algorithm 4 Learning rate warm-up

Require:(1) Initial learning rate γ(2) Current epoch number t(3) Number of initial warm-up epochs K(4) Base value of the warm up coefficient B which has to be less than 1

Ensure:(1) Adjusted and progressively increasing learning rate γ during the firstK epochs

1: if t >= K then2: γ = γ3: else4: γ = BK−tγ5: end if6: return γ

Appropriate parameter settings have been found as K = 3 and B = 0.1.

5.5 Soft Free Bits settings

During experiments with VaeRec, it was noted how the KL values differgreatly from the marginals to the KL. This is because as the latent dimen-

5.5. SOFT FREE BITS SETTINGS 37

sionality increases, it gets harder to match the prior and the posterior. Forthis reason, for larger latent dimensionalities, it can be observed a posteriorcollapse trough the KL marginals, even if the KL still returns values thatare reasonably high.

Within our Soft Free Bits (see section 2.6) implementation our solutionwas just to set the λ linearly proportional to K, as in λ = 2 ∗K.

The annealing ε was set, as suggested in [Chen et al., 2016], to the value0.05. The value of γ was updated at every minibatch learning iteration.

Figure 5.3: Free Soft Bits: evolution of kl annealing coefficient vs. kl diver-gence. The values are sampled after evey minibatch update. This plot hasbeen obtained with about 50 epochs of VaeRec, without Normalizing Flowsand with latent dimensionality K=1. In blue: annealing coefficient value; ingreen: KL measure


Figure 5.4: Zoom-in of the last part of the previous figure 5.3 . It is no-ticeable how the KL divergence measure succesfully converges towards thedesired amount of 2. The annealing coefficient keeps oscillating, reflecting thedynamic nature of the annealing-vs-kl system. In blue: annealing coefficientvalue; in green: KL measure


Figure 5.5: Similar plot to 5.3, but with the more interesting case of K=5.The KL divergence also succesfully converge to the target value 10. In blue:annealing coefficient value; in green: KL measure


Figure 5.6: Zoom-in of the last part of the previous figure 5.5. The annealingcoefficient converges, with oscillations, towards small values such as the casewith K=1. In blue: annealing coefficient value; in green: KL measure

5.5.1 Soft free bits settings in a deeper model withhigh latent dimensionality

In order to verify the effectiveness of the linear relationship between the latentdimensionality and the amount of ”soft free bits” λ, a different, deeper andwith higher dimensionality VaeRec model was chosen (without NormalizingFlows).

In this model, the latent dimensionality chosen is 250 and there are twohidden layers in both the encoder and decoder. L2 regularization has beenused with coefficient 100; an initial learning rate valued 0.000006 and anneal-ing T = 10 has been employed.

The following plot shows the learning evolution with different settings of


the λ:

Figure 5.7: ”Deep” model with different settings for λ: 0.5 ∗ K, 1 ∗ K and2 ∗K

Table 5.3: Best results for deep VaeRec under different ”soft free bits” λsettings

λ best training RMSE best testing RMSE0.5 ∗K 0.8003 0.85651 ∗K 0.4951 0.85362 ∗K 0.3766 0.8598

While λ = 2 ∗K is obiously under-regularized, there might be a ”sweetspot” between the under-regularized λ = K and the apparently over-regularizedλ = 0.5 ∗K. The latter case seem to be quite interesting as it seems to havenot have converged, even after 400 epochs, to a definitive optimum of testingRMSE.


5.6 Choice of hyperparameters

The search for an acceptable model via hyperparameter search was a verylong process. Since most experiments took about 8 days to complete, choos-ing the right hyperparameter settings took many months.

At the end of this process it was understood that most hyperparameterswere located on a specific trade-off minimum in a convex curve of validationerror. For example, using minibatch of size 1 gave often unsatisfactory resuls.Better results were obtained with a minibatch of size 64, but increasing thatvalue, for example to 128 or 256 gave worse results.

Other hyperparameters that were problematic to set were those dedicatedto L2 regularization of the network weights. A good balance was found byusing 100 or 200 as L2 regularization coefficient.

The learning rate was also located as a minimum in a convex curve.High learning rates might initially progress faster but may ”jump over” goodminima, while lower learning rates might converge to a better minima buttake a larger amount of epochs. These effects might be mitigated by the useof moment-based descent algorithm, such as Adam [Kingma and Ba, 2014]and by the use of the learning rate annealing described in section [Orr, 1999],with parameter T set at 10, meaning that the initial halving of the learningrate happens after the first 10 epochs (further decay is much slower) . Agood initial learning rate was found being 2e-6.

The ideal number of epochs would have been 1000 but unfortunatelyreaching this target was highly inpractical by the sheer amount of time thatthe training required. This fact was aggravated by the limit on the numberof concurrent jobs that was imposed by the distributed supercomputer DAS4administration, which was about 10-20 long-running jobs at the same time.For these reasons many reported experiments have been trained for a lowernumber of epochs.

The depth of the network was finally chosen to be 1 hidden layer, asit eases the creation of useful intermediate-level representation values, asopposed to not using hidden layers at all. Latent dimensionalities exploredwere 5, 250 and 500.


5.7 Normalizing Flows

Experiments with planar and RealNVP normalizing flows have been per-formed.

A base VaeRec model has been chosen with 1 hidden layer with dimen-sionality 1000 and latent dimensionality 5. The reason that such a limitedmodel was chosed - in terms of latent dimensionality - is because RealNVPrequires the production of K*K parameters for each produced weight matrixwithin the RealNVP step.

5.7.1 Experiments with Planar transformations

The efficacy of planar transformations was difficult to asses, because of nu-merical instabilities that kept being one of the hardest challenges.

The following plot shows the progress curves for a model without ”softfree bits”, but having fixed KL coefficient set to 1.

Figure 5.8: VaeRec with and without single planar transformation step

The normalizing flow step introduce a conspicuous element of variabilityin the learning outcomes, that can be observed in the scattered learningcurves, both the training and the testing curve. Especially observing theRMSE outcome with the training set, it seems that either the introduction


of the planar transformation step has a regularizing effect, or the tuning ofthe produced transformation parameters makes the learning difficult, as itmight require a larger amount of epochs.

The following table summarizes the minima in RMSE:

Table 5.4: Best results for VaeRec with and without single planar transfor-mation step

best training RMSE best testing RMSEwithout transformation step 0.7071 0.8620

with transformation step 0.8038 0.87053

It might be interesting how the ELBO of the two models compare. If theadditional step has the effect of improving the posterior approximation, thenthe objective function of the training set should be inferior in the latter case.In these experiments the objective function is exactly the negative ELBO, asthere was no L2 regularization set.

Figure 5.9: VaeRec with and without single planar transformation step: me-dian of objective function (negative ELBO)

The plot seems to confirm the hypothesis that the ELBO gets to higher


values with the planar transformation step, hence the posterior approxima-tion is necessarily more accurate.

5.7.2 Experiments with RealNVP transformations

To experiment with RealNVP transformation, a variant with learning rateannealing T=10, and soft free bits’ λ = 2 ∗K = 10 has been used.

Figure 5.10: VaeRec with RealNVP Normalizing flows

Table 5.5: Best results for deep VaeRec under 0, 1 and 5 RealNVP stepsNumber of RealNVP steps best training RMSE best testing RMSE

0 steps 0.8191 0.85741 step 0.8159 0.85665 steps 0.8127 0.8554

These results show that an increase in number of RealNVP steps leads toa better testing accuracy, althogh the improvement is modest.


5.8 Equivalences between AutoRec and VaeRec

models

In order to perform an adequate comparison between the AutoRec andVaeRec models it’s important to establish if there are any available equiva-lences. In other words, it is interesting to see if a specific choice of hyperpa-rameters of the VaeRec leads to a model that is similar to the VaeRec bothin its definition and its performance.

Luckily such a model can be found in the VaeRec by setting the KLcoefficient to 0. This way that extra regularization term is absent and theVaeRec model becomes analogous to the AutoRec model.

The ELBO function per-datapoint of a VAE with posterior distributionbeing diagonal-covariance gaussians, without the KL divergence becomes:

L(x(i))

= Eqφ(z|x(i))[log pθ

(x(i)|z

)](5.8)

If pθ(x(i)|z

)has a spherical gaussian (I covariance matrix) form, then this

objective, which comprises only the likelihood term, becomes very similar tothe reconstruction error of a regular autoencoder, but differs for the fact thatz(i) is stochastic and drawn from a distribution determined by the encoder.Since the KL term is absent, qφ

(z|x(i)

), unregolarized, will tend to collapse

to distributions that are centered in specific µ’s in latent space but have σ’sthat tend to 0, hence with random sample from qφ

(z|x(i)

)being always µ.

Hence, an hypothesis can be formulated about the similarity of VaeRecwithout KL and AutoRec:

Experimental results confirm the hypothesis by showing similarity of test-ing error:

Minibatchsize

hid.layerwidth

num. hiddenlayers

latent zdimensionality

AutoRec (RProp)testing RMSE

VaeRec (Adam)testing RMSE

64 1000 1 250 0.8700 0.833564 1000 2 250 0.8341 0.836564 1000 1 500 0.876764 1000 2 500 0.8696 0.8511

Table 5.6: Comparison of similar VaeRec and AutoRec models

The testing error achieved by both AutoRec and VaeRec models are very

5.9. USER+ITEMCONCATENATION VS TRADITIONAL ITEMORUSER DATAPOINTS 47

similar under similar hyperparameter settings.

5.9 User+Item concatenation vs traditional

Item or User datapoints

5.9.1 User+Item concatenation on AutoRec

For this experiment the base model AutoRec was used, with the purposeto observe difference in learning outcomes between using just item vectorsversus the concatenation of user and item vectors.

For this experiment, a latent dimensionality of 250 and with hidden layerdimensionality set at 500. These hyperparameters reflect typical settingsfrom the original AutoRec paper [Sedhain et al., 2015]. L2 regularizationwas set at 200.

Figure 5.11: AutoRec: comparing item learning vs user+item learning

Unfortunately the user+item version is not able to generalize well onthe dataset, Nevertheless it’s interesting to see how the user+item versionoverfits more than the item version. This indicates that using user+itemconcatenation datapoints might lead to better performance on the test set ifa suitable regularization method is found.


One disadvantage of using user+item was the much longer times for train-ing, likely because of the datapoint dimensionality increase.

5.9.2 User+Item concatenation on VaeRec

Similar comparison experiments have been performed on VaeRec, with differ-ent model hyperparameters. Specifically, these experiments differ by havingused a much lower dimensionality (5), which might have regularizing effects,as well as L2 regularization set at 100 and hidden layer dimensionality set at1000. Moreover, learning rate annealing T parameter has been set to 10 andsoft free bits have been employed with λ = 2 ∗K = 10.

Figure 5.12: VaeRec: comparing item learning vs user+item and user learn-ing

Similarly as the previous AutoRec experiment, it can be observed howuser+item overfits and performs poorly on the testing set.

Interestingly, the ’user’ version of VaeRec performs better than the base-line ’user’ variant of AutoRec as reported in their paper.

5.10. EXPERIMENTS WITH REGULARIZATION TECHNIQUES 49

Table 5.7: VaeRec: performance on the test set of item learning vs user+itemand user learning, compared to the reported AutoRec outcomes [Sedhainet al., 2015]

VaeRec AutoRectraining testing testing

item λ = 10 0.8240 0.8599 0.831user λ = 10 0.8262 0.8598 0.874user+item λ = 10 0.8241 1.0893 N/Auser+item λ = 1 0.9813 0.9930

In these experiments using λ = 1 very soon leads to NaN (Not A Number)interruption. This lead to the observation that there exists a lower-boundto the KL divergence, in this case about 2.46, which causes the annealingcoefficient to diverge towards infinity.

5.10 Experiments with regularization techniques

5.10.1 Dropout layer on the input of an AutoRec model

Denosing autoencoders [Vincent et al., 2010b] improve the quality of therepresentations by forcing resiliance of the neural network by adding noiseon the input and using the original datapoint in the objective function.

We tried a similar mechanism on our AutoRec re-implementation byadding a Dropout [Srivastava et al., 2014] layer with parameter p = 0.1on the input.

The following plot shows that an improvement in generalization is indeedobtained:


Figure 5.13: Dropout layer on the input of an AutoRec model

5.10.2 Dropout layer on the input on a deep VaeRecmodel

Additional experiments to very the effectiveness of adding a dropout layerto the input have been performed to a deeper model as described in section5.5.1. This base model has been set with ”soft free bits” λ = 1 ∗K, henceslightly under-regularized.


Figure 5.14: Dropout layer on the input of a deep VaeRec model

Table 5.8: Best results for deep VaeRec under different p settings of a dropoutlayer applied on the input

p best training RMSE best testing RMSE0 0.5102 0.8536

0.5 0.6413 0.84430.8 0.8469 0.8861

The use of a dropout layer on the input to enforce a denosing-like behaviorseems to be beneficial. While p = 0 results in overfitting and p = 0.8 resultsin underfitting, the VaeRec is able to achieve an extremely good performanceof 0.8443 on the testing set with p = 0.5.


5.10.3 Tradeoff between KL divergence and weightsregularization

Both weights decay and KL divergence are regularizers that enable the modelto achieve generalization. The contribution of both these terms is com-pounded, so that coefficients need to be tuned in order to avoid over-regularization.

As an example, this can be seen in the following plot, where a VaeRecmodel using L2 regularization with coefficient 200 is tested with, and withoutthe KL term:

Figure 5.15: VaeRec with and without KL divergence, minibatch size set at1

For additional comparison, here are the result by using a minibatch up-date schema with size set at 64:


Figure 5.16: VaeRec with and without KL divergence, minibatch size set at64

Using the minibatch shows considerable improvements for both variants(with and without KL divergence). It is interesting how the model with KLimproves in such a drastic way by using minibatch learning. This is probablycaused by the KL regularizer being very noisy with individual samples, caus-ing over-regularization, as typical with SGD (without minibatch) schemas.By using a minibatch the KL regularizer becomes less noisy by smootheningvia averaging.


Chapter 6

Conclusion

VAERec models introduce a straightforward extension to the AutoRec mod-els. Probabilistic information on latent variables representing users or itemswas exhamined by [van Baalen, 2016].

AutoRec has been chosen as the base model for its capability to recon-struct an entire sparse vector of ratings belonging to a user, or to an item byestimating all its missing values during a single query.

Our work introduces additional explorations trough the use of variationalautoencoders specifically made to handle sparse input, implemented via pa-rameter masking.

Moreover, posterior approximation improvements have been added, inthe form of planar flows [Rezende and Mohamed, 2015] and novel use ofthe more powerful invertible transformation introduced by RealNVP [Dinhet al., 2016]. Comparisons between different hyperparameter settings havebeen illustrated.

Experiments show that adding transformations to the posterior approxi-mations leads to a higher ELBO, hence the posterior approximation is nec-essarily improved. Additional experiments show how RealNVP transforma-tions improve the generalization capability of the model.

Overfitting and underfitting were some of the major obstacles in the at-tempt to obtain models with good generization capabilities. In VAE modelsthese phenomenon can be tackled by altering the coefficient to the KL regu-larizer term. An adaptive method, named Soft Free Bits [Chen et al., 2016]has been employed in order to dynamically alter the KL coefficient accordingto the value of the KL term.

The novel use of datapoint comprised of a concatenation of user and item

55

56 CHAPTER 6. CONCLUSION

vector indicated some promising prospects for AutoRec-like models. Thisinput variant leads to a better fitting than item or user-based models underidentical circumstances. The drawback of overfitting seems to be overcomeby regularization techniques, in the case of the VaeRec, by careful handlingof the coefficients of the KL divergence, with techniques such as soft free bits.

6.1 Future work

The field of representation learning and autoencoders is currently object ofgrowing interest from researchers. Specifically, methods to improve posteriorapproximations of VAEs are being researched and could be applied to thebase model VAERec. For instance, of particular interest is AutoregressiveFlow [Kingma et al., 2016]

Specifically to this work, the RealNVP transformation could be furtherimproved by changing the function a(·) into a nonlinear function instead ofbeing an affine transformation.

The decoder, or generator network, has been chosen as having a sphericalgaussian form with identity covariance matrix I on pθ (x|z). A more infor-mative model with an arbitrarly-valued diagonal covariance matrix could beemployed in order to give a measure of uncertainty on the estimated ratings.

With an eye on different models, Generative Adversarial Networks [Good-fellow et al., 2014] seem to be well suited for collaborative filtering. TheGenerator-Discriminator networks might help obtaining predicted ratingsthat are as ”real” as they could possibly be.

Hyperparameter search for VaeRec models needs to be further investi-gated. Specifically, computation-intensive improvements such as increase indepth and width of the networks should be looked into. Alternative settingsto the λ parameter for soft free bits need to be tested in order to find a properoptimum. Different libraries than Theano [Al-Rfou et al., 2016] should alsobe tried, as Theano’s development is currently discontinued, in favor of Ten-sorFlow [Abadi et al., 2015] or PyTorch [Paszke et al., 2017], which mighthave better support for sparse tensors.

Appendix A

Derivations

A.1 Density of a transformed multivariate ran-

dom variable

a random variable z0 is transformed via an invertible transformation T :Rn → Rn:

z1 = T (z0) (A.1)

Then, it’s possible to express the pdf of z1 by using the distribution ofthe original variable pz0 (z0) and the invertible transformation T .

This can be demonstrated by making use of the cdf of z1, denominatedFz1 (z1), showing its relationship with the cdf of z0 and the inverse of T[Watkins, 2009]:

Fz1 (γ) = P (z1 ≤ γ)

= P (T (z0) ≤ γ)

= P(z0 ≤ T −1 (γ)

)= Fz0

(T −1 (γ)

) (A.2)

The integral is expanded and integration by substitution is used to high-light the role of the pdf of z0 and the determinant of the Jacobian matrix of

57

58 APPENDIX A. DERIVATIONS

the inverse of T :

Fz1 (z1) = Fz0

(T −1 (z1)

)=

∫ T −1(z1)

pz0 (z0) dz0

=

∫ z1

pz0(T −1 (z1)

)·∣∣∣∣det

∂T −1 (z1)

∂z1

∣∣∣∣ dz1

(A.3)

Next, the derivative on z1 is applied to the form of the cdf of z1 justachieved, in order to obtain a convenient formula for pz1 (z1):

pz1 (z1) =∂

∂z11 · · · ∂z1n

Fz1 (z1)

=∂

∂z11 · · · ∂z1n

∫ z1

pz0(T −1 (z1)

)·∣∣∣∣det

∂T −1 (z1)

∂z1

∣∣∣∣ dz1

= pz0(T −1 (z1)

)·∣∣∣∣det

∂T −1 (z1)

∂z1

∣∣∣∣(A.4)

The matrix inverse of the Jacobian matrix of an invertible function, suchas T , is the Jacobian matrix of T −1: [inv, 2017][

∂T (z0)

∂z0

]−1

=∂T −1 (z1)

∂z1

(A.5)

The well known property of the determinant of inverse matrix det(A−1) =1/ det(A) can be used in order to just calculate the Jacobian of T instead ofthe Jacobian of T −1 . This leads to:

pz1 (z1) = pz0(T −1 (z1)

)·∣∣∣∣det

∂T (z0)

∂z0

∣∣∣∣−1

(A.6)

This form can be eventually be expressed as a function of z0:

pz1 (z1) = pz0 (z0) ·∣∣∣∣det

∂T (z0)

∂z0

∣∣∣∣−1

(A.7)

A.2 Variational Expectation Lower Bound

Given a dataset X = x(1) . . .x(N) and the respective unobserved latentvariable realizations Z = z(1) . . . z(N) [Fox and Roberts, 2012] suggest to

A.2. VARIATIONAL EXPECTATION LOWER BOUND 59

derive a quantity to be maximized by considering the minimization of theKullback-Leibler divergence KL [qφ (Z|X) ||pθ (Z|X)] by decomposing it asfollows:

KL [qφ (Z|X) ||pθ (Z|X)] = Eqφ(Z|X)

[log

qφ (Z|X)

pθ (Z|X)

]= Eqφ(Z|X) [log qφ (Z|X)− log pθ (X|Z)− log pθ (Z) + log pθ (X)]

(A.8)As the integrand inside the expectation Eqφ(Z|X) [log pθ (X)] is a constant

w.r.t. Z, then Eqφ(Z|X) [log pθ (X)] = log pθ (X). Hence, the previous expres-sion can be rewritten as:

log pθ (X) = KL [qφ (Z|X) ||pθ (Z|X)] + L (X) (A.9)

Where we made use of this shorthand:

L (X) = Eqφ(Z|X) [− log qφ (Z|X) + log pθ (X|Z) + log pθ (Z)] (A.10)

As log pθ (X) is a constant w.r.t. the variational parameters φ, andKL [qφ (Z|X) ||pθ (X|Z)] is always non-negative, then the quantity L (X) canbe interpreted as a lower-bound to log pθ (X) whose maximization impliesthe minimization of KL [qφ (Z|X) ||pθ (X|Z)].

The lower-bound L (X) can also be expressed, as in [Kingma and Welling,2013], by grouping some terms into a negative Kullback-Leibler divergence:

L (X) = −KL [qφ (Z|X) ||pθ (Z)] + Eqφ(Z|X) [log pθ (X|Z)] (A.11)

The Kullback-Leibler divergence KL [qφ (Z|X) ||pθ (Z)] can be also ex-pressed by an entropy and a cross-entropy term, hence:

L (X) = −H [qφ (Z|X) , pθ (Z)] + H [qφ (Z|X)] + Eqφ(Z|X) [log pθ (X|Z)](A.12)

More commonly, the lower-bound is re-arranged by making use of the en-tropy of the variational approximation and expectation over the joint prob-ability:

L (X) = H [qφ (Z|X)] + Eqφ(Z|X) [log pθ (X,Z)] (A.13)

This is described in [Hoffman and Johnson, 2016] as Average negativeenergy plus entropy.


A.3 ELBO as sum of terms dependent on in-

dividual datapoints

As used by [Kingma and Welling, 2013], the ELBO can be decomposed into asum of terms, each dependent only on an individual datapoint. This followsthe assumption that each datapoint generated by a certain latent variablerealization is independent from both the other datapoints:

pθ (X|Z) =N∏i=1

pθ(x(i)|z(i)

)(A.14)

Same assumption is made on the prior distribution on the latent vari-ables:

pθ (Z) =N∏i=1

pθ(z(i))

(A.15)

Hence this is the form for the joint probability:

pθ (X,Z) = pθ (X|Z) pθ (Z) =N∏i=1

pθ(x(i)|z(i)

)pθ(z(i))

=N∏i=1

pθ(x(i), z(i)

)(A.16)

For convenience, the chosen form for L (X) will be the (A.12).

It’s possible to make use of information-theoretical properties [Bergstrom,2008]:

H [qφ (Z|X)] = H[qφ(z(1)|x(1)

)]+ H

[qφ(Z−(1)|X−(1)

)|qφ(z(1)|x(1)

)]chain rule for joint entropy

= H[qφ(z(1)|x(1)

)]+ H

[qφ(Z−(1)|X−(1)

)]independence of datapoints

=∑i

H[qφ(z(i)|x(i)

)]recursion

(A.17)

Similarly, for H [qφ (Z|X) , pθ (Z)]:

A.3. ELBO AS SUMOF TERMS DEPENDENTON INDIVIDUAL DATAPOINTS61

H [qφ (Z|X) , pθ (Z)] = H[qφ(z(1)|x(1)

), pθ(z(1))]

+ H[qφ(Z−(1)|X−(1)

), pθ(Z−(1)

)|qφ(z(1)|x(1)

), pθ(z(1))]

= H[qφ(z(1)|x(1)

), pθ(z(1))]

+ H[qφ(Z−(1)|X−(1)

), pθ(Z−(1)

)]=∑i

H[qφ(z(i)|x(i)

), pθ(z(i))]

(A.18)

For the third term Eqφ(Z|X) [log pθ (X|Z)]:

Eqφ(Z|X) [log pθ (X|Z)] =

∫z(1)· · ·∫z(N)

∏qφ(z(i)|x(i)

) N∑i=1

log pθ(x(i)|z(i)

)dz(N) · · · dz(1)

=

∫z(1)

qφ(z(1)|x(1)

)· · ·∫z(N)

qφ(z(N)|x(N)

) N∑i=1

log pθ(x(i)|z(i)

)dz(N) · · · dz(1)

=N∑i=1

∫z(i)

qφ(z(i)|x(i)

)log pθ

(x(i)|z(i)

)dz(i)

=N∑i=1

Eqφ(z(i)|x(i))[log pθ

(x(i)|z(i)

)](A.19)

By plugging these forms into the ELBO (A.12), it can be shown as a sumof individual objective terms, each of those is dependent on only a singledatapoint:

L (X) =N∑i=1

−H[qφ(z(i)|x(i)

), pθ(z(i))]

+ H[qφ(z(i)|x(i)

)]+ Eqφ(z(i)|x(i))

[log pθ

(x(i)|z(i)

)](A.20)

It’s noteworthy that L (X) can be expressed by the following mutualinformation term:

I (z; x) = Ex [KL [qφ (z|x) ||pθ (z)]]

≈ 1

N

N∑i=1

KL[qφ(z(i)|x(i)

)||pθ

(z(i))] (A.21)

Hence the L (X) can be expressed as:


L (X) = −N · I (z; x) +N∑i=1

Eqφ(z(i)|x(i))[log pθ

(x(i)|z(i)

)](A.22)

This is reminiscent of the Average term-by-term reconstruction minus KLto prior interpretation of the ELBO formulated in [Hoffman and Johnson,2016]

A.4 Rearranging the ELBO

The term KL[qφ(z|x(i)

)||pθ (z)

]has an analytical solution within the orig-

inal VAE framework with Gaussian approximation of the posterior, whenceit’s not subject to Monte-Carlo sampling.

Unfortunately, by using Normalizing Flows, the KL cannot be determinedanalytically, so it has to be subject to Monte-Carlo sampling. The negativelower bound L(x) can be interpreted as a negative Free-energy −F(x) thathas to be minimized.

It’s useful to reduce the free energy into it’s ”atomic” probability compo-nents:

F(x(i))

= −L(x(i))

= −Eqφ(z|x(i))[log pθ

(x(i), z

)− log qφ

(z|x(i)

)]= Eqφ(z|x(i))

[− log pθ

(x(i)|z

)− log pθ (z) + log qφ

(z|x(i)

)] (A.23)

The random multivariate variable z can be interpreted as being the resultof a transformation z = T (z0) of an initial random multivariate variablewhich happens to have a simple distribution, such as multivariate gaussianwith diagonal covariance matrix.

For the law of the unconscious statistician (LOTUS) [Ringner, 2009] theenergy can have a form with expectations over the simpler distribution of z0:

F(x(i))

= Eq0(z0|x)

[− log pθ

(x(i)|T (z0)

)− log pθ (T (z0))

]+ Eqφ(z|x(i))

[log qφ

(z|x(i)

)](A.24)

A.4. REARRANGING THE ELBO 63

The last term is clearly the negative entropy of qφ(z|x(i)

):

Hqφ

[z|x(i)

]= −Eqφ(z|x(i))

[log qφ

(z|x(i)

)](A.25)

At this point the previous result on the transformed density (A.6) can beused in order to express this term as a function of q0 (z0|x):

log qφ(z|x(i)

)= log q0(T −1 (z)) + log

(∣∣∣∣det∂T (z0)

∂z0

∣∣∣∣−1)

= log q0(z0)− log

(∣∣∣∣det∂T (z0)

∂z0

∣∣∣∣)(A.26)

Hence, this entropy term becomes:

−Hqφ

[z|x(i)

]= Eqφ(z|x(i))

[log q0(T −1 (z))− log

(∣∣∣∣det∂T (T −1 (z))

∂T −1 (z)

∣∣∣∣)](A.27)

By applying the law of the unconscious statistician to −Hqφ

[z|x(i)

], and

considering the expectation over q0 instead of over qφ, a relationship betweenthe entropy of the two distributions emerges:

−Hqφ

[z|x(i)

]= Eq0(z0|x)

[log q0(T −1 (T (z0)))

]− Eq0(z0|x)

[log

(∣∣∣∣det∂T (z0)

∂z0

∣∣∣∣)]= Eq0(z0|x) [log q0(z0)]− Eq0(z0|x)

[log

(∣∣∣∣det∂T (z0)

∂z0

∣∣∣∣)]= −Hq0 [z0|x]− Eq0(z0|x)

[log

(∣∣∣∣det∂T (z0)

∂z0

∣∣∣∣)](A.28)

A.4.1 A closer look to the terms of the free energy

At this point, it’s possible to write F(x(i))

in the following form:

F(x(i))

=− Eq0(z0|x)

[log pθ

(x(i)|T (z0)

)]− Eq0(z0|x) [log pθ (T (z0))]

−Hq0 [z0|x]

− Eq0(z0|x)

[log

(∣∣∣∣det∂T (z0)

∂z0

∣∣∣∣)](A.29)


Every term can be interpreted and given meaning:

Eq0(z0|x)

[log pθ

(x(i)|T (z0)

)]: the likelihood term

This term is the probability of the datapoint x(i) as evaluated by the proba-bility distribution conditional to the sampled latent code T (z0). In bayesianlingo, this is the likelihood of the datapoint given the model parameters.

When the model is seen as an autoencoder this term can be interpreted asa reconstruction error as calculated from the decoder, also called recognitionnetwork.

Eq0(z0|x) [log pθ (T (z0))] : the prior term on the transformed latentcode

This term is the probability of the latent code T (z0) without consideringthe current datapoint x(i). In this model this quantity is evaluated from theprior distribution of the latent codes, whose parameters are pre-defined andnot subject to learning. As usual with fixed prior distribution, such a termin objective functions can be interpreted as a regularizer of the transformedlatent code.

Hq0 [z0|x] : the entropy term on the initial code

This term is the entropy of the distribution of the sample z0 of the origi-nal identity-covariance Gaussian posterior approximation distribution. Theparameters of this distribution are generated by the inference network.

The sample z0 is going to be transformed into the actual latent code viathe transformation T .

As that distribution might be simple, such as a multivariate Gaussianwith diagonal covariance matrix, this term is expected to be analyticallyknown.

Minimization of the free energy F(x(i))

implies maximization of thisentropy term. Hence the generated variational distribution of z0 will tendnot to concentrate probability mass in small areas of the probability space.For this reason this term can be interpreted as a form of regularization onz0.

A.5. JACOBIAN OF A PLANAR TRANSFORMATION 65

Eq0(z0|x)

[log(∣∣∣det ∂T (z0)

∂z0

∣∣∣)] : the transformation term

This term does not contain probability quantities and only makes use of thetransformation T .

A.5 Jacobian of a Planar Transformation

The Jacobian ∂t(z)∂z

of the transformation can be found as follows:

∂z

∂z= I

∂w>z + b

∂z= w> The Jacobian of a scalar is a row vector

∂h(w>z + b)

∂z= h′(w>z + b)w> Chain rule

(A.30)Hence:

∂t(z)

∂z= I + uh′(w>z + b)w> (A.31)

In order to derive the determinant of the Jacobian, it’s possible to makeuse of the matrix determinant lemma which states:

det(A + ab>) = (1 + b>A−1a)(det A) (A.32)

Considering A = I, a = u and b> = h′(w>z + b)w>, this leads to:

det∂t(z)

∂z= det(I + uh′(w>z + b)w>)

= (1 + h′(w>z + b)w>I−1u)(det I)

= 1 + h′(w>z + b)w>u

(A.33)

A.6 Energy of single planar transformation

step

A.6.1 Model form

As a simple case let’s derive the free-energy objective function by using thefollowing model:


The transformation T is made of a single planar transformation step.

The distribution q0 (z0|x) of the initial sample z0 is assumed to be asimple Multivariate Normal with diagonal covariance matrix.

The likelihood distribution pθ(x(i)|T (z0)

)is also assumed to be a Mul-

tivariate Normal with diagonal covariance matrix.

Prior distribution on the transformed latent code is a spherical Mul-tivariate Normal with Identity covariance matrix centered on 0.

A.6.2 Derivation of the free-energy F(x(i))

Follows the derivation of a Monte Carlo 1-sample approximation of F(x(i)):

The likelihood term

Eq0(z0|x)

[log pθ

(x(i)|T (z0)

)]≈ − log

(√2π |Σθ|

)− 1

2

(x(i) − µθ

)>Σ−1θ

(x(i) − µθ

)= −1

2log (2π)− 1

2

∑j

log σθj −1

2

∑j

(x

(i)j − µθj

)2

· 1

σθj

(A.34)It can be seen how this term expresses a regression error.

The prior term on the transformed latent code. The derivation forthis term is made easy by the parameters of the prior (µ = 0,Σ = I).

Eq0(z0|x) [log pθ (T (z0))] ≈ − log(√

2π |I|)− 1

2(T (z0)− 0)>I−1 (T (z0)− 0)

= −1

2log (2π)− 1

2||T (z0)||22

(A.35)The form of this term highlight its function as a regularizer.

The entropy term on the initial code. The entropy of a MultivariateNormal distribution is: 1

2log((2π)k |Σ|

), hence the entropy of the initial

sample z0 can be derived as:

A.7. MULTIPLE NESTED TRANSFORMATION STEPS 67

Hq0 [z0|x] ≈ 1

2log (2π) +

1

2k +

1

2

∑j

log σφj (A.36)

The transformation term Follows the derivation (A.33):

Eq0(z0|x)

[log

(∣∣∣∣det∂T (z0)

∂z0

∣∣∣∣)] ≈ log∣∣1 + h′(w>z0 + b)w>u

∣∣ (A.37)

Free-energy F(x(i))

implementation with the Planar Transforma-tion

By summing all the terms and removing the constant terms, this is the formof approximation of the free-energy F

(x(i))

objective:

F(x(i))≈1

2

∑j

log σθj

+1

2

∑j

[(x(i) − µθ

)[j]

]2

· 1

σθj

+1

2||T (z0)||22

− 1

2

∑j

log σφj

− log∣∣1 + h′(w>z0 + b)w>u

∣∣

(A.38)

A.7 Multiple nested transformation steps

A transformation T (z) might be composed of multiple nested but similartransformation steps each with it’s own parameters:

T (z) = tk tk−1 . . . t1(z0) (A.39)

By using the chain rule, the gradient of the transformation becomes:

∂T (z0)

∂z0

=K∏k=1

∂tk (zk−1)

∂zk−1

(A.40)


The determinant of the product of square matrices is the product of theirdeterminants, hence:

det∂T (z0)

∂z0

=K∏k=1

det∂tk (zk−1)

∂zk−1

(A.41)

The expectation term, as previously illustrated, can be approximatedwith a single Monte Carlo sample:

Eq0(z0|x)

[log

(∣∣∣∣∂T (zk−1)

∂zk−1

∣∣∣∣)] ≈ log

∣∣∣∣∣K∏k=1

det∂tk (zk−1)

∂zk−1

∣∣∣∣∣=

K∑k=1

log

∣∣∣∣∂tk (zk−1)

∂zk−1

∣∣∣∣(A.42)

A.7.1 Implementation with Planar Transformations

With planar transformations, the derivative becomes:

∂T (z0)

∂z0

=K∏k=1

I + ukh′(w>k zk−1 + bk)w

>k (A.43)

The expectation sample becomes:

Eq0(z0|x)

[log

(∣∣∣∣∂T (zk−1)

∂zk−1

∣∣∣∣)] ≈ K∑k=1

log∣∣1 + h′(w>k zk−1 + bk)w

>k uk

∣∣(A.44)

A.8 KL between diagonal-covariance Gaussian

qφ (z|x) and spherical Gaussian prior

The Variational-Autoencoder introduced by [Kingma and Welling, 2013]makes use of a posterior approximation that takes the form of a diagonal-covariance Normal distribution qφ (z|x) = N (z|µ,σ); the prior distributionon the latent codes has a spherical Gaussian distribution with Σ = I and

A.8. KL BETWEENDIAGONAL-COVARIANCE GAUSSIANQφ (Z|X) AND SPHERICAL GAUSSIAN PRIOR69

µ = 0. Dimensionality of z is J . Follows a derivation which ends up in aconvenient form:

KL [qφ (z|x) ||pθ (z)] =

∫z

qφ (z|x) (log qφ (z|x)− log pθ (z)) dz (A.45)

By considering separately the two terms one obtains:

∫z

pθ (z) log pθ (z) dz =

∫z

pθ (z)

[log((

(2π)J |I|)− 1

2

)− 1

2(z>I−1z)

]dz

= −1

2

[J log 2π + Eqφ(z|x)

[z>z

]]= −1

2

[J log 2π +

(Tr (Σ) + µ>µ

)]= −1

2

[J log 2π +

(J∑j=1

σ2j +

J∑j=1

µ2j

)](A.46)

Where the trick Ex

[x>Ax

]= Tr (AΣ) + m>Am from [Petersen and

Pedersen, 2012] has been used.

Second term is:

∫z

pθ (z) log pθ (z) dz = −1

2

[J log(2π) + log(|Σ|) + Eqφ(z|x)

[(z− µ)>Σ−1(z− µ)

]]= −1

2

[J log(2π) + log(|Σ|) + (µ− µ)>Σ−1(µ− µ) + Tr

(Σ−1Σ

)]= −1

2

[J log(2π) +

J∑j=1

log σ2j + J

](A.47)

Where the trick Ex

[(x−m′)>A(x−m′)

]= Tr (AΣ)+(m−m′)>A(m−

m′) from [Petersen and Pedersen, 2012] has been used.

Putting the two terms together:


KL [qφ (z|x) ||pθ (z)] =1

2

[J∑j=1

σ2j +

J∑j=1

µ2j −

J∑j=1

log σ2j − J

]

=1

2

J∑j=1

[σ2j + µ2

j − log σ2j − 1

] (A.48)

A.8.1 KL of diagonal covariance gaussians is a sum ofthe KL of the individual variables

A Gaussian pdf with diagonal covariance matrix can be decomposed into aproduct of the individual independent latent variables.

First, the diagonal-covariance Gaussian can be interpreted as the productof individual one-dimensional independent Gaussian variables:

N (x|µ,Σ) =((2π)J |Σ|

)− 12 exp

−1

2(x− µ)>Σ−1(x− µ)

=

[J∏j=1

(2πσ2j )− 1

2

]exp

−1

2

[J∑j=1

(xj − µj)>1

σ2j

(xj − µj)

]

=

[J∏j=1

(2πσ2j )− 1

2

]exp

−1

2(xj − µj)>

1

σ2j

(xj − µj)

=J∏j=1

N(xj|µj, σ2

j

)(A.49)

This fact can be used to get to a simpler way to calculate the KL diver-

A.8. KL BETWEENDIAGONAL-COVARIANCE GAUSSIANQφ (Z|X) AND SPHERICAL GAUSSIAN PRIOR71

gence:

KL [qφ (z|x) ||pθ (z)] =

∫z

[J∏

j′=1

q(zj′ |x)

]log

[J∏j=1

q(zj|x)

]− log

[J∏j=1

p(zj)

]dz

=

∫z

[J∏

j′=1

q(zj′ |x)

]exp

J∑j=1

log q(zj|x)− log p(zi)

dz

=J∑j=1

∫z

[J∏

j′=1

q(zj′ |x)

]exp log q(zj|x)− log p(zi) dz

=J∑j=1

∫zj

[log q(zj|x)− log p(zj)] q(zj|x)dzj

∫z−j

q(z−j|x)dz−j︸︷︷︸=1

=J∑j=1

KL [q(zj|x)||p(zj)]

(A.50)Where z−j are the latent variables excluding zj.

A.8.2 KL of a one-dimensional Gaussian vs unit circleGaussian

The Kullback-Leibler divergence of an arbitrary one-dimensional Gaussianq(z) = N (z|µ, σ2) versus a unit circle Gaussian p(z) = N (z|0, 1) can bederived as follows:

KL [q(z)||p(z)] = Eq [log q(z)− log p(z)]

=1

2

[− log(2πσ2)− 1

σ2Eq[z2]

+2µ

σ2Eq [z]− µ2

σ2+ log 2π + Eq

[z2]]

=1

2

[− log(2πσ2)− 1

σ2(µ2 + σ2) +

2µ2

σ2− µ2

σ2+ log 2π + µ2 + σ2

]=

1

2

[− log σ2 − 1 + µ2 + σ2

](A.51)

This allows us to get the same result as A.48 by easily combining A.50with A.51.


Bibliography

[inv, 2017] (2017). Inverse function theorem. https://en.wikipedia.org/wiki/Inverse_function_theorem.

[Abadi et al., 2015] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen,Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat,S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz,R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R., Moore,S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever,I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viegas, F.,Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng,X. (2015). TensorFlow: Large-scale machine learning on heterogeneoussystems. Software available from tensorflow.org.

[Al-Rfou et al., 2016] Al-Rfou, R., Alain, G., Almahairi, A., Angermuller,C., Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., Belopol-sky, A., Bengio, Y., Bergeron, A., Bergstra, J., Bisson, V., Snyder, J. B.,Bouchard, N., Boulanger-Lewandowski, N., Bouthillier, X., de Brebisson,A., Breuleux, O., Carrier, P. L., Cho, K., Chorowski, J., Christiano, P. F.,Cooijmans, T., Cote, M., Cote, M., Courville, A. C., Dauphin, Y. N., De-lalleau, O., Demouth, J., Desjardins, G., Dieleman, S., Dinh, L., Ducoffe,M., Dumoulin, V., Kahou, S. E., Erhan, D., Fan, Z., Firat, O., Germain,M., Glorot, X., Goodfellow, I. J., Graham, M., Gulcehre, C., Hamel, P.,Harlouchet, I., Heng, J., Hidasi, B., Honari, S., Jain, A., Jean, S., Jia, K.,Korobov, M., Kulkarni, V., Lamb, A., Lamblin, P., Larsen, E., Laurent,C., Lee, S., Lefrancois, S., Lemieux, S., Leonard, N., Lin, Z., Livezey, J. A.,Lorenz, C., Lowin, J., Ma, Q., Manzagol, P., Mastropietro, O., McGibbon,R., Memisevic, R., van Merrienboer, B., Michalski, V., Mirza, M., Orlandi,A., Pal, C. J., Pascanu, R., Pezeshki, M., Raffel, C., Renshaw, D., Rocklin,M., Romero, A., Roth, M., Sadowski, P., Salvatier, J., Savard, F., Schluter,

73

https://en.wikipedia.org/wiki/Inverse_function_theorem

https://en.wikipedia.org/wiki/Inverse_function_theorem

74 BIBLIOGRAPHY

J., Schulman, J., Schwartz, G., Serban, I. V., Serdyuk, D., Shabanian, S.,Simon, E., Spieckermann, S., Subramanyam, S. R., Sygnowski, J., Tan-guay, J., van Tulder, G., Turian, J. P., Urban, S., Vincent, P., Visin, F.,de Vries, H., Warde-Farley, D., Webb, D. J., Willson, M., Xu, K., Xue, L.,Yao, L., Zhang, S., and Zhang, Y. (2016). Theano: A python frameworkfor fast computation of mathematical expressions. CoRR, abs/1605.02688.

[Bal et al., 2016] Bal, H., Epema, D., de Laat, C., van Nieuwpoort, R.,Romein, J., Seinstra, F., Snoek, C., and Wijshoff, H. (2016). A medium-scale distributed system for computer science research: Infrastructure forthe long term. Computer, 49(5):54–63.

[Bergstrom, 2008] Bergstrom, C. (2008). Lecture 3: Joint en-tropy, conditional entropy, relative entropy, and mutual infor-mation. http://octavia.zoology.washington.edu/teaching/429/

lecturenotes/lecture3.pdf.

[Bobadilla et al., 2013] Bobadilla, J., Ortega, F., Hernando, A., andGutieRrez, A. (2013). Recommender systems survey. Know.-Based Syst.,46:109–132.

[Bowman et al., 2015] Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M.,Jozefowicz, R., and Bengio, S. (2015). Generating sentences from a con-tinuous space. CoRR, abs/1511.06349.

[Chen et al., 2016] Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhari-wal, P., Schulman, J., Sutskever, I., and Abbeel, P. (2016). Variationallossy autoencoder.

[Dinh et al., 2016] Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Den-sity estimation using real NVP. CoRR, abs/1605.08803.

[Fox and Roberts, 2012] Fox, C. W. and Roberts, S. J. (2012). A tutorial onvariational bayesian inference. Artificial Intelligence Review, 38(2):85–95.

[Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu,B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014).Generative adversarial nets. In Ghahramani, Z., Welling, M., Cortes, C.,Lawrence, N. D., and Weinberger, K. Q., editors, Advances in NeuralInformation Processing Systems 27, pages 2672–2680. Curran Associates,Inc.

http://octavia.zoology.washington.edu/teaching/429/lecturenotes/lecture3.pdf

http://octavia.zoology.washington.edu/teaching/429/lecturenotes/lecture3.pdf

BIBLIOGRAPHY 75

[Harper and Konstan, 2015] Harper, F. M. and Konstan, J. A. (2015). Themovielens datasets: History and context. ACM Trans. Interact. Intell.Syst., 5(4):19:1–19:19.

[Hoffman and Johnson, 2016] Hoffman, M. D. and Johnson, M. (2016). Elbosurgery: yet another way to carve up the variational evidence lower bound.NIPS 2016 Workshop on Advances in Approximate Bayesian Inference.

[Kingma, 2017] Kingma, D. P. (2017). Variational Inference and Deep Learn-ing: A New Synthesis. PhD thesis, Universiteit Van Amsterdam.

[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A methodfor stochastic optimization. CoRR, abs/1412.6980.

[Kingma et al., 2016] Kingma, D. P., Salimans, T., and Welling, M. (2016).Improving variational inference with inverse autoregressive flow. CoRR,abs/1606.04934.

[Kingma and Welling, 2013] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.

[Markou, 2017] Markou, N. (2017). Deep learning : Why youshould use gradient clipping. http://nmarkou.blogspot.com/2017/07/

deep-learning-why-you-should-use.html.

[Mhaskar et al., 2017] Mhaskar, H., Liao, Q., and Poggio, T. A. (2017).When and why are deep networks better than shallow ones? In Pro-ceedings of the Thirty-First AAAI Conference on Artificial Intelligence,February 4-9, 2017, San Francisco, California, USA., pages 2343–2349.

[Orr, 1999] Orr, G. B. (1999). cs449. https://www.willamette.edu/

~gorr/classes/cs449/momrate.html.

[Paszke et al., 2017] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang,E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017).Automatic differentiation in pytorch.

[Petersen and Pedersen, 2012] Petersen, K. B. and Pedersen, M. S. (2012).The matrix cookbook. Version 20121115.

[Poggio et al., 2015] Poggio, T., Anselmi, F., and Rosasco, L. (2015). I-theory on depth vs width: hierarchical function composition.

http://nmarkou.blogspot.com/2017/07/deep-learning-why-you-should-use.html

http://nmarkou.blogspot.com/2017/07/deep-learning-why-you-should-use.html

https://www.willamette.edu/~gorr/classes/cs449/momrate.html

https://www.willamette.edu/~gorr/classes/cs449/momrate.html

76 BIBLIOGRAPHY

[Rezende and Mohamed, 2015] Rezende, D. J. and Mohamed, S. (2015).Variational inference with normalizing flows.

[Riedmiller and Braun, 1993] Riedmiller, M. and Braun, H. (1993). A directadaptive method for faster backpropagation learning: The rprop algo-rithm. In IEEE INTERNATIONAL CONFERENCE ON NEURAL NET-WORKS, pages 586–591.

[Ringner, 2009] Ringner, B. (2009). The law of the unconscious statis-tician. http://www.maths.lth.se/matstat/staff/bengtr/mathprob/

unconscious.pdf.

[Robbins and Monro, 1951] Robbins, H. and Monro, S. (1951). A stochasticapproximation method. Annals of Mathematical Statistics, 22:400–407.

[Salakhutdinov and Mnih, 2008] Salakhutdinov, R. and Mnih, A. (2008).Probabilistic matrix factorization. In Advances in Neural Information Pro-cessing Systems, volume 20.

[Sedhain et al., 2015] Sedhain, S., Menon, A. K., Sanner, S., and Xie, L.(2015). Autorec: Autoencoders meet collaborative filtering. In Proceedingsof the 24th International Conference on World Wide Web, WWW ’15Companion, pages 111–112, New York, NY, USA. ACM.

[Srivastava et al., 2014] Srivastava, N., Hinton, G., Krizhevsky, A.,Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to pre-vent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.

[van Baalen, 2016] van Baalen, M. (2016). Deep Matrix Factorization forRecommendation. Master’s thesis, University of Amsterdam, the Nether-lands.

[Vincent et al., 2010a] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., andManzagol, P.-A. (2010a). Stacked denoising autoencoders: Learning usefulrepresentations in a deep network with a local denoising criterion. J. Mach.Learn. Res., 11:3371–3408.

[Vincent et al., 2010b] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y.,and Manzagol, P.-A. (2010b). Stacked denoising autoencoders: Learning

http://www.maths.lth.se/matstat/staff/bengtr/mathprob/unconscious.pdf

http://www.maths.lth.se/matstat/staff/bengtr/mathprob/unconscious.pdf

BIBLIOGRAPHY 77

useful representations in a deep network with a local denoising criterion.J. Mach. Learn. Res., 11:3371–3408.

[Watkins, 2009] Watkins, J. C. (2009). Transformations of Random Vari-ables. http://math.arizona.edu/~jwatkins/f-transform.pdf.

http://math.arizona.edu/~jwatkins/f-transform.pdf

Documents

Collaborative Filtering with Variational … In this work we integrate collaborative ltering models that make use of Stochastic Gradient Variational Bayes with more recent posterior