Sequence generation and classiﬁcation with VAEs and RNNsepxing/Class/10708-17/project-reports/project7.pdf · Sequence generation and classiﬁcation with VAEs and RNNs though not

Sequence generation and classification with VAEs and RNNs

Jay Hennig 1 * Akash Umakantha 1 * Ryan Williamson 1 *

1. IntroductionVariational autoencoders (VAEs) (Kingma & Welling,2013) are a popular approach for performing unsupervisedlearning that can also be used as generative models. VAEsmodel the data distribution as a nonlinear transformationof unobserved latent variables. Inference of the latent vari-ables given the observed data is performed via a “recog-nition” (encoding) model, while the transformation fromlatent variables to data is modeled by a “generating” (de-coding) model. The distribution of latent variables is en-couraged during training to take a simple form, such asa normal distribution, making the data generating processstraightforward. This process has shown to be success-ful at generating complicated forms of data such as dig-its and faces (Kingma & Welling, 2013). However, whenthe data is multi-modal, VAEs do not provide an explicitmechanism for specifying which mode our generated sam-ple should come from. For example, we may wish to en-sure we generate a ‘2’ and not a ‘4’. One possible approachis to model the latent distribution as a mixture distributionsuch as a mixture of Gaussians (Doersch, 2016). However,this approach is difficult to train and often not successful inpractice (Makhzani et al., 2015).

In this paper we introduce the Classifying VAE, in whichwe train a classifier to detect the mode of each data pointsimultaneously to training the recognition and generationmodels. By modeling the probability that a data point be-longs to each mode as a Logistic Normal distribution, wecan use the reparameterization trick of Stochastic GradientVariational Bayes to sample from this distribution duringtraining. In the case of generating sequence data, VAEscan be successfully combined with recurrent neural net-works (RNNs). The combination of VAEs and RNNs forma class of models we refer to broadly as “variational RNNs”(vRNNs) (Chung et al., 2015; Boulanger-Lewandowski

*Equal contribution 1Carnegie Mellon Univer-sity, Pittsburgh, PA. Correspondence to: Jay Hen-nig <[email protected]>, Akash Umakan-tha <[email protected]>, Ryan Williamson<[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, 2017. JMLR: W&CP. Copyright2017 by the author(s).

et al., 2012; Fraccaro et al., 2016; Bayer & Osendorfer,2014). When each sequence comes from a discrete class,we can similarly train a Classifying vRNN.

We demonstrate that our models are useful for generatingdata from one of several discrete modes by training ourmodels on polyphonic music data. In music, segments ofmusic conform to a particular key, with different notes be-ing appropriate in different keys. When generating mu-sic, therefore, it is important to be able to generate mu-sic from only one key at a time. Previous approaches of-ten preprocess the training data to be in only one of twokeys (Boulanger-Lewandowski et al., 2012), compared toup to 24 keys present in the original music. Even in thissimplified setting, we show that models such as VAEs andvRNNs are prone to generating samples that do not stay inkey, resulting in dissonance and poor-sounding music. Bycontrast, we show that our Classifying VAE and Classify-ing vRNN can generate samples that are as in key as theoriginal training data, even when the training data includessongs in 18 different keys.

2. Background & Related WorkModeling sequence data

The standard application of RNNs to sequence data is tohave them model the probabilistic distribution over se-quences as follows (Bayer & Osendorfer, 2014):

p(x1:T ) =

T∏t=1

p(xt | x1:t−1)

The output of the RNN at each timestep t is then interpretedas the parameters of the term p(xt | x1:t−1).

Boulanger-Lewandowski et al. (2012) combined an RNNwith a restricted Boltzmann machine (RBM) in order tomodel correlations amongst the outputs at each timestep.Their paper established a baseline for music generation byreporting the performance of a variety of simple modelssuch as GMMs, RBMs, HMMs, and n-grams.

More recently, there has been a lot of interest in the varia-tional Bayes approach for training deep generative mod-els. The original paper by Kingma & Welling (2013),


though not related to modeling sequence data, introducedthe Stochastic Gradient Variational Bayes (SGVB) esti-mator, and the variational autoencoder (VAE). The VAEmodel used a variational approximation to learn an encodermodel, which mapped input data to a stochastic latent rep-resentation, and a decoder model, which mapped this latentrepresentation back into the data space. Mathematicallytheir model sought to maximize the following estimate ofthe variational lower bound:

L(θ, φ;x) =

−DKL(qφ(z(i,l)|x(i)), pθ(z)) + Eqφ(z|x) log pθ(x(i)|z(i,l))

Their key insight was what they called the the“reparametrization trick”. The problem with classicalbackpropagation in a deep generative model framework isthat it requires calculating gradients of a expectation withrespect to a random variable z. In general, this is difficult.The reparametrization trick rewrites the random variable asa deterministic variable using some other random variableε. The objective is then rewritten as the empirical lowerbound:

L̃(θ, φ;x) =

−DKL(qφ(z(i,l)|x(i)), pθ(z)) +1

L

L∑l=1

log pθ(x(i)|z(i,l))

where z = gφ(ε(i,l), x(i)) and ε(i,l) ∼ p(ε). When z ∼N (µ(x), σ2(x)), the KL term is reparametrized as follows:

DKL(qφ(z(i,l)|x(i)), pθ(z)) =

1

2

J∑j=1

(1 + log σ2j (x)− (µj(x))2 − σ2

j (x))

Recent papers have applied this variational perspective tomodeling sequence data. Bayer & Osendorfer (2014) de-veloped a model called STORN, which introduced stochas-tic latent variables to an RNN model, fit with SGVB. Thesampled latent variables are assumed to be independentover time. Figure 1 shows a diagram of the STORN model.

Next, Chung et al. (2015) developed what they called an“Autoencoding RNN” (VRNN), which extended STORNso that the latent variables depend on the sequence historyand therefore evolve over time.

Finally, work by Fraccaro et al. (2016) attempted to mergethe RNN and state-space model perspective. Their ap-proach, which they call a “Stochastic RNN” (SRNN), ex-plicitly separates the stochastic and deterministic compo-nents of their model.

Figure 1. The generative models for the VAE (Kingma & Welling,2013) and vRNN (Bayer & Osendorfer, 2014; Chung et al., 2015)in black, with the additions of the Classifying VAE and Classify-ing vRNN shown in blue. z are continuous latent variables, whilew specifies the class of X (in the Classifying VAE) or the class ofa sequence X1, ..., XT (in the Classifying vRNN).

Application to music

In Western music, there are twelve possible ‘pitch classes’one can play (A Bb B C Db D Eb E Fb F Gb G Ab), andmost songs are written using only a subset of these. The“key” of a song (or a part of a song) refers to the central‘tonic’ note (e.g., ‘C’, ‘D’, etc.), along with its ‘mode’ (e.g.,major or minor). For example, music written in the key ofC major will tend to use only notes without sharps or flats(A B C D E F G), with ‘C’ being the tonic note that mostmelodies end on. Music in the key of C minor, by contrast,will have the same tonic note (C) but a different overall setof notes (C D Eb F G Ab Bb). In this paper we consider24 possible keys, referring to each of twelve tonic notes ineither the major or minor mode.

In this paper we model music data as a series of 88-dimensional binary vectors, Xt ∈ {0, 1}88, where the jth

entry, Xjt , represents whether or not the jth key on an 88-

key piano was played at time t. We refer to X = {Xt | t =1, ..., T} as a music sequence with length T . If X is in thekey of C major, from a probabilistic standpoint this meansP (Xj

t ) ≈ 0 if the note referred to by Xjt is not part of the

key of C major. While some songs do occasionally changekeys, we assume that the key is constant as long as T is nottoo large.

3. MethodsClassifying Variational Autoencoder

The joint distribution we consider is

P (X, z, w) = P (x | z, w)P (z | w)P (w)

where X is the observed data, and z and w are unobservedlatent variables. We suppose thatw is a vector of dimensiond, specifying the probabilities that X comes from one of ddistinct classes.

We now follow the main idea of the variational autoen-


coder, which begins with constructing a variational lower-bound on the log of the marginal likelihood P (X) =∫P (X | z, w)P (z | w)P (w)∂w∂z. We can write the

following equality:

logP (X)−D[Q(z, w | X)‖P (z, w | X)] = L(X)

= Ez,w∼Q[logP (X | z, w)]−D[Q(z, w | X)‖P (z, w)]

where D is the KL-divergence (which is always non-negative), and L(X) is the evidence lower-bound, orELBO. Applying the chain rule on the KL terms lets usrewrite the left-hand side as:

logP (X)−D[Q(z | w,X)‖P (z | w,X)]

−D[Q(w | X)‖P (w | X)]

We can similarly rewrite the right-hand side:

Ez,w∼Q[logP (X | z, w)]−D[Q(z | w,X)‖P (z | w)]

−D[Q(w | X)‖P (w)]

Above, P (z | w) and P (w) are priors, while Q is an ar-bitrary distribution. Critically, Q will be parameterized bya neural network, and we will optimize Q (with gradientdescent) so as to maximize the ELBO, thereby maximizingthe marginal likelihood.

To keep our lower-bound on the marginal likelihood tight,we must ensure the KL divergences on the left-hand sideare small. Typically, we don’t have access to P (z | X,w).However, we suppose that when training we do have ac-cess to the true w. So instead of minimizing D[Q(w |X)‖P (w | X)], we can minimize the cross-entropy be-tween Q(w | X) and the true w. In other words, mini-mizing this portion of the loss will encourage Q(w | X) toclassify w. By simultaneously minimizing cross-entropyloss while maximizing the the ELBO, we keep the lowerbound on the data likelihood tight while also maximizingit.

As with a standard VAE, we use the Stochastic GradientVariational Bayes (SGVB) estimator of L(X), whereby wedraw samples of w ∼ Q(w | X) and z ∼ Q(z | w,X) foreach minibatch during training. One of the limitations ofthe VAE is that the latent variables z and w cannot be dis-crete, otherwise we could not perform backpropagation onthe parameters of the neural network that specify Q. Inthis case, we wish for w to be the probabilities that X be-longs to each of the d classes. We therefore parameter-ize Q(w | X) as a Logistic Normal distribution. Sam-ples w ∼ Q(w | X) = LN (µ,Σ) have the property that0 ≤ wi ≤ 1 for i = 1, ..., d, and

∑di=1 wi = 1, so that

w can be interpreted as our estimated probability that Xbelongs to each of the d classes.

Critically, to use SGVB, we must be able to apply the“reparameterization trick” and generate samples of z and

w as differentiable, deterministic functions of X and someauxiliary variables ε with independent marginals. (It is im-portant to note that samples from the Dirichlet, another op-tion for Q(w | X), cannot be trivially written in this form.)A common choice for the distribution of z is a Normaldistribuion, to which the reparameterization trick applies.For w, samples from the Logistic Normal distribution canbe generated deterministically from a sample from a Nor-mal distribution with the same parameters. Specifically, ify ∼ N (µ,Σ) is a sample from a Normal distribution withy ∈ Rd−1, thenw ∼ LN (µ,Σ) is a sample from a LogisticNormal if we set wj = eyj

1+∑d−1j=1 yj

for j = 1, ..., d− 1 and

wd = 1

1+∑d−1j=1 yj

.

4. ExperimentsMusic generation. We apply our method to generate se-quences of polyphonic music in the form of piano-roll data.For training data we use the entire corpus of 382 four-part harmonized chorales by J.S. Bach (“JSB Chorales”),obtained from the Python package music21 (Cuthbert &Ariza, 2010). This corpus was converted to piano-roll nota-tion, a binary matrix denoting when each of 88 notes (fromA0 to C8) was played over time (with time divided intoquarter or eighth notes, depending on the data set).

One of the most popular algorithms for detecting the keyof a musical sequence is the Krumhansl-Schmuckler algo-rithm (Krumhansl, 2001). This algorithm finds the propor-tion of each pitch class present in a sequence, and comparesthis with the proportions expected in each key. We used theimplementation of this algorithm in music21 to establishthe ground-truth key of each musical sequence in our train-ing data. The JSB Chorales corpus includes songs in 18different keys. When Classifying the key, we treat majorand minor scales that have the same notes (e.g., C majorand A minor) as the same key, resulting in 10 distinct keyclasses. Additionally, for comparison with previous workon this data set, we also present results for models trainedon songs transposed so that each song is in either C majoror C minor.

Finally, we perform further experiments on the ‘Piano-midi.de’ data set, with songs transposed to either C ma-jor or C minor, as seen in Boulanger-Lewandowski et al.(2012).

Implementation of the VAE and Classifying VAE.

To model temporal data with a VAE and Classifying VAE,we considered two options. First, we may consider eachtimestep in isolation, autoencoding Xt to Xt via zt and wt.Note that in all models, wt is constant for all t referring tothe same song. Our second option is to “autoencode” Xt

to Xt+1. We found that the former approach led to sam-


ples with better musical qualities and lower cross-validatedreconstruction error.

For the Classifying VAE, we assume the following priorson z and w:

P (z | w) = P (z) = N (0, I)

P (w) = LN (0, I)

For Q, we use:

Q(w | X) = LN (µw(X), diag[σ2w(X)])

Q(z | X,w) = N (µz(X,w), diag[σ2z(X,w)])

where µw and σ2w are implemented as the outputs of a neu-

ral network (the “classifier”), and µz and σ2z implemented

as the outputs of a different neural network (the “encoder”).Note that the outputs of the encoder are a function of Xas well as w, so when training we must draw a samplew ∼ Q(w | X) as described above before computingµz(X,w) and σ2

z(X,w). Note that the true w is used dur-ing training only in the loss function. Finally, we assumeP (X | w, z) is an 88D Bernoulli distribution with proba-bilities p(w, z) as the output of a neural network (the “de-coder”).

For the classifier, encoder, and decoder network, we usemultilayer perceptrons (one for each network) with onehidden layer and ReLu activation. In practice, we foundthat Classifying the key of a musical sequence is not diffi-cult, even when the training data includes songs in 18 dif-ferent keys.

To implement the standard VAE, we ignore w in all equa-tions above, and use only an encoder and decoder networkas in Kingma & Welling (2013).

Implementation of the vRNN and Classifying vRNN.

For our standard vRNN, we implement a network similarto STORN (Bayer & Osendorfer, 2014). In this case, theencoder and decoder networks are each replaced with anLSTM with ReLu activation, followed by a single denselayer. The Classifying vRNN is similar except for the addi-tion of a classifier, implemented as in the Classifying VAE.Critically, we classify the key using an entire sequence ofX . We leave the sequence length as a hyperparameter tobe chosen. A sample w ∼ Q(w | X) is then fed into theencoder and decoder networks at each timestep of the se-quence.

Generating music samples.

When generating samples with a VAE, typically one woulddraw zt ∼ N(0, I), pass this value through the decoderto get p(zt), and then sample Xt from those probabilities.However, we found that the generated samples had bettermusical qualities if we drew zt ∼ Q(z | Xt−1). In other

words, we sample zt from the encoding of Xt−1, which isthe previously generated sample. We used this approachwhen generating samples for all models.

Training details.

We built all models using Keras with a Tensorflow back-end. All weights were initialized to a random sampleN (0, 0.01), and trained with stochastic gradient descentusing Adam ( Kingma & Ba (2014)). To prevent overfit-ting, we stopped training when the loss on our validationdata did not increase for two consecutive epochs. We foundthat using weight normalization (Salmimans & Kingma,2016) resulted in faster convergence. We performed a hy-perparameter search using grid search, choosing the hyper-parameters that resulted in the lowest loss on validationdata (see Tables 1, 2, and 3). For the VAE and Classi-fying VAE, batch sizes in the range of 10-50 were pre-ferred, along with a latent dimensionality around 4-8. Forthe vRNN and Classifying vRNN, a batch size of 50-200was sufficient, along with a latent dimensionality of around8-16, and a sequence length of around 8-16. For all hid-den perceptron layers (i.e., for the encoders, decoders, andclassifiers) we found a dimensionality equal to 88, the inputdimensionality, worked best. In both the VAE and vRNNmodels, hyperparameter search found that the Classifyingmodels generally needed fewer latent dimensions than theunsupervised models. We attribute this to the fact that theextra Classifying network in the Classifying models addspredictive power and model complexity.

Experiment 1.

VAE Classifying VAE

Figure 2. Comparison of the inferred values of z for the VAE andClassifying VAE from held-out test data with songs in the key ofeither C major (red points) or C minor (blue points). The Clas-sifying VAE encodes z after first inferring the probability that Xcame from a song in one of the two keys. The Classifying VAE’sencoded values of z appear to more closely match the imposedprior z ∼ N(0, I) than do those of the VAE.

The first question we ask is: how does the latent represen-tation differ (or not differ) between the VAE and Classify-ing VAE? We first explore the differences between the twomodels when the dimensionality of the continuous latentvariables z is 2, and the training data contained songs in


JSB, 2 keys batch size latent dim hidden units, latent hidden units, classifier sequence lengthVAE 20 8 88 88 n/aClassifying VAE 5 4 88 88 n/avRNN 50 16 88 88 16Classifying vRNN 50 8 88 88 16

Table 1. Results of hyperparameter search for models trained on the 2 key version of JSB Chorales.

JSB, 10 keys batch size latent dim hidden units, latent hidden units, classifier sequence lengthVAE 50 8 88 88 n/aClassifying VAE 10 8 88 88 n/avRNN 50 16 88 88 8Classifying vRNN 100 4 88 88 16

Table 2. Results of hyperparameter search for models trained on the 10 key version of JSB Chorales.

either C major or C minor. In this case, we can easily visu-alize the representation of the latent space learned by eachnetwork when encoding music in one of two keys. Ad-ditionally, because the VAE and Classifying VAE are nottemporal models, the encoding and decoding of zt dependsonly on the notes played at the current timestep, Xt, ratherthan on the sequence history.

Figure 3. Heatmap of decoded probabilities of X given differentvalues of z and w, for a Classifying VAE trained on music inC major or C minor. For the same value of z, the value of wdetermines whether the decoded notes are appropriate for eitherC major (when w = 0) or C minor (when w = 1).

There are several differences in the latent spaces of theVAE and Classifying VAE (Fig 2). In the VAE, locationin the latent space is important in determining what keythe note/song is in. Notes in C minor (blue) have lowervariance than notes in C major. Thus, a latent value near

the center (e.g. [0,0]) will be more likely to be a notefrom C minor, and values near the edge (e.g. [0,2]) will bemore likely to be a note from C major. Furthermore, sam-ples/songs that have latents in the center will likely changekeys over time. In the Classifying VAE, the additional pa-rameter w encodes the key of the song. This allows therepresentations of of C major and C minor to be highlyoverlapping in the latent space. Furthermore, both latentdistributions more closely resemble the prior distributionN(0, I). Indeed, since the Classifying VAE is explicitlymodeling the key of the class, the burden of ”clustering”the keys (as in the VAE) is removed from the latent space.Fig 3 shows that in the Classifying VAE, the same latentvariable values (z) can give rise to very different notes andchords in the input space depending on the value of w.

Experiment 2.

Having a better intuition of the latent space, we now moveto evaluating and comparing the performance of the variousmodels relative to each other. We trained the VAE, Classi-fying VAE, vRNN, and Classifying vRNN on songs in 3different datasets: JSB Chorales with 2 keys, JSB Choraleswith 18 keys (10 distinct keys), and Piano-midi.de with 2keys. We no longer constrained the latent dimensionalityto be 2, but instead use the results of our hyperparametersearch (Tables 1, 2, 3).

First, we compared the average negative log likelihood (i.e.reconstruction error) of each model on held-out validationdata. Note that since the VAEs are auto-encoding and thevRNNs are predicting sequences, it is not necessarily fairto compare the two. The real comparison is between Clas-sifying vs. non-Classifying models. From the results inTable 4, it was unclear based on this metric alone whetherclassification models improve performance: in come casesit was better (JSB 2 keys, Classifying VAE vs VAE), some-times worse, and in other cases comparable to the unsu-pervised model. We recognize that log likelihood is not


PianoMidi, 2 keys batch size latent dim hidden units, latent hidden units, classifier sequence lengthVAE 50 8 88 88 n/aClassifying VAE 20 4 88 88 n/avRNN 200 16 88 88 8Classifying vRNN 50 4 88 88 8

Table 3. Results of hyperparameter search for models trained on the 2 key version of PianoMidi.de.

Data set VAE Classifying VAE vRNN Classifying vRNNJSB Chorales (2 keys) 5.32 4.38 5.35 6.065JSB Chorales (10 keys) 4.578 4.617 3.826 4.866Piano-midi.de (2 keys) 8.02 10.54 7.31 8.66

Table 4. Average negative log-likelihoods on test data for models fit to data transposed to C major or C minor in the JSB and PianoMididatasets. Hyperparameters used are those in Table 1 and Table 3

a perfect metric–for example, generative adversarial net-works (GANs) have been gaining in popularity lately dueto their ability to generate very realistic samples, yet theirlog likelihoods are poor relative to other models like VAEs.Thus, we turn to other performance metrics to comparethese models.

One very important metric is how well a song stays in key.Indeed, the inspiration for this model was to generate se-quences that do so better than the standard VAE and vRNN.We generated samples from each model after seeding thenetworks with musical sequences from held-out test data.Each model generated music for the next T = 32 timesteps,for N = 7000 different seed sequences in the test data. Tomeasure how well these generated samples preserve the keyof the seed sequence, we calculated the proportion of notesin the generated sequences that were in the key of the seedsequence. We observed that the seed sequences themselves(‘Data’ in Fig. 4) were not perfectly in key; this is presum-ably because the ground-truth key of the test data was mea-sured per song and not per sequence. In any case, we foundthat the Classifying models all generated samples that wereas in key as the test data itself. This is in comparison to theVAE and vRNN models, which generated samples whosenotes were not consistently in key. Note that chance per-formance in this metric is 67% because keys are composedof eight of the twelve pitch classes, and so randomly se-lecting notes from all twelve pitches classes would resultin 8 of 12 notes being in key. Furthermore, in experiments(not shown) where we “flip” the value of w such that it nowrepresents the opposite key, we observed that the percent-age of notes in key drastically decreased. This suggeststhat the value of w is in fact controlling how well generatednotes stay in key.

Figure 4. Comparison of the percentage of notes from generatedsamples that are in key, after training on data in 18 different keys.Gray dotted line shows chance performance of 67%. Boxes depictthe 25th, 50th, and 75th percentiles across N = 7000 samples.Each model generated a sample for T = 32 timesteps after beingseeded with a musical sequence from held-out test data in one ofeighteen different keys. The Classifying vRNN showed similarperformance when samples were generated by inferring the keyof the seed (w) using its classifier network, or when being giventhe true key. These percentages were similar to the percentage ofnotes in key in the test data (‘Data’).

Ideally the benefit of consistent, selectable key would comewithout any penalty in music quality. We therefore nextcompared the samples from each model on other musical-ity metrics to assess how the quality of the music comparedto that of the original data set. In doing these comparisons,we generated samples from a set of 3450 seed sequences,all of which had been transposed to the key of C major.We assessed three musicality metrics in addition to the keyconsistency described above. First, we assessed the num-ber of tones played on average at each time point (indicatedas ‘SimulTones’ in Fig. 5). We found that the Classify-ing vRNN had a slightly higher number of tones per time


step than the other models and the original data. We nextmeasured the tone span, which is the average difference be-tween the highest and lowest tones played across all sam-ples. We found that the Classifying vRNN was much moresimilar to the original data than the VAE or the vRNN. Fi-nally, we measured polyphony, which is the percentage oftime points per sample that more than one note is played.Here we set a maximum number of notes per time step at5, since more than 5 notes per time step would be a signif-icant departure from the training music (which had 4 tonesat nearly all time steps). All models show less polyphonythat the original music, with the Classifying vRNN havingthe lowest polyphony. In summary, the Classifying vRNNdoes better than the VAE and the vRNN at maintaining akey and in mimicking the tone span of the original music.However, the Classifying vRNN had a slightly higher tonespan than the original and a lower polyphony.

Figure 5. Comparison of VAE, vRNN, and Classifying vRNN(cvRNN) on three musicality metrics. SimulTone indicates thenumber of tones played on each time step. ToneSpan indicatesthe average distance across samples between the highest tone andthe lowest tone. Polyphony indicates the percentage of sampleswith than 2 tones. Note that each bar group is normalized by themaximum value in each bar group to allow for comparison on thesame axes.

Experiment 3.

As another validation of our proposed model, we traineda Classifying VAE on MNIST data and show that it canperform style and content separation (similar to Fig. 4) byusing continuous latent variables z along with class labelsw, where each class is a different digit. The implementa-tion details are the same as before, except for our classi-fier network we use a multilayer perceptron with two hid-den layers containing 512 units each, with 20% dropout.Figure 6 shows examples of digits generated by the modelboth when holding w fixed while varying z (top), and whenholding z fixed while varying w (bottom).

Figure 6. Using a Classifying VAE to separate style and contentfor MNIST. Top: Decoded digits when varying the continuous la-tents z while keeping the class labels w fixed, for two examples.Bottom: The left column shows an example digit from the held-out test data, and the numbers to the right show the model’s gen-erated “analogies” to this digit, found by using the same values ofz but varying the class label w.

5. ConclusionKingma et al. (2014) proposed a generative semi-supervised model that is related to our work. In their gen-erative process, the continuous latents z have a mulivari-ate normal prior, class variables y are generated from amultinomaial (and treated as latents if data is unlabelled)and the data generative process is a function of y, z, andneural network parameters. Our proposed model differsin a few ways. First, their model assumes marginal inde-pendence between y and z given data x (i.e. q(y, z|x) =q(y|x)q(z|x)). Our factorization differs in that we do notassume independence: q(y, z|x) = q(z|y, x)q(y|x). Evenwithout this assumption, our model is able to recover a rep-resentation that disentangles style and content. Secondly, inour inference network for q(y|x), we parameterize q(y|x)as a log normal distribution. Since the log normal distribu-tion is continuous, we can implement stochastic backprop-agation (SGVB) through our classifier network in additionto our network for q(z|x, y). This enables training both theclassifier and autoencoder simultaneously, in contrast to themodel in Kingma et al. (2014) which must train these lay-ers separately. Additionally, we not consider training in thesemi-supervised case of partially-labeled data.

Another architecture applied to similar problems as VAEsis the generative adversarial network (GAN) (Goodfellowet al., 2014). Two GAN structures have been developed that


are particularly relevant to the model proposed in this pa-per. First is the Adversarial Autoencoder (Makhzani et al.,2015). This model introduces an adversarial network in theoutput of the latent variable layer of an autoencoder. Therole of the adversarial layer is to force the encoder to mapinput data to latents such that the distribution of the mappeddata matches the prior. This model has two advantages overthe VAE. First it does not require a reparametrizeable prior,and can therefore employ any arbitrary prior as long as itcan be sampled from. Second, the model generally pro-duces representations in the latent space that are closer tothe imposed prior. This is especially true in multi-modaldistributions (e.g., mixture of Gaussians) which could plau-sibly be used to generate data from specific classes. Inaddition, adversarial GANs can incorporate labels in thehidden layer (similar to the w latent variable used here),allowing for generation of data with specific class labels.This architecture has only been constructed for static dataand a sequential model has not yet been developed. Criti-cally, the training of GANs is non-trivial and therefore lessstraightforward than for the VAE.

A second relevant GAN architecture is the continuous Re-current Neural Network Generative Adversarial Network(c-RNN-GAN) (Mogren, 2016). This model is essentiallythe adversarial equivalent to the vRNN model discussedearlier. This model uses a uni-directional recurrent gener-ative model and a bi-directional discriminator model. Thismodel has been applied successfully to music generation,although we leave comparisons between this model andthe one proposed here for future work. We note that thec-RNN-GAN does not incorporate key labels and wouldtherefore likely suffer from the same limitations on key sta-bility as the vRNN when trained on samples that have notbeen transposed to a single key.

In summary we developed an extension of the variationalautoencoder that allows for specification of class labelsin terms of categorical latent variables. We demonstratethat this method is effective in generating data instancesfrom selected classes using two applications: one whereclass label determines the key of the generated music and asecond where the class label determines the value of theMNIST digit. In the music generation task our methoddemonstrated superior stability of key over the sample se-quence with comparable performance on other non-key re-lated metrics of musicality.

We showed that the Classifying VAE is able to indepen-dently encode style and content in both the music genera-tion and MNIST data sets. An interesting future directionis to see whether the Classifying vRNN is similarly able toseparate style and content in its temporal sequences. Forexample, given a song in a particular key, can we some-how vary the class label and transpose the song to a differ-

ent key? Another potential benefit of this model would beto generate songs following a particular chord progression.This would involve changing the interpretation of the classlabel w to be the current chord being played (instead ofthe key of the entire song), which may lead to an effectivemethod for generating melodies to play on top of a givenchord progression.

ReferencesBayer, Justin and Osendorfer, Christian. Learning stochas-

tic recurrent networks. arXiv preprint arXiv:1411.7610,2014.

Boulanger-Lewandowski, Nicolas, Bengio, Yoshua, andVincent, Pascal. Modeling temporal dependencies inhigh-dimensional sequences: Application to polyphonicmusic generation and transcription. In ICML, 2012.

Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel,Kratarth, Courville, Aaron C., and Bengio, Yoshua.A recurrent latent variable model for sequential data.arXiv, 1506.02216, 2015.

Cuthbert, Michael Scott and Ariza, Christopher. Music21,a toolkit for computer-aided musicology and symbolicmusic data. In International Society for Music Informa-tion Retrieval Confrence (ISMIR), 2010.

Doersch, Carl. Tutorial on variational autoencoders. arXivpreprint arXiv:1606.05908, 2016.

Fraccaro, Marco, Sønderby, Søren Kaae, Paquet, Ul-rich, and Winther, Ole. Sequential neural models withstochastic layers. In Advances in Neural InformationProcessing Systems, pp. 2199–2207, 2016.

Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi,Xu, Bing, Warde-Farley, David, Ozair, Sherjil,Courville, Aaron C., and Bengio, Yoshua. Generativeadversarial nets. In NIPS, 2014.

Kingma, Diederik P and Ba, Jimmy. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

Kingma, Diederik P. and Welling, Max. Auto-encodingvariational bayes. arXiv, 1312.6114, 2013.

Kingma, Diederik P, Mohamed, Shakir, Rezende,Danilo Jimenez, and Welling, Max. Semi-supervisedlearning with deep generative models. In Advances inNeural Information Processing Systems, pp. 3581–3589,2014.

Krumhansl, Carol L. Cognitive foundations of musicalpitch. Oxford University Press, 2001.


Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, andGoodfellow, Ian J. Adversarial autoencoders. arXiv,1511.05644, 2015.

Mogren, Olof. C-rnn-gan: Continuous recurrent neuralnetworks with adversarial training. arXiv, 1611.09904,2016.

Salmimans, Tim and Kingma, Diederik P. Weight nor-malization: A simple reparameterization to acceler-ate training of deep neural networks. arXiv preprintarXiv:16.02.07868, 2016.

Documents

Sequence generation and classiﬁcation with VAEs and RNNsepxing/Class/10708-17/project-reports/project7.pdf · Sequence generation and classiﬁcation with VAEs and RNNs though not