Generating Training Data for Keyword Spotting given Few ...1336760/FULLTEXT01.pdf · provide an up to 20% relative accuracy improvement on the validation set. The baseline augmentation

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Generating Training Data for Keyword Spotting given Few Samples

PIUS FRIESCH

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Generating Training Data for

Keyword Spotting given Few

Samples

PIUS [email protected]

Master in Machine LearningDate: June 4, 2019Supervisor: Jonas BeskowExaminer: Sten TernströmSchool of Electrical Engineering and Computer ScienceHost company: snips.aiSwedish title: Generering av träningsdata för nyckelordsigenkänningutifrån ett fåtal exempel

iii

Abstract

Speech recognition systems generally need a large quantity of highly variablevoice and recording conditions in order to produce robust results. In the specificcase of keyword spotting, where only short commands are recognized insteadof large vocabularies, the resource-intensive task of data acquisition has tobe repeated for each keyword individually. Over the past few years, neuralmethods in speech synthesis and voice conversion made tremendous progressand generate samples that are realistic to the human ear. In this work, we explorethe feasibility of using such methods to generate training data for keywordspotting methods. In detail, we want to evaluate if the generated samples areindeed realistic or only sound so and if a model trained on these generatedsamples can generalize to real samples. We evaluated three neural networkspeech synthesis and voice conversion techniques : (1) Speaker AdaptiveVoiceLoop, (2) Factorized Hierarchical Variational Autoencoder (FHVAE),(3) Vector Quantised-Variational AutoEncoder (VQVAE).

These three methods are evaluated as data augmentation or data generationtechniques on a keyword spotting task. The performance of the models iscompared to a baseline of changing the pitch, tempo, and speed of the originalsample. The experiments show that using the neural network techniques canprovide an up to 20% relative accuracy improvement on the validation set. Thebaseline augmentation technique performs at least twice as good. This seemsto indicate that using multi-speaker speech synthesis or voice conversationnaively does not yield varied or realistic enough samples.

iv

Sammanfattning

Taligenkänningssystem behöver generellt en stor mängd träningsdata med varie-rande röst- och inspelningsförhållanden för att ge robusta resultat. I det specifikafallet med nyckelordsidentifiering, där endast korta kommandon känns igeni stället för stora vokabulärer, måste resurskrävande datainsamling göras förvarje sökord individuellt. Under de senaste åren har neurala metoder i talsyntesoch röstkonvertering gjort stora framsteg och genererar tal som är realistisktför det mänskliga örat. I det här arbetet undersöker vi möjligheten att användasådana metoder för att generera träningsdata för nyckelordsidentifiering. I detaljvill vi utvärdera om det genererade träningsdatat verkligen är realistiskt ellerbara låter så, och om en modell tränad på dessa genererade exempel generali-serar väl till verkligt tal. Vi utvärderade tre metoder för neural talsyntes ochröstomvandlingsteknik: (1) Speaker Adaptive VoiceLoop, (2) Factorized Hie-rarchical Variational Autoencoder (FHVAE), (3) Vector Quantised-VariationalAutoEncoder (VQVAE).

Dessa tre metoder används för att antingen generera träningsdata från text(talsyntes) eller att berika ett befintligt dataset för att simulera flera olika talaremed hjälp av röstkonvertering, och utvärderas i ett system för nyckelordsidenti-fiering. Modellernas prestanda jämförs med en baslinje baserad på traditionellsignalbehandling där tonhöjd och tempo varieras i det ursprungliga träningsda-tat. Experimenten visar att man med hjälp av neurala nätverksmetoder kan geen upp till 20% relativ noggrannhetsförbättring på valideringsuppsättningenjämfört med ursprungligt träningsdata. Baslinjemetoden baserad på signalbe-handling ger minst dubbelt så bra resultat. Detta tycks indikera att användningenav talsyntes eller röstkonvertering med flera talare inte ger tillräckligt varieradeeller representativa träningsdata.

Contents

1 Introduction 1

2 Background 52.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . 52.2 Keyword Spotting . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Generative Models . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Variational Autoencoders . . . . . . . . . . . . . . . . 72.3.3 Generative Adversarial Nets . . . . . . . . . . . . . . 92.3.4 Text to Speech Models . . . . . . . . . . . . . . . . . 9

2.4 Generative Models for Audio . . . . . . . . . . . . . . . . . . 112.4.1 VoiceLoop . . . . . . . . . . . . . . . . . . . . . . . 112.4.2 Factorized Hierarchical Variational Autoencoder . . . 142.4.3 Vector Quantised-Variational AutoEncoder . . . . . . 17

2.5 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Method 223.1 Speaker Adaptive VoiceLoop . . . . . . . . . . . . . . . . . . 233.2 Factorized Hierarchical Variational Autoencoder . . . . . . . . 243.3 Vector Quantised-Variational AutoEncoder . . . . . . . . . . . 24

4 Experiments 274.0.1 Modified VQVAE architecture . . . . . . . . . . . . . 27

4.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . 284.1.1 Qualitative Evaluation . . . . . . . . . . . . . . . . . 284.1.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . 29

4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.1 Experiments: VoiceLoop . . . . . . . . . . . . . . . . 324.2.2 Experiments: Scaleable FHVAE . . . . . . . . . . . . 334.2.3 Experiments: VQVAE . . . . . . . . . . . . . . . . . 33

v

vi CONTENTS

5 Results 355.0.1 Speaker VoiceLoop . . . . . . . . . . . . . . . . . . . 355.0.2 Voice Conversion Results . . . . . . . . . . . . . . . 36

5.1 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . 375.1.1 Speaker VoiceLoop . . . . . . . . . . . . . . . . . . . 385.1.2 FHVAE . . . . . . . . . . . . . . . . . . . . . . . . . 405.1.3 VQVAE . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Discussion 45

7 Conclusion 48

Bibliography 49

Chapter 1

Introduction

Recent advances in automatic speech recognition (ASR) [1, 2, 3, 4, 5, 6] systemsusing neural network based approaches have brought the state of the art (SOTA)to a human or over human level on constrained tasks. A common requirementwith deep learning approaches is the need for large amounts of training datato train deep neural networks to reach this level. For common languages likeChinese, English, etc. these large amounts of transcribed speech are usuallyreadily available, in contrast to languages with relatively few speakers where itis considerably more di�cult to acquire the needed amount of transcribed datafor well-performing models.

ASR mainly considers the case of large vocabulary recognition or phonemerecognition in combination with a language model. For online ASR it is usuallynot feasible to run the ASR all the time. For this reason, smaller models are runin practice to listen for a wake word to start the general ASR. In environmentswhere the hardware resources are heavily constraint, general ASR models mightnot be used at all. Especially when only low amounts of memory are available,bigger ASR models are unfit. Wake word or keyword spotting models are verysmall networks highly fitted on a single or few keywords. Given the requirementfor small resource usage, deep learning wake word models are not trained froma general large vocabulary speech corpus but need a specialized training setonly for the particular commands. Therefore, adding a keyword is linked to asubstantial e�ort to acquire training data. This provides a strong incentive tofind methods that help to lower the amount of data needed.

One approach could be to synthesize training data using a neural networklearned on a large vocabulary speech corpus and then generate samples of thegiven keyword which then are used to train the smaller keyword spotting model.Given that the learned keyword spotting model should generalize well to many

1

2 CHAPTER 1. INTRODUCTION

speakers, the generated samples should cover a broad range of di�erent voices.Another method could be to follow an approach which has proven to be

e�ective in the field of image recognition, namely, augmenting the trainingdata. Data augmentation does not change the corresponding label and canhelp to teach the model invariances in the data. Similar methods that forcean implicit bias for certain invariances are widely used in speech recognition[7, 8, 9, 10]. These digital signal processing (DSP) methods help neuralnetworks to generalize better in most cases. These proposed methods onlycover variances in time, pitch or vocal tract length. However, speakers arefurthermore distinguished by their exact physical build, accent, prosody orother attributes. Given that these attributes are hard to quantify and thus noteasily varied using DSP methods, we turn to approximate neural networkmethods in this work.

The task of turning a sample of one speaker into another sample withthe same content but said by a di�erent speaker is commonly referred to asvoice conversion. The straightforward supervised method is to create a paralleldataset with time annotations of multiple speakers saying the same thing andthen learn how to do the transformation. However, this kind of data is hard toacquire and usually only available in small quantities. Unsupervised methodsthat discover an alignment would not have the issue of costly data. Similar toclassical methods for voice conversion that rely on small parallel datasets toconvert speech to a limited amount of speakers, the classical methods for TTSrely on selecting short samples of recorded speech and stitch these togetherto form new outputs. Yet, in recent literature more parametric neural networkmethods have been proposed and shown to result in large improvements innaturalness and scalability compared to classical methods by utilizing largeamounts of data. Given such progress in recent years, we want to evaluate ifsamples produced by these methods can be used as training data for the uniquetask of keyword spotting. We chose to evaluate the three following methods:

1. Speaker Adaptive VoiceLoop augments the basic VoiceLoop text tospeech model with a speaker encoding prediction network. Thus, itenables fast adaption to speakers and therefore more variability in thenetwork output compared to static learned speaker embeddings duringtraining.

2. Factorized Hierarchical Variational Autoencoder (FHVAE) is a takeon disentangled Variational Autoencoder (VAE) which promises to findhidden local and global features of audio samples in an unsupervisedmanner. The local features can be recombined with the hidden global

CHAPTER 1. INTRODUCTION 3

features, which represent speaker information in our case, of a di�erentaudio sequence to synthesize the audio in the given speaker style.

3. Vector Quantised-Variational AutoEncoder (VQVAE) finds discretespeaker independent local codes from speech samples in an unsupervisedmanner. The found discrete codes are combined with di�erent speakerencodings to reconstruct a given audio sample in the voice of a di�erentspeaker. The synthesizing part of the originally proposed network is re-placed by a Long short-term memory (LSTM) based architecture insteadof a WaveNet based architecture in order to allow for faster generation ofa large number of samples.

First, we train these methods using multi-speaker datasets in order to createtrained models which can produce samples with a variety of speakers. Thesetrained networks can then be used for data generation. The produced samplesare tested on a keyword spotting task, in order to test the augmentation perfor-mance. In order to generate training data, a phoneme transcription for di�erentkeywords is given to the Speaker Adaptive VoiceLoop model to generate key-word audio samples. The trained FHVAE and VQVAE models augmenteddi�erent smaller amounts of training samples by reconstructing them withdi�erent global styles. Di�erent numbers of provided samples correlate withthe e�ort needed to acquire data for a new keyword. Finally, the accuracy ofpredicting the di�erent keyword classes is reported. As baseline augmentationtechnique random pitch and speed adjustments of the given audio samples werechosen.

The neural network techniques produced subjectively good and varied gen-erated samples. Applying these naively to the keyword spotting task improvedthe performance only when few base samples are given. Larger amounts ofbase samples with a larger variety show no improvement. Furthermore, thebaseline augmentation technique performs at least twice as good. Finally, wediscuss possible reasons for the di�culty of training the keyword spotting taskon the generated samples.

The remainder of the thesis is organized as follows. Chapter 2 describesrelated approaches for data augmentation in automatic speech recognition andkeyword spotting models as well as generative models for audio. Also, anoverview of the three methods used in this work is given. In Chapter 3 wedescribe how the neural network methods are trained and used to generate thetraining data and for keyword spotting. Followed by Section 4.1 where wedescribe how we evaluate the chosen methods. How the three data generationmethods are trained is described in Section 4.2. The performance on the key-

4 CHAPTER 1. INTRODUCTION

word spotting task is reported in Section 5, followed by a qualitative evaluationin Section 5.1.

Finally we discuss the results in Chapter 6 and conclude the thesis in Chapter7.

Chapter 2

Background

2.1 Data Augmentation

A common practice in machine learning is the use of data augmentation. In avision task, for example, one could use mirroring, rotation or zooming whichdo not influence the resulting label of the image. Similarly, audio data can beaugmented without changing the resulting label. The factors to which the modelshould be invariant to, change from application to application. Commonlyexpected invariances of models in ASR are the quality of the recording, the dif-ferent attributes of the speaker( e.g. pitch, prosody, accent,... ), environmentalnoise, the distance to the microphone or reverberations of the sound in closedrooms. This invariance is often correlated with how well a model generalizesto new environments and scenarios the model has not been exposed to duringtraining time.

One of the simplest methods is adding a noise track to the audio. This canbe varying recorded environments like a noisy room or a factory setting. InHannun et al. [2], they report that this helped the performance of the ASRsystem, especially in their noisy test set. They also stated, that the length ofthe recorded noise sample should not be too small since this could lead to thenetwork overfitting or remembering the specific added noise. However, justadding noise misses to capture the Lombard E�ect [11] which describes thetendency of speakers to increase their pitch and intonation to overcome a loudand noisy environment. The authors of this work recorded special data forthis use case where speakers were exposed to a noisy environment throughheadphones.

Another simple method can be resampling of the audio signal and therebyspeeding up or slowing down the signal [7]. This has e�ects on the pitch or

5

6 CHAPTER 2. BACKGROUND

fundamental frequency of the audio. So methods which either keep the speedor the pitch constant [12] can also be used to modify the audio. In Collobert,Puhrsch, and Synnaeve [6] they found that stretching helps with small datasets,but the e�ect vanishes for bigger datasets.

Another proposed approach is vocal tract length perturbation (VTLP) [8],which takes the inverse approach of vocal tract normalization (VTLN). VTLNhelps to learn better speaker invariance with respect to vocal tract length bynormalizing input data of di�erent speakers to a common mean by using speakerdepending warp factors based on the physical structure of the speaker’s vocaltract. VTLP uses this in inverse by using randomly generated warp factors toartificially generate more varied samples.

Frequency-axis random distortion as introduced in Kanda, Takeda, andObuchi [13], adds uniform noise to the spectrogram which is then transformedbased on local averaging in small patches. This adds a distortion which changesthe input spectrogram in local patches instead of shifting it based on a globalvalue.

Another approach to augment training data is using an impulse responseof a room to augment the audio signal. This simulates the case of far-fieldrecordings and helps in these cases, but might hurt close-field performanceaccording to Arik et al. [14]. Impulse responses can also be synthesized withoutthe need for physical measurements [15].

2.2 Keyword Spotting

The task of spotting only specific keywords is an active area of research next togeneral ASR. These networks are used in exactly these cases where generalASR systems are not viable. Mostly as small-footprint networks to be run veryresource e�cient at inference. However, given that the number of keywordsis highly limited, instead of predicting sequences, the commands can be pre-dicted directly from the whole sequence. Before the recent surge in interestin neural networks, HMMs with sequence search algorithms were commonlyused. However, even simple 3-layer DNN outperformed these systems in Chen,Parada, and Heigold [16]. Using more complex architectures like CNN [17] orcombining convolutional with recurrent elements [14] showed to be similarlyresource e�cient while improving the performance. One drawback of imple-menting RNNs for an online task is that one has to address the problem of adiverging hidden state. Training is done on short sequences, but at inferencethe network is run online without resetting the hidden state. To reproducethe same setting as in training one would have to clear the hidden state of the

CHAPTER 2. BACKGROUND 7

network regularly to prevent a diverging state. Alternatively, one could chooseto train the network emulating an online setting as seen in Hwang, Lee, andSung [18]. For this work, the choice fell on a simple convolutional architecturesimilar to LeCun et al. [19], given that convolutional networks appear still tobe competitive while having a low complexity.

2.3 Generative Models

2.3.1 Autoencoders

Variational autoencoders take their name from basic autoencoders, which basi-cally can be seen as learned compression networks. Common uses for autoen-coders are de-noising autoencoders or feature or representation learning. Givenan input datapoint which is fed through a network with an information bottle-neck, the model has to reconstruct the original image. The network is optimizedto reconstruct the input. However, the network can not just copy the input giventhe bottleneck element. This forces the network to find an information-richrepresentation. An information dense representation is similarly interestingfor use in discriminative models, since it is a learned completely unsupervisedrepresentation of the data without the need for manual feature engineering. Yet,classical autoencoders produce representations as point-estimates in a hiddenspace. Thus, manipulating this hidden space to generate new samples is quitelimited.

2.3.2 Variational Autoencoders

In order to generate new samples, one could choose to estimate a probabilisticdistribution of the hidden variables p(z) that represents the underlying structureof the data from the given data x with function p(z|x). Given a function p(x|z),this structure can then be used to generate new and realistic samples by samplingfrom p(z). However, it is not clear how this underlying distribution given thedata looks like. But we could define an approximate inference model q(z|x) forwhich we can choose the distribution. Then choose an optimization strategyto reduce the divergence to the underlying distribution. Yet, this seems likecircular reasoning since we do not know the underlying distribution. However,to work around this Kingma and Welling [20] introduced the variational lowerbound:

L(✓,�;x(i)) = �DKL(q�(z|x(i))||p✓(z)) +Eq�(z|x(i))

⇥log p✓(x

(i)|z)⇤

(2.1)


Where q�(z|x(i)) is the inference network to predict the variational param-eters � of the distribution over z, p✓(z) is the Gaussian prior over the hiddenvariables and Eq�(z|x(i))

⇥log p✓(x(i)|z)

⇤represents the reconstruction error.

f(x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

||X � f(x)||2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Decoder

Encoder

X<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

�(X)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

µ(X)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

KL(N (µ(X), �(X))||N (0, I))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Sample z from

N (µ(X), �(X))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

||X � f(x)||2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

f(x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Decoder

Sample � from N (0, I)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>


Encoder

�(X)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

µ(X)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

KL(N (µ(X), �(X))||N (0, I))<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Figure 2.1: Left: Sampling operation from a gaussian distribution. Right:Sampling operation from a reparameterized gaussian distribution. The sampleoperation in red symbolizes the non-di�erentiable sampling operation in thecomputational graph.

Furthermore, the model infers the parameters of the hidden distributionand then samples from it. This sampling operation is not di�erentiable. Toalleviate this issue Kingma and Welling [20] proposed a reparameterizationtrick. In the case for a univariate Gaussian, the hidden variable z is sampledfrom the distribution p(z|x) = N (µ, �2), where µ and � are predicted by theinference network. Thus, z can be reparameterized as z = µ + �✏ where ✏represents a auxiliary noise variable ✏ ⇠ N (0, 1). As seen in Figure 2.1, whenthe reparameterization trick is used, the sampling operation is moved out ofthe computational graph which enables to propagate the gradient around thesampling operation.

A recent extension of the standard Varaiational Autoencoders are �-VaraiationalAutoencoders [21] which add tighter bottleneck on the hidden Gaussian dis-tributions by adding a weight � on the prior and therefore force the hiddendistributions to be more disentangled.

One idea that is very close to the standard VAE is proposed in Hsu et al.[22] where a speaker representation is concatenated to the hidden variable z.Thus, the local or phonetical information gets encoded in z while the speakerinformation is given externally. At inference time, the speaker information ischanged in order to reconstruct the sample in a di�erent voice.


2.3.3 Generative Adversarial Nets

The basic idea of Generative Adversarial Nets (GAN) is that a generativenetwork provides samples which are rated by an adversarial network whichdiscriminates if the generated sample is real or generated. This enables toonly model the generation process from the random variable z. Thus, in thestandard GAN no approximation is necessary for the inference of z comparedto a Variational Autoencoder. The generator network tries to fool the discrim-inator network by producing real samples. This leads to a minimax game,where either the generator network succeeds by producing real samples or thediscriminator succeeds by exposing fake samples. The model converges whenthe discriminator can’t distinguish between real and fake samples and randomlyguesses. However, practice has shown that it is hard to train GANs given themissing convergence guarantee [23].

One example of how this could be used for voice conversion is presented inHsu et al. [24], where the idea is to add a GAN objective to the VAE frameworkto model voice conversion more directly using a Wasserstein objective.

In a more pure GAN setting, Kaneko and Kameoka [25] proposes to use aCycleGAN style network for voice conversion. The main idea of a CycleGANis when a sample is converted from one style to another and then back to theoriginal one, the same sample should be reconstructed. This can be done fromthe direction of both domains and thus learned without supervised paralleldata.

2.3.4 Text to Speech Models

As an alternative to detect local phonetical features in an unsupervised fashion,one could also use text or phoneme labels that have been labeled manually.This could free the model from trying to approximate a good local encoding.Therefore, the networks modeling capabilities could shift the focus on modelingthe speaker information better. Text to speech models have shown to producethe best quality samples rated by human judges. Recent lines of popularresearch are Deep Voice 1, 2 and 3 [26, 27, 28]. Deep Voice 3, the mostrecent paper follows a TTS sequence to sequence architecture. The model firstencodes the given character or phoneme sequence to a learned encoding. Anautoregressive convolutional decoder is then used to predict the next frame orframes of the Mel spectrogram output. Given the autoregressive nature, newframes depend on previously predicted frames. However, during training, theautoregressive decoder network is trained using teacher forcing by feeding thenetwork with ground truth frames instead of the previously predicted frames.


The encoded input is attended to by an attention module. In Deep Voice 3an unconstrained attention mechanism is used which could lead to skippingor reverting in time. The authors reported a heuristic during inference thatforces the attention to be monotonic and found that this gave better performancein their experiments compared to purely monotonic attention. Furthermore,several decoding strategies for raw audio were evaluated, more in Section 2.5.

Similarly, Tacotron 1 and 2 models [29, 30] rely on a sequence to sequencearchitecture which encodes the input text to a fixed encoding which is thenfocused over using an attention module queried by the decoder. The maindi�erences to Deep Voice 3 is that the network mostly uses recurrent neuralnetworks in its encoder and decoder architecture as well as an attention modulethat is encouraged to be monotonic. Furthermore, the decoder is also learnedusing teacher forcing. This would lead the network to only learn how to predictthe next frame and not how to model the sequence. Like in Deep Voice 3,the previous output, or here the ground truth frame, is compressed using ashallow two-layer module followed by dropout [31]. This is labeled as thePreNet. Moreover, as seen in Figure 2.2, the architecture is extended by aPostNet in order to improve the quality of the predicted spectrogram. Theoutput of this residual layer is added to the output of the decoder to form thefinal predicted spectrogram. A completely di�erent approach is the recentlyproposed VoiceLoop architecture[32]. Instead of using common recurrentarchitectures, they introduced a novel method based on a shifting memory.However, they use a similar teacher forcing method, but instead of dropping outneurons, they add noise to prevent the network from learning only to predictthe next frame.

mel spectrogram

Bi-directionalLinguistic Encoder

LocationSensitiveAttention

2 Layer Pre-Net

5 Conv Layer Post-Net

Linear Projection

2 LSTM Layers

WaveNet MoL

Waveform samples

Figure 2.2: Schema of the Tacotron 2 architecture.

Given that these models can produce samples that come close to be indistin-guishable from real human voice, the recent research focus has shifted to model


the expressiveness and variability in the output. For example, Skerry-Ryanet al. [33] adds a prosody embedding to their encoding process. This is pre-dicted given a sample sequence and then put through a strong bottleneck whichreduces the sequence to a fixed length encoding. The input of the decodingnetwork is then composed of this prosody embedding, the speaker embeddingand the local encoding predicted from the text transcription. Similar approachesto model these global properties of an utterance in an unsupervised fashionshowed to achieve a similar goal while focusing on di�erent approaches indetail Hsu et al. [34], Henter, Wang, and Yamagishi [35], and Wang et al. [36].

2.4 Generative Models for Audio

In this section, the general concepts of the three generative neural networksare presented which are used in the experiments. All of these models have incommon that they can be used to model di�erent dimensionalities of audio. Forinstance the hidden global information of the speaker and the local informationof the content. Each model uses di�erent degrees of supervision. While theVoiceLoop model requires phoneme transcription and speaker IDs as informa-tion at training time, VQVAE only requires the speaker ID in addition to theaudio. FHVAE is completely unsupervised.

2.4.1 VoiceLoop

The VoiceLoop [32] architecture is a text to speech model which takes phonemesas input and predicts vocoder features, which in turn can be synthesized to rawaudio/voice. When a speaker identification is given, the model can be trainedto model speech of di�erent speakers.

With the VoiceLoop neural network, a unique recurrent network architec-ture to model the sequence was proposed. The main di�erence to classicalrecurrent neural networks (RNNs) is the use of a stacked memory in the recur-rent architecture instead of a single hidden state. A number of hidden states aresaved from previous timesteps in a FIFO queue style. Thus there are always Shidden states saved. As seen in Figure 2.3, each iteration one is added to the be-ginning and the oldest one is discarded. Therefore, the time context is implicitin the model instead of the model having to learn to compress information overtime in a single hidden state.

Two inputs are given to the network. First, phonemes with force alignedsilence information are fed into the network. This is followed by an embeddingof the speaker id. In order for the network to focus on a specific part in


No<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Nu<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Na<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Output Vocoder features

t�1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

t<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

ðɪs ɪz ɐ lˈɒŋɡəɹ ɛɡzˈampəl sˈɛntəns�

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

S<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

z<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Figure 2.3: Schema of the VoiceLoop architecture.

the given phoneme sequence, an attention module ↵t = Na(St�1) is added.Because of this, the decoding network never sees the full phoneme sequence,but rather a weighted accumulation over the phoneme sequence as a contextvector ct = ↵t⇤xphn. To compute ↵t a monotonic attention mechanism, similarto Graves [37], in the form of a mixture of Gaussians is used. Each mean of theGaussian distributions is advanced monotonically based on the current memorystate. This means that, the model can never go back to focus on already seenphonemes.

A shallow two fully connected layer network is used to predict the nextmemory frame based on the weighted phoneme input and the speaker infor-mation ut = Nu(z, St�1,↵t ⇤ xphn). The new memory frame is added to thebu�er, while the oldest frame is discarded. The final output of the network otis computed by another shallow two-layer network No which takes the currentcomplete bu�er state St.

Given that speech is not deterministic when derived from a sequence ofphonemes, it is very unlikely for the network to predict the exact same sequenceas the ground truth. However, when predicting only the next frame, the range


of likely outputs is more limited. Thus, in order to guide the network duringtraining to predict the same output as the ground truth, the real sequenceof vocoder features is fed into the network. Instead of using the previouslygenerated frames the ground truth frames are used. This technique is usuallyreferred to as teacher forcing. Normal noise is added to the ground truth toprevent the network from just predicting the next from the previous outputframes while ignoring the phoneme and speaker encoding inputs.

The output is learned via a mean squared error (MSE) loss comparing thepredicted vocoder features with the ground truth features.

Speaker Adaptive VoiceLoop

Ns<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

aɪ spˈiːk ðˈɜːfɔːɹ aɪˈamInput PhonemesInput Vocoder Features

No<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Nu<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ns<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> Na


LMSE<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Lcycle<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Current context

Output Vocoder features



t<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Lcontrast<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Figure 2.4: Schema of the Speaker Adaptive VoiceLoop architecture.

The Voiceloop architecture is extended in Nachmani et al. [38] by adding aspeaker extraction network rather than giving the speaker id supervised to themodel. The speaker ID is not given but rather extracted by the model from agiven audio sample into a fixed size vector.

In order to predict a fixed length speaker encoding, the ground truth vocoderfeatures xvocoder=x1,...,xT are fed into the speaker network Ns as additional


input. This encoding z is then used instead of the speaker id to predict the nextmemory frame. The speaker encoding is also fed to the output network No

additionally to the current bu�er. In addition to the main MSE loss to predictthe vocoder features two other losses are added to improve the learned speakerencoding.

First, a triple loss is used, which takes the current sample plus anothersample from the same and a sample from a di�erent speaker as inputs. Thenthe optimizer reduces the MSE loss between the samples of the same speakerand maximizes the MSE between the sample and the other speaker over a givenmargin. In addition, as seen in Figure 2.4, a cycle loss is employed whichreduces the MSE of the speaker vector derived from the output of the networkto the speaker vector derived from the ground truth.

Given that the speaker network has to encode the complex information ofa whole voice, a shallow two-layer linear network is not used, contrary to thetheme in the proposed method. The speaker recognition network consists ofseveral convolutional layers followed by an average pooling layer over time toreduce the output to a single speaker encoding.

Priming

A common theme in recurrent neural networks is the presence of a memory,which at the beginning of a sequence is its initial value, independent of thesequence to be generated. This means contextual information is not saved yet.The hidden state can also be initialized using the hidden state of a previousrun of a di�erent sample. This technique is labeled priming. Given thatvoice, prosody or emotion mostly stays consistent during a given sample, theintuition is that RNNs memorize that in the hidden state. Thus a particularstyle could be forced by giving an example sequence. Our experiments duringthe development process did not show a noticeable subjective di�erence in theresulting output when using other very similar samples to prime the memory.However, the original authors show that priming can lead to largely variableoutput. Thus it is left to further investigation into how di�erent the primingsamples need to be to produce a noticeable di�erence.

2.4.2 Factorized Hierarchical Variational Autoencoder

The proposed Factorized Hierarchical Variational Autoencoders (FHVAE) [39,40] architecture takes inspiration from the VAE model as described in 2.3.2.Contrary to the standard VAE, the FHVAE model has two disentangled hiddenvariables with di�erent semantics.


FHVAE are based on the assumption of the multi-scale nature of speechdata. In detail, phonemes would represent localized information, while thespeaker or the noise environment would correspond to global information. Thishierarchical nature can be modeled in an unsupervised manner by applyingdi�erent bottlenecks per time scale. Thus the goal of the model is for it tolearn to disentangle the hidden distributions for a local time-scale and a globalconsistent time-scale. Thus two random hidden variables are introduced insteadof only one as in VAEs.

First, the random variable z2 represents global information like the speakeror recording condition and will later be used to condition the network. Thesecond random variable z1 represents the remaining local information. Both areparameterized by a Gaussian distribution. The authors of Hsu and Glass [39]chose a single layer LSTM [41, 42] network followed by a linear transformationto parameterize the Gaussian distributions for z1, z2 and the output distribution.The full architecture can be seen in Figure 2.5.


x1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

. . .<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

xt<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

µz2|x<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

�2z2|x


�z2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

z2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Global Encoder

�2z1|x,z2


µz1|x,z2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

�z1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

�x1

�z2

�


�xt

�z2

�



z1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Local Encoder

x<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

µx1| �z1, �z2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

�2x1| �z1, �z2


�2xt| �z1, �z2


µxt| �z1, �z2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

��z1

�z2

�


��z1

�z2

�



Decoder

Figure 2.5: The FHVAE inference architecture consists of two encoders (orangeand green) and one decoder (blue). x = [x1, · · · , x20] is a segment of 20 frames.Dotted lines in the encoders denote sampling from normal distributions.

The main point is that the hidden variable z1 is generated based on a globalprior similar to the standard VAE, whereas the second hidden variable z2depends on a sequence-dependent prior conditioned on a sequence level hiddenvariable µ2. Following the VAE framework, the model is built as a sequence tosequence model with an inference or encoder network which infers the hiddenvariables and a decoder or generation network which reproduces the givensamples from the hidden variables. The model only has either a log or a Mel-spectrogram as input. In order to introduce a notion of local and global, theutterance or here labeled as sequences X(i), are split into several segmentsx(i,n). A single z1 and a single z2 are sampled per segment. During training,


a hidden distribution for z2 is learned which is consistent within segments ofthe same sequence. Followed by the encoding network of z1 which is givenz2 as input and is therefore encouraged to contain the remaining factors whichchange between segments.

In the variational autoencoder framework the inference of the exact posteriorof the hidden variables and network parameters is intractable, therefore aninference model, q�(Z(i)

1 ,Z(i)2 ,µ(i)

2 |X(i)), that approximates the true posterior,p✓(Z

(i)1 ,Z(i)

2 ,µ(i)2 |X(i)), for variational inference [20] is introduced.

This approximate model is constructed so that the evidence lower boundcan be computed which in turn is maximized to approximate the true posterioras close as possible. The complete derivation of the lower bound can be foundin Hsu, Zhang, and Glass [40]. The eventual lower bound in equation 2.2 doesnot depend on the whole sequence, given that the estimation of the secondhidden variables mean µ2 has been replaced with a cache hµ2(i), where iindexes the training sequences. The inference model for µ2 then becomesq(µ2|X(i)) = N (hµ2(i), I) [39]. During training, the whole sequence is neverseen by the model at once, only the hidden variable µ2 encodes sequenceinformation.

L(p, q;x(i,n)) = L(p, q;x(i,n)|hµ2(i)) +1

N (i)log p(hµ2(i)) (2.2)

Given this model, we can now optimize the di�erent distributions for thedi�erent variables. For a forward pass through this model, the outputs of thesevariables are sampled from the corresponding distribution. This operationis not di�erentiable. However, we can choose to set all distributions to beGaussian, which allows us to use the reparametrization trick [20] to be able totrain the whole model using gradient descent.

Yet, no part of the network forces the di�erent means µ2 in the look-uptable to be di�erent for other sequences and the prior is maximized with zeromean. However, the goal for z2 is to contain sequence level information whichdi�erentiates itself from other sequences. Therefore a discriminative objectivein eq. 2.3 is introduced that favors z2 from the same sequence to be close to thesame µ2 but also encourages it to be far from µ2 of all other sequences [39].

log p(i|z̄(i,n)2 ) := log

p(z̄(i,n)2 |µ̄(i)

2 )P

M

j=1 p(z̄(i,n)2 |µ̄(j)

2 )(2.3)

To come to the final objective function the lower bound is combined withthe discriminative objective by a weighing parameter ↵ which is then optimized


to be maximized by the FHVAE model.

Ltotal(p, q;x(i,n)) = L(p, q;x(i,n)) + ↵ log p(i|z̄(i,n)2 ) (2.4)

However, the denominator of the discriminative objective is dependent onthe number of training samples. This influences the weighing between thevariational lower bound and the discriminative objective, thus ↵ has the beadjusted for each dataset. Furthermore, for every sample in the training batch,the gradient is dependent on every sample in the embedding table for µ2. Tobe able to train FHVAE for a larger training set, Hsu and Glass [39] introduceda hierarchical sampling approach. In this approach, only a limited number ofsamples defined by a hyperparameter K are kept in the lookup table. Whenthese K sequences are drawn from the training set the corresponding µ2 areestimated using the current model. Then the normal optimization steps aredone for a number of steps defined by another hyperparameter N . Thus themodel can be scaled to a larger amount of training data.

2.4.3 Vector Quantised-Variational AutoEncoder

Another ongoing line of research in unsupervised learning is finding discretelatent codes rather than continuous representations, common in a VAE setup.One proposed model is the Vector Quantised-Variational AutoEncoder (VQ-VAE) [43]1, which uses vector quantization (VQ) to learn a discrete latentrepresentation. Using discrete latent codes allows the model to circumvent acommon issue in VAEs when paired with a powerful autoregressive decodercalled ’posterior collapse’. In this phenomenon, the variational posterior of thehidden variables collapses to the prior and the generative model ignores thelatent variables.

The idea behind this method is relatively simple. An encoder producesa continuous representation ze(x) of the input speech. That representation isquantized and a decoder is trained to reconstruct the original input from thequantized embedding. The quantization is achieved by clustering and saving thecenters of the clusters as an embedding e in a codebook. In the quantization stepthe representation ze(x) is mapped to the nearest element ek in the codebookas given by eq. 2.5.

zq(x) = ek,where k = argminj||ze(x) � ej||2 (2.5)1Even though the name contains "Variational", no parameterized distributions are used in

this approach.


During the forward pass, first, the raw signal gets compressed into a smallerhidden encoding by the encoder. This encoding is then mapped to the nearestcentroid. These centroids are part of the model and learned during trainingand are the discrete codes. A powerful decoder is then used to reconstruct theoriginal input.

In the case of an audio signal, the proposed model uses several layers of1d-Convolutions with d filters followed by a ReLU(x) = max(0, x) activationfunction. Each convolution has a kernel size of 4 and a stride of 2. Thus,every layer halves the frequency of the signal. In the original paper either 6or 7 convolutions are used which lead to an either 64 or 128 times smallertime-frequency encoding of the signal into d dimensional vectors. For everytimestep the closest embedding by l2-norm is selected. This is followed by thedecoder reconstructing the sequence of discrete codes. The original authorsuse a WaveNet [44] architecture as the powerful autoregressive model whichhas shown that it is able to model raw speech very well.

However, this mapping operation on the codebook or embedding is notdi�erentiable. Neither the embedding nor the encoder could be optimized withgradient decent this way. Therefore, Oord, Vinyals, and Kavukcuoglu [43]proposes a ’trick’ to make the model di�erentiable. To optimize the encoder,the gradient from the decoder is copied straight to the encoder. Since encoderoutput and discrete code share the same dimensionality, the assumption is thatthis should work. Especially, given that they are close in space, the gradientof the cluster still carries useful information for the encoder output. Yet, thismapping operation does not modify the codebook. Thus, in order to learn thecodebook, the selected clustered are moved closer to the encoder outputs ze(x)by the optimizer used using the l2 distance as loss.

This technique resembles learning the embedding clusters with a near-est neighbor approach. This suggests other clustering algorithms could beused as well. For example, there is recent research in using the expectation-maximization framework [45] to cluster in the latent space. The original authorsmentioned that an exponential moving average [43] variant similar to a batchedk-means algorithm could help the model to learn good embeddings faster andmake the optimization less volatile.


Encoder

ze(x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

zq(x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

39 47 89 …

q(z|x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

rzL<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

e1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

e2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>


ek<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

EmbeddingSpace

Upsample

D<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

D<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> e 3

9<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> e 4


9<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>


ze(x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

zq(x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

ez<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

p(x|zq)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>


Figure 2.6: Left: A Figure describing the VQ-VAE. Right: Visualization ofthe embedding space. The output of the encoder z(x) is mapped to the nearestpoint e2. The gradient rzL (in red) will push the encoder to change its output,which could alter the configuration in the next forward pass.

The point of the model is to learn local discrete features in the given audiosignal without providing similar supervised features like phonemes. However,this disregards global factors which are constant over a given sequence. Inthe case of multiple speakers, the decoder would benefit from informationabout the speaker to be synthesized. In a fully unsupervised manner, a speakerrecognition network that produces a speaker encoding could be added. However,in practice, speaker labels are easy to acquire and contain little noise. Thus aspeaker recognition network can be replaced by either a one-hot encoding ofthe di�erent speakers or better, a learned embedding. This not only adds thepossibility to add unknown speakers but can also be used to investigate how themodel uses the speaker information. Additionally, a learned speaker embeddingcan be used to generate ’fake’ speakers by sampling from the embedding space.

In the model, the speaker encoding is concatenated to the output of thequantization step. This helps the model to focus on the residual local factorsindependently from the speaker. Furthermore, since the encoder only encodeslocal features, one can use the local encodings of one utterance and combinethem with a speaker encoding of a di�erent speaker to resynthesize the originalutterance in a di�erent speakers voice.

2.5 Feature extraction

With the popularity of deep learning, learning more and more approachesthat learn from raw pulse-code modulation (PCM) audio signals have gainedpopularity [46, 47, 48]. However, these models usually need to use considerablemore resources than comparable methods using extracted features given thehigh time resolution of audio signals.


Common digital signal processing (DSP) tools can help to reduce the signalresolution considerably by increasing the information density in the signal. Aspectrogram is a lossy representation of a signal in the frequency domain. Totransform a signal into the frequency domain the Fourier transform is appliedto the signal in time domain. This results in the decomposition of frequenciesexisting in the whole audio signal. However, since speech is a non-stationarysignal its frequency composition changes over time. To capture this change theaudio is split in a number of frames or windows in the time domain where thefrequency composition is approximately consistent. Yet, usually overlappingframes are chosen to reduce artifacts at the frame boundary. The change overtime is then captured by the di�erence from frame to frame. This processis generally referred to as short-term Fourier transform (STFT). Given theamplitude and phase information, the transformation is loss-less. However, inpractice, the phase information is disregarded as it can be approximated fromonly the amplitude and the framing parameters. Therefore, the phase does notcontain a lot of useful information, given small windows are used. The phaseinformation is disregarded by taking the absolute or square of the STFT outputwhich is then labeled as spectrogram. Given that the objective is to recognizehuman speech and humans do not perceive all frequencies equally, the sizeof the frequency domain can be further reduced. The Mel-scale is a set offrequencies that are following human perception. They are spread with equaldistance as perceived by humans on the frequency scale. Thus a filter followingthe Mel-scale is applied, which mainly squashes higher frequencies togethersince they are less distinguishable for humans.

A common approach to further compress the signal is doing a discrete cosinetransform (DCT) on the Mel-spectrogram output. This output is then commonlyreferred to as Mel-frequency cepstral coe�cients (MFCCs). This adds thee�ect to reduce the correlation of the di�erent features of the representation.However, this also makes the representation more susceptible to noise factors.

Vocoder

Similar to how di�erent representations are used in Section 2.5 to perfompattern recognition, di�erent representations are used when the goal is tosynthezise raw audio. The choice of the vocoder is mainly interesting for thesubjective inspection of generated samples.

Gri�n-lim In the use case of speech recognition the phase part of the spec-trogram representation is discarded given the limited information it contains.


However, the spectrogram representation is only invertible given this part. Onecommon technique to approximate this part was introduced in Gri�n and Lim[49], commonly labeled as Gri�n-Lim. In this algorithm, the phase part ofthe spectrogram gets initialized with noise and then iteratively optimized byinverting the spectrogram and then performing the STFT again, while keepingthe amplitude part constant.

WORLD Vocoder Unlike Gri�n-Lim, WORLD uses di�erent semanticcomponents directly targeted to reconstruct raw audio containing speech. Firstthe fundamental frequency f0 or pitch is estimated in short time intervals.Further features are Mel-Generalized Cepstral Coe�cients (MGCEP or MGC)2

and band aperiodicities (BAP). These features can then be used to resynthesizesubjectively more natural sounding speech [50].

WaveNet The original WaveNet publication [44] proposed WaveNet as acomplete TTS model using a very deep architecture with text as input. Yet,Shen et al. [30] proposed to use a smaller version of the WaveNet architectureas a vocoder to invert Mel spectrograms to raw audio. This smaller WaveNetarchitecture can be trained on ground truth Mel spectrograms and then fine-tuned on predicted spectrograms to increase the audio quality further.

Caveats of WaveNet Given the autoregressive nature of the WaveNet ar-chitecture, a naive implementation has the major drawback at being slow atsampling in practice. Mainly since WaveNet consists of many layers which haveto be loaded sequentially while producing a very high sampling rate in a naiveimplementation. On a NVIDIA K80 GPU 1 second takes roughly 10 minutesto generate with a implementation in common deep learning frameworks. Gen-erating several thousand samples this way would not be feasible. There existsa heavily engineered CUDA implementation from NVIDIA which enablesinference with the WaveNet architecture at faster than real time3. However, ito�ers limited flexibility.

2Not to be confused to MFCC features, commonly used in ASR which tend to removespeaker-specific information and are not easily revertible.

3https://github.com/NVIDIA/nv-wavenet

Chapter 3

Method

Voice Generation Model

fənˈɛtɪkəl ˌɪnfəmˈeɪʃən dˈɪfɹənt sˈɛntəns

fˈəʊniːm ɹˈɪtʃ

ˈʌðəɹ ˈaksənt

sˈeɪm sˈɛnteɪs dˈɪfɹənt spˈiːkə

(+) (+)

play song

play song

play song

stop music

stop music

Large Keyword specific datasetKeyword spotting

model

play song

stop music

unknown

Large Speech or LVCSR Dataset

Feature extraction

Data Augmentation

Feature extraction

Feature extraction

fənˈɛtɪkəl ˌɪnfəmˈeɪʃən

play song

stop music

Few Keyword samples

Pretrained Voice

Generation Model

play song

Figure 3.1: Overview of the changed data preparation pipeline when using apre-trained voice generation network.

In order to train a standard Keyword spotting model one usually starts with alarge keyword specific dataset, adds augmentation to the audio samples, extractssome form of features and then trains a keyword spotting model on these inputs.This pipeline can be seen grayed-out in Figure 3.1. In our approach, we considerthe case where the large keyword specific dataset is not available. Instead, wetrain di�erent neural networks that can generate new samples. These networksare trained on LVCSR data which covers many speakers and pronunciations.These pre-trained networks can then be used to generate new samples from thefew existing keyword samples. Thus the changed data preparation as seen inFigure 3.1.

22

CHAPTER 3. METHOD 23

In this section, we explore how this could be achieved with the methodsdescribed in Section 2.4.1, 2.4.2 and 2.4.3. Each of the methods is followingthree di�erent approaches which can be used to tackle a similar problem ofgenerating samples with high variability. They also di�er in the extent ofwhich they receive additional supervised information as input, as seen in Table3.1. We focus on the underlying variability introduced by di�erent speakers.However, di�erent recording environments, distance to the microphone, etc.has also an major impact on the ability of the trained keyword spotting modelto generalize well.

Method Name Input representation Output representationSpeakerVoiceLoop Vocoder Features

& Phonemes& Speaker IDs

Vocoder Features

FHVAE Mel-Spectrograms Mel-SpectrogramsVQVAE Raw Audio

& Speaker IDsMel-Spectrograms

Table 3.1: Input and output representations for the di�erent methods.

3.1 Speaker Adaptive VoiceLoop

First, we explore the in Section 2.4.1 described Speaker Adaptive VoiceLoopmodel. We use vocoder features as feature representation as in Nachmani et al.[38]. Experiments of using a less specific Mel-spectrogram representation withthe VoiceLoop method is left to further studies. Thus we used the generatedvocoder features to synthesize raw audio from which in turn a Mel-spectrogramis extracted to be presented to the keyword spotting model. The input and targetof the VoiceLoop network are normalized vocoder feature frames as well asa sequence of phonemes. These phonemes are forced aligned in order to getsilence phonemes. Yet, the exact timing information is discarded. This is doneusing the merlin toolbox [51].

Following the referenced work, the training is done in two steps. In the firststep, the target vocoder features are split into smaller sequences of 1 secondwhile the phoneme input labels are kept. This allows using bigger batch sizesfor the given memory. Additionally, a higher amount of noise, with a standarddeviation (SD) of 4.0, is added. This is trained for 90 epochs. In the second

24 CHAPTER 3. METHOD

step, the target sequences are set to a maximum of 8 seconds and the amountof noise is halved to a SD of 2.0. The model is trained for another 90 epochs.

In order to generate training samples to train the keyword spotting model,first, a phoneme representation for each keyword is taken from the CMUPronouncing Dictionary present in the festival toolkit1. This phoneme repre-sentation is paired with a number of random samples from the training setwhich was used to train the VoiceLoop. The speaker encoding is extractedwith the pre-trained speaker network from each of the random training samples.Together with the speaker encoding a new sample is generated for the givenkeyword phoneme sequence. Thus, each sample should be generated based ondi�erent highly variable speaker encodings.

3.2 Factorized Hierarchical Variational Autoen-

coder

The second explored method is the scalable variant of the Factorized Hierarchi-cal Variational Autoencoder described in Section 2.4.2. The scalable variant isused in order to accommodate a bigger LVCSR dataset.

Similar to VoiceLoop, the model is first trained on a LVCSR dataset in orderto learn the inference network to predict the global variable z2 and the localvariable z1 as well as the generator network to predict the Mel-spectrogramfrom the hidden variables. However, contrary to the VoiceLoop, a few groundtruth samples are needed which can be augmented. In order to augment thesesamples with the pre-trained FHVAE, the keyword samples are first split intoshort segments as required by FHVAE. Random samples from the trainingset of the pre-trained FHVAE are used to generate highly variable global orspeaker encodings z2. These extracted speaker encodings z2 are then pairedwith the content encodings z1 extracted from segments of the same groundtruth keyword sample. These are then used to generate new segments usingthe pre-trained decoder network which are then fed as training data into thekeyword spotting model.

3.3 Vector Quantised-Variational AutoEncoder

This section describes the modified VQVAE architecture used in the experiements.The experimental development process that lead to this architecture is described

1http://www.cstr.ed.ac.uk/projects/festival/

CHAPTER 3. METHOD 25

in Section 4.0.1.The original paper uses a convolutional encoder and a WaveNet architecture

as a decoder as described in Section 2.4.3. However, as described in Section2.5, a WaveNet implementation is limited by its autoregressive nature in a naiveimplementation. The heaviliy engineered open source implementation fromNVIDIA is not flexible enough and quite limited in its maximum model size.Furthermore, the selling point of the WaveNet architectures is the possibilityto generate raw audio. For our use case, it is su�cient to generate a Mel-spectrogram representation. Therefore other powerful decoder architecturesare viable. For this work, we decided to take inspiration from a recentlyproposed architecture in Shen et al. [30] labeled Tacotron 2 which showed tobe able to produce good quality samples.

For the encoder, we mostly follow Oord, Vinyals, and Kavukcuoglu [43]by reducing the raw input to a 128 times smaller frequency by using a stackof covolutions with a width of 4 and a stride of 2. Thus every layer halves theresulting number of samples in the time domain. However, the last layer ofthe encoder is an average pooling layer instead of a convolutional layer whichreduces the bias to fit local noise according to Dieleman, Oord, and Simonyan[52]. The output of the encoder is then fed through the VQ bottleneck. Theoriginal VQVAE paper [43] uses a nearest neighbor approach to cluster theembeddings during training but also proposes that an exponential movingaverage (EMA) as alternative, which we use instead.

After the encoder outputs are quantized, they are concatenated with theglobal speaker embedding and then downsampled once more using a convo-lutional layer to fit the spectrogram time scale. This requires that the STFTto extract the spectrogram produces the same amount of output frames. Thusthe STFT configuration has to match the network. This can be done eitherby resampling the raw input or choosing an appropriate STFT configuration.We use a stride of 256 for the STFT in our experiments given that the imputfrequency is 16 kHz.

As powerful decoder network, a single layer LSTM with 1024 hidden unitsis used in it’s core. Several modules following the Tacotron 2 architecturedescribed in Section 2.3.4 are added. The output of the quantization step is fedas inputs to the LSTM. The output is then fed into a linear projection, whichalso has the current encoder output as input. The output of the linear projectionis the next output frame. Since the recurrent net can not see into the future, aresidual convolutional PostNet is added to increase output quality. The modelis trained the same way as Tacotron 2 by using a MSE loss on the post-net anddecoder output as well as the same teacher forcing technique. In VoiceLoops

26 CHAPTER 3. METHOD

teacher forcing method noise is added to the previous frame to deter the networkfrom just predicting the next frame. Here, a di�erent technique is used. Theprevious frame is first passed through a fully connected bottleneck layer withadded dropout. Thus the network can not reliably predict the next frame fromjust the previous one. Having a PreNet is crucial in successfully training thenetwork. The final decoder architecture can be seen on the right side in Figure3.2.

For the quantitative evaluation, the model is trained to predict Mel spectro-grams instead of log magnitude spectrograms for subjective inspection. Inputnormalization of the spectrograms is used to greatly help the training speed.

Encoder

DownsampleAvg Pool

LSTM

Linear Projection

Post-Net

Pre-Net

ze(x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> zq(x)


39 47 89 …

q(z|x)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>


e1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>



ek<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

EmbeddingSpaceD


D<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> e 3



9<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

p(x|zq)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>


Figure 3.2: Modified VQVAE architecture to generate Mel-spectrograms.

Generating keyword samples The pre-trained VQVAE model consists ofan encoder-decoder part with a learned quantization embedding as well asan embedding of the speakers of the training set. Similar to FHVAE a fewground truth samples need to be present to be augmented. Each of the key-word ground truth samples is fed into the network as raw audio. After thecontent of the keyword samples is extracted in the quantization step, a randomspeaker embedding is used to decode a Mel-spectrogram representation fromthe speaker-sample combination. We only use speaker embeddings learnedduring training of the VQVAE network. Completely random or interpolatedspeaker embeddings as well as newly fitted speakers after the main trainingstep are left for further research. These new samples are then fed as trainingdata into the keyword spotting model.

Chapter 4

Experiments

4.0.1 Modified VQVAE architecture

Decoder Architecture As described in Section 3.3, we used a decoder archi-tecture inspired by Tacotron 2. Following the Tacotron 2 architecture describedin Section 2.3.4 naively seems to work. However, training with location sensi-tive attention is slow and the model starts to ignore encoder outputs after halfof the sequence outputs are generated. In a follow-up paper [33] the authorsproposed to use a monotonic Gaussian mixture model attention. Given thatthere is no non-monotonic information flow in the text to speech this attentionshowed to be a better fit. This way, the attention is learned faster, but the modeljust learned a straight correlation between the encoder output position and thespectrogram index. This indicated that using an attention architecture heredoes not add value to the architecture. And indeed, given that STFT can beimplemented as a convolution the input and output alignment can be perfectlymatched given matching hyperparameter configurations are used. The assump-tion that by adding attention the models gets more degrees of freedom andtherefore variability in the timing of the output did not hold true. In furtherwork, this could be explored more, if an implicit bias for more variability inthe output is added to the architecture.

Learning VQ embeddings As described in Section 3.3, we use an exponen-tial moving average instead of a nearest neighbour approach in our experiments.Using a moving average showed that the network uses more embeddings muchfaster. Even when compared to a nearest neighbor approach with greatly in-creased batch size by training the model on multiple GPUs to fit the batches intomemory. This can be seen as an indication that the EMA approach converges

27

28 CHAPTER 4. EXPERIMENTS

faster to a stable embedding, similar as described in Dieleman, Oord, andSimonyan [52]. Furthermore, samples generated from a model trained withEMA have a noticeable higher subjective quality.

Using the proposed approach in Dieleman, Oord, and Simonyan [52] toreplace the VQ embedding with an encoding which is reduced to a one-hotencoding by an argmax operation did not yield usable results.

Note on Hyperparameters During development log magnitude spectro-grams are used which can be approximately inverted using Gri�n-Lim [49],since Mel spectrograms are not as easily invertible. A higher dropout probabil-ity of 0.7 on the PreNet compared to 0.5 proposed in Shen et al. [30] outputshowed to be crucial for usable performance at inference. This tightens the bot-tleneck on the information flow from the teacher forced previous spectrogramand forces the model to encode more information into the VQ embeddings.However, with the PreNet removed, and thereby teacher forcing, the modelfailed to train.

Based on subjective results, the model seems to be robust to di�erent sizesof the discrete embedding space or even adding dropout after the quantiza-tion bottleneck. Furthermore, anecdotally a smaller embedding space of thequantized embedding seems to help to capture the speaker identity slightlybetter. This might be given that less information in the quantized embedding iscaptured due to its limited size. In the final experiments, a size of 32 is used.However, the results with di�erent sizes for the quantization embedding weresubjectively small. This suggests that the network is quite robust to changesin the size of the quantized embedding space. This might be given that themain bottleneck is the number of embeddings. We did not experiment with thenumber of embeddings.

4.1 Evaluation Setup

4.1.1 Qualitative Evaluation

In machine learning research it is desirable to follow hard metrics that model thedesired output well. In the case of synthesized speech, the best possible resultwould be speech which is indistinguishable from a real recording. However,there is no metric that approximates the perceived quality of humans closely.There are methods like Mel Cepstral Distortion(MCDK) [53], Gross Pitch Er-ror(GPE) [54],Voicing Decision Error(VDE) [54] or F0 Frame Error(FFE) [55]that give an indication of the quality of a generative model. These metrics

CHAPTER 4. EXPERIMENTS 29

compare the produced sample to a ground truth. For voice conversion, this isnot a useful metric, compared to when used in TTS, given that there are noground truth examples for the converted samples. One could use a parallelcorpus of speech data to approximate a ground truth conversion. This wouldthen be an approximation of an approximation of the metric we actually wantto measure. If this metric would be used this way, one would have to makestrong arguments for why this approach would be used and how it holds internalvalidity for the unit to measure. Furthermore, to our knowledge, there are noinstances in the literature that use it this way.

Another approach would be to focus on the reconstruction quality. However,this introduces a bias towards good reconstruction instead of good conversionof style. Therefore, similar methods that compare the generated samples witha ground truth are not used.

It is also common in the literature to judge generative models for images bythe accuracy of another classification network [56]. One proxy network thatis used in the use case of voice conversion is a speaker recognition network.While a speaker recognition network would give a qualitative signal if theconverted sample is recognized as the same speaker or not, it does not quantifythe naturalness, variety or quality. To my knowledge, there is no approach inthe literature trying to solve this problem yet, thus evaluation strategies thatuse a speaker recognition network are left to further work.

These evaluation methods are only proxies for perceived human quality.Thus, the straight forward method in the literature is, doing a survey amonga few users that are queried for the perceived quality of a number of samples.This is then presented as mean opinion scores (MOS). However, since this isa time consuming metric it is more common to rely on one’s own perceivedquality and use that to do judgment calls during the development process.

4.1.2 Quantitative Evaluation

In order to compare the di�erent approaches, the experiments have to be donein the same or very similar conditions to make them properly comparable.The use case I am focusing on is the recognition of short speech commands.Compared to large vocabulary speech recognition, models for recognition ofspeech commands are generally smaller, but also need very targeted datasets.Each speech command requires a dataset of a few thousand samples of only thatspoken command. In practice, it is common to have more than 500 speakersrepeating the keyword around 5 times as a minimum requirement for a goodperforming classifier. More data points from a greater variety of speakers and


in di�erent environments would obviously improve the performance of themodel further.

Similarly, in general large vocabulary ASR several hours of training data areused with di�erent speakers, noise environment and microphone configurations.Models based on such datasets can extract the important local features to recog-nize the correct transcription while being robust to di�erent global conditions.Therefore, they are able to generalize well to new conditions without needinga dataset specific to every use case. Furthermore, existing datasets for thesesources [57, 58, 59, 60, 61] can be used to train usable models which alsogeneralize to new use cases. There are similar dataset existing for di�erentlanguages. On the contrary, acquiring a good keyword dataset is quite resourceintensive. First of all, for every keyword, a new dataset is required and second,a greater variety of speakers is required than in the general ASR use case. Thisall leads to a quite expensive process to acquire training data. Thus we chosespeech commands as our use case to evaluate the models on.

An additional constraint for keyword spotting is the use of resource e�cientmodels. If a model with a large number of parameters is viable, a generalASR model could be used instead which only detects a given set of commands.However, given the resource constraints, one needs to rely on small models.

Baseline

One of the most prominent features of one’s voice is the pitch or fundamentalfrequency. Generally referred to as having a high or low voice. Changingthe pitch can make it very hard for humans to recognize the person speaking.Especially, when only a few voice samples are provided. Given that is one ofthe most prominent features of one’s voice it can also provide a baseline forother voice conversion algorithm. Especially, to give an indication if they learnadditional features beyond modifying the fundamental frequency. Furthermore,ASR models should be invariant to faster or slower speech. This gives us abaseline to test the models against by modifying these features using commonDSP techniques.

The baseline augmentation is performed with the SoX1 framework. First,the pitch of the audio is changed randomly with either [-4, -2, 0, 2, 4] semitones.The speed change (pitch and tempo combined) is drawn from a normal-randomdistribution with a variance of 0.001 and a cuto� at 0.9 and 1.1. The tempochange (tempo without pitch) is drawn from a normal-random distribution witha variance of 0.01 and a cuto� at 0.8 and 1.2.

1http://sox.sourceforge.net/


Evaluation Model

The di�erent models generate audio samples which are meant to be used astraining data for a discriminative model. For this evaluation, it is assumed thatthe choice of the recognition model does not have an impact on the relativeperformance of the generative models. Thus, a relatively small network can beused to compare the di�erent methods. In practice, a model used for keywordspotting would be heavily fine-tuned and iterated over to get the maximumperformance. So this model will not give an absolute performance of anaugmentation technique that would be used in practice. Yet, it is su�cientto compare the relative performance of these methods. The choice is alsoinfluenced by the fact, that a small network with few parameters is comparativelyfast to train to convergence and therefore resource-e�cient.

One classical convolutional classification model is LeNet-5 [19] whichonly consists of 2 convolutional layers, followed by 2 fully connected layers.The model is augmented with a dropout module after the second convolutionallayer and the first fully connected layer. Despite its simplicity, it is still possibleto reach a reasonably high accuracy of 91% on the reference test set [62].Additionally, this model converges in under 2 hours for 30 speech commands inour PyTorch implementation. All the approaches are benchmarked by trainingthis model.

4.2 Experiments

Datasets In our experiments we use four di�erent datasets. To train thegenerating networks VCTK [57], CMU ARCTIC [58] and Librispeech [59]are used. Each method is trained with a di�erent dataset configuration as seenin Table 4.1.

Method Name Dataset Dataset SizeSpeakerVoiceLoop VCTK106 43h

FHVAE Librispeech train-clean-100 100hVQVAE VCTK108 & CMU ARCTIC18 44h + 18h

Table 4.1: Datasets used for the di�erent methods.


4.2.1 Experiments: VoiceLoop

The authors of the original VoiceLoop paper provided an open source imple-mentation for the standard VoiceLoop. However, the preprocessing script isfailing and has many convoluted and redundant steps. Thus, reproducing thesesteps showed to be a considerable challenge. We experimented with di�erentpreprocessing strategies. For one, we took the raw audio, removed the silencesat the beginning and the end using a 35 dB threshold and transformed the textlabels to phoneme labels. To transform the text to phones the US phonesetof 42 phones is used while using the default lexicon in the festival2 toolset.These extracted phonemes do not contain labels for silences present in theaudio equivalent. In order to add silence labels, we experimented with a GMMforced aligner version of the Kaldi toolkit. Both of these approaches learnedto produce audible speech on a few validation samples. However, for mostvalidation samples the attention network failed to produce a dense attentionat inference time. This made it impossible for the network to produce audiblespeech on these samples, instead, the network output is only noise.

Eventually, we went back to retrace and reconstruct the preprocessingpipeline used by the original authors using the merlin framework3. The merlintool ehmm is used to force align the extracted phonemes and to add silencelabels. The audio features are extracted using the WORLD vocoder toolset [63].Finally, a DNN duration modeling network is learned and used to predict moreexact timings and to remove silences in the audio. The removal of silences in theaudio using forced alignment seems to be crucial for the network configurationto work since it is more accurate than a heuristic based on a dB threshold. So afinding here seems that the proposed VoiceLoop model is not robust to silencein the training samples and removing the silence at the beginning and end isnot su�cient. TTS models commonly have problems when silence exists inthe training data, however, VoiceLoop showed to be particularly fragile in thisscenario. While the original authors only evaluate the model on a subset of 85speakers of the VCTK dataset, the full VCTK dataset is used in our experimentsexcept 2 speakers for subjective evaluation. E.g. to see if the speaker extractionalso transferes to unseen speakers. The samples of these speakers also servereas validation set.

For the generation of data for the keyword spotting evaluation, each keywordis translated to a single phoneme representation. To vary the speaker, speakerembeddings are generated with the speaker network by selecting random files

2http://www.cstr.ed.ac.uk/projects/festival/3https://github.com/CSTR-Edinburgh/merlin


from the training set. This produces WORLD vocoder features which are thenused to synthesized audio from. As input features for the keyword spottingnetwork, log Mel spectrograms are extracted from this synthesized raw audio.

Two step Training The samples are put into buckets by sequence length,to reduce the amount of padding needed. The first training step is run for 73epochs (73k steps) on the whole of the VCTK dataset with a batch size of200. The vocoder feature frames are truncated to 1 second of audio. Noisewith a SD of 4.0 is added as well. The weight decay on the speaker networkis set to 0.0005. Without this regularization, the network started to divergeduring training. In the second training step, the sequence length is increasedto 5s with noise with a SD of 2.0 added. The L2 regularization is reduced to0.00001 and the batch size is reduced to 20 to be able to accommodate thenetwork in memory. The positive and negative speaker samples for the speakernetwork are random truncated 1 second speaker audio samples. The rest of thehyperparameters follow the original paper [38].

4.2.2 Experiments: Scaleable FHVAE

The authors of the original Scalable FHVAE provided an implementation4

which we followed for these experiments. The dataset used is the 100 hoursclean audio training data split of the LibriSpeech. The cache of µ2 is restrictedto 5000. The original authors suggest to train for a maximum of 100 epochsor if the lower bound does not improve for 10 steps. Then the best modelis determined by the maximum lower bound on the validation set. Repeatedexperiments showed that the model with the given setup does not converge to astable equilibrium but rather diverge between the 45th and the 50th epoch. Thisoccurs independently of the input features used. However, when the modelis trained on log-magnitude spectrograms and synthesized to raw audio, thesubjective evaluation shows that the model learned voice conversion comparableto the original published work. The cause of this is left to further study.

4.2.3 Experiments: VQVAE

The, modified as in 4.0.1, VQVAE model is trained with a joint dataset of theVCTK and CMU ARCTIC without using the text transcription. This leads toa total of 126 di�erent speakers. The model is trained for 40 epochs on thecombined dataset. In order monitor the usage of the discrete embedding indices

4https://github.com/wnhsu/ScalableFHVAE


a perplexity measure eH(p) = e�P

x p(x) ln p(x) is used, where x represents thenumber of times a embedding index is selected. The maximum perplexitywould be reached when each index is used equally likely while the lowestperplexity would mean that only a single embedding index is used. A highervalue is desirable here since it indicates the model encodes information overmost discrete embeddings. A low perplexity indicates the model ignores thelocal codes since they lack di�erentiation between embeddings. Given that theEMA approach is used to learn the embedding, the perplexity value convergesin roughly 3 epochs.

Chapter 5

Results

In order to quantify the performance of the di�erent models for the keywordspotting task 3000 samples for each keyword where generated. For the unsu-pervised voice conversion approaches, a varying number of base samples toaugment are given. As validation dataset the speech commands dataset [62]is used. This dataset includes 30 one word speech commands spoken from alarge variety of speakers. The dataset is crowdsourced and thus the microphonequality is highly diverse and the recordings are mostly people in front of theircomputers in di�erent environments.

5.0.1 Speaker VoiceLoop

(a) Confusion matrix on the validation sam-ples.

(b) Accuracy per keyword.

Figure 5.1: Visualization of evaluation Metrics for the Speaker AdaptiveVoiceLoop Model.

35

36 CHAPTER 5. RESULTS

In order to generate samples using the speaker-adaptive VoiceLoop imple-mentation, every keyword is translated into a single phoneme representation.Each sample is then generated with a di�erent speaker encoding, using a ran-dom sample from the VoiceLoop training set to extract the speaker identity.This results in an average accuracy of 51% when trained only on synthetic data.

Given that it is a generative model, artifacts might be present in the producedaudio, which the keyword spotting might learn. These artifacts would not bepresent in the validation samples. To see if this is the case the model is trainedwith 20% additive white noise. This reduced the performance on the validationset as seen in Figure 5.1. Thus, trivial artifacts are most likely not the reason forthe poor generalization power. Experiments, with adding the in 4.1.2 describedaugmentation technique showed even worse performance. Furthermore, thereis a large gap between the accuracies of di�erent labels, as seen in Figure 5.1b.While few labels show acceptable performance to disambiguate between thelabels, other labels fail to provide a significantly better accuracy than a randomguess.

SpeakerVoiceLoop 0.5105+ 20% white noise 0.4641+ baseline augmentation 0.3780

Table 5.1: Accuracy on keyword spotting task.

5.0.2 Voice Conversion Results

In order to evaluate the performance of the voice conversion models VQVAEand FHVAE, a fixed number of samples from the training set are used to seehow the techniques perform with a changing number of base samples. In areal usage scenario, the goal is to reduce the number of necessary speakers tointroduce variability. Therefore, the training samples are sorted by the speakerswith the most samples in total. For each keyword 10, 30, 50, 150 and 300samples where drawn. These samples are then augmented using the respectivetechnique to generate training data for the keyword spotting network. Theeventual performance is based on the accuracies on the test set of the speechcommands dataset.

Additionally, the speakers in the speech commands dataset appear to bemostly male. It is not documented to how strong this bias is in the speechcommands dataset, neither in the training nor the validation set. Thus, fora second set of experiments, the most prominent speakers in the dataset are

CHAPTER 5. RESULTS 37

(a) Speakers selected by most number ofsamples in dataset.

(b) Speakers selected by most number ofsamples in dataset, but balanced by gender.

Figure 5.2: The accuracy of the given augmentation technique given x realsamples.

classified into male and female manually. Then, the samples are drawn so thebase samples resemble an equally distributed set based on gender.

The results in Figure 5.2 show that VQVAE and FHVAE improve theperformance on the keyword spotting task when few real samples are availableup to a 20% relative or 10% absolute improvement, but only cases when only upto 150 samples are available per speech command. For cases of more than 150available samples per command, the neural augmentation techniques have noe�ect or reduce the performance slightly. Furthermore, the results mainly showthat a simple augmentation of the pitch performs better in most cases. Andwhen a transformation of the speed through resampling is added the di�erenceis even greater.

An experiment on a dataset of a self-recorded single speaker with 10 and20 samples showed no significant improvement from VQVAE or FHVAE. Thebaseline augmentation approach improved the performance slightly.

5.1 Qualitative Evaluation

Given that no good metric exists to quantify the quality of voice convertedsamples by the models, subjective and anecdotal evidence is presented in thissection to judge and compare the generated output of the three models. Somesamples produced by the models during the development process can be foundunder https://pfriesch.github.io/mscthesis_samples/.These might not reflect the final performance since di�erent output representa-

https://pfriesch.github.io/mscthesis_samples/


tions where used to synthesize instead of Mel-spectrograms which were usedto train the keyword spotting model.

5.1.1 Speaker VoiceLoop

(a) Failing attention in Speaker VoiceLoopmodel.

(b) Successful attention in SpeakerVoiceLoop model.

Figure 5.3: Attention output of VoiceLoop model.

The intended usage of the model is generating training data with highvariance in the prosody of the utterance, given that di�erent speaker samplesare provided to the model. The Speaker Adaptive VoiceLoop model producesa natural sounding speech in most cases. However, in a few cases, words orparts of words are not pronounced. Either the words are skipped completelyor replaced by a noise similar to a constant fricative consonant produced bythe speaker. This noise is also produced at the end of the utterance instead ofsilence. This appears to be the case when the attention is failing, as when theattention weighting vector has a high variance or no clear peaked center offocus. At the end of a sentence, it makes sense since all words are produced andthe model can’t focus on what to produce next. However, as seen in Figure 5.3,in some cases the attention completely fails to focus. This results in samplespurely consisting of the mentioned noise. Furthermore, the use of the WORLDvocoder introduces a metallic sounding noise to all samples which limit theinherent maximum performance of the model further.

Doing a subjective listening test, the model has learned an average prosodymodel with little di�erence in the resulting prosody or intonation. As seen inFigure 5.4, the generated samples are very similar, di�ering mostly in pitch(spacing of the bright stripes) and speed, whereas the ground truth samples areclearly a lot more varied in additional attributes.


Figure 5.4: Speaker Adaptive VoiceLoop samples compared to how groundtruth samples saying "sheila" would look like. The top row are random gener-ated samples of the Speaker VoiceLoop model. The lower row are random realsamples from the speech commands training set.

Accents are slightly noticeable. One factor that could contribute to this isthat only very short samples are fed as input. This might not be su�cient tobuild a more varied pronunciation model. Figure 5.5 depicts a tSNE projectionof the extracted speaker embeddings of all training samples. It shows that themodel distinguishes between male and female speakers. However, the twogroups are not clearly separated and individual samples are assigned to a wronggender or rather another speaker cluster. Similarly, the tSNE plot in Figure5.5b of speaker accents shows more individual examples that get assigned toa wrong speaker cluster with a di�erent accent. One reason for this could bethat given that only very short samples are used, the speaker or accent is mostlikely not easily identifiable in a few seconds. Using longer samples duringtraining and inference might help with this issue. In general, the speakers arewell clustered and most clearly separated. There even seems to be some lightlocal correlation by accent. Using priming did not seem to add variability tothe synthesized output.


(a) Visualization by gender. (b) Visualization by accent.

Figure 5.5: tSNE visualization of the extracted epeaker encodings from thetraining samples using the trained speaker extraction network of the SpeakerAdaptive Voiceloop.

5.1.2 FHVAE

Samples generated using the same style sequence, but di�erent content se-quences are expected to sound very similar in style and speaker while thecontent is di�erent. However, subjectively this is not the case with samplessynthesized from the FHVAE model. There is little consistent speaker identitynoticeable across di�erent style samples. There seems to be a strong remaininginfluence of the actual content sequence used. On the other hand, while there islittle consistency when fixing the style sequence, the produced samples from asingle content sequence produce a quite varied output with di�erent speakers.

The naturalness of the generated examples comes close to one of the inputsamples. Yet, the model introduces noticeable artifacts in the resulting audiowhich are local and not consisted over the whole sequence. One reason could bethat the model synthesizes segment by segment instead of the whole sequenceas a whole. Thus the model has no inherent bias towards consistency over thewhole sequence. This results in errors at the connection points of the segmentsas marked in Figure 5.6 or seen in Figure 5.9. Furthermore, the model producesa smooth spectrogram without noise. This is given through the choice of aGaussian distribution at the output. Thus, there is no inherent bias in the modelto be indistinguishable from a real spectrogram in terms of noise structure.


Figure 5.6: Connection points between segments in sample produced by FH-VAE model. 1

One indication for how well FHVAE disentangles the local hidden variablesz1 and the global hidden variable z2 is how they form clusters. E.g. hiddenstates z2 of the same speaker should correlate strongly, while there is littlecorrelation of z1. Furthermore, it is expected that the location in the highdimensional has a semantic meaning such as clustering male speakers closeto male speakers and female speakers close to female speakers. As seen inFigure 5.7 there is no real correlation recognizable in the local z1 variables. Inthe z2 space, there is clearly clustering recognizable. However, the speakersdo not seem to be tightly clustered but have high variance. One contributingfactor could be that very short segments are used to predict the parametersof z2. This might not be enough to create stronger correlated speakers. Thisindicates that, while global features are found, they are not very descriptive ofthe speakers’ unique identity, but focus on features that have stronger correlationover di�erent speakers, e.g. the pitch. This is seen in the samples in Figure 5.9where the change in pitch is clearly visible as the spacing between the dominantfrequencies.

Furthermore, the model does find an embedding with a clear distinctionbetween male and female speakers. This suggests that the model does notcluster mainly based on gender influenced features but other factors similarto attributes like recording environment or noise type. This also supports thatsynthesized samples with a constant z2 do not produce samples with a verysimilar speaker style, but vary in di�erent ways.

1Audio Content: "Yet the most charitable criticism must refuse these sectaries any knowl-edge of the pure and proper divinity of Christ."


(a) z1 labeled by Speaker. (b) z2 labeled by Speaker.

(c) z2 labeled by gender. Red for femaleand blue for male Speakers.

Figure 5.7: The by the FHVAE extracted z1 and z2 values from random seg-ments of each sample from 20 random speakers in the validation set as a tSNEprojection.

5.1.3 VQVAE

First, the reconstructed audio from the model trained on log magnitude spec-trogram is judged. The naturalness of the input speech is preserved in thereconstruction. At the same time, prosody is preserved too. E.g. particular pro-nunciations of words or non-voiced sounds by speakers of the original sampleare present in the reconstruction with a di�erent speaker ID. For instance, theaccent of the input speaker is mostly preserved in the reconstruction, indepen-dent of the given speaker ID. Yet, The speaker of the given speaker ID is clearlyrecognizable and consisted when other random samples to be converted arefed into the model. Still, the quality of the output audio is limited by the use of


the Gri�n-Lim algorithm. Other major artifacts in the audio are not obvious.

(a) Speaker projection by gender. (b) Speaker projection by accent.

Figure 5.8: tSNE projection of the learned speaker embedding in the VQVAEmodel.

This is supported when visualizing the learned speaker embedding. Maleand female speakers are clearly separable as seen in Figure 5.8a. Furthermore,a light clustering by accent is recognizable in Figure 5.8b. The suggests that theVQVAE model has learned a meaningful and information-rich representationof the speakers.

Another distinguishing factor is the synthesized spectrogram. As seen inFigure 5.9 the generated spectrograms are smooth and do not show similar noiseas in the raw spectrogram. This is given the fact that the model is trained with aMSE loss. Thus there is no inherent bias in the model to be as indistinguishableas possible from the real one. In further work, one could try an adversarialobjective to force to model to generate more undistorted samples.

Another limiting factor is the number of speakers in the current setup.The number of di�erent samples that can be generated from a base sample isgiven by the number of speakers in the speaker embedding. More speakerscould be learned, by fine-tuning on the speaker embedding. Another possiblefurther approach would be to interpolate between speakers or choose completelyrandom speaker vectors in the embedding space to generate new artificialspeakers.


Figure 5.9: Mel spectrogram of an audio sample with the transcription "mar-vin" with each row showing a random change in the spectrogram given theaugmentation method.

Chapter 6

Discussion

In this work, we explored three di�erent generative approaches to produceaudio sequences from a large variety of speakers. The first approach included aTTS system with a unique recurrent architecture combined with a jointly trainedspeaker recognition network. The model can generate speaker adapted text-to-speech samples. Yet, the eventual variety of single word synthesized samplesis not great enough for a keyword spotting system trained on these samples togeneralize well to a real dataset. This leads to the conclusion that either thesamples are not realistic enough, there is not enough variety in the generatedsamples so the model can learn to disambiguate the test data or the modeloverfits on artifacts in the generated training data. However, augmenting thegenerated samples with additive white noise reduces the performance, whichgives an indication that there are no obvious artifacts the model can overfit onduring training. Augmenting the generated samples with the described baselineaugmentation, by changing pitch and speed, the performance on the real-lifetest set is reduced further. One explanation for this result could be that thethere are artifacts in the generated samples that get amplified by resampling ormodifying the pitch. Thus the models learn to disambiguate based on thesesamples instead of patterns that are also present in real samples. Thus, furtherwork is necessary to determine if the samples are not realistic enough or if theydo not provide enough variance to be able to help the model to generalize toreal data. Especially the fact that some keywords generalize very badly, whileother achieve considerably higher performance on the test set need furtherexploration.

The second explored approach is a method out of a recent line of researchexploring variational autoencoders which disentangle hidden variables ondi�erent time scales with di�erent semantics. The evaluated scalable FHVAE

45

46 CHAPTER 6. DISCUSSION

is able to convert given samples to di�erent voices or recording conditions givenstyle sequences. Yet, given the sample of a speaker, the di�erent convertedsequences do not clearly sound like the same speaker but still very di�erentsounding samples. Yet, even if the model lacks voice conversion capability,the generated samples sound quite di�erent given that the implicit bias givento the model is to find global features. Thus, in the keyword spotting task,with only a small number of given samples, the model is able to add variety toaid the generalization. However, for a larger number of given samples whichalready contain some variety, the voice converted samples did not improvegeneralization.

The third evaluated approach, VQVAE which finds local discrete encodingswhich are reconstructed with a speaker id, shows stronger voice conversioncapability, given that the speaker information is given to the network externally.Subjectively, the di�erent voices are clearly distinguishable given di�erent basesamples and the same speaker id. However, from each sample, the same localdiscrete features are extracted not depended on the speaker. Thus, there is animplicit bias in the model against variance in time. Given di�erent speaker IDs,the model mostly varies along the frequencies. As seen in Figure 5.2b, it seemsto be on par with augmenting only the pitch in the case of very few availablesamples. This suggests that the model learned to generated samples withdi�erent fundamental frequencies, while being limited to the learned speakers.The baseline pitch augmentation on the other hand is applying random pitchaugmentations.

Furthermore, given the nature of the optimization of these neural networktechniques, samples that follow an average output are favored, given that theyproduce a lower mean error. Thus outliers are unlikely to be produced. However,in order to learn a good discriminative system the full range of variance shouldbe present in the training data. Yet, there is no bias in the evaluated samples toproduce this high variance. One result of this is that smooth spectrograms areproduced. Instead, more realistic samples should be favored that may have alower likelihood. One approach for further exploration would be to introducethis bias, is to add an adversarial network that discriminates between real andfake samples. This would introduce a bias to produce more realistic samplesin terms of artifacts. However, it does not necessarily help to produce morerealistic sounding speech.

Additionally, all models are trained with datasets of read speech or high-quality TTS datasets. These datasets o�er high-quality recordings with clearlyrecognizable speech and little disturbance. Thus, the generative models learnto reproduce this style of speech data. On the other hand, the speech commands

CHAPTER 6. DISCUSSION 47

dataset [62] consists mostly of bad quality recordings that resemble a real-worldscenario. This means the neural networks learn to generate or translate samplesto the read speech. This creates a mismatch in data domains. The baselineaugmentation technique, on the other hand, does not change the domain thesamples are in, but rather adds pure variety.

In order to generate samples in di�erent recording conditions a data ac-quisition e�ort for more in-the-wild training data has to be undertaken. Forunsupervised models, no transcriptions would be needed, however, added sam-ples need to be checked if only one speaker is present in the sample giventhat the methods should learn a single global encoding per sample. Giventhat VQVAE and FHVAE are unsupervised approaches, this seems to be aninteresting and scalable direction to explore further.

The results clearly show that generating training data using TTS or voiceconversion involves di�erent complexities than TTS or voice conversion onits own. First of all, in the standard use case for TTS or voice conversion ofbeing played to humans, the main objective is to produce the most naturalsounding samples with little artifacts or noise. Highly variable output samplesare not a focus there. Whereas, when generating training data, the objective isto generate a large variety of samples with distortions, like room reverberations,microphone quality, noise environment, distance to the microphone and speaker.

Furthermore, exploration in this direction showed to require heavy usage ofintuition and of educated guesses, given the lack of proper metrics to measurethe intended performance of the di�erent methods directly during training.This would be a valuable research direction which would accelerate furtherdevelopment in highly variable generative modeling.

Chapter 7

Conclusion

This work describes three approaches for synthesizing multi-speaker speechdata and evaluate if naively generating training data for a keyword spotting taskcan improve its performance. The three approaches consist of: (1) SpeakerAdaptive Voiceloop, a TTS model with fast adoption to new speakers, (2) Fac-torized Hierarchical Variational Autoencoder a disentangled Variational Au-toencoder for voice conversion which exploits the di�erent time resolutionfor a speaker and phonetical information, (3) Vector Quantised-VariationalAutoEncoder an autoencoder that finds discrete phoneme like features whichcan be converted to di�erent voices. We propose a modification of the VQVAEto allow for faster parallel synthesis. We evaluate these models subjectivelyand discuss the di�erences in the resulting samples. We use these models togenerate training samples for a keyword spotting task and evaluate the accuracyon a real-world test set when the keyword spotting model is solely trained onsynthesized data. We show that the voice conversion models can improve theperformance when only a few real samples are available, yet, our chosen DSPbaseline performs at least twice as good. We analyze why this might be thecase and provide further possible research directions.

Acknowledgements

I would like to thank the whole team at snips.ai where the work for this thesiswas done during an internship. I would like to thank especially Maël Primetand Théodore Bluche for their helpful discussions and advice.

48

Bibliography

[1] Alex Graves, Abdel-rahman Mohamed, and Geo�rey Hinton. “SpeechRecognition with Deep Recurrent Neural Networks”. In: arXiv:1303.5778[cs] (Mar. 2013). arXiv: 1303.5778 [cs].

[2] Awni Hannun et al. “Deep Speech: Scaling up End-to-End Speech Recog-nition”. In: arXiv:1412.5567 [cs] (Dec. 2014). arXiv: 1412.5567[cs].

[3] Dario Amodei et al. “Deep Speech 2: End-to-End Speech Recognition inEnglish and Mandarin”. In: arXiv:1512.02595 [cs] (Dec. 2015). arXiv:1512.02595 [cs].

[4] Albert Zeyer et al. “Improved Training of End-to-End Attention Modelsfor Speech Recognition”. In: arXiv preprint arXiv:1805.03294 (2018).

[5] W. Xiong et al. “The Microsoft 2017 Conversational Speech RecognitionSystem”. In: arXiv:1708.06073 [cs] (Aug. 2017). arXiv: 1708.06073[cs].

[6] Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. “Wav2Letter:An End-to-End ConvNet-Based Speech Recognition System”. In: arXiv:1609.03193[cs] (Sept. 2016). arXiv: 1609.03193 [cs].

[7] Tom Ko et al. “Audio augmentation for speech recognition”. In: Six-teenth Annual Conference of the International Speech CommunicationAssociation. 2015.

[8] Navdeep Jaitly and Geo�rey E. Hinton. “Vocal Tract Length Perturbation(VTLP) Improves Speech Recognition”. In: Proc. ICML Workshop onDeep Learning for Audio, Speech and Language. Vol. 117. 2013.

[9] Anton Ragni et al. “Data augmentation for low resource languages”. en.In: Proceedings of the Annual Conference of the International SpeechCommunication Association, INTERSPEECH, 810-814. Jan. 2014. ��:10/gfxjr9. ��: https://www.repository.cam.ac.uk/handle/1810/279192 (visited on 03/28/2019).

49

http://arxiv.org/abs/1303.5778







https://doi.org/10/gfxjr9

https://www.repository.cam.ac.uk/handle/1810/279192

https://www.repository.cam.ac.uk/handle/1810/279192

50 BIBLIOGRAPHY

[10] Xiaodong Cui, Vaibhava Goel, and Brian Kingsbury. “Data Augmen-tation for Deep Neural Network Acoustic Modeling”. In: IEEE/ACMTransactions on Audio, Speech, and Language Processing 23.9 (Sept.2015), pp. 1469–1477. ��: 2329-9290, 2329-9304. ��: 10.1109/TASLP.2015.2438544.

[11] Jean-Claude Junqua. “The Lombard Reflex and Its Role on HumanListeners and Automatic Speech Recognizers”. In: The Journal of theAcoustical Society of America 93.1 (Jan. 1993), pp. 510–524. ��: 0001-4966. ��: 10.1121/1.405631.

[12] W. Verhelst and M. Roelands. “An Overlap-Add Technique Based onWaveform Similarity (WSOLA) for High Quality Time-Scale Modifica-tion of Speech”. In: 1993 IEEE International Conference on Acoustics,Speech, and Signal Processing. Vol. 2. Apr. 1993, 554–557 vol.2. ��:10.1109/ICASSP.1993.319366.

[13] N. Kanda, R. Takeda, and Y. Obuchi. “Elastic Spectral Distortion for LowResource Speech Recognition with Deep Neural Networks”. In: 2013IEEE Workshop on Automatic Speech Recognition and Understanding.Dec. 2013, pp. 309–314. ��: 10.1109/ASRU.2013.6707748.

[14] Sercan O. Arik et al. “Convolutional Recurrent Neural Networks forSmall-Footprint Keyword Spotting”. In: arXiv:1703.05390 [cs] (Mar.2017). arXiv: 1703.05390 [cs].

[15] Robin Scheibler, Eric Bezzam, and Ivan DokmaniÊ. “Pyroomacoustics:A Python Package for Audio Room Simulation and Array ProcessingAlgorithms”. In: 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 351–355.

[16] Guoguo Chen, Carolina Parada, and Georg Heigold. “Small-FootprintKeyword Spotting Using Deep Neural Networks”. In: 2014 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2014, pp. 4087–4091. ��: 10/gfws9d.

[17] Tara N. Sainath and Carolina Parada. “Convolutional Neural Networksfor Small-Footprint Keyword Spotting”. In: Sixteenth Annual Conferenceof the International Speech Communication Association. 2015.

[18] Kyuyeon Hwang, Minjae Lee, and Wonyong Sung. “Online KeywordSpotting with a Character-Level Recurrent Neural Network”. en. In:arXiv:1512.08903 [cs] (Dec. 2015). arXiv: 1512.08903 [cs].

https://doi.org/10.1109/TASLP.2015.2438544

https://doi.org/10.1109/TASLP.2015.2438544

https://doi.org/10.1121/1.405631

https://doi.org/10.1109/ICASSP.1993.319366

https://doi.org/10.1109/ASRU.2013.6707748


https://doi.org/10/gfws9d


BIBLIOGRAPHY 51

[19] Yann LeCun et al. “Gradient-Based Learning Applied to DocumentRecognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.

[20] Diederik P. Kingma and Max Welling. “Auto-Encoding VariationalBayes”. In: arXiv:1312.6114 [cs, stat] (Dec. 2013). arXiv: 1312.6114[cs, stat].

[21] Irina Higgins et al. “Beta-Vae: Learning Basic Visual Concepts with aConstrained Variational Framework”. In: International Conference onLearning Representations. 2017.

[22] Chin-Cheng Hsu et al. “Voice Conversion from Non-Parallel CorporaUsing Variational Auto-Encoder”. In: arXiv:1610.04019 [cs, stat] (Oct.2016). arXiv: 1610.04019 [cs, stat].

[23] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. “GenerativeAdversarial Networks”. In: Deep Learning, pp. 696–699.

[24] Chin-Cheng Hsu et al. “Voice Conversion from Unaligned Corpora Us-ing Variational Autoencoding Wasserstein Generative Adversarial Net-works”. In: arXiv:1704.00849 [cs] (Apr. 2017). arXiv: 1704.00849[cs].

[25] Takuhiro Kaneko and Hirokazu Kameoka. “Parallel-Data-Free VoiceConversion Using Cycle-Consistent Adversarial Networks”. In: arXiv:1711.11293[cs, eess, stat] (Nov. 2017). arXiv: 1711.11293 [cs, eess, stat].

[26] Sercan O. Arik et al. “Deep Voice: Real-Time Neural Text-to-Speech”.In: arXiv:1702.07825 [cs] (Feb. 2017). arXiv: 1702.07825 [cs].

[27] Sercan Arik et al. “Deep Voice 2: Multi-Speaker Neural Text-to-Speech”.In: arXiv:1705.08947 [cs] (May 2017). arXiv: 1705.08947 [cs].

[28] Wei Ping et al. “Deep Voice 3: 2000-Speaker Neural Text-to-Speech”. In:arXiv:1710.07654 [cs, eess] (Oct. 2017). arXiv: 1710.07654 [cs,eess].

[29] Yuxuan Wang et al. “Tacotron: Towards End-to-End Speech Synthesis”.In: arXiv:1703.10135 [cs] (Mar. 2017). arXiv: 1703.10135 [cs].

[30] Jonathan Shen et al. “Natural TTS Synthesis by Conditioning WaveNeton Mel Spectrogram Predictions”. In: arXiv:1712.05884 [cs] (Dec.2017). arXiv: 1712.05884 [cs].

[31] Nitish Srivastava et al. “Dropout: a simple way to prevent neural networksfrom overfitting”. In: The Journal of Machine Learning Research 15.1(2014), pp. 1929–1958.













52 BIBLIOGRAPHY

[32] Yaniv Taigman et al. “VoiceLoop: Voice Fitting and Synthesis via aPhonological Loop”. In: arXiv:1707.06588 [cs] (July 2017). arXiv:1707.06588 [cs].

[33] R. J. Skerry-Ryan et al. “Towards End-to-End Prosody Transfer forExpressive Speech Synthesis with Tacotron”. In: arXiv:1803.09047 [cs,eess] (Mar. 2018). arXiv: 1803.09047 [cs, eess].

[34] Wei-Ning Hsu et al. “Hierarchical Generative Modeling for ControllableSpeech Synthesis”. In: arXiv:1810.07217 [cs, eess] (Oct. 2018). arXiv:1810.07217 [cs, eess].

[35] Gustav Eje Henter, Xin Wang, and Junichi Yamagishi. “Deep Encoder-Decoder Models for Unsupervised Learning of Controllable SpeechSynthesis”. In: arXiv preprint arXiv:1807.11470 (2018).

[36] Yuxuan Wang et al. “Style Tokens: Unsupervised Style Modeling, Con-trol and Transfer in End-to-End Speech Synthesis”. In: arXiv:1803.09017[cs, eess] (Mar. 2018). arXiv: 1803.09017 [cs, eess].

[37] Alex Graves. “Generating Sequences with Recurrent Neural Networks”.In: arXiv preprint arXiv:1308.0850 (2013).

[38] Eliya Nachmani et al. “Fitting New Speakers Based on a Short Untran-scribed Sample”. In: arXiv:1802.06984 [cs, eess] (Feb. 2018). arXiv:1802.06984 [cs, eess].

[39] Wei-Ning Hsu and James Glass. “Scalable Factorized Hierarchical Vari-ational Autoencoder Training”. In: arXiv:1804.03201 [cs, eess, stat](Apr. 2018). arXiv: 1804.03201 [cs, eess, stat].

[40] Wei-Ning Hsu, Yu Zhang, and James Glass. “Unsupervised Learning ofDisentangled and Interpretable Representations from Sequential Data”.In: arXiv:1709.07902 [cs, eess, stat] (Sept. 2017). arXiv: 1709.07902[cs, eess, stat].

[41] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”.In: Neural Computation 9.8 (Nov. 1997), pp. 1735–1780. ��: 0899-7667. ��: 10.1162/neco.1997.9.8.1735.

[42] Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cummins. “Learning toForget: Continual Prediction with LSTM”. In: Neural Computation 12(2000), pp. 2451–2471. ��: 10.1162/089976600300015015.

[43] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. “NeuralDiscrete Representation Learning”. In: arXiv:1711.00937 [cs] (Nov.2017). arXiv: 1711.00937 [cs].









https://doi.org/10.1162/neco.1997.9.8.1735

https://doi.org/10.1162/089976600300015015


BIBLIOGRAPHY 53

[44] Aaron van den Oord et al. “WaveNet: A Generative Model for RawAudio”. In: arXiv:1609.03499 [cs] (Sept. 2016). arXiv: 1609.03499[cs].

[45] Aurko Roy et al. “Theory and Experiments on Vector Quantized Autoen-coders”. In: arXiv preprint arXiv:1805.11063 (2018).

[46] D. Palaz, M. Magimai.-Doss, and R. Collobert. “Convolutional NeuralNetworks-Based Continuous Speech Recognition Using Raw SpeechSignal”. In: 2015 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). Apr. 2015, pp. 4295–4299. ��: 10.1109/ICASSP.2015.7178781.

[47] Neil Zeghidour et al. “Learning Filterbanks from Raw Speech for PhoneRecognition”. In: arXiv:1711.01161 [cs] (Nov. 2017). arXiv: 1711.01161 [cs].

[48] Mirco Ravanelli and Yoshua Bengio. “Speaker Recognition from RawWaveform with SincNet”. In: arXiv:1808.00158 [cs, eess] (July 2018).arXiv: 1808.00158 [cs, eess].

[49] D. Gri�n and Jae Lim. “Signal Estimation from Modified Short-TimeFourier Transform”. In: IEEE Transactions on Acoustics, Speech, andSignal Processing 32.2 (Apr. 1984), pp. 236–243. ��: 0096-3518. ��:10.1109/TASSP.1984.1164317.

[50] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. “WORLD: AVocoder-Based High-Quality Speech Synthesis System for Real-TimeApplications”. In: IEICE TRANSACTIONS on Information and Systems99.7 (2016), pp. 1877–1884.

[51] Yuxin Wu and Kaiming He. “Group Normalization”. In: arXiv:1803.08494[cs] (Mar. 2018). arXiv: 1803.08494 [cs].

[52] Sander Dieleman, Aäron van den Oord, and Karen Simonyan. “TheChallenge of Realistic Music Generation: Modelling Raw Audio atScale”. In: arXiv:1806.10474 [cs, eess, stat] (June 2018). arXiv: 1806.10474 [cs, eess, stat].

[53] R. Kubichek. “Mel-Cepstral Distance Measure for Objective SpeechQuality Assessment”. In: Communications, Computers and Signal Pro-cessing, 1993., IEEE Pacific Rim Conference On. Vol. 1. IEEE, 1993,pp. 125–128.








https://doi.org/10.1109/TASSP.1984.1164317




54 BIBLIOGRAPHY

[54] Tomohiro Nakatani et al. “A Method for Fundamental Frequency Estima-tion and Voicing Decision: Application to Infant Utterances Recorded inReal Acoustical Environments”. In: Speech Communication 50.3 (2008),pp. 203–214.

[55] Wei Chu and Abeer Alwan. “Reducing f0 frame error of f0 trackingalgorithms under noisy conditions with an unvoiced/voiced classificationfrontend”. In: 2009 IEEE International Conference on Acoustics, Speechand Signal Processing. IEEE, 2009, pp. 3969–3972.

[56] Tim Salimans et al. “Improved Techniques for Training Gans”. In: Ad-vances in Neural Information Processing Systems. 2016, pp. 2234–2242.

[57] Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. SUPER-SEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTRVoice Cloning Toolkit. eng. 2016. ��: 10.7488/ds/1495. ��:http://datashare.is.ed.ac.uk/handle/10283/2119.

[58] John Kominek and Alan W. Black. “The CMU Arctic Speech Databases”.In: Fifth ISCA Workshop on Speech Synthesis. 2004.

[59] Vassil Panayotov et al. “Librispeech: An ASR Corpus Based on Pub-lic Domain Audio Books”. In: Acoustics, Speech and Signal Process-ing (ICASSP), 2015 IEEE International Conference On. IEEE, 2015,pp. 5206–5210.

[60] Keith Ito. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/. 2017.

[61] M-AILABS. The M-AILABS Speech Dataset. Sept. 2018.

[62] Pete Warden. “Speech Commands: A Dataset for Limited-VocabularySpeech Recognition”. In: arXiv:1804.03209 [cs] (Apr. 2018). arXiv:1804.03209 [cs].

[63] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. “WORLD: AVocoder-Based High-Quality Speech Synthesis System for Real-TimeApplications”. In: IEICE TRANSACTIONS on Information and Systems99.7 (2016), pp. 1877–1884.

https://doi.org/10.7488/ds/1495

http://datashare.is.ed.ac.uk/handle/10283/2119

https://keithito.com/LJ-Speech-Dataset/

https://keithito.com/LJ-Speech-Dataset/


BIBLIOGRAPHY 55

Figure 1: Mel spectrogram of an audio sample with the transcription "sheila"with each row showing a random change in the spectrogram given the augmen-tation method.

TRITA -EECS-EX-2019:194

www.kth.se

Documents

Generating Training Data for Keyword Spotting given Few ...1336760/FULLTEXT01.pdf · provide an up to 20% relative accuracy improvement on the validation set. The baseline augmentation