99
One Million Agents Speaking All the Languages in the World Miguel ˆ Angelo Ant ´ onio Ventura Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisor: Prof. Maria Lu´ ısa Torres Ribeiro Marques da Silva Coheur Examination Committee Chairperson: Prof. Jo ˜ ao Ant ´ onio Madeiras Pereira Supervisor: Prof. Maria Lu´ ısa Torres Ribeiro Marques da Silva Coheur Members of the Committee: Prof. Bruno Emanuel Da Grac ¸a Martins June 2018

One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

One Million Agents SpeakingAll the Languages in the World

Miguel Angelo Antonio Ventura

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisor: Prof. Maria Luısa Torres Ribeiro Marques da Silva Coheur

Examination Committee

Chairperson: Prof. Joao Antonio Madeiras PereiraSupervisor: Prof. Maria Luısa Torres Ribeiro Marques da Silva CoheurMembers of the Committee: Prof. Bruno Emanuel Da Graca Martins

June 2018

Page 2: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described
Page 3: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Acknowledgments

First and foremost, I have to thank my research supervisor, Prof. Luısa Coheur. Without her as-

sistance and dedicated involvement in every step throughout the process, this dissertation would have

never been accomplished.

I would also like to thank to the 44 volunteers involved in the survey of our research. Without their

participation and input, the survey could not have been successfully conducted.

A word of thanks to all L2F/INESC-ID members that offered me some tips and opinions for my

research. A special thanks to Vania Mendonca and Marco Pereira for helping me out with the legacy

systems which were the starting point for my thesis.

I would also like to acknowledge the developers of OpenNMT-tf tool, because they gave elucidating

and fast responses to some issues/doubts I had while using their tool.

I should also thank all my friends, in special to those who are finishing their Master Degree at the

same time as I am. All exchange of information and feedback contributed for my research.

I am grateful to all of those that are part of Direccao de Servicos Informaticos, at Instituto Superior

Tecnico, for relaxing my working hours when I needed to dedicate my time for my dissertation.

Most importantly, none of this could have happened without my family. I must express my very

profound gratitude to my parents (especially to my mother) for providing me with unfailing support and

continuous encouragement throughout my years of study and through the process of researching and

writing this thesis. This accomplishment would not have been possible without them. Thank you.

Page 4: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described
Page 5: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Abstract

Currently, creating a conversational agent for a specific domain is an accessible task but the resulting

agents have restricted knowledge due to the human effort needed to manually introduce the data.

Movie and TV shows subtitles are available for free in ever-growing databases. They constitute a

remarkable resource of data distributed across more than 70 languages.

In this document, we propose B-Subtle - a novel tool for automatic creation of corpora and collection

of analytical data from subtitles. Since different users might have different needs, we aim to provide a

flexible system that can be fully parametrized through a configuration file. The generated corpora will

serve as a knowledge base for conversational agents.

Besides the corpora generation tool, another system will be described - Say Something Deep. This

system is capable of creating sequence-to-sequence models to answer questions made by its users. It

relies on neural networks to implement a generative approach by taking corpora generated with B-Subtle

as its knowledge base.

Keywords

Dialogue Systems; Movie Subtitles; Information Extraction; Generative Models; Deep Learning.

iii

Page 6: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described
Page 7: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Resumo

Atualmente, a criacao de um agente de conversacao para um domınio especıfico e uma tarefa acessıvel.

Contudo, os agentes resultantes tem uma base de conhecimentos restrita devido ao esforco humano

necessario para introduzir manualmente os dados.

Legendas de filmes e de programas de TV estao disponıveis gratuitamente em bancos de dados

em constante crescimento. Eles constituem um recurso notavel de dados distribuıdos em mais de 70

idiomas.

Neste documento, apresentamos o B-Subtle - uma nova ferramenta para criacao automatica de

corpora de interacoes pergunta/resposta e de extracao de dados estatısticos a partir de legendas.

Sendo que utilizadores diferentes podem ter necessidades distintas, o nosso objetivo e fornecer um

sistema flexıvel que possa ser totalmente parametrizado por meio de um ficheiro de configuracao. Os

corpora gerados servirao como bases de conhecimento para agentes de conversacao.

Alem da ferramenta de geracao de corpora, outro sistema sera apresentado - Say Something Deep.

Este sistema e capaz de responder a perguntas feitas por utilizadores. Este sistema utilizara uma

estrategia de geracao de respostas apos treinar modelos com arquiteturas de redes neuronais.

Palavras Chave

Sistemas de Dialogo; Legendas de filmes; Extracao de Informacao; Modelos Gerativos; Aprendizagem

Profunda.

v

Page 8: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described
Page 9: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Contents

1 Introduction 1

1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 7

2.1 Subtle Corpus and Subtle Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Dialogue Turns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Corpus Data Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.4 Final Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Say Something Smart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Indexing Subtle Corpus and Extracting Answers . . . . . . . . . . . . . . . . . . . 11

2.2.2 Electing the Best Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Flexibility of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.4 Final Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Sequence-to-Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Padding and Bucketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.4 Greedy Search and Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.6 Long-Short Term Memory Neural Networks . . . . . . . . . . . . . . . . . . . . . . 16

3 Related Work 17

3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 OpenSubtitles2016 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1.A Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1.B Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

vii

Page 10: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

3.1.1.C Output files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.2 Movie-Dic Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.3 Cornell Movie-Dialogs Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.4 Ubuntu Dialog Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 End-to-end Sequence to Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 B-Subtle 27

4.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 B-Subtle Parts Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.1 Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.2 Meta-data Collectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.3 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.3.A Meta-data Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.3.B Interaction Pairs Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.4 Producers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.6 Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.7 Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.8 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Building Agents 39

5.1 Say Something Deep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.2 Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Main Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3.2 Say Something Deep Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.3 Say Something Smart Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3.4 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Evaluation 51

6.1 How to Evaluate our Agents? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.3.2 Short and Simple Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

viii

Page 11: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

7 Conclusions 59

7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A Preliminary Experiments 69

B Survey 71

B.1 Survey description given to volunteers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

B.2 Raw data with the questions and corresponding agent answers used in survey . . . . . . 72

C Sample Configuration File 77

ix

Page 12: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

x

Page 13: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

List of Figures

2.1 Information flow in a Sequence-to-Sequence (seq2seq) model. . . . . . . . . . . . . . . . 14

2.2 Architecture of a Recurrent Neural Networks (RNN) with a loop and then unrolled after

unfolding. Xt represents what is given as input. Yt is the output generated by the network. 16

4.1 Possible B-Subtle pipeline for OpenSubtitles 2016 files using all the components available. 31

5.1 Tensorboard - visualization tool for understanding, debugging, and optimizing the models

being trained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1 Example of a question in our survey (in Portuguese). Each line corresponds to an answer

given by one of our agents that should be classified as valid, plausible or invalid. . . . . . 54

6.2 Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered valid among all

answers labeled as valid by the survey participants. . . . . . . . . . . . . . . . . . . . . . 57

6.3 Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered plausible among all

answers labeled as plausible by the survey participants. . . . . . . . . . . . . . . . . . . . 57

6.4 Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered invalid among all

answers labeled as invalid by the survey participants. . . . . . . . . . . . . . . . . . . . . 57

xi

Page 14: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

xii

Page 15: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

List of Tables

6.1 Percentage of valid, plausible and invalid answers classifications given to our 5 agents

(Alpha, Beta, Charlie, Delta, and Echo) by the volunteers who have filled out the survey. . 55

6.2 Amount of “Sim.”(Yes.), “Nao.”(No.) and “Nao sei.”(I do not know.) answers returned by

Agents Alpha, Beta and Charlie for the 100 questions included in the conducted survey. . 56

6.3 Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers evaluated independently for

each answer label present in our survey (valid, plausible and invalid). The agents relying

on the new Say Something Deep (SSD) system for generating answers are included for

comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

xiii

Page 16: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

xiv

Page 17: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Listings

3.1 Structure of a subtitle file in .srt format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Simplified example of a OpenSubtitles2016 input file structure. . . . . . . . . . . . . . . . 21

4.1 Example of configuration file for OpenSubtitles2016 Pipeline . . . . . . . . . . . . . . . . 38

5.1 B-Subtle’s configuration file to generate Corpus A. . . . . . . . . . . . . . . . . . . . . . . 47

5.2 B-Subtle’s configuration file to generate Corpus B. . . . . . . . . . . . . . . . . . . . . . . 47

5.3 B-Subtle’s configuration file to generate Corpus C. . . . . . . . . . . . . . . . . . . . . . . 48

A.1 B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubti-

tles2016 subtitles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A.2 B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubti-

tles2016 subtitles with answer length above 25 characters. . . . . . . . . . . . . . . . . . 69

A.3 B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubti-

tles2016 subtitles with Horror as a genre and a subtitle rating above 5.0. . . . . . . . . . . 70

xv

Page 18: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

xvi

Page 19: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Acronyms

AI Artificial Intelligence

AutoML Automatic Machine Learning

CA Conversational Agent

DOM Document Object Model

GPU Graphics Processing Unit

HTML HyperText Markup Language

HTTP Hypertext Transfer Protocol

IMDb International Movie Database

IMSDb Internet Movie Script Database

IP Internet Protocol

JSON JavaScript Object Notation

LSTM Long-Short Term Memory

NER Named Entity Recognition

NLTK Natural Language Toolkit

NL Natural Language

OCR Optical Character Recognition

RNN Recurrent Neural Networks

SAX Simple API for XML

seq2seq Sequence-to-Sequence

xvii

Page 20: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

SSD Say Something Deep

SSS Say Something Smart

StAX Streaming API for XML

XML eXtensible Markup Language

YAML YAML Ain’t Markup Language

xviii

Page 21: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

1Introduction

Contents

1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1

Page 22: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

2

Page 23: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Artificial Intelligence (AI) is a field of Computer Science that is constantly evolving since its beginning

in the 1950s decade. Easing the communication between humans and machines remains as one of the

highest ambitions.

To have successful interactions between humans and machines, a common way of communication

must be defined. Humans communicate with each other using Natural Language (NL) (a language that

has developed naturally over time), contrasting with artificially created ones (used by machines, based

on strict rules). This demands machines to be able to interpret and generate a language understood by

humans. Conversational Agents (CAs) come into the way as one of the tools that can be used.

Creating a CA for a certain domain is an accessible task considering some tools available. For

instance, Pandorabots1 is a web service for building and deploying these type of agents. However, the

resulting agents have restricted knowledge due to the human effort needed to manually introduce the

data.

One of the challenges is to develop methods that build sources of knowledge automatically without

the need for human intervention. The foremost task is to find appropriate information sources and then

extract useful data from them. Subtitles from movies and TV shows are one of those sources. They

are available on-line for free in ever-growing databases. OpenSubtitles2 is one of those databases,

where the number of subtitle files available surpasses the 4 million mark distributed across more than

60 languages. Considering its size and language variety, it is reasonable to say that it makes up a

remarkable resource to extract valuable features (linguistically speaking) due to its broadness of covered

genres (action, comedy, horror, etc.) and its multiple types of discourse (narrative, slang, etc.).

L2F/INESC-ID group already built a corpus from subtitles - the Subtle Corpus [Ameixa et al., 2013].

It is composed by interactions (pairs of triggers3 and answers) extracted from 6 thousand English subtitle

files and 4 thousand Portuguese subtitle files. A tool was used to automate the process of building the

corpus - we will call it Subtle Tool from now on. The resulting corpus, from now on Subtle Corpus, can

be used as a knowledge base for a CA. However, this corpus is limited to those two languages and it

has not been updated ever since with data from new subtitles. On the other hand, the Subtle Tool used

to generated the corpus does not allow to customize the corpus generation process without changing its

implementation. Since subtitle files have other information associated (genre of the movie, release year,

timestamps of turns, etc.) it could be useful to allow collection and analysis of that data while building

the corpus. The aim of our work was to include all of this meta-data, helping the end user to customize

the corpus generation process.

Selecting responses for a CA to give when dealing with user input consists in another challenge.

The best response should be chosen or generated from the built source of knowledge. This could be

1https://www.pandorabots.com/2https://www.opensubtitles.org/3A turn extracted from subtitles that will cause the next one to appear - the answer.

3

Page 24: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

accomplished by using Deep Learning techniques either by following a retrieval approach or a generative

approach.

At WildML blog4 we can find more about the pros and cons of each approach, but essentially retrieval

models are simpler to implement. They work with a source of predefined responses and use some

kind of heuristic to pick an appropriate response. However they may be incapable of handling unseen

cases. A dialogue system developed at L2F/INESC-ID group relies on a retrieval-based approach: Say

Something Smart (SSS) [Ameixa et al., 2013] uses Subtle corpus as its knowledge base. The system

starts by matching the user input against a set of possible responses. Afterwards, it applies some

weighted measures to give them an order. The one with the best score is returned to the user. However,

when SSS (explained in section 2.2) needs to select the best response it is highly coupled with an

internal scoring algorithm used by the search engine library5.

On the other side, generative models are harder to implement but can generate responses of their

own, although being more likely to perform grammatical errors or give irrelevant answers. Recently, an

increasing number of studies have found that end-to-end CAs can be built by following purely data-driven

approaches by relying on neural models. One of our main goals is to create end-to-end CAs with corpora

generated by B-Subtle by following an approach based in recent studies.

1.1 Goals

We will now describe the main goals of our work.

We aim to replace the existing Subtle Tool by providing the following features:

1. Creating corpora with interactions and with meta-data associated such as genre of the movie/TV

show, release year, spoken language, subtitle language, etc. This allows generation of corpora

that meets specific requirements: for example including only interactions collected from movies

with “Action” as a genre;

2. Supporting a larger and recent set of subtitles as an input for the corpora generation process;

3. Supporting multiple languages;

4. Processing text content from subtitles in a customized way. This allows the creation of corpora with

custom interactions (e.g. accept only interactions where the trigger ends with a question mark);

5. Generating different output formats for the corpus being generated such as JavaScript Object

Notation (JSON) or eXtensible Markup Language (XML) files. This enables end-users to choose

the output format that fits best their needs;4http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction5Used for indexing the corpus files and retrieving results by making queries.

4

Page 25: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

6. User-friendly configuration: this allows the end-users to completely control the behavior of B-Subtle

when creating corpora by simply adjusting parameters in a configuration file;

7. Generating analytical data about movies, TV shows and subtitle files.

Since we are aiming to build one million agents speaking all the languages in the world, we are

determined to provide an additional system for generating the best answer given a user input. We need

to accomplish the following requirements:

1. Open-domain with capability of generating answers for questions not present in the knowledge

base;

2. Build knowledge-bases in Portuguese by using new corpora of interactions created from subtitle

files;

3. Support languages other than Portuguese and English;

1.2 Contributions

In order to achieve the aforementioned goals we decided to built a revamped tool for generating

corpora of interactions - B-Subtle. This tool offers automatic creation of corpora and collection of an-

alytical data from subtitles. Since different end-users might have different needs, we provide a flexible

system that can be fully parametrized through a configuration file. The generated corpora will serve as

a knowledge base for conversational agents.

Besides the corpora generation tool, we also offer another tool - SSD. This system is capable of

generating responses upon receiving some input. It relies on neural networks by using state-of-the-art

seq2seq models. Corpora generated with B-Subtle can serve as knowledge base for the conversational

models created with SSD.

1.3 Document Outline

The remaining part of the document is organized as follows: in Chapter 2 previous systems will

be described and some details about neural network architectures will be given; in Chapter 3, we will

describe related work carried out in the scope of building corpora and seq2seq models for CA creation;

in Chapter 4 we present the architectural details of our tool for building corpora - B-Subtle; then in

Chapter 5 we present our experiments with Neural CAs by using corpora generated with B-Subtle; from

there on we proceed to Chapter 6 where we describe how we evaluated our agents; finally, in Chapter 7

we include conclusions and a discussion about future work to do based on current limitations of the

systems used in our experiments.

5

Page 26: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

6

Page 27: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

2Background

Contents

2.1 Subtle Corpus and Subtle Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Say Something Smart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Sequence-to-Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

7

Page 28: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

8

Page 29: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Previous work has already been done at L2F/INESC-ID group in order to extract information from

subtitle files and build a corpus from it. Additionally another system was developed to receive that corpus

as input and select plausible responses for user requests. Those two systems will now be described.

2.1 Subtle Corpus and Subtle Tool

Subtle Corpus [Ameixa et al., 2013] is composed by interactions - pairs of triggers and answers

- collected from movie subtitles available at OpenSubtitles website. The most recent version of Sub-

tle Corpus [Magarreiro et al., 2014] was generated from almost 6,000 English subtitle files and 4,000

Portuguese subtitle files from four different movie genres: Romance, Sci-Fi, Western and Horror. Pro-

cessing these files resulted in a total of 5,693,811 English interactions and 3,322,683 Portuguese inter-

actions.

Extracting interactions involves identifying whether the information is relevant or not. The purpose of

building this corpus is to allow a CA to extract responses for user requests. Therefore, the actual content

of the subtitles has received special treatment for some special cases in order to generate a corpus with

useful interactions.

2.1.1 Pre-processing

Subtitles may include special annotations for people with hearing impairment. When a movie char-

acter is talking, but it is not showing on the screen, their name usually appears at the beginning of the

utterance followed by a colon. This information is removed from the utterance before forming a new pair

of trigger and answer. Sound descriptions commonly appear too. They can be usually found in upper-

case between square brackets. Since they are not an actual trigger or answer, they were discarded.

Also, subtitle files can have tags which can be parsed by video players to change the way fonts appear

on the screen. As explained in [Magarreiro et al., 2014] they found that these tags “almost always con-

tained the name of the person that synced the subtitles with the movie” so they opted out to discard

them all.

The system can also be configured to perform Named Entity Recognition (NER), so that when it finds

words present in the trigger or the answer that can be categorized it replaces them by generic tags. This

allows a dialogue system that receives the corpus as an input to apply similarity measures on those

tags.

9

Page 30: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

2.1.2 Dialogue Turns

Subtitle files comprise a sequence of slots with utterances. Since the main objective is to extract

actual dialogues, it must be decided whether consecutive slots constitute an interaction or not. Some-

times a character has his utterances distributed across multiple slots. When the first utterance ends with

an hyphen, a comma, a colon or ellipsis and the second starts in lowercase, they are joined together

forming a possible trigger/answer for an interaction.

The time between two consecutive slots might be an indicator if an interaction was found (or not).

The common sense lets us expect that slots that are further away from each other, should not form an

interaction, because they do not constitute a valid dialogue (they probably correspond to different scenes

in the movie). However, it is hard to set an appropriate value for the maximum time difference between

the slots (movies have pace variations). For that reason the maximum time allowed between two slots

can be settled by the research group who is using the tool (in the configuration file1). Giving it the value

zero indicates that all consecutive slots will be considered as possible interactions.

2.1.3 Corpus Data Fields

Besides the fields for the triggers and the answers of the interactions found, some additional ones

are included.

To know from where an interaction was extracted the filename of the source subtitle file is stored

along for reference.

A CA using Subtle Corpus as his knowledge base might need to have context information. Therefore,

each interaction has a unique identifier. Every time a new pair of trigger-answer is found it is incremented

and then assigned to the newly created one (creates a reversed linked list).

The time difference (in milliseconds) between the time value of the trigger and the time value of the

respective answer is saved along too.

2.1.4 Final Considerations

This corpus of interactions extracted from subtitle files still has a set of drawbacks:

1. It only covers English and Portuguese subtitles, although there are a large number of subtitles

available for a wide range of languages;

2. Some of the data collected do not correspond to real interactions. Some efforts were made to

minimize this from happening (Section 2.1.1), but one can still find pairs of triggers and answers

that do not represent valid sentences that could be used by a CA;1File where the user can indicate the location of the input files and the maximum time difference allowed between triggers and

answers.

10

Page 31: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

3. It has not evolved. The number of available subtitle files is increasing every day but the corpus is

not considering them;

4. There is information related to the subtitles (e.g. release year, rating of the subtitle, information

about the movie or TV show, etc.) that could be useful to include in the corpus;

5. Although a configuration file is provided, it still does not provide much flexibility to its end user

(e.g. it just offers fields that are needed for the tool to work, the users has few fields where he can

customize the behavior of the tool).

Our solution adds support for more languages. The pre-processing phase is enhanced in order to

address new cases from the added languages. We built corpora from a set of subtitles collected till the

year of 2016. Meta-data associated with the subtitle files is now part of the corpus and allows to perform

analytical experiments with it too. The system is fully configurable through a configuration file giving

flexibility for whatever needs an end-user may have.

2.2 Say Something Smart

The SSS [Ameixa et al., 2014] is the engine that chooses an answer when a user poses a request.

The input given by the user is matched against the interactions present in Subtle Corpus. First, a list of

candidate answers is retrieved and then they are scored according to some measures in order to return

the answer with the best final score.

2.2.1 Indexing Subtle Corpus and Extracting Answers

Subtle Corpus consists in thousands of files each one containing a large amount of interactions. To

perform queries on the data, a high-performance text search engine was needed. Lucene was chosen. It

is an open-source software library which provides fast information retrieval (users should not be waiting

too much time to get a response for their input) by adding content to a full-text index.

For Lucene to make a successful search some prior steps are mandatory when analyzing the raw

data. First, it is necessary to transform the data into indexable tokens. Lucene contains some tools

(analyzers) that allow that transformation to happen:

• Tokenizer: splits text into tokens at punctuation marks;

• Stemmer: removes morphological affixes from words, leaving only the word stem;

• Stop-words filter: some words are ignored when searching. This is done by having a file with a

stop-words list for some specific language and feed it into the filter.

11

Page 32: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Since SSS is a question/answering system, it relies on answers redundancy [Mendes et al., 2013]

to help the process of choosing the best response. While performing the search, Lucene compares the

user request with the triggers available in the Subtle corpus. A relevance score is attributed to each one

of them. Since Lucene applies an internal scoring algorithm2, its results are returned in descending order

(from the one with the highest score to the one with the lowest). Usually, most of the results obtained

from this scoring system are not semantically related to the request made by the user. To maintain the

answers redundancy while allowing a fast time for the response to be given, some tests were made. It

was concluded that SSS should get the first 100 matches found by Lucene. Then, it should use its own

algorithm to try to find the best answer. Additional measures (described in Section 2.2.2) were studied

in order to give an improved scoring to each one of them.

The search made by Lucene might not match any interaction for a given user input. When that

happens the system still gives an answer to its user indicating that the user request was not understood

or that the system does not know how to respond.

2.2.2 Electing the Best Answer

Due to the semantical deviation that might exist between the user request and the triggers/answers

found in interactions reported by Lucene, SSS can apply four different weighted measures where the

weight of each measure as well as which ones are used can be customized by the end user by changing

some parameters of a configuration file.

These are measures available:

• Trigger Similarity to the User Input (M1): the interactions which have triggers with more sim-

ilarity with the user input are given a higher value for this measure. As discussed before, many

of the triggers found in the interactions filtered by Lucene are semantically deviated, so this mea-

sure plays an important role in identifying which ones are more relevant in order to give a better

response to the user request;

• Answer Frequency (M2): this measure evaluates the answer fields of the interactions that are

more common among all the interactions returned by Lucene. This way the corpus redundancy is

taken into account, by giving the highest score to the answer that appears more times;

• Answer Similarity to the User Input (M3): although we might think that the similarity between

the user input and the trigger might be more useful, the similarity between the answer and the user

input can also help the system to give better responses;

• Time Difference (M4): Subtle corpus provides the time difference between the trigger and the

answer. This measure does not make much sense to use solely on its own, but when used together2http://www.lucenetutorial.com/advanced-topics/scoring.html

12

Page 33: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

with the other measures it can help improving the final results. When there is too much time

difference between the trigger and the answer, it might be an indicator that they do not constitute

a real interaction thus receiving a lower score.

2.2.3 Flexibility of the System

SSS provides a configuration file3 where its end user can specify some parameters that control the

way things are done. First, the language can be specified with English and Portuguese as possible

values. The path to the stop-words list should also be changed accordingly. There is also a field for

indicating a list of predefined answers for the system to select one of them when no suitable answer is

found.

2.2.4 Final Considerations

Some improvements could be made to the current state of SSS:

1. As we have seen, when choosing an appropriate response to a user input, several measures are

combined in order to select the answer with the highest score but the weights for each measure

were determined empirically, meaning that the system might not be at its best yet;

2. Since Subtle Corpus includes a reverse linked list of interactions, including the context of the

conversation could improve the success rate of the system;

3. When retrieving possible responses the system relies on an external tool with a non-customizable

scoring algorithm.

We wanted to replace Lucene with a complete alternative tool but no relevant candidate was found.

There are tools available that provide the same kind of functionality, but most of them have Lucene at its

base (e.g. Solr4 or Elasticsearch5).

Having that in mind we built a new dialogue system using neural-networks (Section 5.1) and compare

its performance against SSS. Our system relies on a generative model so that we can answer questions

the system had never seen.

2.3 Sequence-to-Sequence Models

In this section we will briefly explain the main concepts about seq2seq models. We will start by

explaining the base architecture. Thereon, we will describe more specific aspects and techniques for3XML file.4https://lucene.apache.org/solr/5https://www.elastic.co/products/elasticsearch

13

Page 34: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Figure 2.1: Information flow in a seq2seq model.

improving the base seq2seq model.

A seq2seq model consists of two RNNs: an encoder and a decoder. The encoder processes the

input sequence one symbol6 at a time, converting the whole sequence to a fixed representation con-

taining only the important information of the sequence. The decoder sees the encoded representation

(context7) and is trained to predict/generate another sequence, also one symbol at a time. The decoder

is influenced by the context and the previously generated symbols as shown in Figure 2.1. All of those

symbols need to have a representation so a vocabulary is needed.

2.3.1 Padding and Bucketing

Training a standard seq2seq model involves a high amount of matrix multiplications and other oper-

ations which benefit from parallelization. Graphics Processing Units (GPUs) are an excellent candidate

for that task. For that matter, all sequences must have a fixed length in order to be divided into batches.

The input dataset must be converted to fixed length sequences. This is accomplished by padding the

input sequence with special symbols8. This implies that for an input dataset all of its sequences would

have to padded to match the size of the biggest sequence, slowing down the training of the decoder.

Fortunately, bucketing solves this problem by putting sequences into buckets of different sizes (e.g. one

bucket for sequences with length between 5 and 10 and another for sequences with length between 10

and 15, and son on).

2.3.2 Word Embeddings

Taking an input dataset of sentences as an example, the seq2seq model needs a representation

for each word present in the vocabulary. This is accomplished by using word embeddings, were each

word is represented by a fixed length vector. Semantic relations between words can be captured by this

technique. In seq2seq models the word embeddings are simultaneously trained with other parameters

of the model.

6In our experiments each symbol correspondent to a token that could be either a word or punctuation.7Given by the encoder state vectors.8Signaling end of sentences, decoding starting point, symbols not in vocabulary and a filling slots.

14

Page 35: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

2.3.3 Attention Mechanism

The seq2seq model provides the ability to process input and output sequences. However, com-

pressing an entire input sequence to a fixed length context can cause a loss of considerable amount of

information. In [Bahdanau et al., 2014] they took as an inspiration the human perceptual system and

introduced an attention mechanism that allows the decoder to selectively look at the input sequence

while decoding. As a result, unnecessary information can be filtered out and a better performance can

be achieved.

2.3.4 Greedy Search and Beam Search

The decoder needs to select the most likely output sequence. This involves searching through all the

possible output sequences based on their likelihood. Usually, the size of the vocabulary tends to very

large (e.g. hundreds of thousands of words). Therefore, the search problem is exponential in the length

of the output sequence and is intractable9 to search completely.

For seq2seq models it is common to use either a greedy search or a beam search approach, in order

to find candidates to be chosen by the decoder. A greedy search selects the most likely symbol at each

step in the output sequence, making it a very fast approach. However, the quality of the final output

sequences may be far form optimal. On the other hand, beam search expands upon the greedy search

and returns a list of most likely output sequences, by keeping th n most likely, where n corresponds to

the beam width specified by a user. Using beam search with n = 1 results in a greedy search. Higher

beam-width values result in a decrease of the decoding speed.

2.3.5 Recurrent Neural Networks

As reported before, seq2seq models rely on RNNs which are essentially standard neural networks

with loops.

A RNN consists in multiple copies of the same network. Each copy passes information about the

sequence being processed to its successor. This information is the hidden state - information about

what happened in all of the previous time steps. In Figure 2.2, an output is generated by the network

at each time step. This may not be necessary for some tasks (e.g. predicting an answer to give based

on a user input - the final result is the most important, we might not care about what response could be

given at each word in the input).

There is one case where the utilization of simple RNNs (usually referred to as vanilla RNNs) might

not do well: when learning long-term dependencies (e.g. dependency on information that is present in

9NP-complete.

15

Page 36: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Figure 2.2: Architecture of a RNN with a loop and then unrolled after unfolding. Xt represents what is given asinput. Yt is the output generated by the network.

steps that are far apart). However, there some alternatives that are able to address that problem, for

instance Long-Short Term Memorys (LSTMs) described in the following section.

2.3.6 Long-Short Term Memory Neural Networks

LSTMs [Hochreiter and Schmidhuber, 1997] have the same architecture of a RNN. The only differ-

ence is that they are designed to apply a different function to compute the hidden state, thus avoiding

the long-term dependency problem usually reported with basic RNNs. Instead of having only one neural

network, they have four which are combined with some pointwise inside a cell which represents the

memory of the LSTM. Internally these cells have the ability to remove or add information to their state.

That way, they can remember information from steps that are far apart.

16

Page 37: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

3Related Work

Contents

3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 End-to-end Sequence to Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . 23

17

Page 38: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

18

Page 39: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

This section presents a review of previous work regarding the process of building corpora for dialogue

systems and the process of generating responses with end-to-end CAs for user input in a turn-by-turn

basis.

3.1 Corpora

In this section we will describe corpora that is suitable to be used as a knowledge base for a CA.

3.1.1 OpenSubtitles2016 Corpus

As previously pointed out, the number of subtitle files is constantly increasing everyday. The OPUS

Corpus1 have been updated in 2016 with a new dataset based on movie and TV subtitles: the Open-

Subtitles2016 Corpus [Lison and Tiedemann, 2016]. After preprocessing the source files they align the

subtitle files with each other to form a parallel corpus.

3.1.1.A Source Data

The source data that originated this corpus consists of a database dump from OpenSubtitles.org,

containing a total of 3.36 million subtitle files distributed across more than 60 languages. Some files

were discarded from the conversion if their formats were unsupported or if they had corrupt encodings.

OpenSubtitles team have introduced multiple mechanisms that improved the quality of the subtitles

available in their website. They were able to remove duplicate, spurious and misclassified subtitles.

After the conversion step, the dataset includes subtitle data from a total of 152,939 movies or TV

shows episodes.

The raw subtitles files go through a preprocessing phase described in the following sections.

3.1.1.B Preprocessing

Encoding detection: before being able to parse the content of a file, the encoding must be known

a priori, since OpenSubtitles does not enforce any kind of encoding type. The encoding of the file is

then determined by applying a couple of heuristics. The problem was solved by creating a list of pos-

sible character encodings for each language in the dataset. Some languages allow several alternative

encodings. They determine the most likely encoding by doing auto-detection [Li and Momoi, 2001].

Sentence Segmentation: a structure of blocks is present in the raw files (Listing 3.1), consisting

of short portions of text with associated start and end times. Since there are no direct correspondence

1http://opus.lingfil.uu.se/

19

Page 40: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

between these blocks and sentences, they apply a sentence segmentation process that finds sentence-

ending markers in order to detect if some subtitle block is a continuation of preceding block. However,

this detection of sentence-ending markers is highly language dependent and must obey some specific

rules.

Listing 3.1: Structure of a subtitle file in .srt format

1 1

2 00:02:17 ,440 --> 00:02:20 ,375

3 Senator , we're making

4 our final approach into Coruscant.

5

6 2

7 00:02:20 ,476 --> 00:02:22 ,501

8 Very good , Lieutenant.

Spell Correction: many subtitle files that serve as a data source are automatically extracted by

using Optical Character Recognition (OCR) from video streams. This causes some spelling errors.

Also, many subtitle files are made by amateurs making spelling errors very likely to be part of the

data. Following a simple noisy-channel approach (integrated handcrafted error models and statistical

models) such errors (including misplaced accent marks) were automatically detected and corrected for

11 European Languages2. The approach followed cannot correct words that include more than one

misrepresented character.

Collecting meta-data: for each subtitle they generate meta-data that includes generic attributes of

the source material3 extracted from the International Movie Database (IMDb), attributes of the subtitle

itself4. There is also some additional meta-data related with the previous phases such as the encoding

that was detected or the number of spelling errors found.

3.1.1.C Output files

After preprocessing the files they align the subtitle files with each other to form a parallel corpus. The

details of that step will not be described here since the alignment of subtitles is not the focus of our work.

In addition to the bi-text files generated for the parallel corpora, they also provide XML files containing

the subtitles either with sentences tokenized or not. An example of a OpenSubtitles2016 corpus file can

be seen in Listing 3.2.

2English, Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish3Release year, original language, duration and genre.4Upload date, subtitle rating on OpenSubtitles and subtitle duration

20

Page 41: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Listing 3.2: Simplified example of a OpenSubtitles2016 input file structure.1 <?xml version=”1.0” encoding=”UTF-8”?>2 <document id=”66487”>3 <s id=”1”>4 <time id=”T1S” value=”00:00:31,800” />5 Smeagol6 <time id=”T1E” value=”00:00:38,800” />7 </s>8 <s id=”2”>9 <time id=”T2S” value=”00:01:15,700” />

10 Apanhei um!!11 </s>12 <meta>13 <subtitle>14 <language>Portuguese</language>15 <date>2004-03-06</date>16 <duration>00:01:15,700</duration>17 <rating>7.0</rating>18 </subtitle>19 <source>20 <original>English, Quenya, Old English, Sindarin</original>21 <year>2003</year>22 <duration>201 min</duration>23 <genre>Action, Adventure, Fantasy</genre>24 <country>USA, New Zealand</country>25 </source>26 </meta>27 </document>

Does OpenSubtitles2016 Corpus meet our requirements?

This corpus is the ideal candidate for the goals we defined. It is the largest corpus available. It is a

recent corpus with subtitles till the year 2016. It provides meta-data about each corpus file. Supports

multiple languages, including 96254 files with Portuguese subtitles.

3.1.2 Movie-Dic Corpus

Movie-Dic Corpus [Banchs, 2012] is available for research and development purposes. It comprises

132,229 dialogues containing a total of 764,146 turns extracted from 753 English movies scripts. It can

be used in chat-oriented dialogue systems since it does not provide a knowledge base focused on a

specific domain or area of interest.

The movie scripts that serve as the source data of this corpus are freely available at Internet Movie

Script Database (IMSDb)5 as HyperText Markup Language (HTML) files. Three types of information are

extracted when crawling the files:

• Speakers: corresponds to the names of the movie characters that are speaking in a given turn of

the dialogue;

• Context: additional information of narrative nature, explaining what is happening in the movie

scene;

5http://www.imsdb.com/

21

Page 42: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

• Utterances: what is said at each turn by some speaker.

With that information, some heuristics were developed in order to identify proper dialogue bound-

aries. After identifying the dialogues, some post-processing was applied for filtering or amending parsing

errors as well as erroneous data present in the types of information already describe above. Finally, all

information was organized in dialogue units and then written to XML files.

Does Movie-Dic Corpus meet our requirements?

Although this corpus contains speakers and context information that could be useful to build CAs,

it does not meet our requirements because it is only available in English. It is a small corpus when

compared with OpenSubtitles2016 corpus. Also, it does not provide meta-data.

3.1.3 Cornell Movie-Dialogs Corpus

The Cornell Movie-Dialogs Corpus6 is another dataset that contains conversations extracted from

raw movie scripts. It comprises 220,579 conversational exchanges involving 9,035 characters from 617

movies. There is also included meta-data for each conversation, containing details about the movie7

and about the characters8. The info was gathered by an algorithm that performs queries on IMDb data

interfaces9.

It was manly used to study how conversational participants adapt to each other’s language styles

while communicating with each other [Danescu-Niculescu-Mizil and Lee, 2011].

Does Cornell Movide-Dialogs Corpus meet our requirements?

This corpus is very similar to Movie-Dic corpus. Thus it does not meet our requirements either by

the same reasons explained before.

3.1.4 Ubuntu Dialog Corpus

Ubuntu Dialog Corpus [Lowe et al., 2015] is a dataset that helps building dialogue agents that are

capable of interacting in one-to-one conversations on very technical subjects. Since the dataset is

characterized by a multi-turn property of unstructured nature10, the resulting agents can perform multi-

turn conversations.

This corpus comprises almost 1 million two-party conversations extracted from Ubuntu chat logs11

between 2004 and 2015. Each conversation has an average of 8 turns and a minimum of 3. They allow

to create CAs for targeted applications (in this case technical support).6https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html7Movie title, genres, release year, IMDb rating and number of IMDb votes.8Gender and order of appearance in movie credits.9http://www.imdb.com/interfaces

10There is no logical representation for the information exchange during a conversation.11https://irclogs.ubuntu.com/

22

Page 43: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Some learning architectures were studied in order to analyze how this corpus contributes to a ques-

tion/answering system by having the task of selecting the best response given a user input. Before test-

ing the those learning techniques, the collected data have gone through a preprocessing stage. Each

utterance was parsed using Natural Language Toolkit (NLTK) library12 and Twitter tokenizer13 (there is

no information available about how they used both tools). Afterwards, some generic tagging was done

by using NER, using generic tags for multiple word categories (person names, locations, system paths

and so on). Also, the data was further processed to create tuples with three fields: the context, the

response and a flag which was used to indicate whether a response was valid or not14.

Does Ubuntu Dialog Corpus meet our requirements?

This dataset contains information about specific technical topics. Knowing this it is unsuitable for

evaluating open-domain end-to-end dialogue systems.

3.2 End-to-end Sequence to Sequence Models

The automated discovery of abstraction is the fundamental idea behind all deep learning methodolo-

gies. They capture the semantic content by building abstract representations of raw data features.

The seq2seq learning framework is one of those methodologies. It looks for an in-between repre-

sentation of content when mapping one complex structure to another [Sutskever et al., 2014]. However,

training and optimizing this process of translation between structures is exceptionally challenging [Ben-

gio, 2013].

Given a string of inputs a generative neural network model produces a string of outputs, both of

arbitrary lengths. It relies on encoder-decoder models. The encoder encodes the source sequence,

while the decoder produces the target sequence.

Recently, an increasing number of studies have found that seq2seq models can be used for creating

dialogue systems by relying on a purely data-driven approach [Li et al., 2015, Yin et al., 2015, Lu et al.,

2017].

Using seq2seq models for end-to-end training facilitated the creation of systems for various complex

natural language tasks with machine translation being one of the most favored tasks [Kalchbrenner and

Blunsom, 2013,Sutskever et al., 2014,Cho et al., 2014,Luong et al., 2015]. They can be used in multiple

types of systems without considerable changes in the architecture. This is ideal for tasks which are too

difficult to design rules manually such as dialog systems.

Building a CA involves mapping questions to responses and being able to do that with a straightfor-

ward model is very appealing. The seq2seq model can learn to map between questions and answers in12http://www.nltk.org/13https://www.cs.cmu.edu/~ark/TweetNLP/14A response is flagged as valid if it is the next utterance after the context.

23

Page 44: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

either closed-domains or open-domains datasets as shown by [Lu et al., 2017] and [Li et al., 2015].

In [Vinyals and Le, 2015], they built a neural conversational model by using OpenSubtitles2009

[Tiedemann, 2009] 15 English subtitle files was one of the datasets used to test the model. They con-

sidered consecutive sentences as if they were uttered by distinct characters. The model was trained

to predict the next sentence given the previous one. Their CA was capable of having basic fluent

open-domain conversations. The model could generalize to new questions it has never seen during the

training phase. However, the built CA has some drawbacks: it gives too many short and simple answers

and it also lacks a way to ensure consistency during a conversation because it does not include any

general world knowledge (it is an unsupervised model) and it has no memory of the past conversation.

They evaluated their CA by comparing against CleverBot16. They asked four different humans to rate

the answers given by both agents for 200 questions. Their CA achieved a better score after analyzing

the results of the human evaluation. They justified the choice of using human evaluators by stating that

designing a good metric to measure the quality of a conversational model remains an open research

problem.

The problem with generic answers was also referred by [Guo et al., 2017]. After training a seq2seq

model with Cornell Movie Dialog Corpus (Section 3.1.3) they ended up with a high percentage of “I don’t

know” answers. After removing all “I don’t know” sentences from the input dataset and training a new

model, the responses remained vague with a high percentage of “What do you mean?”). The evaluation

of the created CA was made by the team members involved in the research.

Generating long, informative, coherent and diverse responses remains a hard task. A recent review

of the literature on this topic [Li et al., 2015] has found that the traditional objective function that selects

the best answer is unsuited for question-answering systems (although it provides state-of-the-art results

in machine translation tasks). They proposed using a Maximum Mutual Information as the objective

function which penalizes generic responses. They stated that more meaningful responses can be found

in N-best lists given by seq2seq models but rank much lower. After applying their proposed objective

function, they were able to achieve more diverse responses. OpenSubtitles2009 [Tiedemann, 2009]

dataset was also used in their experiments. While training the model they used BLEU [Papineni et al.,

2002] for parameter tuning. They also relied on human evaluation. The judges were informed to prefer

outputs that were more relevant to the preceding context, as opposed to those that were more generic.

After analyzing the results, they were able to improve the number of diverse and interesting responses

returned by the model. In [Shao et al., 2017] similar results were obtained by using a slightly modified

version of beam-search when selecting the best response. They introduced stochastic sampling opera-

tions in the beam-search algorithm. This allowed them to inject diversity earlier in the answer generation

process. They also implemented a back-off strategy by choosing to fallback to the baseline model with

15One of the previous versions of the corpus presented in Section 3.1.116https://www.cleverbot.com/

24

Page 45: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

standard beam-search algorithm when the response was shorter than 40 characters17. Once again,

they also relied on an evaluation done by humans by asking them to rank the answers given by their

model with a 5-scale rating18. Some of their methods were able to generate longer answers, however

those had worse results after analyzing the classifications given by the judges.

17Textual length.18Excellent, Good, Acceptable, Mediocre, and Bad.

25

Page 46: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

26

Page 47: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

4B-Subtle

Contents

4.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 B-Subtle Parts Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

27

Page 48: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

28

Page 49: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

We present a revamped tool for creating corpora - B-Subtle. We aim to replace the existing Subtle

Tool by providing the following features:

1. Create corpora of interactions with meta-data associated such as the genre of the movie/TV show,

release year, spoken language, subtitle language, etc. This allows end-users to generate corpora

that meet specific requirements: for example including only interactions collected from movies with

“Action” as a genre;

2. SupportOpenSubtitles2016 Corpus [Lison and Tiedemann, 2016] files as an input;

3. Support different output formats for the corpus generated such as JSON or XML files. This could

be useful for end-users since they could choose the output format that fits their needs;

4. User-friendly configuration file: allow the user to fully control the behavior of B-Subtle when creat-

ing corpora simply by adjusting parameters in a configuration file.

5. Allow end-users to collect analytical data about movies, TV shows and subtitle files.

6. Deal with specific language details (e.g. encodings) in order to allow end-users to work with a

broader range of languages.

4.1 Architecture Overview

B-Subtle’s architecture (Figure 4.1) was designed to produce a flexible system that can be fine-tuned

according to the requirements a user may have. For that reason, a modular approach was adopted.

The system is capable of running pipelines. Those pipelines indicate how some input files will be

processed and are specialized in dealing with the inner details of them. They specify which components

made available by B-Subtle will be used for an input dataset.

Essentially this tool receives a dataset as input and outputs a customized dataset of interaction pairs

and/or analytics.

B-Subtle is currently able to process OpenSubtitles2016 Corpus files. As seen in Figure 4.1 process-

ing each OpenSubtitles2016 corpus file comprises a sequence of steps:

1. First, it gathers all the meta-data available in the subtitle file;

2. In order to fill missing fields of the already gathered meta-data (e.g. the subtitle file might not have

any value for the genre field), additional components can be applied: the Meta-data Collectors

(explained in Section 4.2.2);

29

Page 50: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

3. After having all of the meta-data collected the system is ready to filter our file using Meta-data

Filters (also explained in Section 4.2.3.A). These are the components that allow the tool to provide

the possibility of creating targeted corpora from subtitle;

4. If the file survived all of the Meta-data Filters applied the system starts collecting interaction pairs

from it;

5. While collecting interaction pairs some of them can be filtered out by configuring some Interaction

Pairs Filters described in detail in Section 4.2.3.B.

6. With all the interactions collected, the system can now process each one of them by applying

Producers (Section 4.2.4). These components are able to enrich each interaction with more data

(e.g. perform a tokenization of the trigger and/or the answer);

7. After applying the Producers we can use Interaction Pair Filters again, because some of them can

only be applied to the data generated by the Producers (e.g. filtering out all the triggers with more

than 5 tokens);

8. Then we can apply Transformers which are responsible for modifying the raw data fields of the

triggers and answers. We explain them in more detail in Section 4.2.5.

9. Once again Interaction Pairs Filters can be applied to the data fields modified by the Transformers.

10. Afterward, all the information collected can be written to a B-Subtle corpus file (explained in detail

in Section 4.2.6).

There is also the possibility to collect analytical data along the corpora generation, although B-Subtle

can also collect analytical data without generating any output corpus. See Section 4.2.7 in order to know

which types of analytics can be collected and how you configure them.

4.2 B-Subtle Parts Explained

By adopting an architecture of components we are allowing B-Subtle to be easily expanded in the

future. In this section, we will describe in detail all of its constituent parts (Figure 4.1).

4.2.1 Input Files

The OpenSubtitles2016 Corpus files are in XML format, therefore a dedicated parser was imple-

mented. To process all the data successfully, some additional steps have been performed. For example,

some files contained invalid XML characters that need to be deleted so that the file could be correctly

30

Page 51: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Figure 4.1: Possible B-Subtle pipeline for OpenSubtitles 2016 files using all the components available.

analyzed without being immediately discarded. Also, during the meta-data collection step, the “dura-

tion”1 field of the subtitle file was found written in multiple patterns (“HH:MM:ss,3S”, “MM min”, or even

“N/A”) and we converted them to a unified format2.

While choosing the parser architectures for XML files we came across the following options:

• Document Object Model (DOM): the whole XML file is loaded into memory. It is possible to nav-

igate to parent and child elements across the document. This technique can be problematic with

large files due to heavy memory consumption;

• Simple API for XML (SAX): it starts by reading the XML file from beginning to end. It does not store

anything in memory. It fires events and a custom event handler can be used to catch all or part of

them. It lacks a parent structure like DOM but does not suffer memory consumption problems;

• Streaming API for XML (StAX): its similar to SAX, however, the responsibility of moving the parser

through the XML file belongs to the event handler. The main advantage against SAX is that it

allows writing to the XML file.

Since the size of each XML file is relatively small the DOM parsing architecture was chosen because

it allowed a faster development of the B-Subtle tool.

4.2.2 Meta-data Collectors

These components are responsible for enriching the meta-data provided by original input files with

meta-data from external sources (for instance the genre of the movie if this information is not available).

They can be particularly useful for .srt files directly downloaded from websites, which lack on information.

1Refers to the total duration of the source movie related to the subtitle file.2An Integer with the amount of time in minutes.

31

Page 52: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

We have currently implemented the themoviedb3 Meta-data Collector 4. This component makes an

Hypertext Transfer Protocol (HTTP) request to themoviedb with an IMDb identifier5 and receives back

a JSON that is parsed by B-Subtle. The relevant information is extracted (e.g. extracting the movie

certification codes in order to classify the audience type). This information can then be filtered with the

components we will describe next.

4.2.3 Filters

When dealing with input data that is rich in meta-data6, filters may be applied. This feature allows

end-users to easily generate targeted corpora (e.g. generate a corpus of interactions from movies with

“Western” as a genre and released before the year 1990).

Two types of filters can be used: Meta-data Filters and Interaction Pair Filters, described below.

4.2.3.A Meta-data Filters

• Audience: allows filtering subtitle files according to an audience rating/certification. This infor-

mation can be added by our themoviedb Meta-Data Collector. This filter supports a flag for adult

movies. It also supports filtering by motion picture content rating when provided together with a

country identifier7 (different countries have different criteria for content and age rating). (e.g. apply-

ing the flag for filtering adult movies will result in skipping those subtitle files; defining the content

rating value as M/16 with Portugal as a country would result in accepting all subtitles files from

movies with that content rating).

• Country: allows filtering subtitles of movies/TV shows made in a specific country or set of coun-

tries. Using a regular expression8 for the country name is also supported. (e.g. accept only

subtitles files from movies made in countries starting with “Po” by using the following regular ex-

pression: "^\Po").

• Country Quantity: allows filtering subtitles of movies/TV shows made in a determined quantity

of countries. A maximum, minimum or exact quantity can be defined. A range can also be used.

(e.g. defining a range value of 2 to 4 would result in accepting subtitle files from movies filmed in

at least 2 countries but filmed in less than 4 countries).

3www.themoviedb.org4Limited to 40 API requests every 10 seconds by IP address.5It is provided by OpenSubtitles2016 Corpus as a filename of the subtitle files.6Possibly enriched with B-Subtle’s own Meta-data Getters.7Details about motion picture content rating in multiple countries: https://en.wikipedia.org/wiki/Motion_picture_

content_rating_system8A sequence of characters that define a search pattern.

32

Page 53: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

• Duration: allows filtering by the total duration of the movie (in minutes). A maximum, minimum or

exact quantity can be defined. A range can also be used. (e.g. accept only subtitles from movies

with less than 90 minutes by defining a maximum quantity of minutes).

• Encoding: allows filtering subtitle files written in a specific encoding. Using a regular expression

for the encoding is also supported as well as the existence of it (one might want to filter only the

subtitle files that have the encoding correctly identified).

• Genre: allows filtering subtitles of movies/TV shows belonging to a specific genre or set of genres.

Using a regular expression for the genre type is also supported.

• Genre Quantity: allows filtering subtitles of movies/TV shows belonging to a determined quantity

of genres. A maximum, minimum or exact quantity can be defined. A range can also be used.

(e.g. accept only subtitles with “Action”” and “Comedy” as a genre).

• IMDb Identifier: allows filtering subtitle files that have the IMDb ID present in the meta-data fields.

• Movie Title: allows subtitle files to be filtered by movie name by providing a regular expression.

The existence of that field in the meta-data can also be tested (one might want to filter subtitles

that have a movie title associated, some of them might not have that information included, since

this field is not provided by OpenSubtitles2016 Corpus files).

• Original Language: allows filtering subtitles of movies/TV shows made in a specific language or

languages. Using a regular expression for the original language is also supported.

• Original Language Quantity: allows filtering subtitles of movies/TV shows made in a determined

quantity of original languages. A maximum, minimum or exact quantity can be defined. A range

can also be used.

• Movie Rating: allows filtering subtitle files based on the movie rating associated with the subti-

tles. It supports checking for the existence of that field in the meta-data as well as a maximum.

minimum, exact or range of values.

• Subtitle Rating: allows filtering subtitle files based on the subtitle rating associated. It supports

checking for the existence of that field in the meta-data as well as a maximum, minimum, exact

or range of values. (e.g. accept only subtitle files with a rating above 6.3 in a scale of 0 to 10 by

defining a minimum value).

• Year: allows filtering files based on the release year of the movie. It supports checking for the

existence of that field in the meta-data as well as a maximum, minimum, exact or range of values.

33

Page 54: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

4.2.3.B Interaction Pairs Filters

The following filters are available for the interaction pairs. Some of them require a Producer to be

applied a priori :

• Interaction Interval: allows filtering the value of time interval allowed between a trigger and an an-

swer. It supports checking a maximum, minimum, exact, or range of values (e.g. collect interaction

pairs where the answer appears up to 4 seconds after the trigger);

• Trigger/Answer Sentiment: allows filtering interaction pairs where the trigger/answer expresses

a sentiment defined by the user (requires the Sentiment Producer Component). (e.g. accepting

only triggers with a positive sentiment).

• Trigger/Answer Tokens Quantity: allows filtering interaction pairs where the trigger/answer has

a determined amount of tokens (requires the Tokenizer Producer Component). A maximum, min-

imum or exact quantity can be defined. A range can also be used. (e.g. accepting only answers

with more than 5 tokens such as sentences like “Yes you are!” are discarded9).

• Trigger/Answer Characters Quantity: the same as the above, but for textual content length

(e.g. the sentence “I am fine.” contains 10 characters and if we define this filter with a minimum

characters quantity of 5, that sentence is accepted if it is part of a trigger/answer).

• Trigger/Answer Regular Expression: allows filtering interaction pairs where the trigger/answer

matches some regular expression defined by the user. This filter gives a lot of flexibility such as

building a regular expression that filters out triggers containing curse words.

• Trigger/Answer Text Content: allows filtering interaction pairs where the trigger/answer starts

with, contains or ends with some sequence of characters. The same result can be achieved by

using a Trigger/Answer Regular Expression Filter, but since some users might not be advanced

enough to use regular expressions we decided to provide this filter for simples use cases of text

content starting with, containing or ending with some sequence of characters.

4.2.4 Producers

This type of components is responsible for generating additional data for the interaction pairs. The

search and implementation of producers were limited to tools with existing Java libraries. We also

preferred tools where Portuguese was available as a supported language.

9Contains only 4 tokens: “Yes”, “you”, “are” and “!”

34

Page 55: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

• OpenNLP Sentiment Analyzer: uses a sentiment analysis tool from OpenNLP. Currently it is only

prepared to deal with English sentences. It evaluates a sentences sentiment according to the

following scale: very negative, negative, neutral, positive, and very positive;

• OpenNLP Stemmer: uses a snowball stemmer from Apache OpenNLP10. This stemmer supports

16 languages11. The language parameter is customizable through a B-Subtle configuration file

(Section 4.2.8). By default, it is applied to both the trigger and the answer.

• Open NLP Tokenizer: converts the raw text from triggers and/or answers into separated tokens.

It is available for Danish, German, English, Dutch, Portuguese and Sweden.

• TreeTagger Lemmatizer: converts the raw text from triggers and/or answers into separated lem-

mas. It is available for 23 languages12.

4.2.5 Transformers

Transformers are entities responsible for transforming the raw text data present in the interaction

pairs.

• Lowercase: converts the raw text from triggers and/or answers into lowercase characters.

• Uppercase: converts the raw text from triggers and/or answers into uppercase characters.

Some of them can also be applied to the data generated by the Producers.

• Stringify Tokens: replaces the trigger and/or answer fields by joining the tokens generated by

some producer with some separator (the default is the space character).

• Stringify Lemmas: does the same as the above but applied to the lemmas generated by a pro-

ducer.

4.2.6 Output Files

The generated corpora can be written to four types of output files: JSON files, XML files, Legacy

files13 or Parallel Files. Each one of those types will be described in this section. Some output types

support a customizable parameter that allows the user to enable or disable pretty print (for JSON and

XML), thus generating files with more or less size respectively.10opennlp.apache.org11Danish, Dutch, English, Finnish, French, German, Hungarian, Irish, Italian, Norwegian, Portuguese, Romanian, Russian,

Spanish, Swedish, Turkish.12German, English, French, Italian, Danish, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Greek, Chinese, Swahili,

Slovak, Slovenian, Latin, Estonian, Polish, Romanian, Czech, Coptic and old French13Similar to the ones generated in Subtle. Since SSS is dependent on the format of those files, this will ease the process of

evaluating the system.

35

Page 56: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

• JSON Files: we provide this output file type because JSON is just text in a standardized format. It

can be useful for an end-user using our corpora files for some application that requires communi-

cation between a browser and a server. This output type supports a pretty print14 option that can

be enabled or disabled.

• XML Files: we provide this output file type because the way we implemented our JSON output

could be adapted without much effort to output XML files. This output type also supports a pretty

print option that can be enabled or disabled.

• Legacy Files: in order to be retro-compatible with previous systems developed at L2F/INESC-ID

(e.g. SSS) we decided to include this output type. We also planned our experiments to use SSS

with a corpus generated with B-Subtle. These output files are simple text files with the same fields

generated by the Subtle tool.

• Parallel Files: this output type consists in generating at least two files: one containing the triggers

and another one containing the answers. Each line of the triggers file is aligned with each line of

the answers file. If we pick the same line from both files (e.g. 14th line) we get the trigger and the

corresponding answer of an interaction pair. This output type allows generating corpora that can

be fed to seq2seq systems (which will be the case in our experiments for generating end-to-end

CAs). Two additional files can also be generated. We call them validation files (usually required

for experiments with seq2seq frameworks). One is for the validation triggers and the other is for

the validation answers. These files are also aligned. B-Subtle randomly samples a user-defined

quantity of interaction pairs to build these validation files. The interaction pairs inserted in validation

files are not present in corpus files.

4.2.7 Analytics

B-Subtle offers the possibility to collect analytical data about the process of creating corpora, as well

as analytics about the input/output datasets. This allows to easily have information about the corpora,

but also to analyze movies and TV shows information from a bunch of subtitle files (e.g. it can be

interesting to study the evolution of movies pace over the years). Three types of analytical data can be

generated: global, meta-data and interaction. The analytics collected are currently outputted to JSON

files.

Meta-data Suppose that we want to study the pace of subtitles from 1990 till 2010 for the “Adventure”

genre. We can define a Genre Filter for that type and the corresponding Year Filter with the correspond-

ing range value and then we activate both the Global Analytics and Meta-data Analytics components.

14Show JSON with indentation in multiple lines, instead of a single line with all the information.

36

Page 57: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

We will be able to get the average time difference between trigger and answer, giving us the information

that we were interested.

Each Meta-data Filter can be configured to collect or not the analytical data. Basically, our filters will

fire an event that will be captured by the Meta-data Analytics component which aggregates data received

from multiple filters.

The total types of Meta-data Analytics that can be collected are as much as the total of meta-data

filters available.

Interaction Pairs Aggregating Interaction Pair Analytics works similarly to the Meta-data Analytics

component since we also have filters for the interaction pairs. Therefore, the total types of interaction

pair analytics that can be collected are as much as the total of interaction pair filters available.

Global When generating a new corpus the end user might want to collect analytical data about the

process of generating a corpus. We call it Global Analytics and it includes the following information:

• Total input files processed (includes quantity and size);

• Total invalid input files (includes quantity and size);

• Total output files generated per output type (includes quantity and size);

• Average time spent processing each file;

• Total time spent processing all files;

• Average interactions pairs extracted for each input file;

• Input file with the most interactions pairs;

• Largest input file;

• Largest output file (per output type).

4.2.8 Configuration Files

YAML Ain’t Markup Language (YAML)15 configuration files are supported since their signal-to-noise

ratio is higher without all the brackets that we are used to seeing in XML files. This makes it subjectively

easier to read and edit (Listing 4.1).

For the OpenSubtitles2016 task the following fields are currently available to be used:

• The directory where the input files can be found;15http://yaml.org/

37

Page 58: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

• A list of Meta-data Collector components;

• A list of Meta-data Filter components;

• A list of Interaction Pair Filter components;

• A list of Producer components;

• A list of Transformer components;

• A list of Analytics components;

• A list of Output components that will generated corpora files (see section 4.2.6).

Listing 4.1: Example of configuration file for OpenSubtitles2016 Pipeline1 ---2 p i p e l i n e s :3 - p ipe l ineType: opensub t i t l es4 i n p u t D i r e c t o r y : "/input/dataset/path"5

6 metada taF i l t e rs :7 - f i l t e r T y p e : count ry8 value: "Italy"9

10 i n t e r a c t i o n F i l t e r s :11 - f i l t e r T y p e : t r iggerEndsWi th12 value: "?"13

14 producers:15 - producerType: openNLPTokenizer16

17 t rans fo rmers :18 - t ransformerType: lowercase1920 outputs :21 - outputType: json22 ou tpu tD i r : "/output/corpus/path"23 p r e t t y P r i n t : true

For a full understanding of configuration files possibilities and what their content represents see

Appendix C for a sample configuration file representation with all parameters briefly explained. We also

provide online documentation for B-Subtle16.

16https://miguelventura.gitbook.io/bsubtle/

38

Page 59: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

5Building Agents

Contents

5.1 Say Something Deep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Main Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

39

Page 60: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

40

Page 61: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

We aim to create a set of agents. In order to achieve it we need to accomplish two distinct tasks:

1. Build knowledge bases for our agents;

2. Assign them an architecture for generating or selecting the best answer given a user input.

We can rely on our B-Subtle tool to generate knowledge bases from OpenSubtitles2016 Corpus. For

the second task, we already have a system available for selecting the answer given a user input: SSS

(described in Section 2.2). As we have seen it relies on several measures that are combined in order to

select the answer with the highest score. The weights for each measure were determined empirically.

In [Mendonca et al., 2017] the authors investigated a methodology to learn the best values for each

weight. This was done by feeding user feedback into the system. However, this learning strategy did

not bring significant improvements to the overall system performance. Therefore, we decided to create

a brand new system based on seq2seq models - SSD, described in the following section.

With two distinct architectures for answer selection, we will be able to compare them against each

other by using agents with the same knowledge base.

5.1 Say Something Deep

In recent experiments, the seq2seq model has been shown to be very appealing for purely data-

driven approaches. It provides state-of-the-art results when mapping complex structures with variable

length to other structures in many domains such as machine translation, speech recognition, and text

summarization [Sutskever et al., 2014,Cho et al., 2014,Wu et al., 2016].

Instead of translating from one language to another with seq2seq, we aim to “translate” an input

(trigger) to an output (answer). By relying on a generative approach we expect our system to generate

new answers even for interactions it has not seen during the training phase. This way we diverge from

SSS which relies on a retrieval approach that only allows it to return pre-defined answers available in

the knowledge base.

5.1.1 Architecture

SSD receives a user request (the trigger) and chooses an answer. Architecturally speaking it works

just like SSS: it receives an input and applies some procedures to select an output. However, the

process of selecting the best answer is entirely different.

SSD relies on a seq2seq framework which is commonly used for machine translation tasks: OpenNMT-

tf1. This framework is a general purpose sequence modeling tool built with TensorFlow2. It provides

1opennmt.net/OpenNMT-tf/2tensorflow.org

41

Page 62: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

multiple features that we need for our research:

• It defines neural network models with a simple configuration file;

• Since it is oriented for machine translation tasks, it receives a source/target dataset as an input

when training a model, for instance, when translating from English to French the source dataset is

in English and target dataset is in French. Both datasets must have the same size3. In our case,

the input dataset consists of triggers. The output dataset is made by the corresponding answers;

• By being built with TensorFlow it is ready for production because of TensorFlow Serving4. Once a

model is trained it can be deployed to production environments in order to make predictions with

new data samples. This allows our model to be connected to a user interface for an agent such as

Filipe [Ameixa et al., 2014];

• The training process can also be monitored by using TensorFlow’s Visualization Toolkit: Tensor-

board5 (Figure 5.1).

5.1.2 Neural Network Model

Our model bears a close resemblance to the model described in [Vinyals and Le, 2015] - a seq2seq

model with LSTMs. We also applied to our model an attention mechanism [Luong et al., 2015,Bahdanau

et al., 2014] which lets the decoder learn to focus on specific parts of the input sequence when decoding,

instead of relying only on the hidden vector of the decoder’s LSTM. This model architecture consists in

two LSTMs: one for the encoding phase and another for the decoding phase. Before reaching our main

experiments this model suffered multiple changes. All of those changes are described in the remaining

part of this chapter.

5.2 Preliminary Experiments

Training a seq2seq model involves setting and adjusting multiple parameters related to the architec-

ture of the underlying neural network. We need to decide the type of our encoders and decoders. We

should also select an appropriate value for the number of encoding and decoding layers. Then a vocab-

ulary size must be set. This goes on and on through an extensive list of configurable parameters. There

is no right or wrong when selecting values for those parameters because they depend on the data that

we will provide. Since determining the best values for those parameters is out of the scope of this thesis

3Number of entries/sentences.4https://www.tensorflow.org/serving/5github.com/tensorflow/tensorboard

42

Page 63: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Figure 5.1: Tensorboard - visualization tool for understanding, debugging, and optimizing the models being trained.

43

Page 64: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

we chose the same values used by [Vinyals and Le, 2015] for the OpenSubtitles corpus as a starting

point: 2 LSTM layers with 4096 hidden units for the Encoder and for the Decoder.

We created three corpora with our B-Subtle tool so that we could adjust the configuration of our SSD

system during our first phase of preliminary experiments. We ended up with the following corpora for

our first phase of preliminary experiments:

1. Corpus with all Portuguese subtitles. The B-Subtle configuration file used to generate the corpus

is available at Listing A.1. We ended up with a corpus with almost 95 Million interaction pairs;

2. Corpus with all Portuguese subtitles with answer length greater than 25 characters. The B-Subtle

configuration file used to generate the corpus is available at Listing A.2. We ended up with a

corpus with almost 43 Million interaction pairs;

3. Corpus with all Portuguese subtitles with the “Horror” genre and subtitle rating above 5.0. The

B-Subtle configuration file used to generate the corpus is available at Listing A.3. We ended up

with a corpus with just a little bit over 320 thousand interaction pairs.

Due to the large size of the first corpus, we had to adjust the configuration of the model in order to

adapt it to our time and hardware constraints6.

We started with two-layered LSTM using AdaGrad with gradient clipping [Duchi et al., 2011]. Since

we were dealing with subtitles in which longer sentences are not so frequent, we made a compromise

to reduce the number of LSTM hidden units from the original 4096 units to 2048 units. This decision

might influence the efficiency of the memory mechanism of an LSTM. However, the process of training

the model became faster7 allowing us to proceed with our experiments in reasonable time. For the

vocabulary size, we tried 100 thousand words but we got out of memory errors when training. To avoid

those memory errors on our GPUs8, we reduced to 75 thousand words for the first and second corpus.

The corpus with fewer interactions (third one in the previous list) had to be set at 50 thousand words

since it did not have any more different words. We kept the embedding size value at 5129 units.

Afterward, we had to decide the training time for our SSD system. The OpenNMT-tf framework

allows us to define a maximum number of steps for our training process. Each step processes a batch10

of interaction pairs from our input corpus. We ended up defining three training marks for our first phase

of preliminary experiments: 100 thousand, 250 thousand and 500 thousand steps11.

6INESC-ID provides GPUs to their community as a shared resource any member can use responsibly.7We almost tripled the steps (each step corresponds to processing one batch of input sequences) per second of the training

process.8We trained our models in GeForce GTX 1080 Ti, GeForce GTX TITAN X and Tesla K20Xm GPUs as long as they were

available.9Default value for the medium size model provided by OpenNMT-tf.

10We used the default value for the batch size: 64.11Selecting an epochs number would be a better approach (a bigger corpus requires more steps to complete an epoch than a

smaller one), but due to time constraints we were not able to test a full epoch for the first and second corpus at that time. Also, wewere only doing our first phase of preliminary experiments to get a sense of what to expect from our SSD system.

44

Page 65: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

To evaluate the convergence of the training we defined our validation set size to ten percent of the

original corpus (e.g. for the first corpus that has about 95 million interactions, the validation set has 9,5

million interactions). In later experiments and after some research work we found that in the context of

big data this set of validation data could be much smaller.

We manually translated 200 questions used in [Vinyals and Le, 2015] to Portuguese so that we could

make a subjective analysis of the results after training our models with the corpora we generated after

reaching the training marks previously described. We also tested with a set of 161 user questions in

Portuguese made to Filipe [Ameixa et al., 2014].

We came up with the following observations:

• We detected a considerable amount of “<unk>” symbols in the answers given by the model trained

with the first corpus. This is related to the size of the vocabulary;

• We did not manage to finish the third training mark (500 thousand steps) for the first corpus since

the model training was only processing about 2 steps per second12. It would take about three days

to reach 500 thousand steps;

• Our first model provided very simple and short answers. These three were undoubtedly the most

frequent: “sim”/(yes), “nao”/(no) e “nao sei”/(I don’t know);

• Our second model gave longer answers. Most of them were questions like: “O que e que estas

a fazer?”/(What are you doing?) and “O que e que isso quer dizer?”/(What does that mean?).

Even though it produced some valid answers (e.g. question: “Queres ir para outro lugar?”(Do

you want to go somewhere else?), answer: “Nao quero ir para outro lugar.”(I do not want to go

anywhere else.)) great part of them did not make sense (e.g. question: “’O que fazes nos tempos

livres?’(What do you do in your free time?), answer: “Nao quero que te aconteca nada.”(I do not

want anything to happen to you.));

• Our third model gave responses with more variability. Even so, most of them are not related to the

corresponding question (e.g. question: “O que fazes nos tempos livres?”/(What do you do in your

free time?), answer: “Nao ha nada de errado durante a manha.”/(There is nothing wrong in the

morning.)). Also, the corpus used to train this model was taken from Horror movies, however, we

did not find any of its answers causing any sentiment of fear;

• The validation size we set was too big and slowed down our training process.

All the results obtained during our preliminary experiments are available in a spreadsheet13.

12On a Geforce GTX 1080 Ti with a compute capability of 6.1 (https://developer.nvidia.com/cuda-gpus)13https://goo.gl/4tXu3v

45

Page 66: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

5.3 Main Experiments

In order to conduct our main experiments, we tried to fix some of the issues we found during the

preliminary experiments phase.

In order to partially solve the problem with the high volume of “<unk>” symbols we decided to

normalize all of our corpora by applying a tokenization step and a transformation step that would convert

all text to lowercase. We also reduced the embedding units of our SSD system from 512 units to 256

units. This way we were able to increase the size of the vocabulary to 100 thousand words.

In preliminary experiments, the training process was too slow. To increase performance (while poten-

tially degrading the results obtained) we decided to reduce the number of our LSTM hidden units from

2048 to 102414. The performance of the training process with SSD improved by a significant amount15.

We generated three new corpora. After defining three training marks at 100 thousand, 200 thousand

and 300 thousand steps we found that they could give interesting results16 if trained during a couple

epochs17.

We will now describe the setup for our main experiments. We will start by showing how we generated

corpora in Section 5.3.1 and we will show a detailed description of the configuration of our answer

selection systems in Section 5.3.2 and in Section 5.3.3.

5.3.1 Corpora

We generated three new corpora using our B-Subtle tool. We also use Subtle Corpus (Section 2.1)

so that we can compare our answer selection systems.

We started by generating a brand-new corpus of Portuguese interactions from all subtitles provided

by OpenSubtitles2016. This is an important contribution since it provides a much larger corpus when

compared with the Portuguese version of Subtle corpus:

• Corpus A: Corpus with all Portuguese subtitles available in OpenSubtitles2016. The following B-

Subtle components were used: an Open NLP Tokenizer Producer and a Lowercase Transformer.

It outputs Parallel files for the SSD system and Legacy Files for the SSS system. We ended up

with a corpus with almost 95 million interaction pairs. The validation size was set to a fixed value

of 2500 interaction pairs. Check the B-Subtle configuration file in listing 5.1.

14Since we are dealing with corpora from subtitles, most of the sentences are short, so reducing the number of hidden units ofour LSTMs might not have a great impact on final results.

15From about 2 steps per second to about 8.5 steps per second on a Geforce GTX 1080 Ti.16Spreadsheet with results for those three training marks: https://tinyurl.com/y8kwx7lb17The final result of multiplying the number of steps with the batch size should be equal to the corpus size for an epoch to be

complete.

46

Page 67: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Listing 5.1: B-Subtle’s configuration file to generate Corpus A.1 ---2 p i p e l i n e s :3 - p ipe l ineType: opensub t i t l es4 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt"5 producers:6 - producerType: openNLPTokenizer7 modi fy InPlace: true8 t rans fo rmers :9 - t ransformerType: lowercase

10 outputs :11 - outputType: p a r a l l e l12 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/corpusA/"13 v a l i d a t i o n S i z e : 250014 - outputType: legacy15 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/corpusA/"

Having in mind the Corpus A dimension, we decided to create a smaller corpus18 by filtering out the

subtitle files with a worse rating.:

• Corpus B: Corpus with all Portuguese subtitles available in OpenSubtitles2016 that have a subtitle

rating equal or above 5.019. The following B-Subtle components were used: a Subtitle Rating Filter,

an Open NLP Tokenizer Producer and a Lowercase Transformer. It outputs Parallel files for the

SSD system. We ended up with a corpus with almost 4,5 million interaction pairs. The validation

size was set to a fixed value of 2500 interaction pairs. Check the B-Subtle configuration file in

listing 5.2.

Listing 5.2: B-Subtle’s configuration file to generate Corpus B.1 ---2 p i p e l i n e s :3 - p ipe l ineType: opensub t i t l es4 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt"5 metada taF i l t e rs :6 - f i l t e r T y p e : s u b t i t l e R a t i n g M i n7 value: 5 .08 producers:9 - producerType: openNLPTokenizer

10 modi fy InPlace: true11 t rans fo rmers :12 - t ransformerType: lowercase13 outputs :14 - outputType: p a r a l l e l15 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/corpusB/"16 v a l i d a t i o n S i z e : 2500

Since previously generated corpora contain interactions extracted from any sequence of sentences

we decided to limit our triggers to questions by filtering sentences ending with a question mark:

• Corpus C: Corpus with all Portuguese subtitles available in OpenSubtitles2016 that have a subtitle

rating equal or above 5.0 and the trigger ends with a question mark. The following B-Subtle

components were used: a Subtitle Rating Filter, an Interaction Pair Filter, an Open NLP Tokenizer

18It allows us to train models with SSD across more epochs, thus allowing us to further research the capabilities of our neuralmodel for answer selection.

19OpenSubtitles subtitle rating scale goes from a minimum of 0 to a maximum of 10.

47

Page 68: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Producer and a Lowercase Transformer. It outputs Parallel files for the SSD system. We ended

up with a corpus with almost 786 thousand interaction pairs. The validation size was set to a fixed

value of 2500 interaction pairs. Check the B-Subtle configuration file in listing 5.3.

Listing 5.3: B-Subtle’s configuration file to generate Corpus C.1 ---2 p i p e l i n e s :3 - p ipe l ineType: opensub t i t l es4 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt"5 metada taF i l t e rs :6 - f i l t e r T y p e : s u b t i t l e R a t i n g M i n7 value: 5 .08 i n t e r a c t i o n F i l t e r s :9 - f i l t e r T y p e : t r iggerEndsWi th

10 value: "?"11 producers:12 - producerType: openNLPTokenizer13 modi fy InPlace: true14 t rans fo rmers :15 - t ransformerType: lowercase16 outputs :17 - outputType: p a r a l l e l18 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/corpusC/"19 v a l i d a t i o n S i z e : 2500

5.3.2 Say Something Deep Setup

The preliminary experiments together with further research made us choose the following configura-

tion for the SSD:

• Encoder type: Bidirectional RNN Encoder with 2 layers, each one using LSTM as a memory cell

type with 512 hidden units;

• Decoder type: Attentional RNN Decoder with 2 layers, each one using LSTM as a memory cell

type with 512 hidden units;

• Vocabulary Size: 100 thousand words;

• Word Embedding Size: 256;

• Training time: for the corpus A we trained for 6 epochs due to time restrictions20. For the other

ones we trained till the normalized loss stabilized21.

We ended up choosing a Bidirectional Encoder for our main experiments because it does a better job

at preserving the input context. Also it is used in several seq2seq models like in [Wu et al., 2016]. Since

introducing this type of decoder decreased the speed of the training process, we reduced each layer of

both the encoder and decoder from 1024 units to 512 units, thus reducing even further the number of

hidden units of the LSTM.20One epoch takes almost one day and a half using two GTX 1080 Ti.21When it stopped decreasing its value.

48

Page 69: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

We applied the default AdaGrad optimizer22 provided by the OpenNMT-tf framework.

For each corpus we adjusted the sample buffer size and the number of the training steps in order to

train all models during full epochs.

In order to try to reduce the appearance of sub-optimal answers given by our models, we did some

experiments to compare a greedy search approach to a beam search approach when decoding the

output. After some subjective analysis, we ended up choosing a beam search technique with a beam

width of 2 and applying a length penalty of 1 (neutral). The results returned by our models with this

configuration seemed to provide better overall results when compared to the others (particularly against

a greedy approach).

5.3.3 Say Something Smart Setup

The previous Subtle Corpus was already indexed by SSS in the past. We reused those indexes for

our experiments. However, we needed to index Corpus A by following the instructions available in a

readme file provided by the tool.

5.3.4 Agents

In this section, we present the agents that we created by showing which are their knowledge bases

and answer selection mechanisms.

• Agent Alpha: its knowledge base comes from Corpus A and its answer selection mechanism is

the SSD system with a model trained during 6 epochs;

• Agent Bravo: its knowledge base comes from Corpus B and its answer selection mechanism is

the SSD system with a model trained during 15 epochs;

• Agent Charlie: its knowledge base comes from Corpus C and its answer selection mechanism is

the SSD system with a model trained during 15 epochs;

• Agent Delta: its knowledge base comes from Corpus A and its answer selection mechanism is

the SSS system;

• Agent Echo: its knowledge base comes from Subtle Corpus and its answer selection mechanism

is the SSS system;

22Adam optimizer with clipping gradients and decay.

49

Page 70: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

50

Page 71: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

6Evaluation

Contents

6.1 How to Evaluate our Agents? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

51

Page 72: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

52

Page 73: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

In this chapter, we present the procedures and results of evaluating the 5 agents created in our

experiments. We start by discussing in Section 6.1 which evaluation methods seem to fit the agents

(“chatbots”) we created for open-domain conversations. Then, we describe the method we ended up

choosing in Section 6.2. Finally in Section 6.3 we present and discuss the results obtained.

6.1 How to Evaluate our Agents?

One of the main challenges faced when evaluating open domain dialog agents chatbots is the lack

of a good mechanism to measure their performance. The absence of explicit objective for open do-

main conversations makes evaluating dialogue systems a challenging research problem. Researches

regarding answer generation have led to adoption of metrics from machine translation to automate the

evaluation phase [Ritter et al., 2011,Serban et al., 2016,Li et al., 2015,Wen et al., 2015]. For instance,

BLEU [Papineni et al., 2002] is a standard for evaluating machine translation models. However, relying

on this metric to evaluate our agents could lead to an erroneous analysis of the answers given by our

agents. BLEU assumes that valid answers have significant word overlap with the ground truth answers.

In an open-domain conversation, there is a significant diversity in the space of valid answers to give a

particular trigger. For example, let us say we have an interaction pair where its trigger is “Do you want to

go the beach?”, its (ground truth) answer is “Not today.” and one of our agents responded with a plausible

“Of course I want!”. The BLEU score for the agent response would be zero because there are no words

in common between its answer and the ground truth answer. Additionally, in our experiments there is no

guarantee that the interaction pairs present in our corpora contain the required ground truth answers that

actually respond/react to the trigger they are associated with1. A recent study [Liu et al., 2016] provides

evidence against existing metrics when evaluating dialogue response generation systems. It is shown

that there is very weak and sometimes inexistent correlation between automatic evaluation with metrics

and human judgment. With this in mind, we decided to rely on the human evaluation by conducting a

survey.

6.2 Human Evaluation

We aimed to compare our agents against each other so that we could examine and determine the

influence of the corpora we generated, but we also aimed to compare the architecture for generating

(SSD) or selecting (SSS) the best answer. With the purpose of evaluating the performance of our

agents, we asked volunteers to fill a survey. That survey consisted of a list of questions. For each

question the volunteer had to qualify an answer given by each agent (Figure 6.1) in the following way:

1The answer can correspond to a change of a scene in a movie or can come from a completely different speaker.

53

Page 74: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Figure 6.1: Example of a question in our survey (in Portuguese). Each line corresponds to an answer given by oneof our agents that should be classified as valid, plausible or invalid.

• Valid: when the answer was appropriate to the subject of the question (e.g. “How old are you?”

and the answer given was “I am 24 years old.”);

• Plausible: when the answer could be appropriate in a given context (e.g. “Where do you live?”

and the answer given was “We do not have time for that, keep running.”) or referred to details that

may belong to the person who performed the interaction (e.g. the agents could return answers

containing names such as “Ok John.” so that would be a plausible answer if the agent is talking

with someone called John);

• Invalid: when the answer was not adequate or included grammatical errors that made it difficult to

understand (e.g. “How old are you?” and the answer given is “Yes.”).

The survey consisted in a list of 100 questions (see Appendix B) randomly picked from a set of

361 questions: 200 questions manually translated to Portuguese from [Vinyals and Le, 2015] and 161

questions made to Filipe [Ameixa et al., 2014] that were already available in Portuguese. Although

we refer to them as questions, some of them were commentaries (e.g. “Life is hard...”). This list of

questions suffered the same pre-processing as our corpora generated with B-Subtle by being converted

to lowercase and tokenized2. When getting the answers given by the agents to our list of questions we

reverted the tokenization process and capitalized the first letter so that their appearance seemed more

natural to volunteers (e.g. “bye , see you later .” was converted to “Bye, see you later.”). We assured

that each answer entry only appeared once for the same question (e.g. when two agents answer “I2Except for the agent Echo. The indexed version of Subtle Corpus of this agent did not pass through our pre-processing

pipeline.

54

Page 75: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

am fine.” for the question “How are you?” only one entry would appear with “I am fine.”). Taking this

into account, the number of possible entries for each response could vary between 1 (when all agents

answered equally) and 5 (when all agents answered differently). Since some volunteers could not have

the availability to fill the entire survey we decided to split it into two parts, each one with 50 questions.

The first part was required and the second part was optional.

6.3 Results

A total of 44 volunteers participated in the survey. Of the participants, 32 filled both the first and the

second part. Since the survey had 100 questions, each one with 5 answers (given by our agents) to be

evaluated, we were able to collect 38003 human evaluations per agent.

6.3.1 Overview

We present the results of the responses given by the volunteers for the percentage of valid, plausible

and invalid answers classifications given to our agents in Table 6.1.

Table 6.1: Percentage of valid, plausible and invalid answers classifications given to our 5 agents (Alpha, Beta,Charlie, Delta, and Echo) by the volunteers who have filled out the survey.

Agent Alpha Agent Beta Agent Charlie Agent Delta Agent EchoValid Answer 72.97% 57.97% 58.53% 22.58% 26.08%

Plausible Answer 18.32% 19.39% 21.47% 16.58% 20.95%Invalid Answer 8.71% 22.63% 20.00% 60.84% 52.97%

The first observation is that the agents which use SSD (Alpha, Beta, and Charlie) gave answers way

more preferred by our survey participants when comparing them against the agents which use SSS

(Delta and Echo).

Agent Alpha achieved impressive results by having 72.97% of its answers classified as valid and

18.32% classified as plausible. Only (8.71%) of its answers were labeled as invalid. It also leads

the pack by a great margin by having almost 14.44% more valid answers than the second-best Agent

Charlie. It is interesting to see that these results were obtained by the agent who trained for less time.

Although Charlie is the second-best agent among the five, Agent Beta showed similar behavior by

having identical amounts of valid, plausible and invalid answers. Agent Charlie is only slightly better than

Agent Beta, by having less amount of invalid answers (-2.63%).

At the end of the pack, we have Agent Delta and Agent Echo. Agent Delta had the worst results

by having 60.84% of its answers classified as invalid. The difference between its valid answers and

plausible answers is short and the sum of the two (39.16%) does not even reach the number of valid

3(32 ∗ 100) + (12 ∗ 50) = 3800

55

Page 76: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

answers given by the third-best Agent Beta (57.97%). Surprisingly, Agent Echo, which uses a smaller

and outdated corpus of Portuguese subtitles (Subtle Corpus), was able to achieve better results than

Agent Delta, which uses a brand new corpus created during our experiments (Corpus A).

6.3.2 Short and Simple Answers

While analyzing the answers given by our agents which rely on SSD (Agents Alpha, Beta, and Char-

lie) we noticed a high amount of short and simple responses. “Sim.”(Yes.), “Nao.”(No.) and “Nao sei.”(I

do not know.) were the most frequent as we can see in table Table 6.2.

Table 6.2: Amount of “Sim.”(Yes.), “Nao.”(No.) and “Nao sei.”(I do not know.) answers returned by Agents Alpha,Beta and Charlie for the 100 questions included in the conducted survey.

Agent Alpha Agent Beta Agent Charlie“Sim.”(Yes.) 19.00% 21.00% 18.00%“Nao.”(No.) 17.00% 10.00% 10.00%

“Nao sei.”(I do not know.) 35.00% 24.00% 29.00%Total 71.00% 55.00% 57.00%

At this point, we were intrigued by the possibility of having an almost direct correlation between the

percentage of short and simple answers returned by our agents and the percentage of answers marked

as valid by our survey participants. As shown in Table 6.1 our Agent Alpha had 72.97% of its answers

considered valid but 71.00% of them (Table 6.2) were either “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.”(I do

not know.). We can verify similar results for Agents Beta and Charlie.

After further investigation, we identified that 23.06% of Agent Alpha valid answers correspond to

responses other than the short and simple answers already described (see Figure 6.2). We were able

to recognize a similar pattern for the other two agents. Agent Beta got 24.86% and Agent Charlie got

28.49%. Although Agent Charlie appears to return more diverse (and perhaps more interesting) answers

than its competitors for the valid label, the converse happens for the plausible label (see Figure 6.3).

Despite the fact that the agents Alpha, Beta and Charlie gave a majority of short and simple answers,

most of them were considered valid by our survey participants (in Figure 6.4 it is evident that the short

and simple answers were labeled as invalid very few times in comparison with all the other invalid

answers). The high amount of short and simple answers can also be justified by the content of the

questions that were chosen randomly for the research. Some of them really require a “Sim.”(Yes.),

“Nao.”(No.) or “Nao sei.”(I do not know.) as an answer (e.g. “Gostas do teu trabalho?”(Do you like you

work?) can be satisfied by a “Nao.”(No.)).

The “Nao sei.”(I do not know.) answer seems to be the biggest issue that we identify in our SSD

agents. That type of answer can make a chatbot boring in a conversational context. Indeed, the an-

swer “Nao sei.”(I do not know.) was considered valid or plausible a significant amount of times by the

volunteers who participated in our survey (see Table 6.3).

56

Page 77: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Figure 6.2: Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered valid among all answers labeledas valid by the survey participants.

Figure 6.3: Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered plausible among all answerslabeled as plausible by the survey participants.

Figure 6.4: Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered invalid among all answers labeledas invalid by the survey participants.

57

Page 78: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Table 6.3: Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers evaluated independently for each answerlabel present in our survey (valid, plausible and invalid). The agents relying on the new SSD system forgenerating answers are included for comparison.

Answer is “Sim.”(Yes.)Agent Alpha Agent Beta Agent Charlie

Valid Answer 91.22% 87.81% 93.13%Plausible Answer 3.51% 2.74% 4.68%

Invalid Answer 5.27% 9.45% 2.19%Answer is “Nao.”(No.)

Agent Alpha Agent Beta Agent CharlieValid Answer 93.51% 98.60% 97.55%

Plausible Answer 1.30% 1.40% 1.90%Invalid Answer 5.19% 0.00% 0.54%

Answer is “Nao sei.”(I do not know.)Agent Alpha Agent Beta Agent Charlie

Valid Answer 61.25% 55.34% 49.27%Plausible Answer 32.55% 35.36% 37.23%

Invalid Answer 6.19% 9.29% 13.50%

6.4 Summary

In this chapter, we presented the results of evaluating our 5 CAs. We started by discussing why we

chose to conduct a survey to evaluate them. Then, we provided the details about how we created the

survey and the guidelines given to our human evaluators.

After showing and analyzing the results of the survey, we showed that CAs based on a generative

approach were able to achieve a much higher score of plausible and valid answers when comparing

against CAs based on a retrieval approach. However, we identified that the generative agents returned

a much higher amount of short and simple answers than retrieval agents.

58

Page 79: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

7Conclusions

Contents

7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

59

Page 80: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

60

Page 81: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

7.1 Contributions

This document addressed both a hands-on project and a scientific research assignment. By analyz-

ing the limitations of Subtle Corpus and Subtle Tool created by L2F/INESC-ID group, we aimed to build

a brand new tool for generating corpora from movies and TV shows subtitles. We called it B-Subtle.

Furthermore, we scrutinized the system behind Filipe (a CA also built by L2F/INESC-ID group), which

relies on SSS for retrieving answers from Subtle Corpus after receiving a user input. After finding its

limitations, we came across two possibilities: improve the existing system or develop a competing sys-

tem by following a different approach based on answer generation. We opted for the latter by creating

SSD, a system that uses seq2seq learning with neural networks for building models with the capability

to generate answers given a user input.

We presented B-Subtle as a powerful tool for generating new corpora of interaction pairs from subti-

tle files. It was developed with flexibility and expandability in mind. Therefore, a modular approach was

chosen, allowing the tool to be expanded in the future with additional features just by adding new mod-

ules/components. At present, we support OpenSubtitles2016 Corpus files as input. It was the biggest

and the most appropriate corpus of subtitles among all the corpora we found during our research. The

incorporation of meta-data together with the possibility of processing the data we collect, allowed us to

offer an advanced and wide range of filtering options. This enabled the generation of highly customized

corpora of interaction pairs. Considering the fact that end-users of B-Subtle might need to integrate

corpora with different systems, we offered multiple output formats. This was very useful for our experi-

ments since we needed a corpus that was both compatible with SSS and SSD. Having one tool capable

of generating the same corpora in two distinct formats was a great advantage. Besides creating new

corpora of interaction pairs from subtitles, we also added the option for collecting analytical data about

them with B-Subtle.

We also provided an overview of the present state-of-the-art in end-to-end generative dialogue sys-

tems in our related work section. We presented various types of data-driven dialogue systems relying

on neural networks for seq2seq learning. Since some of them even used movies and TV shows subtitles

as input data, we followed some of their guidelines as a starting point. Then, we began creating our own

conversational models.

In our main experiments we ended up creating 5 CAs. Three of them were created with SSD, thus

being able to generate answers. The other 2 were created with SSS, thus using a retrieval strategy

for selecting answers. We relied on our B-Subtle tool to create the knowledge bases of our agents,

by creating multiple variations of Portuguese corpora. In order to evaluate how each one of our CAs

compared to each other, we opted for human evaluation by conducting a survey. The results obtained

from the survey showed that generative agents created with SSD gave answers way more preferred than

retrieval agents created with SSS.

61

Page 82: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

To sum up, we were able to:

• Provide B-Subtle as a complete tool for creating fully customizable corpora of interaction pairs

from movies and TV shows subtitles. The architecture of this tool allows for further expansion of

its features by future developers;

• Create 3 CAs that (according to human evaluation through a survey) gave a much more quantity

of appropriate answers than 2 CAs built with the existing SSS developed by L2F/INESC-ID group;

• Hand over three new corpora containing interaction pairs collected from Portuguese subtitle files

till the year of 2016.

7.2 Future Work

We are confident that our research will serve as a base for future studies on creating CAs using

seq2seq neural networks. Although the results of our survey reveal that agents created with SSD have a

much higher percentage of valid answers than the ones created with SSS, we found ourselves observing

a problem already known in the literature: the generation of safe, short and simple responses. In fact,

those type of responses seem to be appropriate most of the times (e.g. for “yes” and “no” answers),

but after analyzing the answers given by SSD agents we identified almost one-third of its answers being

“Nao sei.”(I do not know.). The human evaluation results have shown that the survey participants had

a slightly more divided opinion when classifying that type of answer as valid, plausible and invalid. The

problem relies on the simplicity of the seq2seq model because its objective function being optimized

does not capture the actual objective of a conversation with a human. Recent studies have proposed

alternatives to this objective function. In [Li et al., 2015] they suggest using an objective function that

avoids favoring responses that have high probability by making a tradeoff between the input given and

the possible responses and vice-versa. However, they applied their new objective function a posteriori to

an N-best list returned by the seq2seq. A similar experiment can be made by returning N-best answers

of our models generated with SSD and then applying the similarity metrics that are part of SSS. In [Shao

et al., 2017] they changed the behavior of the beam search algorithm and introduced stochastic sampling

operations. As we can see, there are multiple approaches that can be taken to reduce the number of

commonplace answers.

We are also aware that our agents built with SSD lack a way to ensure consistency in a conversational

context since they rely on purely unsupervised models. In our survey, we asked the participants to

evaluate each pair of question-answer independently, so further evaluation should be made in a fully

conversational context.

Training our models with SSD consisted in assigning values to a considerable amount of parameters

62

Page 83: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

related to the configuration of the neural networks. The value of each one of them can affect in different

ways the results obtained. We followed the same configuration found in similar experiments and then

tweaked some of them to adjust to our reality. Still, we are not sure which configuration combination

might yield better results for the corpora we used as input. In order to know that, we would need another

neural network to learn how to train the SSD neural networks. It is a very challenging problem that can

be studied with Automatic Machine Learning (AutoML) techniques.

Regarding B-Subtle, it was built with expandability in mind so we expect additional components to

be added by future developers as well as support for other types of input data, for instance, Cornell

Movie-Dialogs. Currently, the most evident limitation of B-Subtle is the use of The Movie DB Meta-Data

Collector, because it makes HTTP requests to get data and a rate limiting policy is applied (40 requests

every 10 seconds by Internet Protocol (IP) address). However, at the time of writing, we are unaware

of a better alternative. It is important to note that in our preliminary and main experiments we did not

use this Meta-data Collector since the currently supported input data (OpenSubtitles2016 Corpus files)

already have meta-data included.

63

Page 84: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

64

Page 85: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Bibliography

[Ameixa et al., 2014] Ameixa, D., Coheur, L., Fialho, P., and Quaresma, P. (2014). Luke, i am your

father: dealing with out-of-domain requests by using movies subtitles. In International Conference on

Intelligent Virtual Agents, pages 13–21. Springer.

[Ameixa et al., 2013] Ameixa, D., Coheur, L., and Redol, R. A. (2013). From subtitles to human interac-

tions: introducing the subtle corpus. Technical report, Tech. rep., INESC-ID (November 2014).

[Bahdanau et al., 2014] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by

jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[Banchs, 2012] Banchs, R. E. (2012). Movie-dic: a movie dialogue corpus for research and develop-

ment. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:

Short Papers-Volume 2, pages 203–207. Association for Computational Linguistics.

[Bengio, 2013] Bengio, Y. (2013). Deep learning of representations: Looking forward. In International

Conference on Statistical Language and Speech Processing, pages 1–37. Springer.

[Cho et al., 2014] Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,

H., and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical

machine translation. arXiv preprint arXiv:1406.1078.

[Danescu-Niculescu-Mizil and Lee, 2011] Danescu-Niculescu-Mizil, C. and Lee, L. (2011). Chameleons

in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.

In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, pages

76–87. Association for Computational Linguistics.

[Duchi et al., 2011] Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online

learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.

[Guo et al., 2017] Guo, P., Xiang, Y., Zhang, Y., and Zhan, W. (2017). Snowbot: An empirical study of

building chatbot using seq2seq model with different machine learning framework.

65

Page 86: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

[Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term mem-

ory. Neural computation, 9(8):1735–1780.

[Kalchbrenner and Blunsom, 2013] Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous

translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language

Processing, pages 1700–1709.

[Li et al., 2015] Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. (2015). A diversity-promoting

objective function for neural conversation models. arXiv preprint arXiv:1510.03055.

[Li and Momoi, 2001] Li, S. and Momoi, K. (2001). A composite approach to language/encoding detec-

tion. In Proceedings of the 19th International Unicode Conference, pages 1–14.

[Lison and Tiedemann, 2016] Lison, P. and Tiedemann, J. (2016). Opensubtitles2016: Extracting large

parallel corpora from movie and tv subtitles. In Proceedings of the 10th International Conference on

Language Resources and Evaluation.

[Liu et al., 2016] Liu, C.-W., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., and Pineau, J. (2016).

How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for

dialogue response generation. arXiv preprint arXiv:1603.08023.

[Lowe et al., 2015] Lowe, R., Pow, N., Serban, I., and Pineau, J. (2015). The ubuntu dialogue cor-

pus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint

arXiv:1506.08909.

[Lu et al., 2017] Lu, Y., Keung, P., Zhang, S., Sun, J., and Bhardwaj, V. (2017). A practical approach to

dialogue response generation in closed domains. arXiv preprint arXiv:1703.09439.

[Luong et al., 2015] Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective approaches to

attention-based neural machine translation. arXiv preprint arXiv:1508.04025.

[Magarreiro et al., 2014] Magarreiro, D., Coheur, L., and Melo, F. S. (2014). Using subtitles to deal with

out-of-domain interactions. In Proceedings of 18th Workshop on the Semantics and Pragmatics of

Dialogue (SemDial), pages 98–106.

[Mendes et al., 2013] Mendes, A. C., Coheur, L., Silva, J., and Rodrigues, H. (2013). Just. ask—a

multi-pronged approach to question answering. International Journal on Artificial Intelligence Tools,

22(01):1250036.

[Mendonca et al., 2017] Mendonca, V., Melo, F. S., Coheur, L., and Sardinha, A. (2017). A conver-

sational agent powered by online learning. In Proceedings of the 16th Conference on Autonomous

66

Page 87: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Agents and MultiAgent Systems, pages 1637–1639. International Foundation for Autonomous Agents

and Multiagent Systems.

[Papineni et al., 2002] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for

automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association

for computational linguistics, pages 311–318. Association for Computational Linguistics.

[Ritter et al., 2011] Ritter, A., Cherry, C., and Dolan, W. B. (2011). Data-driven response generation in

social media. In Proceedings of the conference on empirical methods in natural language processing,

pages 583–593. Association for Computational Linguistics.

[Serban et al., 2016] Serban, I. V., Sordoni, A., Bengio, Y., Courville, A. C., and Pineau, J. (2016).

Building end-to-end dialogue systems using generative hierarchical neural network models. In In

Association for the Advancement of Artificial Intelligence Conference, volume 16, pages 3776–3784.

[Shao et al., 2017] Shao, Y., Gouws, S., Britz, D., Goldie, A., Strope, B., and Kurzweil, R. (2017). Gen-

erating high-quality and informative conversation responses with sequence-to-sequence models. In

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages

2210–2219.

[Sutskever et al., 2014] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning

with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger,

K. Q., editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran

Associates, Inc.

[Tiedemann, 2009] Tiedemann, J. (2009). News from opus - a collection of multilingual parallel corpora

with tools and interfaces. In Recent advances in natural language processing, volume 5, pages 237–

248.

[Vinyals and Le, 2015] Vinyals, O. and Le, Q. (2015). A neural conversational model. arXiv preprint

arXiv:1506.05869.

[Wen et al., 2015] Wen, T.-H., Gasic, M., Mrksic, N., Su, P.-H., Vandyke, D., and Young, S. (2015).

Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv

preprint arXiv:1508.01745.

[Wu et al., 2016] Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao,

Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the

gap between human and machine translation. arXiv preprint arXiv:1609.08144.

[Yin et al., 2015] Yin, J., Jiang, X., Lu, Z., Shang, L., Li, H., and Li, X. (2015). Neural generative question

answering. arXiv preprint arXiv:1512.01337.

67

Page 88: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

68

Page 89: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

APreliminary Experiments

Listing A.1: B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubtitles2016 sub-

titles.

1 ---

2 p i p e l i n e s :

3 - p ipe l ineType: opensub t i t l es

4 batchSize: 2000

5 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt/"

6 outputs :

7 - outputType: p a r a l l e l

8 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/pt/"

Listing A.2: B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubtitles2016 sub-

titles with answer length above 25 characters.

69

Page 90: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

1 ---

2 p i p e l i n e s :

3 - p ipe l ineType: opensub t i t l es

4 batchSize: 2000

5 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt_1/"

6 i n t e r a c t i o n F i l t e r s :

7 - f i l t e r T y p e : answerMinLength

8 value: 25

9 outputs :

10 - outputType: p a r a l l e l

11 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/pt_2/"

Listing A.3: B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubtitles2016 sub-

titles with Horror as a genre and a subtitle rating above 5.0.

1 ---

2 p i p e l i n e s :

3 - p ipe l ineType: opensub t i t l es

4 batchSize: 2000

5 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt_1/"

6 metada taF i l t e rs :

7 - f i l t e r T y p e : genre

8 value: "Horror"

9 - f i l t e r T y p e : s u b t i t l e R a t i n g M i n

10 value: 5 .0

11 outputs :

12 - outputType: p a r a l l e l

13 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/pt_2/"

70

Page 91: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

BSurvey

B.1 Survey description given to volunteers

No questionario que se segue ira encontrar uma lista de interaccoes constituıdas por uma pergunta/comentario e um conjunto

de possibilidades de resposta. Considere uma interaccao Pessoa-Maquina. Coloque-se na posicao de quem esta a interagir com

a maquina e avalie as respostas dadas.

Classifique cada uma das possibilidades de resposta (dadas pela maquina) da seguinte forma:

• Valida: a resposta e adequada relativamente ao assunto da pergunta/comentario (esta contextualizada) - (ex: ”Que idade

tens?” e a resposta e ”28 anos.”) (ex: ”’Es engracado!?” e a resposta e ”Tu e que es.”)

• Aceitavel: a resposta pode ser adequada em determinado contexto ou refere detalhes que podem pertencer a pessoa

que efectuou a interaccao - (ex: ”Onde e que vives?” e a resposta e ”Isso agora nao importa, temos que ir embora.”) (ex:

”Estas cansado?” e a resposta e ”Nao Joana.”)

• Nao valida: quando a resposta nao e adequada ou contem erros gramaticais dificultam a compreensao - (ex: ”Que idade

tens?” e a resposta e ”Sim.”)

O questionario encontra-se dividido em duas partes. A primeira parte e de caracter obrigatorio enquanto que a segunda parte

e opcional (depende da sua disponibilidade). No entanto gostarıamos de contar com a sua colaboracao para as duas partes.

O tempo estimado e de 7 a 10 minutos por cada parte.

Obrigado, desde ja, pela seu contributo.

71

Page 92: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Notas: - Algumas respostas poderao nao ser consistentes na utilizacao de maiusculas para nomes, cidades, etc., no entanto,

isso nao devera influenciar a sua classificacao;

B.2 Raw data with the questions and corresponding agent an-

swers used in survey

Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo

qual o prato que mais

gostas ?

nao sei . nao sei . nao sei . jalebi . Foi a bela que matou o mon-

stro.

para onde vais ? nao sei . para casa . nao sei . ola . Vou tirar-te as duvidas.

que genero de musica e

que gostas ?

nao sei . gosto de musica . e uma piada . esta bem , esta bem , esta

bem .

E um genero de musica do

mundo.

gostas do teu trabalho ? sim . sim . sim . sou consciencioso e ambi-

cioso .

Tenho a sensacao que

deixamos de comunicar.

o que e que estudas ? trabalho . nao estou a brincar . nao sei . microbiologia . Tu nao queres ficar preso a

esta casa para o resto da tua

vida, pois nao?

ja viajaste para longe de

casa ?

nao . nao . nao , nao , nao . nao , estava em casa . O que queres?

o que e pensas sobre a in-

teligencia artificial ?

o que e isto ? e uma maquina do

tempo .

nao sei . um deus . Uma unica consciencia que

criou toda uma prole de

maquinas.

andas em que faculdade ? nao sei . sim . nao sei . caltech . Estavamos a espera de que?

tu fumas ? nao . nao . nao . fuma , fuma , fuma ! Nao?

qual a tua profissao ? sou um homem de

negocios .

nao sei . escultor , senhor . duro . Um pouco disso e daquilo e

nada de mais.

estas ai sozinho ? sim . sim . sim . oh , querida , que situacao

estranha .

Exijo que o venha ja buscar.

meu , nao percebo nada

do que disseste ...

o que e que disseste

?

nao sei o que dizer . nao sei . quando ? E tambem nao percebo nada

de advocacia.

moras em lisboa ? sim . sim . nao . ha 14 meses atras . O que estavas a fazer no

metro?

segues algum desporto ? nao . nao , nao . claro . e por isso que eu luto . A aeronave aproximando-se

de Cedar Creek, fala o Dr.

Sam Daniels.

jogas jogos de video ? sim . sim . sim . claro . Da-nos a latitude e a longi-

tude.

estudaste ? sim . a serio ? na escola . para que ? Nao, nem por isso.

qual e o proposito do pro-

grama de exploracao es-

pacial ?

nao sei . nao sei . temos de ir embora . charles . Por que nos apenas nao deix-

amos isso para os profission-

ais?

gostas de ficcao cientıfica

?

nao . sim . sim . nao . O ”Morte 1” ressuscitou os

mortos, que se alimentam

de carne humana, Tambem

chama a isso ficcao cientıfica?

onde e que compraste

essa roupa ?

nao sei . na casa de banho . onde e que ela esta ? na clınica da metadona ? Isto e bonito, de onde o tirou?

Continued on next page

72

Page 93: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Table B.1 – Continued from previous page

Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo

porque e que estas aqui ? nao sei . porque e que estas

aqui ?

porque e que nao me

disseste ?

hmm ? Estas a brincar, nao estas?

de que tipo de filmes

gostas ?

nao sei . de que ? de que e que estas a

falar ?

esta desaparecido ha oito

horas .

Sim, eu gosto de filmes.

entao e esse tempo ? nao sei . nao sei . nao sei . estavamos ansiosos . E por que nao o veria?

o que e a desonestidade ? o que ? uma investigacao . o que e que esta a

fazer ?

uma desonestidade que

envenena tudo .

Eu nao percebo.

ronald reagan e ... e o meu filho . o que e que se passa

?

eu disse que era um

homem .

ronald reagan? bem. e o vice-presidente ?

onde vais assim vestido ? nao sei . nao sei . nao sei . a uma exposicao de moda . E assim! E assim que isto fun-

ciona.

a vida e difıcil . e difıcil . a vida nao e facil . nao , nao , nao . cresceu na nossa rua . Mas esta vida esta a chegar

ao fim, e considero-me sor-

tudo por a ter vivido.

o que e que vais fazer esta

noite ?

nao sei . nao sei . nao sei . o habitual . O que eu vou fazer hoje a

noite?

quantos filhos tens ? dois . nao sei . dois . galloway . Nenhum.

qual foi a melhor coisa que

ja fizeste a alguem ?

nao sei . nao sei . nao sei . eu nao fujo as minhas re-

sponsabilidades .

E a melhor coisa que farei.

esta bem , sabes porque e

que falamos tanto ?

porque e o que eu

faco .

porque e que nao me

disseste ?

nao . esta aqui , algures no

castelo !

Porque representamos tudo o

que querias ter, mas receias

enfrentar.

onde estudaste ? na universidade . na universidade . no hotel . barnard . Nao, nem por isso.

em que trabalhas ? sou um homem de

negocios .

nao sei . nao sei . trabalha isso . De que tamanho?

qual e o teu trabalho ? o meu trabalho . o meu trabalho . nao sei . trabalha isso . De que tamanho?

quais sao os teus objetivos

para a vida e para o tra-

balho ?

nao sei . nao sei . nao sei . tao triste . Tudo o que vem e ouvem po-

dem aproveitar.

que dia e hoje ? sexta-feira . dia de natal . nao sei . vejo que nao tem mal . O que?

que hobbies tens ? o que e isso ? eu ? o que queres dizer ? galloway . O que e que somos? E isto?

Um tipo de entretenimento?

qual o dia do teu aniver-

sario ?

nao sei . o que e isso ? 8 de maio de 1969 . e feriado . Meu Deus.

que disciplinas tens ? nao sei . as minhas coisas . algum . galloway . Disciplina!

etica e moral sao a mesma

coisa ?

sim . nao sei . isso e um pouco

complicado .

tudo e um jogo para ele . Temos a teoria de que o crime

aumenta a beleza.

viste o jogo dos new york

knicks ?

sim . nao sei . a ultima vez que

o vi , o meu pai

e um homem muito

poderoso .

pois , os knicks sao os

maiores !

Ela conhece um escritor em

New York.

queres ajuda ? nao . nao . sim . apanhamo-lo fora do es-

critorio cerca de uma hora

depois de o termos apan-

hado a si .

Nao. Sim.

que horas sao ? nao sei . nao sei . nao sei . nao sei . E quase meio-dia.

tudo bem ? sim . sim . sim . isso ja chega para gente

como ele .

Esta tudo bem. Esta tudo

bem. Esta tudo bem.

Continued on next page

73

Page 94: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Table B.1 – Continued from previous page

Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo

queres uma goma ? sim . sim . nao , nao quero . nao , nao quero essas tuas

estupidas gomas .

Nao, nao quero essas tuas

estupidas gomas.

sou bonito ? sim . sim . sim . bonito rapaz E bonito.

gostas de cantar ? sim . sim . sim . rhonda ... Sim... de repente tudo veio-

me a cabeca.

o que e que achas dos

japoneses ?

nao sei . nao sei . nada . vamos sabe-lo assim que

escurecer .

Os japoneses, o que?

o que e que fazes ? sou um homem de

negocios .

nao sei . nao sei . nao , nao o facas ! Reparo computadores. Nao,

trabalho com computadores.

devemos dancar ? nao . sim . nao . sabes , smith... ja sei o que

tu mais odeias .

Nao sei dancar.

se fores sair avisas-me ? claro . claro . sim . os meus pais vao chegar

dentro de uma hora .

Eu sei, isto nunca aconteceu.

Tu nao existes.

o que fazes na vida ? nao sei . o que fazes aqui ? nao sei . boa noite... Sou arquitecto.

ola ! como estas ? ola . como estas ? ola . tragam-no . Ola.

sabes dancar ? nao . sim . nao . nao . Eu nao sei dancar.

costumas sair a noite ? nao . nao . nao . sim . E eu ja nao percebo.

tu bebes cervejas ? nao . nao . nao . estamos apaixonados ,

suas cretinas .

Nao, obrigado.

podes dizer-me o cam-

inho para a biblioteca mais

proxima ?

sim . sim . o que e isto ? que vergonha , salem . Nao sei quem sou.

onde vives ? em casa . em casa do meu pai

.

na escola . eu levo-a a casa . Em Sao Francisco.

como te descreves em tres

palavras ?

sim . sim . com as mulheres ? panico . Primeira palavra.

eu acho que tu es uma

maquina

eu sou uma maquina eu sou uma estrela

de rock

sim . eu sou o coronel gre-

gor yegorov da federal

seguranca de rusia .

Es uma sobrevivente.

tens algum animal de

estimacao ?

nao . nao . nao . nao . Nao.

e passatempos preferidos

?

nao sei . de que ? o resto de comida

nao me permite ir

para casa ?

os meus preferidos ? Montanhismo e um pas-

satempo.

qual o melhor elogio que ja

recebeste ?

nao sei . nao , obrigado . nao sei . a viagem foi boa ? Claro.

tens quantos anos ? seis . oito . nao sei . galloway . Ela e do Ano do Galo.

tudo bem contigo ? sim . sim . sim . sim , tudo bem . Sim, tudo bem.

qual e a cor do ceu ? nao sei . nao sei . nao sei . repare , david . Sim, mas mesmo assim pre-

firo o azul.

como queres ser lembrado

?

nao sei . eu sei . eu nao sei . deve ser o sandy . Oh, acho que nao.

gostas de lasanha ? nao . sim . sim . adoro lasanha . Sim. Gosras de lasanha?

o que e que achas da

rainha ?

nao sei . nao sei . nao sei . nao , acho que nao . O que e a rainha tem a ver

com isso?

o que fazes ? nao sei . nao sei . nao sei . eu vim para te ver , e agora

pede minhas economias ...

Um pouco disto, um pouco

daquilo. Neste momento es-

tou de maos a abanar.

Continued on next page

74

Page 95: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Table B.1 – Continued from previous page

Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo

amanha encontramo-nos

?

sim . sim . sim . isso significa ... Esta bem.

fazes algum desporto ? nao . nao . nao . e por isso que eu luto . Desporto? Sim, eu era o

lancador.

qual e a tua religiao ? nao sei . nao sei . nao sei . i . i. , senhor . Nenhuma, porque? E obri-

gatorio?

o que e que fazes amanha

?

nao sei . o que fazes ? nao sei . isso significa ... O feriado do 4 de Julho?

quando e que nasceste ? nao sei . nao sei . nao sei . o futuro esta em ti . E que ela nasceu 3 minutos

depois.

ja vi que es muito agres-

sivo !

o que e que estas a

fazer ?

nao me digas que

nao es o unico .

nao sei . eu fui a um medico . Sejam agressivas!

qual e a tua comida

preferida ?

nao sei . nao sei . nao sei . que belo petisco . Porque e amarela...

o que te trouxe aqui ? nao sei . nada . o que e que eu fiz ? eu nao acreditava , mas e

”ver para crer” .

O Michael estava a ter proble-

mas na escola e eu... mudei-

me para ca para encontrar um

lugar mais seguro.

qual e a coisa mais impor-

tante a saber sobre biolo-

gia ?

nao sei . o que e que eu faco

?

nao sei . com a sra . hawker ou a sra

. jennings ?

E algo que tenho de fazer.

podes falar para sempre ? sim . sim . claro . quero que te esforces para

falares decentemente .

Estava a comecar a pensar

que... tu nao me querias

mais... e que andavas a sair

com outra pessoa.

qual e o buraco mais fundo

no mundo ?

nao sei . nao sei . nao sei . sabes que mais ? Vamos recalcular com as me-

didas do teu exo-esqueleto e

ver o que da.

em que paıses ja estiveste

?

nao . nao sei . tenho que fazer isto . e bom , mas e como se

morresses e acordasses no

paraıso judeu .

Nao em todos.

tu es uma mulher

engracada !

nao , nao , nao . eu sei . sim . pelo seu corte confundia-a

com um homem .

Nao, tu e que es.

tens medo de robos ? nao . nao , nao , nao . sim . galloway . Eu nao tenho medo de nada.

o que e que acontece

se as maquinas puderem

pensar ?

nao sei . o que e que se passa

?

venham de volta . eles devem ter uma

fraqueza .

Corram o mais rapido que

puderem.

qual e a tua cor preferida ? azul . vermelho . nao sei . posso ajuda-lo , sargento ? A minha cor preferida... Nao

sei. Cinzento, talvez.

gostas de computadores ? sim . sim . sim . uso-os . Sim, o computador e meu.

costumas passar por aqui

?

nao . nao . nao . amanha filmo-o para voces

.

O costume de toda uma vida.

de onde es natural ? nao sei . de onde veio ? do meu pai . da cidade . Tenho a certeza de que pode

por no congelador as mul-

tas de estacionamento e as

licencas para fogo-de-artifıcio

que e a funcao normal do seu

departamento.

Continued on next page

75

Page 96: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

Table B.1 – Continued from previous page

Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo

toquio e ... o que e isto ? o que ? o han esta a estudar

.

sabe , papa , nao e preciso

levar-me a escola todos os

dias .

Toquio? A hora mais cedo de

chegada a Toquio e as 6:15

am.

quem sao os teus pais ? nao sei . nao sei . nao sei . nao sei . Este... e o meu Pai.

o que gostas de fazer nos

tempos livres ?

nao sei . nao sei . nao sei . o moulin rouge e o lugar p

´ra ti

Nao ha mesmo mais nada.

gostas de estudar ? sim . sim . sim . e? Nao a perco por nada. Tranco-

me no meu quarto.

as vezes queres ser es-

pancado ?

nao . ou nao ? eu quero . aqui esta . Parece que ele... ele nao a

abandonou.

tens primos ? nao . nao . quantos sao ? galloway . Super-Homem e seu primo?

de que tipo de musicas

gostas ?

de que tipo ? de que ? nao sei . gostas de salsa ? Do tipo de musica que vos

gostardes.

tu es um homem sem in-

teligencia .

eu sei . eu sei . nao , nao sou . nao . Certo. Eu sou o homem.

gostas de trabalhar ? sim . sim . sim . sim . E o que faz David?

com quem vives ? com o meu pai . com o meu pai . com a minha mae . uma num milhao , milhao So com o meu tio.

queres ir beber um cafe ? nao , obrigado . nao . nao , obrigado . nao posso . So um cafe.

francamente , minha

querida , eu nao quero

saber !

nao quero saber ! nao quero saber ! o que ? nao e habitual conceder ao

condenado o seu ultimo de-

sejo ?

O que preciso de saber e: sou

fedorento?

76

Page 97: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

CSample Configuration File

1 ---

2 p i p e l i n e s :

3 - p ipe l ineType: opensub t i t l es

4 batchSize: 2500

5 i n p u t D i r e c t o r y : "/input/dataset/path"

6

7 metadataCol lec tor :

8 - co l l ec to rType : themoviedb

9 apiKey: YOURAPIKEY

10

11 metada taF i l t e rs :

12 - f i l t e r T y p e : audience

13 value: "M/16"

14 count ry : "Portugal"

15 - f i l t e r T y p e : count ry

77

Page 98: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

16 value: "Italy"

17 - f i l t e r T y p e : countryRegex

18 value: "^A.*o$"

19 - f i l t e r T y p e : countryQuant i tyMax

20 value: 3

21 - f i l t e r T y p e : durat ionMin

22 value: 3

23 - f i l t e r T y p e : encoding

24 value: "utf-8"

25 - f i l t e r T y p e : genres

26 value: [ 'Action' , 'Comedy' ]

27 - f i l t e r T y p e : imdbIDExistence

28 - f i l t e r T y p e : or ig ina lLanguage

29 value: "Portuguese"

30 - f i l t e r T y p e : or ig ina lLanguageQuant i t y

31 value: 2

32 - f i l t e r T y p e : movieRatingRange

33 l e f t V a l u e : 2 .2

34 r i gh tVa lue : 9 .3

35 - f i l t e r T y p e : s u b t i t l e R a t i n g E x a c t

36 value: 10 .0

37 - f i l t e r T y p e : yearMin

38 value: 1974

39

40 i n t e r a c t i o n F i l t e r s :

41 - f i l t e r T y p e : in te rva lMax

42 value: 4 # in seconds

43 - f i l t e r T y p e : t r i gge rSen t imen t

44 value: "positive"

45 - f i l t e r T y p e : answerTokensQuantityMin # requires using producer to tokenize the answer

46 value: 26

47 - f i l t e r T y p e : t r iggerCharactersRange

48 l e f t V a l u e : 5

49 r i gh tVa lue : 42

50 - f i l t e r T y p e : t r iggerRegex

51 value: "[:alpha:]*?$"

52 - f i l t e r T y p e : answerContains

53 value: "Hello"

78

Page 99: One Million Agents Speaking All the Languages in the World...serve as a knowledge base for conversational agents. Besides the corpora generation tool, another system will be described

54 ignoreCase: fa lse

55 i n v e r t : fa lse

56

57 producers:

58 - producerType: openNLPSentiment

59 - producerType: openNLPTokenizer

60 reverseTr igger : fa lse

61 reverseAnswer: fa lse

62 - producerType: treeTaggerLemmatizer

63 - producerType: openNLPStemmer

64

65 t rans fo rmers :

66 - t ransformerType: s t r i n g i f y T o k e n s

67 separa tor : " "

68 - t ransformerType: st r ing i fyLemmas

69 - t ransformerType: lowercase

70 - t ransformerType: uppercase

71

72 outputs :

73 - outputType: legacy

74 ou tpu tD i r : "/output/corpus/legacy/path"

75 - outputType: xml

76 ou tpu tD i r : "/output/corpus/xml/path"

77 - outputType: json

78 ou tpu tD i r : "/output/corpus/json/path"

79 p r e t t y p r i n t : true

80 - outputType: p a r a l l e l

81 ou tpu tD i r : "/output/corpus/parallel/path"

82 t r i ggersF i lename: "source.txt"

83 answersFilename: "target.txt"

84 v a l i d a t i o n S i z e : 2000

79