Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
One Million Agents SpeakingAll the Languages in the World
Miguel Angelo Antonio Ventura
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisor: Prof. Maria Luısa Torres Ribeiro Marques da Silva Coheur
Examination Committee
Chairperson: Prof. Joao Antonio Madeiras PereiraSupervisor: Prof. Maria Luısa Torres Ribeiro Marques da Silva CoheurMembers of the Committee: Prof. Bruno Emanuel Da Graca Martins
June 2018
Acknowledgments
First and foremost, I have to thank my research supervisor, Prof. Luısa Coheur. Without her as-
sistance and dedicated involvement in every step throughout the process, this dissertation would have
never been accomplished.
I would also like to thank to the 44 volunteers involved in the survey of our research. Without their
participation and input, the survey could not have been successfully conducted.
A word of thanks to all L2F/INESC-ID members that offered me some tips and opinions for my
research. A special thanks to Vania Mendonca and Marco Pereira for helping me out with the legacy
systems which were the starting point for my thesis.
I would also like to acknowledge the developers of OpenNMT-tf tool, because they gave elucidating
and fast responses to some issues/doubts I had while using their tool.
I should also thank all my friends, in special to those who are finishing their Master Degree at the
same time as I am. All exchange of information and feedback contributed for my research.
I am grateful to all of those that are part of Direccao de Servicos Informaticos, at Instituto Superior
Tecnico, for relaxing my working hours when I needed to dedicate my time for my dissertation.
Most importantly, none of this could have happened without my family. I must express my very
profound gratitude to my parents (especially to my mother) for providing me with unfailing support and
continuous encouragement throughout my years of study and through the process of researching and
writing this thesis. This accomplishment would not have been possible without them. Thank you.
Abstract
Currently, creating a conversational agent for a specific domain is an accessible task but the resulting
agents have restricted knowledge due to the human effort needed to manually introduce the data.
Movie and TV shows subtitles are available for free in ever-growing databases. They constitute a
remarkable resource of data distributed across more than 70 languages.
In this document, we propose B-Subtle - a novel tool for automatic creation of corpora and collection
of analytical data from subtitles. Since different users might have different needs, we aim to provide a
flexible system that can be fully parametrized through a configuration file. The generated corpora will
serve as a knowledge base for conversational agents.
Besides the corpora generation tool, another system will be described - Say Something Deep. This
system is capable of creating sequence-to-sequence models to answer questions made by its users. It
relies on neural networks to implement a generative approach by taking corpora generated with B-Subtle
as its knowledge base.
Keywords
Dialogue Systems; Movie Subtitles; Information Extraction; Generative Models; Deep Learning.
iii
Resumo
Atualmente, a criacao de um agente de conversacao para um domınio especıfico e uma tarefa acessıvel.
Contudo, os agentes resultantes tem uma base de conhecimentos restrita devido ao esforco humano
necessario para introduzir manualmente os dados.
Legendas de filmes e de programas de TV estao disponıveis gratuitamente em bancos de dados
em constante crescimento. Eles constituem um recurso notavel de dados distribuıdos em mais de 70
idiomas.
Neste documento, apresentamos o B-Subtle - uma nova ferramenta para criacao automatica de
corpora de interacoes pergunta/resposta e de extracao de dados estatısticos a partir de legendas.
Sendo que utilizadores diferentes podem ter necessidades distintas, o nosso objetivo e fornecer um
sistema flexıvel que possa ser totalmente parametrizado por meio de um ficheiro de configuracao. Os
corpora gerados servirao como bases de conhecimento para agentes de conversacao.
Alem da ferramenta de geracao de corpora, outro sistema sera apresentado - Say Something Deep.
Este sistema e capaz de responder a perguntas feitas por utilizadores. Este sistema utilizara uma
estrategia de geracao de respostas apos treinar modelos com arquiteturas de redes neuronais.
Palavras Chave
Sistemas de Dialogo; Legendas de filmes; Extracao de Informacao; Modelos Gerativos; Aprendizagem
Profunda.
v
Contents
1 Introduction 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 7
2.1 Subtle Corpus and Subtle Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Dialogue Turns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Corpus Data Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Final Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Say Something Smart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Indexing Subtle Corpus and Extracting Answers . . . . . . . . . . . . . . . . . . . 11
2.2.2 Electing the Best Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Flexibility of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Final Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Sequence-to-Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Padding and Bucketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Greedy Search and Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.6 Long-Short Term Memory Neural Networks . . . . . . . . . . . . . . . . . . . . . . 16
3 Related Work 17
3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 OpenSubtitles2016 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1.A Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1.B Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vii
3.1.1.C Output files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Movie-Dic Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 Cornell Movie-Dialogs Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.4 Ubuntu Dialog Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 End-to-end Sequence to Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 B-Subtle 27
4.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 B-Subtle Parts Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Meta-data Collectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3.A Meta-data Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3.B Interaction Pairs Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.4 Producers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.6 Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.7 Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.8 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Building Agents 39
5.1 Say Something Deep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.2 Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Main Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.2 Say Something Deep Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.3 Say Something Smart Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.4 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 Evaluation 51
6.1 How to Evaluate our Agents? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3.2 Short and Simple Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
viii
7 Conclusions 59
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A Preliminary Experiments 69
B Survey 71
B.1 Survey description given to volunteers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B.2 Raw data with the questions and corresponding agent answers used in survey . . . . . . 72
C Sample Configuration File 77
ix
x
List of Figures
2.1 Information flow in a Sequence-to-Sequence (seq2seq) model. . . . . . . . . . . . . . . . 14
2.2 Architecture of a Recurrent Neural Networks (RNN) with a loop and then unrolled after
unfolding. Xt represents what is given as input. Yt is the output generated by the network. 16
4.1 Possible B-Subtle pipeline for OpenSubtitles 2016 files using all the components available. 31
5.1 Tensorboard - visualization tool for understanding, debugging, and optimizing the models
being trained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1 Example of a question in our survey (in Portuguese). Each line corresponds to an answer
given by one of our agents that should be classified as valid, plausible or invalid. . . . . . 54
6.2 Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered valid among all
answers labeled as valid by the survey participants. . . . . . . . . . . . . . . . . . . . . . 57
6.3 Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered plausible among all
answers labeled as plausible by the survey participants. . . . . . . . . . . . . . . . . . . . 57
6.4 Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered invalid among all
answers labeled as invalid by the survey participants. . . . . . . . . . . . . . . . . . . . . 57
xi
xii
List of Tables
6.1 Percentage of valid, plausible and invalid answers classifications given to our 5 agents
(Alpha, Beta, Charlie, Delta, and Echo) by the volunteers who have filled out the survey. . 55
6.2 Amount of “Sim.”(Yes.), “Nao.”(No.) and “Nao sei.”(I do not know.) answers returned by
Agents Alpha, Beta and Charlie for the 100 questions included in the conducted survey. . 56
6.3 Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers evaluated independently for
each answer label present in our survey (valid, plausible and invalid). The agents relying
on the new Say Something Deep (SSD) system for generating answers are included for
comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
xiii
xiv
Listings
3.1 Structure of a subtitle file in .srt format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Simplified example of a OpenSubtitles2016 input file structure. . . . . . . . . . . . . . . . 21
4.1 Example of configuration file for OpenSubtitles2016 Pipeline . . . . . . . . . . . . . . . . 38
5.1 B-Subtle’s configuration file to generate Corpus A. . . . . . . . . . . . . . . . . . . . . . . 47
5.2 B-Subtle’s configuration file to generate Corpus B. . . . . . . . . . . . . . . . . . . . . . . 47
5.3 B-Subtle’s configuration file to generate Corpus C. . . . . . . . . . . . . . . . . . . . . . . 48
A.1 B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubti-
tles2016 subtitles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.2 B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubti-
tles2016 subtitles with answer length above 25 characters. . . . . . . . . . . . . . . . . . 69
A.3 B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubti-
tles2016 subtitles with Horror as a genre and a subtitle rating above 5.0. . . . . . . . . . . 70
xv
xvi
Acronyms
AI Artificial Intelligence
AutoML Automatic Machine Learning
CA Conversational Agent
DOM Document Object Model
GPU Graphics Processing Unit
HTML HyperText Markup Language
HTTP Hypertext Transfer Protocol
IMDb International Movie Database
IMSDb Internet Movie Script Database
IP Internet Protocol
JSON JavaScript Object Notation
LSTM Long-Short Term Memory
NER Named Entity Recognition
NLTK Natural Language Toolkit
NL Natural Language
OCR Optical Character Recognition
RNN Recurrent Neural Networks
SAX Simple API for XML
seq2seq Sequence-to-Sequence
xvii
SSD Say Something Deep
SSS Say Something Smart
StAX Streaming API for XML
XML eXtensible Markup Language
YAML YAML Ain’t Markup Language
xviii
1Introduction
Contents
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
2
Artificial Intelligence (AI) is a field of Computer Science that is constantly evolving since its beginning
in the 1950s decade. Easing the communication between humans and machines remains as one of the
highest ambitions.
To have successful interactions between humans and machines, a common way of communication
must be defined. Humans communicate with each other using Natural Language (NL) (a language that
has developed naturally over time), contrasting with artificially created ones (used by machines, based
on strict rules). This demands machines to be able to interpret and generate a language understood by
humans. Conversational Agents (CAs) come into the way as one of the tools that can be used.
Creating a CA for a certain domain is an accessible task considering some tools available. For
instance, Pandorabots1 is a web service for building and deploying these type of agents. However, the
resulting agents have restricted knowledge due to the human effort needed to manually introduce the
data.
One of the challenges is to develop methods that build sources of knowledge automatically without
the need for human intervention. The foremost task is to find appropriate information sources and then
extract useful data from them. Subtitles from movies and TV shows are one of those sources. They
are available on-line for free in ever-growing databases. OpenSubtitles2 is one of those databases,
where the number of subtitle files available surpasses the 4 million mark distributed across more than
60 languages. Considering its size and language variety, it is reasonable to say that it makes up a
remarkable resource to extract valuable features (linguistically speaking) due to its broadness of covered
genres (action, comedy, horror, etc.) and its multiple types of discourse (narrative, slang, etc.).
L2F/INESC-ID group already built a corpus from subtitles - the Subtle Corpus [Ameixa et al., 2013].
It is composed by interactions (pairs of triggers3 and answers) extracted from 6 thousand English subtitle
files and 4 thousand Portuguese subtitle files. A tool was used to automate the process of building the
corpus - we will call it Subtle Tool from now on. The resulting corpus, from now on Subtle Corpus, can
be used as a knowledge base for a CA. However, this corpus is limited to those two languages and it
has not been updated ever since with data from new subtitles. On the other hand, the Subtle Tool used
to generated the corpus does not allow to customize the corpus generation process without changing its
implementation. Since subtitle files have other information associated (genre of the movie, release year,
timestamps of turns, etc.) it could be useful to allow collection and analysis of that data while building
the corpus. The aim of our work was to include all of this meta-data, helping the end user to customize
the corpus generation process.
Selecting responses for a CA to give when dealing with user input consists in another challenge.
The best response should be chosen or generated from the built source of knowledge. This could be
1https://www.pandorabots.com/2https://www.opensubtitles.org/3A turn extracted from subtitles that will cause the next one to appear - the answer.
3
accomplished by using Deep Learning techniques either by following a retrieval approach or a generative
approach.
At WildML blog4 we can find more about the pros and cons of each approach, but essentially retrieval
models are simpler to implement. They work with a source of predefined responses and use some
kind of heuristic to pick an appropriate response. However they may be incapable of handling unseen
cases. A dialogue system developed at L2F/INESC-ID group relies on a retrieval-based approach: Say
Something Smart (SSS) [Ameixa et al., 2013] uses Subtle corpus as its knowledge base. The system
starts by matching the user input against a set of possible responses. Afterwards, it applies some
weighted measures to give them an order. The one with the best score is returned to the user. However,
when SSS (explained in section 2.2) needs to select the best response it is highly coupled with an
internal scoring algorithm used by the search engine library5.
On the other side, generative models are harder to implement but can generate responses of their
own, although being more likely to perform grammatical errors or give irrelevant answers. Recently, an
increasing number of studies have found that end-to-end CAs can be built by following purely data-driven
approaches by relying on neural models. One of our main goals is to create end-to-end CAs with corpora
generated by B-Subtle by following an approach based in recent studies.
1.1 Goals
We will now describe the main goals of our work.
We aim to replace the existing Subtle Tool by providing the following features:
1. Creating corpora with interactions and with meta-data associated such as genre of the movie/TV
show, release year, spoken language, subtitle language, etc. This allows generation of corpora
that meets specific requirements: for example including only interactions collected from movies
with “Action” as a genre;
2. Supporting a larger and recent set of subtitles as an input for the corpora generation process;
3. Supporting multiple languages;
4. Processing text content from subtitles in a customized way. This allows the creation of corpora with
custom interactions (e.g. accept only interactions where the trigger ends with a question mark);
5. Generating different output formats for the corpus being generated such as JavaScript Object
Notation (JSON) or eXtensible Markup Language (XML) files. This enables end-users to choose
the output format that fits best their needs;4http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction5Used for indexing the corpus files and retrieving results by making queries.
4
6. User-friendly configuration: this allows the end-users to completely control the behavior of B-Subtle
when creating corpora by simply adjusting parameters in a configuration file;
7. Generating analytical data about movies, TV shows and subtitle files.
Since we are aiming to build one million agents speaking all the languages in the world, we are
determined to provide an additional system for generating the best answer given a user input. We need
to accomplish the following requirements:
1. Open-domain with capability of generating answers for questions not present in the knowledge
base;
2. Build knowledge-bases in Portuguese by using new corpora of interactions created from subtitle
files;
3. Support languages other than Portuguese and English;
1.2 Contributions
In order to achieve the aforementioned goals we decided to built a revamped tool for generating
corpora of interactions - B-Subtle. This tool offers automatic creation of corpora and collection of an-
alytical data from subtitles. Since different end-users might have different needs, we provide a flexible
system that can be fully parametrized through a configuration file. The generated corpora will serve as
a knowledge base for conversational agents.
Besides the corpora generation tool, we also offer another tool - SSD. This system is capable of
generating responses upon receiving some input. It relies on neural networks by using state-of-the-art
seq2seq models. Corpora generated with B-Subtle can serve as knowledge base for the conversational
models created with SSD.
1.3 Document Outline
The remaining part of the document is organized as follows: in Chapter 2 previous systems will
be described and some details about neural network architectures will be given; in Chapter 3, we will
describe related work carried out in the scope of building corpora and seq2seq models for CA creation;
in Chapter 4 we present the architectural details of our tool for building corpora - B-Subtle; then in
Chapter 5 we present our experiments with Neural CAs by using corpora generated with B-Subtle; from
there on we proceed to Chapter 6 where we describe how we evaluated our agents; finally, in Chapter 7
we include conclusions and a discussion about future work to do based on current limitations of the
systems used in our experiments.
5
6
2Background
Contents
2.1 Subtle Corpus and Subtle Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Say Something Smart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Sequence-to-Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7
8
Previous work has already been done at L2F/INESC-ID group in order to extract information from
subtitle files and build a corpus from it. Additionally another system was developed to receive that corpus
as input and select plausible responses for user requests. Those two systems will now be described.
2.1 Subtle Corpus and Subtle Tool
Subtle Corpus [Ameixa et al., 2013] is composed by interactions - pairs of triggers and answers
- collected from movie subtitles available at OpenSubtitles website. The most recent version of Sub-
tle Corpus [Magarreiro et al., 2014] was generated from almost 6,000 English subtitle files and 4,000
Portuguese subtitle files from four different movie genres: Romance, Sci-Fi, Western and Horror. Pro-
cessing these files resulted in a total of 5,693,811 English interactions and 3,322,683 Portuguese inter-
actions.
Extracting interactions involves identifying whether the information is relevant or not. The purpose of
building this corpus is to allow a CA to extract responses for user requests. Therefore, the actual content
of the subtitles has received special treatment for some special cases in order to generate a corpus with
useful interactions.
2.1.1 Pre-processing
Subtitles may include special annotations for people with hearing impairment. When a movie char-
acter is talking, but it is not showing on the screen, their name usually appears at the beginning of the
utterance followed by a colon. This information is removed from the utterance before forming a new pair
of trigger and answer. Sound descriptions commonly appear too. They can be usually found in upper-
case between square brackets. Since they are not an actual trigger or answer, they were discarded.
Also, subtitle files can have tags which can be parsed by video players to change the way fonts appear
on the screen. As explained in [Magarreiro et al., 2014] they found that these tags “almost always con-
tained the name of the person that synced the subtitles with the movie” so they opted out to discard
them all.
The system can also be configured to perform Named Entity Recognition (NER), so that when it finds
words present in the trigger or the answer that can be categorized it replaces them by generic tags. This
allows a dialogue system that receives the corpus as an input to apply similarity measures on those
tags.
9
2.1.2 Dialogue Turns
Subtitle files comprise a sequence of slots with utterances. Since the main objective is to extract
actual dialogues, it must be decided whether consecutive slots constitute an interaction or not. Some-
times a character has his utterances distributed across multiple slots. When the first utterance ends with
an hyphen, a comma, a colon or ellipsis and the second starts in lowercase, they are joined together
forming a possible trigger/answer for an interaction.
The time between two consecutive slots might be an indicator if an interaction was found (or not).
The common sense lets us expect that slots that are further away from each other, should not form an
interaction, because they do not constitute a valid dialogue (they probably correspond to different scenes
in the movie). However, it is hard to set an appropriate value for the maximum time difference between
the slots (movies have pace variations). For that reason the maximum time allowed between two slots
can be settled by the research group who is using the tool (in the configuration file1). Giving it the value
zero indicates that all consecutive slots will be considered as possible interactions.
2.1.3 Corpus Data Fields
Besides the fields for the triggers and the answers of the interactions found, some additional ones
are included.
To know from where an interaction was extracted the filename of the source subtitle file is stored
along for reference.
A CA using Subtle Corpus as his knowledge base might need to have context information. Therefore,
each interaction has a unique identifier. Every time a new pair of trigger-answer is found it is incremented
and then assigned to the newly created one (creates a reversed linked list).
The time difference (in milliseconds) between the time value of the trigger and the time value of the
respective answer is saved along too.
2.1.4 Final Considerations
This corpus of interactions extracted from subtitle files still has a set of drawbacks:
1. It only covers English and Portuguese subtitles, although there are a large number of subtitles
available for a wide range of languages;
2. Some of the data collected do not correspond to real interactions. Some efforts were made to
minimize this from happening (Section 2.1.1), but one can still find pairs of triggers and answers
that do not represent valid sentences that could be used by a CA;1File where the user can indicate the location of the input files and the maximum time difference allowed between triggers and
answers.
10
3. It has not evolved. The number of available subtitle files is increasing every day but the corpus is
not considering them;
4. There is information related to the subtitles (e.g. release year, rating of the subtitle, information
about the movie or TV show, etc.) that could be useful to include in the corpus;
5. Although a configuration file is provided, it still does not provide much flexibility to its end user
(e.g. it just offers fields that are needed for the tool to work, the users has few fields where he can
customize the behavior of the tool).
Our solution adds support for more languages. The pre-processing phase is enhanced in order to
address new cases from the added languages. We built corpora from a set of subtitles collected till the
year of 2016. Meta-data associated with the subtitle files is now part of the corpus and allows to perform
analytical experiments with it too. The system is fully configurable through a configuration file giving
flexibility for whatever needs an end-user may have.
2.2 Say Something Smart
The SSS [Ameixa et al., 2014] is the engine that chooses an answer when a user poses a request.
The input given by the user is matched against the interactions present in Subtle Corpus. First, a list of
candidate answers is retrieved and then they are scored according to some measures in order to return
the answer with the best final score.
2.2.1 Indexing Subtle Corpus and Extracting Answers
Subtle Corpus consists in thousands of files each one containing a large amount of interactions. To
perform queries on the data, a high-performance text search engine was needed. Lucene was chosen. It
is an open-source software library which provides fast information retrieval (users should not be waiting
too much time to get a response for their input) by adding content to a full-text index.
For Lucene to make a successful search some prior steps are mandatory when analyzing the raw
data. First, it is necessary to transform the data into indexable tokens. Lucene contains some tools
(analyzers) that allow that transformation to happen:
• Tokenizer: splits text into tokens at punctuation marks;
• Stemmer: removes morphological affixes from words, leaving only the word stem;
• Stop-words filter: some words are ignored when searching. This is done by having a file with a
stop-words list for some specific language and feed it into the filter.
11
Since SSS is a question/answering system, it relies on answers redundancy [Mendes et al., 2013]
to help the process of choosing the best response. While performing the search, Lucene compares the
user request with the triggers available in the Subtle corpus. A relevance score is attributed to each one
of them. Since Lucene applies an internal scoring algorithm2, its results are returned in descending order
(from the one with the highest score to the one with the lowest). Usually, most of the results obtained
from this scoring system are not semantically related to the request made by the user. To maintain the
answers redundancy while allowing a fast time for the response to be given, some tests were made. It
was concluded that SSS should get the first 100 matches found by Lucene. Then, it should use its own
algorithm to try to find the best answer. Additional measures (described in Section 2.2.2) were studied
in order to give an improved scoring to each one of them.
The search made by Lucene might not match any interaction for a given user input. When that
happens the system still gives an answer to its user indicating that the user request was not understood
or that the system does not know how to respond.
2.2.2 Electing the Best Answer
Due to the semantical deviation that might exist between the user request and the triggers/answers
found in interactions reported by Lucene, SSS can apply four different weighted measures where the
weight of each measure as well as which ones are used can be customized by the end user by changing
some parameters of a configuration file.
These are measures available:
• Trigger Similarity to the User Input (M1): the interactions which have triggers with more sim-
ilarity with the user input are given a higher value for this measure. As discussed before, many
of the triggers found in the interactions filtered by Lucene are semantically deviated, so this mea-
sure plays an important role in identifying which ones are more relevant in order to give a better
response to the user request;
• Answer Frequency (M2): this measure evaluates the answer fields of the interactions that are
more common among all the interactions returned by Lucene. This way the corpus redundancy is
taken into account, by giving the highest score to the answer that appears more times;
• Answer Similarity to the User Input (M3): although we might think that the similarity between
the user input and the trigger might be more useful, the similarity between the answer and the user
input can also help the system to give better responses;
• Time Difference (M4): Subtle corpus provides the time difference between the trigger and the
answer. This measure does not make much sense to use solely on its own, but when used together2http://www.lucenetutorial.com/advanced-topics/scoring.html
12
with the other measures it can help improving the final results. When there is too much time
difference between the trigger and the answer, it might be an indicator that they do not constitute
a real interaction thus receiving a lower score.
2.2.3 Flexibility of the System
SSS provides a configuration file3 where its end user can specify some parameters that control the
way things are done. First, the language can be specified with English and Portuguese as possible
values. The path to the stop-words list should also be changed accordingly. There is also a field for
indicating a list of predefined answers for the system to select one of them when no suitable answer is
found.
2.2.4 Final Considerations
Some improvements could be made to the current state of SSS:
1. As we have seen, when choosing an appropriate response to a user input, several measures are
combined in order to select the answer with the highest score but the weights for each measure
were determined empirically, meaning that the system might not be at its best yet;
2. Since Subtle Corpus includes a reverse linked list of interactions, including the context of the
conversation could improve the success rate of the system;
3. When retrieving possible responses the system relies on an external tool with a non-customizable
scoring algorithm.
We wanted to replace Lucene with a complete alternative tool but no relevant candidate was found.
There are tools available that provide the same kind of functionality, but most of them have Lucene at its
base (e.g. Solr4 or Elasticsearch5).
Having that in mind we built a new dialogue system using neural-networks (Section 5.1) and compare
its performance against SSS. Our system relies on a generative model so that we can answer questions
the system had never seen.
2.3 Sequence-to-Sequence Models
In this section we will briefly explain the main concepts about seq2seq models. We will start by
explaining the base architecture. Thereon, we will describe more specific aspects and techniques for3XML file.4https://lucene.apache.org/solr/5https://www.elastic.co/products/elasticsearch
13
Figure 2.1: Information flow in a seq2seq model.
improving the base seq2seq model.
A seq2seq model consists of two RNNs: an encoder and a decoder. The encoder processes the
input sequence one symbol6 at a time, converting the whole sequence to a fixed representation con-
taining only the important information of the sequence. The decoder sees the encoded representation
(context7) and is trained to predict/generate another sequence, also one symbol at a time. The decoder
is influenced by the context and the previously generated symbols as shown in Figure 2.1. All of those
symbols need to have a representation so a vocabulary is needed.
2.3.1 Padding and Bucketing
Training a standard seq2seq model involves a high amount of matrix multiplications and other oper-
ations which benefit from parallelization. Graphics Processing Units (GPUs) are an excellent candidate
for that task. For that matter, all sequences must have a fixed length in order to be divided into batches.
The input dataset must be converted to fixed length sequences. This is accomplished by padding the
input sequence with special symbols8. This implies that for an input dataset all of its sequences would
have to padded to match the size of the biggest sequence, slowing down the training of the decoder.
Fortunately, bucketing solves this problem by putting sequences into buckets of different sizes (e.g. one
bucket for sequences with length between 5 and 10 and another for sequences with length between 10
and 15, and son on).
2.3.2 Word Embeddings
Taking an input dataset of sentences as an example, the seq2seq model needs a representation
for each word present in the vocabulary. This is accomplished by using word embeddings, were each
word is represented by a fixed length vector. Semantic relations between words can be captured by this
technique. In seq2seq models the word embeddings are simultaneously trained with other parameters
of the model.
6In our experiments each symbol correspondent to a token that could be either a word or punctuation.7Given by the encoder state vectors.8Signaling end of sentences, decoding starting point, symbols not in vocabulary and a filling slots.
14
2.3.3 Attention Mechanism
The seq2seq model provides the ability to process input and output sequences. However, com-
pressing an entire input sequence to a fixed length context can cause a loss of considerable amount of
information. In [Bahdanau et al., 2014] they took as an inspiration the human perceptual system and
introduced an attention mechanism that allows the decoder to selectively look at the input sequence
while decoding. As a result, unnecessary information can be filtered out and a better performance can
be achieved.
2.3.4 Greedy Search and Beam Search
The decoder needs to select the most likely output sequence. This involves searching through all the
possible output sequences based on their likelihood. Usually, the size of the vocabulary tends to very
large (e.g. hundreds of thousands of words). Therefore, the search problem is exponential in the length
of the output sequence and is intractable9 to search completely.
For seq2seq models it is common to use either a greedy search or a beam search approach, in order
to find candidates to be chosen by the decoder. A greedy search selects the most likely symbol at each
step in the output sequence, making it a very fast approach. However, the quality of the final output
sequences may be far form optimal. On the other hand, beam search expands upon the greedy search
and returns a list of most likely output sequences, by keeping th n most likely, where n corresponds to
the beam width specified by a user. Using beam search with n = 1 results in a greedy search. Higher
beam-width values result in a decrease of the decoding speed.
2.3.5 Recurrent Neural Networks
As reported before, seq2seq models rely on RNNs which are essentially standard neural networks
with loops.
A RNN consists in multiple copies of the same network. Each copy passes information about the
sequence being processed to its successor. This information is the hidden state - information about
what happened in all of the previous time steps. In Figure 2.2, an output is generated by the network
at each time step. This may not be necessary for some tasks (e.g. predicting an answer to give based
on a user input - the final result is the most important, we might not care about what response could be
given at each word in the input).
There is one case where the utilization of simple RNNs (usually referred to as vanilla RNNs) might
not do well: when learning long-term dependencies (e.g. dependency on information that is present in
9NP-complete.
15
Figure 2.2: Architecture of a RNN with a loop and then unrolled after unfolding. Xt represents what is given asinput. Yt is the output generated by the network.
steps that are far apart). However, there some alternatives that are able to address that problem, for
instance Long-Short Term Memorys (LSTMs) described in the following section.
2.3.6 Long-Short Term Memory Neural Networks
LSTMs [Hochreiter and Schmidhuber, 1997] have the same architecture of a RNN. The only differ-
ence is that they are designed to apply a different function to compute the hidden state, thus avoiding
the long-term dependency problem usually reported with basic RNNs. Instead of having only one neural
network, they have four which are combined with some pointwise inside a cell which represents the
memory of the LSTM. Internally these cells have the ability to remove or add information to their state.
That way, they can remember information from steps that are far apart.
16
3Related Work
Contents
3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 End-to-end Sequence to Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . 23
17
18
This section presents a review of previous work regarding the process of building corpora for dialogue
systems and the process of generating responses with end-to-end CAs for user input in a turn-by-turn
basis.
3.1 Corpora
In this section we will describe corpora that is suitable to be used as a knowledge base for a CA.
3.1.1 OpenSubtitles2016 Corpus
As previously pointed out, the number of subtitle files is constantly increasing everyday. The OPUS
Corpus1 have been updated in 2016 with a new dataset based on movie and TV subtitles: the Open-
Subtitles2016 Corpus [Lison and Tiedemann, 2016]. After preprocessing the source files they align the
subtitle files with each other to form a parallel corpus.
3.1.1.A Source Data
The source data that originated this corpus consists of a database dump from OpenSubtitles.org,
containing a total of 3.36 million subtitle files distributed across more than 60 languages. Some files
were discarded from the conversion if their formats were unsupported or if they had corrupt encodings.
OpenSubtitles team have introduced multiple mechanisms that improved the quality of the subtitles
available in their website. They were able to remove duplicate, spurious and misclassified subtitles.
After the conversion step, the dataset includes subtitle data from a total of 152,939 movies or TV
shows episodes.
The raw subtitles files go through a preprocessing phase described in the following sections.
3.1.1.B Preprocessing
Encoding detection: before being able to parse the content of a file, the encoding must be known
a priori, since OpenSubtitles does not enforce any kind of encoding type. The encoding of the file is
then determined by applying a couple of heuristics. The problem was solved by creating a list of pos-
sible character encodings for each language in the dataset. Some languages allow several alternative
encodings. They determine the most likely encoding by doing auto-detection [Li and Momoi, 2001].
Sentence Segmentation: a structure of blocks is present in the raw files (Listing 3.1), consisting
of short portions of text with associated start and end times. Since there are no direct correspondence
1http://opus.lingfil.uu.se/
19
between these blocks and sentences, they apply a sentence segmentation process that finds sentence-
ending markers in order to detect if some subtitle block is a continuation of preceding block. However,
this detection of sentence-ending markers is highly language dependent and must obey some specific
rules.
Listing 3.1: Structure of a subtitle file in .srt format
1 1
2 00:02:17 ,440 --> 00:02:20 ,375
3 Senator , we're making
4 our final approach into Coruscant.
5
6 2
7 00:02:20 ,476 --> 00:02:22 ,501
8 Very good , Lieutenant.
Spell Correction: many subtitle files that serve as a data source are automatically extracted by
using Optical Character Recognition (OCR) from video streams. This causes some spelling errors.
Also, many subtitle files are made by amateurs making spelling errors very likely to be part of the
data. Following a simple noisy-channel approach (integrated handcrafted error models and statistical
models) such errors (including misplaced accent marks) were automatically detected and corrected for
11 European Languages2. The approach followed cannot correct words that include more than one
misrepresented character.
Collecting meta-data: for each subtitle they generate meta-data that includes generic attributes of
the source material3 extracted from the International Movie Database (IMDb), attributes of the subtitle
itself4. There is also some additional meta-data related with the previous phases such as the encoding
that was detected or the number of spelling errors found.
3.1.1.C Output files
After preprocessing the files they align the subtitle files with each other to form a parallel corpus. The
details of that step will not be described here since the alignment of subtitles is not the focus of our work.
In addition to the bi-text files generated for the parallel corpora, they also provide XML files containing
the subtitles either with sentences tokenized or not. An example of a OpenSubtitles2016 corpus file can
be seen in Listing 3.2.
2English, Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish3Release year, original language, duration and genre.4Upload date, subtitle rating on OpenSubtitles and subtitle duration
20
Listing 3.2: Simplified example of a OpenSubtitles2016 input file structure.1 <?xml version=”1.0” encoding=”UTF-8”?>2 <document id=”66487”>3 <s id=”1”>4 <time id=”T1S” value=”00:00:31,800” />5 Smeagol6 <time id=”T1E” value=”00:00:38,800” />7 </s>8 <s id=”2”>9 <time id=”T2S” value=”00:01:15,700” />
10 Apanhei um!!11 </s>12 <meta>13 <subtitle>14 <language>Portuguese</language>15 <date>2004-03-06</date>16 <duration>00:01:15,700</duration>17 <rating>7.0</rating>18 </subtitle>19 <source>20 <original>English, Quenya, Old English, Sindarin</original>21 <year>2003</year>22 <duration>201 min</duration>23 <genre>Action, Adventure, Fantasy</genre>24 <country>USA, New Zealand</country>25 </source>26 </meta>27 </document>
Does OpenSubtitles2016 Corpus meet our requirements?
This corpus is the ideal candidate for the goals we defined. It is the largest corpus available. It is a
recent corpus with subtitles till the year 2016. It provides meta-data about each corpus file. Supports
multiple languages, including 96254 files with Portuguese subtitles.
3.1.2 Movie-Dic Corpus
Movie-Dic Corpus [Banchs, 2012] is available for research and development purposes. It comprises
132,229 dialogues containing a total of 764,146 turns extracted from 753 English movies scripts. It can
be used in chat-oriented dialogue systems since it does not provide a knowledge base focused on a
specific domain or area of interest.
The movie scripts that serve as the source data of this corpus are freely available at Internet Movie
Script Database (IMSDb)5 as HyperText Markup Language (HTML) files. Three types of information are
extracted when crawling the files:
• Speakers: corresponds to the names of the movie characters that are speaking in a given turn of
the dialogue;
• Context: additional information of narrative nature, explaining what is happening in the movie
scene;
5http://www.imsdb.com/
21
• Utterances: what is said at each turn by some speaker.
With that information, some heuristics were developed in order to identify proper dialogue bound-
aries. After identifying the dialogues, some post-processing was applied for filtering or amending parsing
errors as well as erroneous data present in the types of information already describe above. Finally, all
information was organized in dialogue units and then written to XML files.
Does Movie-Dic Corpus meet our requirements?
Although this corpus contains speakers and context information that could be useful to build CAs,
it does not meet our requirements because it is only available in English. It is a small corpus when
compared with OpenSubtitles2016 corpus. Also, it does not provide meta-data.
3.1.3 Cornell Movie-Dialogs Corpus
The Cornell Movie-Dialogs Corpus6 is another dataset that contains conversations extracted from
raw movie scripts. It comprises 220,579 conversational exchanges involving 9,035 characters from 617
movies. There is also included meta-data for each conversation, containing details about the movie7
and about the characters8. The info was gathered by an algorithm that performs queries on IMDb data
interfaces9.
It was manly used to study how conversational participants adapt to each other’s language styles
while communicating with each other [Danescu-Niculescu-Mizil and Lee, 2011].
Does Cornell Movide-Dialogs Corpus meet our requirements?
This corpus is very similar to Movie-Dic corpus. Thus it does not meet our requirements either by
the same reasons explained before.
3.1.4 Ubuntu Dialog Corpus
Ubuntu Dialog Corpus [Lowe et al., 2015] is a dataset that helps building dialogue agents that are
capable of interacting in one-to-one conversations on very technical subjects. Since the dataset is
characterized by a multi-turn property of unstructured nature10, the resulting agents can perform multi-
turn conversations.
This corpus comprises almost 1 million two-party conversations extracted from Ubuntu chat logs11
between 2004 and 2015. Each conversation has an average of 8 turns and a minimum of 3. They allow
to create CAs for targeted applications (in this case technical support).6https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html7Movie title, genres, release year, IMDb rating and number of IMDb votes.8Gender and order of appearance in movie credits.9http://www.imdb.com/interfaces
10There is no logical representation for the information exchange during a conversation.11https://irclogs.ubuntu.com/
22
Some learning architectures were studied in order to analyze how this corpus contributes to a ques-
tion/answering system by having the task of selecting the best response given a user input. Before test-
ing the those learning techniques, the collected data have gone through a preprocessing stage. Each
utterance was parsed using Natural Language Toolkit (NLTK) library12 and Twitter tokenizer13 (there is
no information available about how they used both tools). Afterwards, some generic tagging was done
by using NER, using generic tags for multiple word categories (person names, locations, system paths
and so on). Also, the data was further processed to create tuples with three fields: the context, the
response and a flag which was used to indicate whether a response was valid or not14.
Does Ubuntu Dialog Corpus meet our requirements?
This dataset contains information about specific technical topics. Knowing this it is unsuitable for
evaluating open-domain end-to-end dialogue systems.
3.2 End-to-end Sequence to Sequence Models
The automated discovery of abstraction is the fundamental idea behind all deep learning methodolo-
gies. They capture the semantic content by building abstract representations of raw data features.
The seq2seq learning framework is one of those methodologies. It looks for an in-between repre-
sentation of content when mapping one complex structure to another [Sutskever et al., 2014]. However,
training and optimizing this process of translation between structures is exceptionally challenging [Ben-
gio, 2013].
Given a string of inputs a generative neural network model produces a string of outputs, both of
arbitrary lengths. It relies on encoder-decoder models. The encoder encodes the source sequence,
while the decoder produces the target sequence.
Recently, an increasing number of studies have found that seq2seq models can be used for creating
dialogue systems by relying on a purely data-driven approach [Li et al., 2015, Yin et al., 2015, Lu et al.,
2017].
Using seq2seq models for end-to-end training facilitated the creation of systems for various complex
natural language tasks with machine translation being one of the most favored tasks [Kalchbrenner and
Blunsom, 2013,Sutskever et al., 2014,Cho et al., 2014,Luong et al., 2015]. They can be used in multiple
types of systems without considerable changes in the architecture. This is ideal for tasks which are too
difficult to design rules manually such as dialog systems.
Building a CA involves mapping questions to responses and being able to do that with a straightfor-
ward model is very appealing. The seq2seq model can learn to map between questions and answers in12http://www.nltk.org/13https://www.cs.cmu.edu/~ark/TweetNLP/14A response is flagged as valid if it is the next utterance after the context.
23
either closed-domains or open-domains datasets as shown by [Lu et al., 2017] and [Li et al., 2015].
In [Vinyals and Le, 2015], they built a neural conversational model by using OpenSubtitles2009
[Tiedemann, 2009] 15 English subtitle files was one of the datasets used to test the model. They con-
sidered consecutive sentences as if they were uttered by distinct characters. The model was trained
to predict the next sentence given the previous one. Their CA was capable of having basic fluent
open-domain conversations. The model could generalize to new questions it has never seen during the
training phase. However, the built CA has some drawbacks: it gives too many short and simple answers
and it also lacks a way to ensure consistency during a conversation because it does not include any
general world knowledge (it is an unsupervised model) and it has no memory of the past conversation.
They evaluated their CA by comparing against CleverBot16. They asked four different humans to rate
the answers given by both agents for 200 questions. Their CA achieved a better score after analyzing
the results of the human evaluation. They justified the choice of using human evaluators by stating that
designing a good metric to measure the quality of a conversational model remains an open research
problem.
The problem with generic answers was also referred by [Guo et al., 2017]. After training a seq2seq
model with Cornell Movie Dialog Corpus (Section 3.1.3) they ended up with a high percentage of “I don’t
know” answers. After removing all “I don’t know” sentences from the input dataset and training a new
model, the responses remained vague with a high percentage of “What do you mean?”). The evaluation
of the created CA was made by the team members involved in the research.
Generating long, informative, coherent and diverse responses remains a hard task. A recent review
of the literature on this topic [Li et al., 2015] has found that the traditional objective function that selects
the best answer is unsuited for question-answering systems (although it provides state-of-the-art results
in machine translation tasks). They proposed using a Maximum Mutual Information as the objective
function which penalizes generic responses. They stated that more meaningful responses can be found
in N-best lists given by seq2seq models but rank much lower. After applying their proposed objective
function, they were able to achieve more diverse responses. OpenSubtitles2009 [Tiedemann, 2009]
dataset was also used in their experiments. While training the model they used BLEU [Papineni et al.,
2002] for parameter tuning. They also relied on human evaluation. The judges were informed to prefer
outputs that were more relevant to the preceding context, as opposed to those that were more generic.
After analyzing the results, they were able to improve the number of diverse and interesting responses
returned by the model. In [Shao et al., 2017] similar results were obtained by using a slightly modified
version of beam-search when selecting the best response. They introduced stochastic sampling opera-
tions in the beam-search algorithm. This allowed them to inject diversity earlier in the answer generation
process. They also implemented a back-off strategy by choosing to fallback to the baseline model with
15One of the previous versions of the corpus presented in Section 3.1.116https://www.cleverbot.com/
24
standard beam-search algorithm when the response was shorter than 40 characters17. Once again,
they also relied on an evaluation done by humans by asking them to rank the answers given by their
model with a 5-scale rating18. Some of their methods were able to generate longer answers, however
those had worse results after analyzing the classifications given by the judges.
17Textual length.18Excellent, Good, Acceptable, Mediocre, and Bad.
25
26
4B-Subtle
Contents
4.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 B-Subtle Parts Explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
27
28
We present a revamped tool for creating corpora - B-Subtle. We aim to replace the existing Subtle
Tool by providing the following features:
1. Create corpora of interactions with meta-data associated such as the genre of the movie/TV show,
release year, spoken language, subtitle language, etc. This allows end-users to generate corpora
that meet specific requirements: for example including only interactions collected from movies with
“Action” as a genre;
2. SupportOpenSubtitles2016 Corpus [Lison and Tiedemann, 2016] files as an input;
3. Support different output formats for the corpus generated such as JSON or XML files. This could
be useful for end-users since they could choose the output format that fits their needs;
4. User-friendly configuration file: allow the user to fully control the behavior of B-Subtle when creat-
ing corpora simply by adjusting parameters in a configuration file.
5. Allow end-users to collect analytical data about movies, TV shows and subtitle files.
6. Deal with specific language details (e.g. encodings) in order to allow end-users to work with a
broader range of languages.
4.1 Architecture Overview
B-Subtle’s architecture (Figure 4.1) was designed to produce a flexible system that can be fine-tuned
according to the requirements a user may have. For that reason, a modular approach was adopted.
The system is capable of running pipelines. Those pipelines indicate how some input files will be
processed and are specialized in dealing with the inner details of them. They specify which components
made available by B-Subtle will be used for an input dataset.
Essentially this tool receives a dataset as input and outputs a customized dataset of interaction pairs
and/or analytics.
B-Subtle is currently able to process OpenSubtitles2016 Corpus files. As seen in Figure 4.1 process-
ing each OpenSubtitles2016 corpus file comprises a sequence of steps:
1. First, it gathers all the meta-data available in the subtitle file;
2. In order to fill missing fields of the already gathered meta-data (e.g. the subtitle file might not have
any value for the genre field), additional components can be applied: the Meta-data Collectors
(explained in Section 4.2.2);
29
3. After having all of the meta-data collected the system is ready to filter our file using Meta-data
Filters (also explained in Section 4.2.3.A). These are the components that allow the tool to provide
the possibility of creating targeted corpora from subtitle;
4. If the file survived all of the Meta-data Filters applied the system starts collecting interaction pairs
from it;
5. While collecting interaction pairs some of them can be filtered out by configuring some Interaction
Pairs Filters described in detail in Section 4.2.3.B.
6. With all the interactions collected, the system can now process each one of them by applying
Producers (Section 4.2.4). These components are able to enrich each interaction with more data
(e.g. perform a tokenization of the trigger and/or the answer);
7. After applying the Producers we can use Interaction Pair Filters again, because some of them can
only be applied to the data generated by the Producers (e.g. filtering out all the triggers with more
than 5 tokens);
8. Then we can apply Transformers which are responsible for modifying the raw data fields of the
triggers and answers. We explain them in more detail in Section 4.2.5.
9. Once again Interaction Pairs Filters can be applied to the data fields modified by the Transformers.
10. Afterward, all the information collected can be written to a B-Subtle corpus file (explained in detail
in Section 4.2.6).
There is also the possibility to collect analytical data along the corpora generation, although B-Subtle
can also collect analytical data without generating any output corpus. See Section 4.2.7 in order to know
which types of analytics can be collected and how you configure them.
4.2 B-Subtle Parts Explained
By adopting an architecture of components we are allowing B-Subtle to be easily expanded in the
future. In this section, we will describe in detail all of its constituent parts (Figure 4.1).
4.2.1 Input Files
The OpenSubtitles2016 Corpus files are in XML format, therefore a dedicated parser was imple-
mented. To process all the data successfully, some additional steps have been performed. For example,
some files contained invalid XML characters that need to be deleted so that the file could be correctly
30
Figure 4.1: Possible B-Subtle pipeline for OpenSubtitles 2016 files using all the components available.
analyzed without being immediately discarded. Also, during the meta-data collection step, the “dura-
tion”1 field of the subtitle file was found written in multiple patterns (“HH:MM:ss,3S”, “MM min”, or even
“N/A”) and we converted them to a unified format2.
While choosing the parser architectures for XML files we came across the following options:
• Document Object Model (DOM): the whole XML file is loaded into memory. It is possible to nav-
igate to parent and child elements across the document. This technique can be problematic with
large files due to heavy memory consumption;
• Simple API for XML (SAX): it starts by reading the XML file from beginning to end. It does not store
anything in memory. It fires events and a custom event handler can be used to catch all or part of
them. It lacks a parent structure like DOM but does not suffer memory consumption problems;
• Streaming API for XML (StAX): its similar to SAX, however, the responsibility of moving the parser
through the XML file belongs to the event handler. The main advantage against SAX is that it
allows writing to the XML file.
Since the size of each XML file is relatively small the DOM parsing architecture was chosen because
it allowed a faster development of the B-Subtle tool.
4.2.2 Meta-data Collectors
These components are responsible for enriching the meta-data provided by original input files with
meta-data from external sources (for instance the genre of the movie if this information is not available).
They can be particularly useful for .srt files directly downloaded from websites, which lack on information.
1Refers to the total duration of the source movie related to the subtitle file.2An Integer with the amount of time in minutes.
31
We have currently implemented the themoviedb3 Meta-data Collector 4. This component makes an
Hypertext Transfer Protocol (HTTP) request to themoviedb with an IMDb identifier5 and receives back
a JSON that is parsed by B-Subtle. The relevant information is extracted (e.g. extracting the movie
certification codes in order to classify the audience type). This information can then be filtered with the
components we will describe next.
4.2.3 Filters
When dealing with input data that is rich in meta-data6, filters may be applied. This feature allows
end-users to easily generate targeted corpora (e.g. generate a corpus of interactions from movies with
“Western” as a genre and released before the year 1990).
Two types of filters can be used: Meta-data Filters and Interaction Pair Filters, described below.
4.2.3.A Meta-data Filters
• Audience: allows filtering subtitle files according to an audience rating/certification. This infor-
mation can be added by our themoviedb Meta-Data Collector. This filter supports a flag for adult
movies. It also supports filtering by motion picture content rating when provided together with a
country identifier7 (different countries have different criteria for content and age rating). (e.g. apply-
ing the flag for filtering adult movies will result in skipping those subtitle files; defining the content
rating value as M/16 with Portugal as a country would result in accepting all subtitles files from
movies with that content rating).
• Country: allows filtering subtitles of movies/TV shows made in a specific country or set of coun-
tries. Using a regular expression8 for the country name is also supported. (e.g. accept only
subtitles files from movies made in countries starting with “Po” by using the following regular ex-
pression: "^\Po").
• Country Quantity: allows filtering subtitles of movies/TV shows made in a determined quantity
of countries. A maximum, minimum or exact quantity can be defined. A range can also be used.
(e.g. defining a range value of 2 to 4 would result in accepting subtitle files from movies filmed in
at least 2 countries but filmed in less than 4 countries).
3www.themoviedb.org4Limited to 40 API requests every 10 seconds by IP address.5It is provided by OpenSubtitles2016 Corpus as a filename of the subtitle files.6Possibly enriched with B-Subtle’s own Meta-data Getters.7Details about motion picture content rating in multiple countries: https://en.wikipedia.org/wiki/Motion_picture_
content_rating_system8A sequence of characters that define a search pattern.
32
• Duration: allows filtering by the total duration of the movie (in minutes). A maximum, minimum or
exact quantity can be defined. A range can also be used. (e.g. accept only subtitles from movies
with less than 90 minutes by defining a maximum quantity of minutes).
• Encoding: allows filtering subtitle files written in a specific encoding. Using a regular expression
for the encoding is also supported as well as the existence of it (one might want to filter only the
subtitle files that have the encoding correctly identified).
• Genre: allows filtering subtitles of movies/TV shows belonging to a specific genre or set of genres.
Using a regular expression for the genre type is also supported.
• Genre Quantity: allows filtering subtitles of movies/TV shows belonging to a determined quantity
of genres. A maximum, minimum or exact quantity can be defined. A range can also be used.
(e.g. accept only subtitles with “Action”” and “Comedy” as a genre).
• IMDb Identifier: allows filtering subtitle files that have the IMDb ID present in the meta-data fields.
• Movie Title: allows subtitle files to be filtered by movie name by providing a regular expression.
The existence of that field in the meta-data can also be tested (one might want to filter subtitles
that have a movie title associated, some of them might not have that information included, since
this field is not provided by OpenSubtitles2016 Corpus files).
• Original Language: allows filtering subtitles of movies/TV shows made in a specific language or
languages. Using a regular expression for the original language is also supported.
• Original Language Quantity: allows filtering subtitles of movies/TV shows made in a determined
quantity of original languages. A maximum, minimum or exact quantity can be defined. A range
can also be used.
• Movie Rating: allows filtering subtitle files based on the movie rating associated with the subti-
tles. It supports checking for the existence of that field in the meta-data as well as a maximum.
minimum, exact or range of values.
• Subtitle Rating: allows filtering subtitle files based on the subtitle rating associated. It supports
checking for the existence of that field in the meta-data as well as a maximum, minimum, exact
or range of values. (e.g. accept only subtitle files with a rating above 6.3 in a scale of 0 to 10 by
defining a minimum value).
• Year: allows filtering files based on the release year of the movie. It supports checking for the
existence of that field in the meta-data as well as a maximum, minimum, exact or range of values.
33
4.2.3.B Interaction Pairs Filters
The following filters are available for the interaction pairs. Some of them require a Producer to be
applied a priori :
• Interaction Interval: allows filtering the value of time interval allowed between a trigger and an an-
swer. It supports checking a maximum, minimum, exact, or range of values (e.g. collect interaction
pairs where the answer appears up to 4 seconds after the trigger);
• Trigger/Answer Sentiment: allows filtering interaction pairs where the trigger/answer expresses
a sentiment defined by the user (requires the Sentiment Producer Component). (e.g. accepting
only triggers with a positive sentiment).
• Trigger/Answer Tokens Quantity: allows filtering interaction pairs where the trigger/answer has
a determined amount of tokens (requires the Tokenizer Producer Component). A maximum, min-
imum or exact quantity can be defined. A range can also be used. (e.g. accepting only answers
with more than 5 tokens such as sentences like “Yes you are!” are discarded9).
• Trigger/Answer Characters Quantity: the same as the above, but for textual content length
(e.g. the sentence “I am fine.” contains 10 characters and if we define this filter with a minimum
characters quantity of 5, that sentence is accepted if it is part of a trigger/answer).
• Trigger/Answer Regular Expression: allows filtering interaction pairs where the trigger/answer
matches some regular expression defined by the user. This filter gives a lot of flexibility such as
building a regular expression that filters out triggers containing curse words.
• Trigger/Answer Text Content: allows filtering interaction pairs where the trigger/answer starts
with, contains or ends with some sequence of characters. The same result can be achieved by
using a Trigger/Answer Regular Expression Filter, but since some users might not be advanced
enough to use regular expressions we decided to provide this filter for simples use cases of text
content starting with, containing or ending with some sequence of characters.
4.2.4 Producers
This type of components is responsible for generating additional data for the interaction pairs. The
search and implementation of producers were limited to tools with existing Java libraries. We also
preferred tools where Portuguese was available as a supported language.
9Contains only 4 tokens: “Yes”, “you”, “are” and “!”
34
• OpenNLP Sentiment Analyzer: uses a sentiment analysis tool from OpenNLP. Currently it is only
prepared to deal with English sentences. It evaluates a sentences sentiment according to the
following scale: very negative, negative, neutral, positive, and very positive;
• OpenNLP Stemmer: uses a snowball stemmer from Apache OpenNLP10. This stemmer supports
16 languages11. The language parameter is customizable through a B-Subtle configuration file
(Section 4.2.8). By default, it is applied to both the trigger and the answer.
• Open NLP Tokenizer: converts the raw text from triggers and/or answers into separated tokens.
It is available for Danish, German, English, Dutch, Portuguese and Sweden.
• TreeTagger Lemmatizer: converts the raw text from triggers and/or answers into separated lem-
mas. It is available for 23 languages12.
4.2.5 Transformers
Transformers are entities responsible for transforming the raw text data present in the interaction
pairs.
• Lowercase: converts the raw text from triggers and/or answers into lowercase characters.
• Uppercase: converts the raw text from triggers and/or answers into uppercase characters.
Some of them can also be applied to the data generated by the Producers.
• Stringify Tokens: replaces the trigger and/or answer fields by joining the tokens generated by
some producer with some separator (the default is the space character).
• Stringify Lemmas: does the same as the above but applied to the lemmas generated by a pro-
ducer.
4.2.6 Output Files
The generated corpora can be written to four types of output files: JSON files, XML files, Legacy
files13 or Parallel Files. Each one of those types will be described in this section. Some output types
support a customizable parameter that allows the user to enable or disable pretty print (for JSON and
XML), thus generating files with more or less size respectively.10opennlp.apache.org11Danish, Dutch, English, Finnish, French, German, Hungarian, Irish, Italian, Norwegian, Portuguese, Romanian, Russian,
Spanish, Swedish, Turkish.12German, English, French, Italian, Danish, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Greek, Chinese, Swahili,
Slovak, Slovenian, Latin, Estonian, Polish, Romanian, Czech, Coptic and old French13Similar to the ones generated in Subtle. Since SSS is dependent on the format of those files, this will ease the process of
evaluating the system.
35
• JSON Files: we provide this output file type because JSON is just text in a standardized format. It
can be useful for an end-user using our corpora files for some application that requires communi-
cation between a browser and a server. This output type supports a pretty print14 option that can
be enabled or disabled.
• XML Files: we provide this output file type because the way we implemented our JSON output
could be adapted without much effort to output XML files. This output type also supports a pretty
print option that can be enabled or disabled.
• Legacy Files: in order to be retro-compatible with previous systems developed at L2F/INESC-ID
(e.g. SSS) we decided to include this output type. We also planned our experiments to use SSS
with a corpus generated with B-Subtle. These output files are simple text files with the same fields
generated by the Subtle tool.
• Parallel Files: this output type consists in generating at least two files: one containing the triggers
and another one containing the answers. Each line of the triggers file is aligned with each line of
the answers file. If we pick the same line from both files (e.g. 14th line) we get the trigger and the
corresponding answer of an interaction pair. This output type allows generating corpora that can
be fed to seq2seq systems (which will be the case in our experiments for generating end-to-end
CAs). Two additional files can also be generated. We call them validation files (usually required
for experiments with seq2seq frameworks). One is for the validation triggers and the other is for
the validation answers. These files are also aligned. B-Subtle randomly samples a user-defined
quantity of interaction pairs to build these validation files. The interaction pairs inserted in validation
files are not present in corpus files.
4.2.7 Analytics
B-Subtle offers the possibility to collect analytical data about the process of creating corpora, as well
as analytics about the input/output datasets. This allows to easily have information about the corpora,
but also to analyze movies and TV shows information from a bunch of subtitle files (e.g. it can be
interesting to study the evolution of movies pace over the years). Three types of analytical data can be
generated: global, meta-data and interaction. The analytics collected are currently outputted to JSON
files.
Meta-data Suppose that we want to study the pace of subtitles from 1990 till 2010 for the “Adventure”
genre. We can define a Genre Filter for that type and the corresponding Year Filter with the correspond-
ing range value and then we activate both the Global Analytics and Meta-data Analytics components.
14Show JSON with indentation in multiple lines, instead of a single line with all the information.
36
We will be able to get the average time difference between trigger and answer, giving us the information
that we were interested.
Each Meta-data Filter can be configured to collect or not the analytical data. Basically, our filters will
fire an event that will be captured by the Meta-data Analytics component which aggregates data received
from multiple filters.
The total types of Meta-data Analytics that can be collected are as much as the total of meta-data
filters available.
Interaction Pairs Aggregating Interaction Pair Analytics works similarly to the Meta-data Analytics
component since we also have filters for the interaction pairs. Therefore, the total types of interaction
pair analytics that can be collected are as much as the total of interaction pair filters available.
Global When generating a new corpus the end user might want to collect analytical data about the
process of generating a corpus. We call it Global Analytics and it includes the following information:
• Total input files processed (includes quantity and size);
• Total invalid input files (includes quantity and size);
• Total output files generated per output type (includes quantity and size);
• Average time spent processing each file;
• Total time spent processing all files;
• Average interactions pairs extracted for each input file;
• Input file with the most interactions pairs;
• Largest input file;
• Largest output file (per output type).
4.2.8 Configuration Files
YAML Ain’t Markup Language (YAML)15 configuration files are supported since their signal-to-noise
ratio is higher without all the brackets that we are used to seeing in XML files. This makes it subjectively
easier to read and edit (Listing 4.1).
For the OpenSubtitles2016 task the following fields are currently available to be used:
• The directory where the input files can be found;15http://yaml.org/
37
• A list of Meta-data Collector components;
• A list of Meta-data Filter components;
• A list of Interaction Pair Filter components;
• A list of Producer components;
• A list of Transformer components;
• A list of Analytics components;
• A list of Output components that will generated corpora files (see section 4.2.6).
Listing 4.1: Example of configuration file for OpenSubtitles2016 Pipeline1 ---2 p i p e l i n e s :3 - p ipe l ineType: opensub t i t l es4 i n p u t D i r e c t o r y : "/input/dataset/path"5
6 metada taF i l t e rs :7 - f i l t e r T y p e : count ry8 value: "Italy"9
10 i n t e r a c t i o n F i l t e r s :11 - f i l t e r T y p e : t r iggerEndsWi th12 value: "?"13
14 producers:15 - producerType: openNLPTokenizer16
17 t rans fo rmers :18 - t ransformerType: lowercase1920 outputs :21 - outputType: json22 ou tpu tD i r : "/output/corpus/path"23 p r e t t y P r i n t : true
For a full understanding of configuration files possibilities and what their content represents see
Appendix C for a sample configuration file representation with all parameters briefly explained. We also
provide online documentation for B-Subtle16.
16https://miguelventura.gitbook.io/bsubtle/
38
5Building Agents
Contents
5.1 Say Something Deep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Main Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
39
40
We aim to create a set of agents. In order to achieve it we need to accomplish two distinct tasks:
1. Build knowledge bases for our agents;
2. Assign them an architecture for generating or selecting the best answer given a user input.
We can rely on our B-Subtle tool to generate knowledge bases from OpenSubtitles2016 Corpus. For
the second task, we already have a system available for selecting the answer given a user input: SSS
(described in Section 2.2). As we have seen it relies on several measures that are combined in order to
select the answer with the highest score. The weights for each measure were determined empirically.
In [Mendonca et al., 2017] the authors investigated a methodology to learn the best values for each
weight. This was done by feeding user feedback into the system. However, this learning strategy did
not bring significant improvements to the overall system performance. Therefore, we decided to create
a brand new system based on seq2seq models - SSD, described in the following section.
With two distinct architectures for answer selection, we will be able to compare them against each
other by using agents with the same knowledge base.
5.1 Say Something Deep
In recent experiments, the seq2seq model has been shown to be very appealing for purely data-
driven approaches. It provides state-of-the-art results when mapping complex structures with variable
length to other structures in many domains such as machine translation, speech recognition, and text
summarization [Sutskever et al., 2014,Cho et al., 2014,Wu et al., 2016].
Instead of translating from one language to another with seq2seq, we aim to “translate” an input
(trigger) to an output (answer). By relying on a generative approach we expect our system to generate
new answers even for interactions it has not seen during the training phase. This way we diverge from
SSS which relies on a retrieval approach that only allows it to return pre-defined answers available in
the knowledge base.
5.1.1 Architecture
SSD receives a user request (the trigger) and chooses an answer. Architecturally speaking it works
just like SSS: it receives an input and applies some procedures to select an output. However, the
process of selecting the best answer is entirely different.
SSD relies on a seq2seq framework which is commonly used for machine translation tasks: OpenNMT-
tf1. This framework is a general purpose sequence modeling tool built with TensorFlow2. It provides
1opennmt.net/OpenNMT-tf/2tensorflow.org
41
multiple features that we need for our research:
• It defines neural network models with a simple configuration file;
• Since it is oriented for machine translation tasks, it receives a source/target dataset as an input
when training a model, for instance, when translating from English to French the source dataset is
in English and target dataset is in French. Both datasets must have the same size3. In our case,
the input dataset consists of triggers. The output dataset is made by the corresponding answers;
• By being built with TensorFlow it is ready for production because of TensorFlow Serving4. Once a
model is trained it can be deployed to production environments in order to make predictions with
new data samples. This allows our model to be connected to a user interface for an agent such as
Filipe [Ameixa et al., 2014];
• The training process can also be monitored by using TensorFlow’s Visualization Toolkit: Tensor-
board5 (Figure 5.1).
5.1.2 Neural Network Model
Our model bears a close resemblance to the model described in [Vinyals and Le, 2015] - a seq2seq
model with LSTMs. We also applied to our model an attention mechanism [Luong et al., 2015,Bahdanau
et al., 2014] which lets the decoder learn to focus on specific parts of the input sequence when decoding,
instead of relying only on the hidden vector of the decoder’s LSTM. This model architecture consists in
two LSTMs: one for the encoding phase and another for the decoding phase. Before reaching our main
experiments this model suffered multiple changes. All of those changes are described in the remaining
part of this chapter.
5.2 Preliminary Experiments
Training a seq2seq model involves setting and adjusting multiple parameters related to the architec-
ture of the underlying neural network. We need to decide the type of our encoders and decoders. We
should also select an appropriate value for the number of encoding and decoding layers. Then a vocab-
ulary size must be set. This goes on and on through an extensive list of configurable parameters. There
is no right or wrong when selecting values for those parameters because they depend on the data that
we will provide. Since determining the best values for those parameters is out of the scope of this thesis
3Number of entries/sentences.4https://www.tensorflow.org/serving/5github.com/tensorflow/tensorboard
42
Figure 5.1: Tensorboard - visualization tool for understanding, debugging, and optimizing the models being trained.
43
we chose the same values used by [Vinyals and Le, 2015] for the OpenSubtitles corpus as a starting
point: 2 LSTM layers with 4096 hidden units for the Encoder and for the Decoder.
We created three corpora with our B-Subtle tool so that we could adjust the configuration of our SSD
system during our first phase of preliminary experiments. We ended up with the following corpora for
our first phase of preliminary experiments:
1. Corpus with all Portuguese subtitles. The B-Subtle configuration file used to generate the corpus
is available at Listing A.1. We ended up with a corpus with almost 95 Million interaction pairs;
2. Corpus with all Portuguese subtitles with answer length greater than 25 characters. The B-Subtle
configuration file used to generate the corpus is available at Listing A.2. We ended up with a
corpus with almost 43 Million interaction pairs;
3. Corpus with all Portuguese subtitles with the “Horror” genre and subtitle rating above 5.0. The
B-Subtle configuration file used to generate the corpus is available at Listing A.3. We ended up
with a corpus with just a little bit over 320 thousand interaction pairs.
Due to the large size of the first corpus, we had to adjust the configuration of the model in order to
adapt it to our time and hardware constraints6.
We started with two-layered LSTM using AdaGrad with gradient clipping [Duchi et al., 2011]. Since
we were dealing with subtitles in which longer sentences are not so frequent, we made a compromise
to reduce the number of LSTM hidden units from the original 4096 units to 2048 units. This decision
might influence the efficiency of the memory mechanism of an LSTM. However, the process of training
the model became faster7 allowing us to proceed with our experiments in reasonable time. For the
vocabulary size, we tried 100 thousand words but we got out of memory errors when training. To avoid
those memory errors on our GPUs8, we reduced to 75 thousand words for the first and second corpus.
The corpus with fewer interactions (third one in the previous list) had to be set at 50 thousand words
since it did not have any more different words. We kept the embedding size value at 5129 units.
Afterward, we had to decide the training time for our SSD system. The OpenNMT-tf framework
allows us to define a maximum number of steps for our training process. Each step processes a batch10
of interaction pairs from our input corpus. We ended up defining three training marks for our first phase
of preliminary experiments: 100 thousand, 250 thousand and 500 thousand steps11.
6INESC-ID provides GPUs to their community as a shared resource any member can use responsibly.7We almost tripled the steps (each step corresponds to processing one batch of input sequences) per second of the training
process.8We trained our models in GeForce GTX 1080 Ti, GeForce GTX TITAN X and Tesla K20Xm GPUs as long as they were
available.9Default value for the medium size model provided by OpenNMT-tf.
10We used the default value for the batch size: 64.11Selecting an epochs number would be a better approach (a bigger corpus requires more steps to complete an epoch than a
smaller one), but due to time constraints we were not able to test a full epoch for the first and second corpus at that time. Also, wewere only doing our first phase of preliminary experiments to get a sense of what to expect from our SSD system.
44
To evaluate the convergence of the training we defined our validation set size to ten percent of the
original corpus (e.g. for the first corpus that has about 95 million interactions, the validation set has 9,5
million interactions). In later experiments and after some research work we found that in the context of
big data this set of validation data could be much smaller.
We manually translated 200 questions used in [Vinyals and Le, 2015] to Portuguese so that we could
make a subjective analysis of the results after training our models with the corpora we generated after
reaching the training marks previously described. We also tested with a set of 161 user questions in
Portuguese made to Filipe [Ameixa et al., 2014].
We came up with the following observations:
• We detected a considerable amount of “<unk>” symbols in the answers given by the model trained
with the first corpus. This is related to the size of the vocabulary;
• We did not manage to finish the third training mark (500 thousand steps) for the first corpus since
the model training was only processing about 2 steps per second12. It would take about three days
to reach 500 thousand steps;
• Our first model provided very simple and short answers. These three were undoubtedly the most
frequent: “sim”/(yes), “nao”/(no) e “nao sei”/(I don’t know);
• Our second model gave longer answers. Most of them were questions like: “O que e que estas
a fazer?”/(What are you doing?) and “O que e que isso quer dizer?”/(What does that mean?).
Even though it produced some valid answers (e.g. question: “Queres ir para outro lugar?”(Do
you want to go somewhere else?), answer: “Nao quero ir para outro lugar.”(I do not want to go
anywhere else.)) great part of them did not make sense (e.g. question: “’O que fazes nos tempos
livres?’(What do you do in your free time?), answer: “Nao quero que te aconteca nada.”(I do not
want anything to happen to you.));
• Our third model gave responses with more variability. Even so, most of them are not related to the
corresponding question (e.g. question: “O que fazes nos tempos livres?”/(What do you do in your
free time?), answer: “Nao ha nada de errado durante a manha.”/(There is nothing wrong in the
morning.)). Also, the corpus used to train this model was taken from Horror movies, however, we
did not find any of its answers causing any sentiment of fear;
• The validation size we set was too big and slowed down our training process.
All the results obtained during our preliminary experiments are available in a spreadsheet13.
12On a Geforce GTX 1080 Ti with a compute capability of 6.1 (https://developer.nvidia.com/cuda-gpus)13https://goo.gl/4tXu3v
45
5.3 Main Experiments
In order to conduct our main experiments, we tried to fix some of the issues we found during the
preliminary experiments phase.
In order to partially solve the problem with the high volume of “<unk>” symbols we decided to
normalize all of our corpora by applying a tokenization step and a transformation step that would convert
all text to lowercase. We also reduced the embedding units of our SSD system from 512 units to 256
units. This way we were able to increase the size of the vocabulary to 100 thousand words.
In preliminary experiments, the training process was too slow. To increase performance (while poten-
tially degrading the results obtained) we decided to reduce the number of our LSTM hidden units from
2048 to 102414. The performance of the training process with SSD improved by a significant amount15.
We generated three new corpora. After defining three training marks at 100 thousand, 200 thousand
and 300 thousand steps we found that they could give interesting results16 if trained during a couple
epochs17.
We will now describe the setup for our main experiments. We will start by showing how we generated
corpora in Section 5.3.1 and we will show a detailed description of the configuration of our answer
selection systems in Section 5.3.2 and in Section 5.3.3.
5.3.1 Corpora
We generated three new corpora using our B-Subtle tool. We also use Subtle Corpus (Section 2.1)
so that we can compare our answer selection systems.
We started by generating a brand-new corpus of Portuguese interactions from all subtitles provided
by OpenSubtitles2016. This is an important contribution since it provides a much larger corpus when
compared with the Portuguese version of Subtle corpus:
• Corpus A: Corpus with all Portuguese subtitles available in OpenSubtitles2016. The following B-
Subtle components were used: an Open NLP Tokenizer Producer and a Lowercase Transformer.
It outputs Parallel files for the SSD system and Legacy Files for the SSS system. We ended up
with a corpus with almost 95 million interaction pairs. The validation size was set to a fixed value
of 2500 interaction pairs. Check the B-Subtle configuration file in listing 5.1.
14Since we are dealing with corpora from subtitles, most of the sentences are short, so reducing the number of hidden units ofour LSTMs might not have a great impact on final results.
15From about 2 steps per second to about 8.5 steps per second on a Geforce GTX 1080 Ti.16Spreadsheet with results for those three training marks: https://tinyurl.com/y8kwx7lb17The final result of multiplying the number of steps with the batch size should be equal to the corpus size for an epoch to be
complete.
46
Listing 5.1: B-Subtle’s configuration file to generate Corpus A.1 ---2 p i p e l i n e s :3 - p ipe l ineType: opensub t i t l es4 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt"5 producers:6 - producerType: openNLPTokenizer7 modi fy InPlace: true8 t rans fo rmers :9 - t ransformerType: lowercase
10 outputs :11 - outputType: p a r a l l e l12 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/corpusA/"13 v a l i d a t i o n S i z e : 250014 - outputType: legacy15 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/corpusA/"
Having in mind the Corpus A dimension, we decided to create a smaller corpus18 by filtering out the
subtitle files with a worse rating.:
• Corpus B: Corpus with all Portuguese subtitles available in OpenSubtitles2016 that have a subtitle
rating equal or above 5.019. The following B-Subtle components were used: a Subtitle Rating Filter,
an Open NLP Tokenizer Producer and a Lowercase Transformer. It outputs Parallel files for the
SSD system. We ended up with a corpus with almost 4,5 million interaction pairs. The validation
size was set to a fixed value of 2500 interaction pairs. Check the B-Subtle configuration file in
listing 5.2.
Listing 5.2: B-Subtle’s configuration file to generate Corpus B.1 ---2 p i p e l i n e s :3 - p ipe l ineType: opensub t i t l es4 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt"5 metada taF i l t e rs :6 - f i l t e r T y p e : s u b t i t l e R a t i n g M i n7 value: 5 .08 producers:9 - producerType: openNLPTokenizer
10 modi fy InPlace: true11 t rans fo rmers :12 - t ransformerType: lowercase13 outputs :14 - outputType: p a r a l l e l15 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/corpusB/"16 v a l i d a t i o n S i z e : 2500
Since previously generated corpora contain interactions extracted from any sequence of sentences
we decided to limit our triggers to questions by filtering sentences ending with a question mark:
• Corpus C: Corpus with all Portuguese subtitles available in OpenSubtitles2016 that have a subtitle
rating equal or above 5.0 and the trigger ends with a question mark. The following B-Subtle
components were used: a Subtitle Rating Filter, an Interaction Pair Filter, an Open NLP Tokenizer
18It allows us to train models with SSD across more epochs, thus allowing us to further research the capabilities of our neuralmodel for answer selection.
19OpenSubtitles subtitle rating scale goes from a minimum of 0 to a maximum of 10.
47
Producer and a Lowercase Transformer. It outputs Parallel files for the SSD system. We ended
up with a corpus with almost 786 thousand interaction pairs. The validation size was set to a fixed
value of 2500 interaction pairs. Check the B-Subtle configuration file in listing 5.3.
Listing 5.3: B-Subtle’s configuration file to generate Corpus C.1 ---2 p i p e l i n e s :3 - p ipe l ineType: opensub t i t l es4 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt"5 metada taF i l t e rs :6 - f i l t e r T y p e : s u b t i t l e R a t i n g M i n7 value: 5 .08 i n t e r a c t i o n F i l t e r s :9 - f i l t e r T y p e : t r iggerEndsWi th
10 value: "?"11 producers:12 - producerType: openNLPTokenizer13 modi fy InPlace: true14 t rans fo rmers :15 - t ransformerType: lowercase16 outputs :17 - outputType: p a r a l l e l18 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/corpusC/"19 v a l i d a t i o n S i z e : 2500
5.3.2 Say Something Deep Setup
The preliminary experiments together with further research made us choose the following configura-
tion for the SSD:
• Encoder type: Bidirectional RNN Encoder with 2 layers, each one using LSTM as a memory cell
type with 512 hidden units;
• Decoder type: Attentional RNN Decoder with 2 layers, each one using LSTM as a memory cell
type with 512 hidden units;
• Vocabulary Size: 100 thousand words;
• Word Embedding Size: 256;
• Training time: for the corpus A we trained for 6 epochs due to time restrictions20. For the other
ones we trained till the normalized loss stabilized21.
We ended up choosing a Bidirectional Encoder for our main experiments because it does a better job
at preserving the input context. Also it is used in several seq2seq models like in [Wu et al., 2016]. Since
introducing this type of decoder decreased the speed of the training process, we reduced each layer of
both the encoder and decoder from 1024 units to 512 units, thus reducing even further the number of
hidden units of the LSTM.20One epoch takes almost one day and a half using two GTX 1080 Ti.21When it stopped decreasing its value.
48
We applied the default AdaGrad optimizer22 provided by the OpenNMT-tf framework.
For each corpus we adjusted the sample buffer size and the number of the training steps in order to
train all models during full epochs.
In order to try to reduce the appearance of sub-optimal answers given by our models, we did some
experiments to compare a greedy search approach to a beam search approach when decoding the
output. After some subjective analysis, we ended up choosing a beam search technique with a beam
width of 2 and applying a length penalty of 1 (neutral). The results returned by our models with this
configuration seemed to provide better overall results when compared to the others (particularly against
a greedy approach).
5.3.3 Say Something Smart Setup
The previous Subtle Corpus was already indexed by SSS in the past. We reused those indexes for
our experiments. However, we needed to index Corpus A by following the instructions available in a
readme file provided by the tool.
5.3.4 Agents
In this section, we present the agents that we created by showing which are their knowledge bases
and answer selection mechanisms.
• Agent Alpha: its knowledge base comes from Corpus A and its answer selection mechanism is
the SSD system with a model trained during 6 epochs;
• Agent Bravo: its knowledge base comes from Corpus B and its answer selection mechanism is
the SSD system with a model trained during 15 epochs;
• Agent Charlie: its knowledge base comes from Corpus C and its answer selection mechanism is
the SSD system with a model trained during 15 epochs;
• Agent Delta: its knowledge base comes from Corpus A and its answer selection mechanism is
the SSS system;
• Agent Echo: its knowledge base comes from Subtle Corpus and its answer selection mechanism
is the SSS system;
22Adam optimizer with clipping gradients and decay.
49
50
6Evaluation
Contents
6.1 How to Evaluate our Agents? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
51
52
In this chapter, we present the procedures and results of evaluating the 5 agents created in our
experiments. We start by discussing in Section 6.1 which evaluation methods seem to fit the agents
(“chatbots”) we created for open-domain conversations. Then, we describe the method we ended up
choosing in Section 6.2. Finally in Section 6.3 we present and discuss the results obtained.
6.1 How to Evaluate our Agents?
One of the main challenges faced when evaluating open domain dialog agents chatbots is the lack
of a good mechanism to measure their performance. The absence of explicit objective for open do-
main conversations makes evaluating dialogue systems a challenging research problem. Researches
regarding answer generation have led to adoption of metrics from machine translation to automate the
evaluation phase [Ritter et al., 2011,Serban et al., 2016,Li et al., 2015,Wen et al., 2015]. For instance,
BLEU [Papineni et al., 2002] is a standard for evaluating machine translation models. However, relying
on this metric to evaluate our agents could lead to an erroneous analysis of the answers given by our
agents. BLEU assumes that valid answers have significant word overlap with the ground truth answers.
In an open-domain conversation, there is a significant diversity in the space of valid answers to give a
particular trigger. For example, let us say we have an interaction pair where its trigger is “Do you want to
go the beach?”, its (ground truth) answer is “Not today.” and one of our agents responded with a plausible
“Of course I want!”. The BLEU score for the agent response would be zero because there are no words
in common between its answer and the ground truth answer. Additionally, in our experiments there is no
guarantee that the interaction pairs present in our corpora contain the required ground truth answers that
actually respond/react to the trigger they are associated with1. A recent study [Liu et al., 2016] provides
evidence against existing metrics when evaluating dialogue response generation systems. It is shown
that there is very weak and sometimes inexistent correlation between automatic evaluation with metrics
and human judgment. With this in mind, we decided to rely on the human evaluation by conducting a
survey.
6.2 Human Evaluation
We aimed to compare our agents against each other so that we could examine and determine the
influence of the corpora we generated, but we also aimed to compare the architecture for generating
(SSD) or selecting (SSS) the best answer. With the purpose of evaluating the performance of our
agents, we asked volunteers to fill a survey. That survey consisted of a list of questions. For each
question the volunteer had to qualify an answer given by each agent (Figure 6.1) in the following way:
1The answer can correspond to a change of a scene in a movie or can come from a completely different speaker.
53
Figure 6.1: Example of a question in our survey (in Portuguese). Each line corresponds to an answer given by oneof our agents that should be classified as valid, plausible or invalid.
• Valid: when the answer was appropriate to the subject of the question (e.g. “How old are you?”
and the answer given was “I am 24 years old.”);
• Plausible: when the answer could be appropriate in a given context (e.g. “Where do you live?”
and the answer given was “We do not have time for that, keep running.”) or referred to details that
may belong to the person who performed the interaction (e.g. the agents could return answers
containing names such as “Ok John.” so that would be a plausible answer if the agent is talking
with someone called John);
• Invalid: when the answer was not adequate or included grammatical errors that made it difficult to
understand (e.g. “How old are you?” and the answer given is “Yes.”).
The survey consisted in a list of 100 questions (see Appendix B) randomly picked from a set of
361 questions: 200 questions manually translated to Portuguese from [Vinyals and Le, 2015] and 161
questions made to Filipe [Ameixa et al., 2014] that were already available in Portuguese. Although
we refer to them as questions, some of them were commentaries (e.g. “Life is hard...”). This list of
questions suffered the same pre-processing as our corpora generated with B-Subtle by being converted
to lowercase and tokenized2. When getting the answers given by the agents to our list of questions we
reverted the tokenization process and capitalized the first letter so that their appearance seemed more
natural to volunteers (e.g. “bye , see you later .” was converted to “Bye, see you later.”). We assured
that each answer entry only appeared once for the same question (e.g. when two agents answer “I2Except for the agent Echo. The indexed version of Subtle Corpus of this agent did not pass through our pre-processing
pipeline.
54
am fine.” for the question “How are you?” only one entry would appear with “I am fine.”). Taking this
into account, the number of possible entries for each response could vary between 1 (when all agents
answered equally) and 5 (when all agents answered differently). Since some volunteers could not have
the availability to fill the entire survey we decided to split it into two parts, each one with 50 questions.
The first part was required and the second part was optional.
6.3 Results
A total of 44 volunteers participated in the survey. Of the participants, 32 filled both the first and the
second part. Since the survey had 100 questions, each one with 5 answers (given by our agents) to be
evaluated, we were able to collect 38003 human evaluations per agent.
6.3.1 Overview
We present the results of the responses given by the volunteers for the percentage of valid, plausible
and invalid answers classifications given to our agents in Table 6.1.
Table 6.1: Percentage of valid, plausible and invalid answers classifications given to our 5 agents (Alpha, Beta,Charlie, Delta, and Echo) by the volunteers who have filled out the survey.
Agent Alpha Agent Beta Agent Charlie Agent Delta Agent EchoValid Answer 72.97% 57.97% 58.53% 22.58% 26.08%
Plausible Answer 18.32% 19.39% 21.47% 16.58% 20.95%Invalid Answer 8.71% 22.63% 20.00% 60.84% 52.97%
The first observation is that the agents which use SSD (Alpha, Beta, and Charlie) gave answers way
more preferred by our survey participants when comparing them against the agents which use SSS
(Delta and Echo).
Agent Alpha achieved impressive results by having 72.97% of its answers classified as valid and
18.32% classified as plausible. Only (8.71%) of its answers were labeled as invalid. It also leads
the pack by a great margin by having almost 14.44% more valid answers than the second-best Agent
Charlie. It is interesting to see that these results were obtained by the agent who trained for less time.
Although Charlie is the second-best agent among the five, Agent Beta showed similar behavior by
having identical amounts of valid, plausible and invalid answers. Agent Charlie is only slightly better than
Agent Beta, by having less amount of invalid answers (-2.63%).
At the end of the pack, we have Agent Delta and Agent Echo. Agent Delta had the worst results
by having 60.84% of its answers classified as invalid. The difference between its valid answers and
plausible answers is short and the sum of the two (39.16%) does not even reach the number of valid
3(32 ∗ 100) + (12 ∗ 50) = 3800
55
answers given by the third-best Agent Beta (57.97%). Surprisingly, Agent Echo, which uses a smaller
and outdated corpus of Portuguese subtitles (Subtle Corpus), was able to achieve better results than
Agent Delta, which uses a brand new corpus created during our experiments (Corpus A).
6.3.2 Short and Simple Answers
While analyzing the answers given by our agents which rely on SSD (Agents Alpha, Beta, and Char-
lie) we noticed a high amount of short and simple responses. “Sim.”(Yes.), “Nao.”(No.) and “Nao sei.”(I
do not know.) were the most frequent as we can see in table Table 6.2.
Table 6.2: Amount of “Sim.”(Yes.), “Nao.”(No.) and “Nao sei.”(I do not know.) answers returned by Agents Alpha,Beta and Charlie for the 100 questions included in the conducted survey.
Agent Alpha Agent Beta Agent Charlie“Sim.”(Yes.) 19.00% 21.00% 18.00%“Nao.”(No.) 17.00% 10.00% 10.00%
“Nao sei.”(I do not know.) 35.00% 24.00% 29.00%Total 71.00% 55.00% 57.00%
At this point, we were intrigued by the possibility of having an almost direct correlation between the
percentage of short and simple answers returned by our agents and the percentage of answers marked
as valid by our survey participants. As shown in Table 6.1 our Agent Alpha had 72.97% of its answers
considered valid but 71.00% of them (Table 6.2) were either “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.”(I do
not know.). We can verify similar results for Agents Beta and Charlie.
After further investigation, we identified that 23.06% of Agent Alpha valid answers correspond to
responses other than the short and simple answers already described (see Figure 6.2). We were able
to recognize a similar pattern for the other two agents. Agent Beta got 24.86% and Agent Charlie got
28.49%. Although Agent Charlie appears to return more diverse (and perhaps more interesting) answers
than its competitors for the valid label, the converse happens for the plausible label (see Figure 6.3).
Despite the fact that the agents Alpha, Beta and Charlie gave a majority of short and simple answers,
most of them were considered valid by our survey participants (in Figure 6.4 it is evident that the short
and simple answers were labeled as invalid very few times in comparison with all the other invalid
answers). The high amount of short and simple answers can also be justified by the content of the
questions that were chosen randomly for the research. Some of them really require a “Sim.”(Yes.),
“Nao.”(No.) or “Nao sei.”(I do not know.) as an answer (e.g. “Gostas do teu trabalho?”(Do you like you
work?) can be satisfied by a “Nao.”(No.)).
The “Nao sei.”(I do not know.) answer seems to be the biggest issue that we identify in our SSD
agents. That type of answer can make a chatbot boring in a conversational context. Indeed, the an-
swer “Nao sei.”(I do not know.) was considered valid or plausible a significant amount of times by the
volunteers who participated in our survey (see Table 6.3).
56
Figure 6.2: Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered valid among all answers labeledas valid by the survey participants.
Figure 6.3: Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered plausible among all answerslabeled as plausible by the survey participants.
Figure 6.4: Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers considered invalid among all answers labeledas invalid by the survey participants.
57
Table 6.3: Amount of “Sim.”(Yes.), “Nao.”(No.) or “Nao sei.” answers evaluated independently for each answerlabel present in our survey (valid, plausible and invalid). The agents relying on the new SSD system forgenerating answers are included for comparison.
Answer is “Sim.”(Yes.)Agent Alpha Agent Beta Agent Charlie
Valid Answer 91.22% 87.81% 93.13%Plausible Answer 3.51% 2.74% 4.68%
Invalid Answer 5.27% 9.45% 2.19%Answer is “Nao.”(No.)
Agent Alpha Agent Beta Agent CharlieValid Answer 93.51% 98.60% 97.55%
Plausible Answer 1.30% 1.40% 1.90%Invalid Answer 5.19% 0.00% 0.54%
Answer is “Nao sei.”(I do not know.)Agent Alpha Agent Beta Agent Charlie
Valid Answer 61.25% 55.34% 49.27%Plausible Answer 32.55% 35.36% 37.23%
Invalid Answer 6.19% 9.29% 13.50%
6.4 Summary
In this chapter, we presented the results of evaluating our 5 CAs. We started by discussing why we
chose to conduct a survey to evaluate them. Then, we provided the details about how we created the
survey and the guidelines given to our human evaluators.
After showing and analyzing the results of the survey, we showed that CAs based on a generative
approach were able to achieve a much higher score of plausible and valid answers when comparing
against CAs based on a retrieval approach. However, we identified that the generative agents returned
a much higher amount of short and simple answers than retrieval agents.
58
7Conclusions
Contents
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
59
60
7.1 Contributions
This document addressed both a hands-on project and a scientific research assignment. By analyz-
ing the limitations of Subtle Corpus and Subtle Tool created by L2F/INESC-ID group, we aimed to build
a brand new tool for generating corpora from movies and TV shows subtitles. We called it B-Subtle.
Furthermore, we scrutinized the system behind Filipe (a CA also built by L2F/INESC-ID group), which
relies on SSS for retrieving answers from Subtle Corpus after receiving a user input. After finding its
limitations, we came across two possibilities: improve the existing system or develop a competing sys-
tem by following a different approach based on answer generation. We opted for the latter by creating
SSD, a system that uses seq2seq learning with neural networks for building models with the capability
to generate answers given a user input.
We presented B-Subtle as a powerful tool for generating new corpora of interaction pairs from subti-
tle files. It was developed with flexibility and expandability in mind. Therefore, a modular approach was
chosen, allowing the tool to be expanded in the future with additional features just by adding new mod-
ules/components. At present, we support OpenSubtitles2016 Corpus files as input. It was the biggest
and the most appropriate corpus of subtitles among all the corpora we found during our research. The
incorporation of meta-data together with the possibility of processing the data we collect, allowed us to
offer an advanced and wide range of filtering options. This enabled the generation of highly customized
corpora of interaction pairs. Considering the fact that end-users of B-Subtle might need to integrate
corpora with different systems, we offered multiple output formats. This was very useful for our experi-
ments since we needed a corpus that was both compatible with SSS and SSD. Having one tool capable
of generating the same corpora in two distinct formats was a great advantage. Besides creating new
corpora of interaction pairs from subtitles, we also added the option for collecting analytical data about
them with B-Subtle.
We also provided an overview of the present state-of-the-art in end-to-end generative dialogue sys-
tems in our related work section. We presented various types of data-driven dialogue systems relying
on neural networks for seq2seq learning. Since some of them even used movies and TV shows subtitles
as input data, we followed some of their guidelines as a starting point. Then, we began creating our own
conversational models.
In our main experiments we ended up creating 5 CAs. Three of them were created with SSD, thus
being able to generate answers. The other 2 were created with SSS, thus using a retrieval strategy
for selecting answers. We relied on our B-Subtle tool to create the knowledge bases of our agents,
by creating multiple variations of Portuguese corpora. In order to evaluate how each one of our CAs
compared to each other, we opted for human evaluation by conducting a survey. The results obtained
from the survey showed that generative agents created with SSD gave answers way more preferred than
retrieval agents created with SSS.
61
To sum up, we were able to:
• Provide B-Subtle as a complete tool for creating fully customizable corpora of interaction pairs
from movies and TV shows subtitles. The architecture of this tool allows for further expansion of
its features by future developers;
• Create 3 CAs that (according to human evaluation through a survey) gave a much more quantity
of appropriate answers than 2 CAs built with the existing SSS developed by L2F/INESC-ID group;
• Hand over three new corpora containing interaction pairs collected from Portuguese subtitle files
till the year of 2016.
7.2 Future Work
We are confident that our research will serve as a base for future studies on creating CAs using
seq2seq neural networks. Although the results of our survey reveal that agents created with SSD have a
much higher percentage of valid answers than the ones created with SSS, we found ourselves observing
a problem already known in the literature: the generation of safe, short and simple responses. In fact,
those type of responses seem to be appropriate most of the times (e.g. for “yes” and “no” answers),
but after analyzing the answers given by SSD agents we identified almost one-third of its answers being
“Nao sei.”(I do not know.). The human evaluation results have shown that the survey participants had
a slightly more divided opinion when classifying that type of answer as valid, plausible and invalid. The
problem relies on the simplicity of the seq2seq model because its objective function being optimized
does not capture the actual objective of a conversation with a human. Recent studies have proposed
alternatives to this objective function. In [Li et al., 2015] they suggest using an objective function that
avoids favoring responses that have high probability by making a tradeoff between the input given and
the possible responses and vice-versa. However, they applied their new objective function a posteriori to
an N-best list returned by the seq2seq. A similar experiment can be made by returning N-best answers
of our models generated with SSD and then applying the similarity metrics that are part of SSS. In [Shao
et al., 2017] they changed the behavior of the beam search algorithm and introduced stochastic sampling
operations. As we can see, there are multiple approaches that can be taken to reduce the number of
commonplace answers.
We are also aware that our agents built with SSD lack a way to ensure consistency in a conversational
context since they rely on purely unsupervised models. In our survey, we asked the participants to
evaluate each pair of question-answer independently, so further evaluation should be made in a fully
conversational context.
Training our models with SSD consisted in assigning values to a considerable amount of parameters
62
related to the configuration of the neural networks. The value of each one of them can affect in different
ways the results obtained. We followed the same configuration found in similar experiments and then
tweaked some of them to adjust to our reality. Still, we are not sure which configuration combination
might yield better results for the corpora we used as input. In order to know that, we would need another
neural network to learn how to train the SSD neural networks. It is a very challenging problem that can
be studied with Automatic Machine Learning (AutoML) techniques.
Regarding B-Subtle, it was built with expandability in mind so we expect additional components to
be added by future developers as well as support for other types of input data, for instance, Cornell
Movie-Dialogs. Currently, the most evident limitation of B-Subtle is the use of The Movie DB Meta-Data
Collector, because it makes HTTP requests to get data and a rate limiting policy is applied (40 requests
every 10 seconds by Internet Protocol (IP) address). However, at the time of writing, we are unaware
of a better alternative. It is important to note that in our preliminary and main experiments we did not
use this Meta-data Collector since the currently supported input data (OpenSubtitles2016 Corpus files)
already have meta-data included.
63
64
Bibliography
[Ameixa et al., 2014] Ameixa, D., Coheur, L., Fialho, P., and Quaresma, P. (2014). Luke, i am your
father: dealing with out-of-domain requests by using movies subtitles. In International Conference on
Intelligent Virtual Agents, pages 13–21. Springer.
[Ameixa et al., 2013] Ameixa, D., Coheur, L., and Redol, R. A. (2013). From subtitles to human interac-
tions: introducing the subtle corpus. Technical report, Tech. rep., INESC-ID (November 2014).
[Bahdanau et al., 2014] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by
jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[Banchs, 2012] Banchs, R. E. (2012). Movie-dic: a movie dialogue corpus for research and develop-
ment. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:
Short Papers-Volume 2, pages 203–207. Association for Computational Linguistics.
[Bengio, 2013] Bengio, Y. (2013). Deep learning of representations: Looking forward. In International
Conference on Statistical Language and Speech Processing, pages 1–37. Springer.
[Cho et al., 2014] Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,
H., and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078.
[Danescu-Niculescu-Mizil and Lee, 2011] Danescu-Niculescu-Mizil, C. and Lee, L. (2011). Chameleons
in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.
In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, pages
76–87. Association for Computational Linguistics.
[Duchi et al., 2011] Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
[Guo et al., 2017] Guo, P., Xiang, Y., Zhang, Y., and Zhan, W. (2017). Snowbot: An empirical study of
building chatbot using seq2seq model with different machine learning framework.
65
[Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term mem-
ory. Neural computation, 9(8):1735–1780.
[Kalchbrenner and Blunsom, 2013] Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous
translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language
Processing, pages 1700–1709.
[Li et al., 2015] Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. (2015). A diversity-promoting
objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
[Li and Momoi, 2001] Li, S. and Momoi, K. (2001). A composite approach to language/encoding detec-
tion. In Proceedings of the 19th International Unicode Conference, pages 1–14.
[Lison and Tiedemann, 2016] Lison, P. and Tiedemann, J. (2016). Opensubtitles2016: Extracting large
parallel corpora from movie and tv subtitles. In Proceedings of the 10th International Conference on
Language Resources and Evaluation.
[Liu et al., 2016] Liu, C.-W., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., and Pineau, J. (2016).
How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for
dialogue response generation. arXiv preprint arXiv:1603.08023.
[Lowe et al., 2015] Lowe, R., Pow, N., Serban, I., and Pineau, J. (2015). The ubuntu dialogue cor-
pus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint
arXiv:1506.08909.
[Lu et al., 2017] Lu, Y., Keung, P., Zhang, S., Sun, J., and Bhardwaj, V. (2017). A practical approach to
dialogue response generation in closed domains. arXiv preprint arXiv:1703.09439.
[Luong et al., 2015] Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective approaches to
attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
[Magarreiro et al., 2014] Magarreiro, D., Coheur, L., and Melo, F. S. (2014). Using subtitles to deal with
out-of-domain interactions. In Proceedings of 18th Workshop on the Semantics and Pragmatics of
Dialogue (SemDial), pages 98–106.
[Mendes et al., 2013] Mendes, A. C., Coheur, L., Silva, J., and Rodrigues, H. (2013). Just. ask—a
multi-pronged approach to question answering. International Journal on Artificial Intelligence Tools,
22(01):1250036.
[Mendonca et al., 2017] Mendonca, V., Melo, F. S., Coheur, L., and Sardinha, A. (2017). A conver-
sational agent powered by online learning. In Proceedings of the 16th Conference on Autonomous
66
Agents and MultiAgent Systems, pages 1637–1639. International Foundation for Autonomous Agents
and Multiagent Systems.
[Papineni et al., 2002] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for
automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association
for computational linguistics, pages 311–318. Association for Computational Linguistics.
[Ritter et al., 2011] Ritter, A., Cherry, C., and Dolan, W. B. (2011). Data-driven response generation in
social media. In Proceedings of the conference on empirical methods in natural language processing,
pages 583–593. Association for Computational Linguistics.
[Serban et al., 2016] Serban, I. V., Sordoni, A., Bengio, Y., Courville, A. C., and Pineau, J. (2016).
Building end-to-end dialogue systems using generative hierarchical neural network models. In In
Association for the Advancement of Artificial Intelligence Conference, volume 16, pages 3776–3784.
[Shao et al., 2017] Shao, Y., Gouws, S., Britz, D., Goldie, A., Strope, B., and Kurzweil, R. (2017). Gen-
erating high-quality and informative conversation responses with sequence-to-sequence models. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages
2210–2219.
[Sutskever et al., 2014] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning
with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger,
K. Q., editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran
Associates, Inc.
[Tiedemann, 2009] Tiedemann, J. (2009). News from opus - a collection of multilingual parallel corpora
with tools and interfaces. In Recent advances in natural language processing, volume 5, pages 237–
248.
[Vinyals and Le, 2015] Vinyals, O. and Le, Q. (2015). A neural conversational model. arXiv preprint
arXiv:1506.05869.
[Wen et al., 2015] Wen, T.-H., Gasic, M., Mrksic, N., Su, P.-H., Vandyke, D., and Young, S. (2015).
Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv
preprint arXiv:1508.01745.
[Wu et al., 2016] Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao,
Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the
gap between human and machine translation. arXiv preprint arXiv:1609.08144.
[Yin et al., 2015] Yin, J., Jiang, X., Lu, Z., Shang, L., Li, H., and Li, X. (2015). Neural generative question
answering. arXiv preprint arXiv:1512.01337.
67
68
APreliminary Experiments
Listing A.1: B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubtitles2016 sub-
titles.
1 ---
2 p i p e l i n e s :
3 - p ipe l ineType: opensub t i t l es
4 batchSize: 2000
5 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt/"
6 outputs :
7 - outputType: p a r a l l e l
8 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/pt/"
Listing A.2: B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubtitles2016 sub-
titles with answer length above 25 characters.
69
1 ---
2 p i p e l i n e s :
3 - p ipe l ineType: opensub t i t l es
4 batchSize: 2000
5 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt_1/"
6 i n t e r a c t i o n F i l t e r s :
7 - f i l t e r T y p e : answerMinLength
8 value: 25
9 outputs :
10 - outputType: p a r a l l e l
11 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/pt_2/"
Listing A.3: B-Subtle’s configuration file to generate a parallel corpus from all portuguese OpenSubtitles2016 sub-
titles with Horror as a genre and a subtitle rating above 5.0.
1 ---
2 p i p e l i n e s :
3 - p ipe l ineType: opensub t i t l es
4 batchSize: 2000
5 i n p u t D i r e c t o r y : "/OpenSubtitles2016/raw/pt_1/"
6 metada taF i l t e rs :
7 - f i l t e r T y p e : genre
8 value: "Horror"
9 - f i l t e r T y p e : s u b t i t l e R a t i n g M i n
10 value: 5 .0
11 outputs :
12 - outputType: p a r a l l e l
13 ou tpu tD i r : "/BSubtleOutput/OpenSubtitles2016/pt_2/"
70
BSurvey
B.1 Survey description given to volunteers
No questionario que se segue ira encontrar uma lista de interaccoes constituıdas por uma pergunta/comentario e um conjunto
de possibilidades de resposta. Considere uma interaccao Pessoa-Maquina. Coloque-se na posicao de quem esta a interagir com
a maquina e avalie as respostas dadas.
Classifique cada uma das possibilidades de resposta (dadas pela maquina) da seguinte forma:
• Valida: a resposta e adequada relativamente ao assunto da pergunta/comentario (esta contextualizada) - (ex: ”Que idade
tens?” e a resposta e ”28 anos.”) (ex: ”’Es engracado!?” e a resposta e ”Tu e que es.”)
• Aceitavel: a resposta pode ser adequada em determinado contexto ou refere detalhes que podem pertencer a pessoa
que efectuou a interaccao - (ex: ”Onde e que vives?” e a resposta e ”Isso agora nao importa, temos que ir embora.”) (ex:
”Estas cansado?” e a resposta e ”Nao Joana.”)
• Nao valida: quando a resposta nao e adequada ou contem erros gramaticais dificultam a compreensao - (ex: ”Que idade
tens?” e a resposta e ”Sim.”)
O questionario encontra-se dividido em duas partes. A primeira parte e de caracter obrigatorio enquanto que a segunda parte
e opcional (depende da sua disponibilidade). No entanto gostarıamos de contar com a sua colaboracao para as duas partes.
O tempo estimado e de 7 a 10 minutos por cada parte.
Obrigado, desde ja, pela seu contributo.
71
Notas: - Algumas respostas poderao nao ser consistentes na utilizacao de maiusculas para nomes, cidades, etc., no entanto,
isso nao devera influenciar a sua classificacao;
B.2 Raw data with the questions and corresponding agent an-
swers used in survey
Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo
qual o prato que mais
gostas ?
nao sei . nao sei . nao sei . jalebi . Foi a bela que matou o mon-
stro.
para onde vais ? nao sei . para casa . nao sei . ola . Vou tirar-te as duvidas.
que genero de musica e
que gostas ?
nao sei . gosto de musica . e uma piada . esta bem , esta bem , esta
bem .
E um genero de musica do
mundo.
gostas do teu trabalho ? sim . sim . sim . sou consciencioso e ambi-
cioso .
Tenho a sensacao que
deixamos de comunicar.
o que e que estudas ? trabalho . nao estou a brincar . nao sei . microbiologia . Tu nao queres ficar preso a
esta casa para o resto da tua
vida, pois nao?
ja viajaste para longe de
casa ?
nao . nao . nao , nao , nao . nao , estava em casa . O que queres?
o que e pensas sobre a in-
teligencia artificial ?
o que e isto ? e uma maquina do
tempo .
nao sei . um deus . Uma unica consciencia que
criou toda uma prole de
maquinas.
andas em que faculdade ? nao sei . sim . nao sei . caltech . Estavamos a espera de que?
tu fumas ? nao . nao . nao . fuma , fuma , fuma ! Nao?
qual a tua profissao ? sou um homem de
negocios .
nao sei . escultor , senhor . duro . Um pouco disso e daquilo e
nada de mais.
estas ai sozinho ? sim . sim . sim . oh , querida , que situacao
estranha .
Exijo que o venha ja buscar.
meu , nao percebo nada
do que disseste ...
o que e que disseste
?
nao sei o que dizer . nao sei . quando ? E tambem nao percebo nada
de advocacia.
moras em lisboa ? sim . sim . nao . ha 14 meses atras . O que estavas a fazer no
metro?
segues algum desporto ? nao . nao , nao . claro . e por isso que eu luto . A aeronave aproximando-se
de Cedar Creek, fala o Dr.
Sam Daniels.
jogas jogos de video ? sim . sim . sim . claro . Da-nos a latitude e a longi-
tude.
estudaste ? sim . a serio ? na escola . para que ? Nao, nem por isso.
qual e o proposito do pro-
grama de exploracao es-
pacial ?
nao sei . nao sei . temos de ir embora . charles . Por que nos apenas nao deix-
amos isso para os profission-
ais?
gostas de ficcao cientıfica
?
nao . sim . sim . nao . O ”Morte 1” ressuscitou os
mortos, que se alimentam
de carne humana, Tambem
chama a isso ficcao cientıfica?
onde e que compraste
essa roupa ?
nao sei . na casa de banho . onde e que ela esta ? na clınica da metadona ? Isto e bonito, de onde o tirou?
Continued on next page
72
Table B.1 – Continued from previous page
Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo
porque e que estas aqui ? nao sei . porque e que estas
aqui ?
porque e que nao me
disseste ?
hmm ? Estas a brincar, nao estas?
de que tipo de filmes
gostas ?
nao sei . de que ? de que e que estas a
falar ?
esta desaparecido ha oito
horas .
Sim, eu gosto de filmes.
entao e esse tempo ? nao sei . nao sei . nao sei . estavamos ansiosos . E por que nao o veria?
o que e a desonestidade ? o que ? uma investigacao . o que e que esta a
fazer ?
uma desonestidade que
envenena tudo .
Eu nao percebo.
ronald reagan e ... e o meu filho . o que e que se passa
?
eu disse que era um
homem .
ronald reagan? bem. e o vice-presidente ?
onde vais assim vestido ? nao sei . nao sei . nao sei . a uma exposicao de moda . E assim! E assim que isto fun-
ciona.
a vida e difıcil . e difıcil . a vida nao e facil . nao , nao , nao . cresceu na nossa rua . Mas esta vida esta a chegar
ao fim, e considero-me sor-
tudo por a ter vivido.
o que e que vais fazer esta
noite ?
nao sei . nao sei . nao sei . o habitual . O que eu vou fazer hoje a
noite?
quantos filhos tens ? dois . nao sei . dois . galloway . Nenhum.
qual foi a melhor coisa que
ja fizeste a alguem ?
nao sei . nao sei . nao sei . eu nao fujo as minhas re-
sponsabilidades .
E a melhor coisa que farei.
esta bem , sabes porque e
que falamos tanto ?
porque e o que eu
faco .
porque e que nao me
disseste ?
nao . esta aqui , algures no
castelo !
Porque representamos tudo o
que querias ter, mas receias
enfrentar.
onde estudaste ? na universidade . na universidade . no hotel . barnard . Nao, nem por isso.
em que trabalhas ? sou um homem de
negocios .
nao sei . nao sei . trabalha isso . De que tamanho?
qual e o teu trabalho ? o meu trabalho . o meu trabalho . nao sei . trabalha isso . De que tamanho?
quais sao os teus objetivos
para a vida e para o tra-
balho ?
nao sei . nao sei . nao sei . tao triste . Tudo o que vem e ouvem po-
dem aproveitar.
que dia e hoje ? sexta-feira . dia de natal . nao sei . vejo que nao tem mal . O que?
que hobbies tens ? o que e isso ? eu ? o que queres dizer ? galloway . O que e que somos? E isto?
Um tipo de entretenimento?
qual o dia do teu aniver-
sario ?
nao sei . o que e isso ? 8 de maio de 1969 . e feriado . Meu Deus.
que disciplinas tens ? nao sei . as minhas coisas . algum . galloway . Disciplina!
etica e moral sao a mesma
coisa ?
sim . nao sei . isso e um pouco
complicado .
tudo e um jogo para ele . Temos a teoria de que o crime
aumenta a beleza.
viste o jogo dos new york
knicks ?
sim . nao sei . a ultima vez que
o vi , o meu pai
e um homem muito
poderoso .
pois , os knicks sao os
maiores !
Ela conhece um escritor em
New York.
queres ajuda ? nao . nao . sim . apanhamo-lo fora do es-
critorio cerca de uma hora
depois de o termos apan-
hado a si .
Nao. Sim.
que horas sao ? nao sei . nao sei . nao sei . nao sei . E quase meio-dia.
tudo bem ? sim . sim . sim . isso ja chega para gente
como ele .
Esta tudo bem. Esta tudo
bem. Esta tudo bem.
Continued on next page
73
Table B.1 – Continued from previous page
Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo
queres uma goma ? sim . sim . nao , nao quero . nao , nao quero essas tuas
estupidas gomas .
Nao, nao quero essas tuas
estupidas gomas.
sou bonito ? sim . sim . sim . bonito rapaz E bonito.
gostas de cantar ? sim . sim . sim . rhonda ... Sim... de repente tudo veio-
me a cabeca.
o que e que achas dos
japoneses ?
nao sei . nao sei . nada . vamos sabe-lo assim que
escurecer .
Os japoneses, o que?
o que e que fazes ? sou um homem de
negocios .
nao sei . nao sei . nao , nao o facas ! Reparo computadores. Nao,
trabalho com computadores.
devemos dancar ? nao . sim . nao . sabes , smith... ja sei o que
tu mais odeias .
Nao sei dancar.
se fores sair avisas-me ? claro . claro . sim . os meus pais vao chegar
dentro de uma hora .
Eu sei, isto nunca aconteceu.
Tu nao existes.
o que fazes na vida ? nao sei . o que fazes aqui ? nao sei . boa noite... Sou arquitecto.
ola ! como estas ? ola . como estas ? ola . tragam-no . Ola.
sabes dancar ? nao . sim . nao . nao . Eu nao sei dancar.
costumas sair a noite ? nao . nao . nao . sim . E eu ja nao percebo.
tu bebes cervejas ? nao . nao . nao . estamos apaixonados ,
suas cretinas .
Nao, obrigado.
podes dizer-me o cam-
inho para a biblioteca mais
proxima ?
sim . sim . o que e isto ? que vergonha , salem . Nao sei quem sou.
onde vives ? em casa . em casa do meu pai
.
na escola . eu levo-a a casa . Em Sao Francisco.
como te descreves em tres
palavras ?
sim . sim . com as mulheres ? panico . Primeira palavra.
eu acho que tu es uma
maquina
eu sou uma maquina eu sou uma estrela
de rock
sim . eu sou o coronel gre-
gor yegorov da federal
seguranca de rusia .
Es uma sobrevivente.
tens algum animal de
estimacao ?
nao . nao . nao . nao . Nao.
e passatempos preferidos
?
nao sei . de que ? o resto de comida
nao me permite ir
para casa ?
os meus preferidos ? Montanhismo e um pas-
satempo.
qual o melhor elogio que ja
recebeste ?
nao sei . nao , obrigado . nao sei . a viagem foi boa ? Claro.
tens quantos anos ? seis . oito . nao sei . galloway . Ela e do Ano do Galo.
tudo bem contigo ? sim . sim . sim . sim , tudo bem . Sim, tudo bem.
qual e a cor do ceu ? nao sei . nao sei . nao sei . repare , david . Sim, mas mesmo assim pre-
firo o azul.
como queres ser lembrado
?
nao sei . eu sei . eu nao sei . deve ser o sandy . Oh, acho que nao.
gostas de lasanha ? nao . sim . sim . adoro lasanha . Sim. Gosras de lasanha?
o que e que achas da
rainha ?
nao sei . nao sei . nao sei . nao , acho que nao . O que e a rainha tem a ver
com isso?
o que fazes ? nao sei . nao sei . nao sei . eu vim para te ver , e agora
pede minhas economias ...
Um pouco disto, um pouco
daquilo. Neste momento es-
tou de maos a abanar.
Continued on next page
74
Table B.1 – Continued from previous page
Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo
amanha encontramo-nos
?
sim . sim . sim . isso significa ... Esta bem.
fazes algum desporto ? nao . nao . nao . e por isso que eu luto . Desporto? Sim, eu era o
lancador.
qual e a tua religiao ? nao sei . nao sei . nao sei . i . i. , senhor . Nenhuma, porque? E obri-
gatorio?
o que e que fazes amanha
?
nao sei . o que fazes ? nao sei . isso significa ... O feriado do 4 de Julho?
quando e que nasceste ? nao sei . nao sei . nao sei . o futuro esta em ti . E que ela nasceu 3 minutos
depois.
ja vi que es muito agres-
sivo !
o que e que estas a
fazer ?
nao me digas que
nao es o unico .
nao sei . eu fui a um medico . Sejam agressivas!
qual e a tua comida
preferida ?
nao sei . nao sei . nao sei . que belo petisco . Porque e amarela...
o que te trouxe aqui ? nao sei . nada . o que e que eu fiz ? eu nao acreditava , mas e
”ver para crer” .
O Michael estava a ter proble-
mas na escola e eu... mudei-
me para ca para encontrar um
lugar mais seguro.
qual e a coisa mais impor-
tante a saber sobre biolo-
gia ?
nao sei . o que e que eu faco
?
nao sei . com a sra . hawker ou a sra
. jennings ?
E algo que tenho de fazer.
podes falar para sempre ? sim . sim . claro . quero que te esforces para
falares decentemente .
Estava a comecar a pensar
que... tu nao me querias
mais... e que andavas a sair
com outra pessoa.
qual e o buraco mais fundo
no mundo ?
nao sei . nao sei . nao sei . sabes que mais ? Vamos recalcular com as me-
didas do teu exo-esqueleto e
ver o que da.
em que paıses ja estiveste
?
nao . nao sei . tenho que fazer isto . e bom , mas e como se
morresses e acordasses no
paraıso judeu .
Nao em todos.
tu es uma mulher
engracada !
nao , nao , nao . eu sei . sim . pelo seu corte confundia-a
com um homem .
Nao, tu e que es.
tens medo de robos ? nao . nao , nao , nao . sim . galloway . Eu nao tenho medo de nada.
o que e que acontece
se as maquinas puderem
pensar ?
nao sei . o que e que se passa
?
venham de volta . eles devem ter uma
fraqueza .
Corram o mais rapido que
puderem.
qual e a tua cor preferida ? azul . vermelho . nao sei . posso ajuda-lo , sargento ? A minha cor preferida... Nao
sei. Cinzento, talvez.
gostas de computadores ? sim . sim . sim . uso-os . Sim, o computador e meu.
costumas passar por aqui
?
nao . nao . nao . amanha filmo-o para voces
.
O costume de toda uma vida.
de onde es natural ? nao sei . de onde veio ? do meu pai . da cidade . Tenho a certeza de que pode
por no congelador as mul-
tas de estacionamento e as
licencas para fogo-de-artifıcio
que e a funcao normal do seu
departamento.
Continued on next page
75
Table B.1 – Continued from previous page
Question Agent Alpha Agent Beta Agent Charlie Agent Delta Agent Echo
toquio e ... o que e isto ? o que ? o han esta a estudar
.
sabe , papa , nao e preciso
levar-me a escola todos os
dias .
Toquio? A hora mais cedo de
chegada a Toquio e as 6:15
am.
quem sao os teus pais ? nao sei . nao sei . nao sei . nao sei . Este... e o meu Pai.
o que gostas de fazer nos
tempos livres ?
nao sei . nao sei . nao sei . o moulin rouge e o lugar p
´ra ti
Nao ha mesmo mais nada.
gostas de estudar ? sim . sim . sim . e? Nao a perco por nada. Tranco-
me no meu quarto.
as vezes queres ser es-
pancado ?
nao . ou nao ? eu quero . aqui esta . Parece que ele... ele nao a
abandonou.
tens primos ? nao . nao . quantos sao ? galloway . Super-Homem e seu primo?
de que tipo de musicas
gostas ?
de que tipo ? de que ? nao sei . gostas de salsa ? Do tipo de musica que vos
gostardes.
tu es um homem sem in-
teligencia .
eu sei . eu sei . nao , nao sou . nao . Certo. Eu sou o homem.
gostas de trabalhar ? sim . sim . sim . sim . E o que faz David?
com quem vives ? com o meu pai . com o meu pai . com a minha mae . uma num milhao , milhao So com o meu tio.
queres ir beber um cafe ? nao , obrigado . nao . nao , obrigado . nao posso . So um cafe.
francamente , minha
querida , eu nao quero
saber !
nao quero saber ! nao quero saber ! o que ? nao e habitual conceder ao
condenado o seu ultimo de-
sejo ?
O que preciso de saber e: sou
fedorento?
76
CSample Configuration File
1 ---
2 p i p e l i n e s :
3 - p ipe l ineType: opensub t i t l es
4 batchSize: 2500
5 i n p u t D i r e c t o r y : "/input/dataset/path"
6
7 metadataCol lec tor :
8 - co l l ec to rType : themoviedb
9 apiKey: YOURAPIKEY
10
11 metada taF i l t e rs :
12 - f i l t e r T y p e : audience
13 value: "M/16"
14 count ry : "Portugal"
15 - f i l t e r T y p e : count ry
77
16 value: "Italy"
17 - f i l t e r T y p e : countryRegex
18 value: "^A.*o$"
19 - f i l t e r T y p e : countryQuant i tyMax
20 value: 3
21 - f i l t e r T y p e : durat ionMin
22 value: 3
23 - f i l t e r T y p e : encoding
24 value: "utf-8"
25 - f i l t e r T y p e : genres
26 value: [ 'Action' , 'Comedy' ]
27 - f i l t e r T y p e : imdbIDExistence
28 - f i l t e r T y p e : or ig ina lLanguage
29 value: "Portuguese"
30 - f i l t e r T y p e : or ig ina lLanguageQuant i t y
31 value: 2
32 - f i l t e r T y p e : movieRatingRange
33 l e f t V a l u e : 2 .2
34 r i gh tVa lue : 9 .3
35 - f i l t e r T y p e : s u b t i t l e R a t i n g E x a c t
36 value: 10 .0
37 - f i l t e r T y p e : yearMin
38 value: 1974
39
40 i n t e r a c t i o n F i l t e r s :
41 - f i l t e r T y p e : in te rva lMax
42 value: 4 # in seconds
43 - f i l t e r T y p e : t r i gge rSen t imen t
44 value: "positive"
45 - f i l t e r T y p e : answerTokensQuantityMin # requires using producer to tokenize the answer
46 value: 26
47 - f i l t e r T y p e : t r iggerCharactersRange
48 l e f t V a l u e : 5
49 r i gh tVa lue : 42
50 - f i l t e r T y p e : t r iggerRegex
51 value: "[:alpha:]*?$"
52 - f i l t e r T y p e : answerContains
53 value: "Hello"
78
54 ignoreCase: fa lse
55 i n v e r t : fa lse
56
57 producers:
58 - producerType: openNLPSentiment
59 - producerType: openNLPTokenizer
60 reverseTr igger : fa lse
61 reverseAnswer: fa lse
62 - producerType: treeTaggerLemmatizer
63 - producerType: openNLPStemmer
64
65 t rans fo rmers :
66 - t ransformerType: s t r i n g i f y T o k e n s
67 separa tor : " "
68 - t ransformerType: st r ing i fyLemmas
69 - t ransformerType: lowercase
70 - t ransformerType: uppercase
71
72 outputs :
73 - outputType: legacy
74 ou tpu tD i r : "/output/corpus/legacy/path"
75 - outputType: xml
76 ou tpu tD i r : "/output/corpus/xml/path"
77 - outputType: json
78 ou tpu tD i r : "/output/corpus/json/path"
79 p r e t t y p r i n t : true
80 - outputType: p a r a l l e l
81 ou tpu tD i r : "/output/corpus/parallel/path"
82 t r i ggersF i lename: "source.txt"
83 answersFilename: "target.txt"
84 v a l i d a t i o n S i z e : 2000
79