Towards building Intelligent Machines that we can ... · •Great progress in machine learning did produce wonderful ... Towards building Intelligent Machines, Tomas Mikolov. Word

Towards building Intelligent Machinesthat we can communicate with

Tomas Mikolov, FacebookTalk at Text, Speech and Dialogue (TSD), 2017

Introduction

• Great progress in machine learning did produce wonderful applications:• Robust speech recognizers

• Automatic machine translation

• Search, ranking, spam filters, …

• Many of these are today part of real-world applications

• What are the next goals for researchers?

Towards building Intelligent Machines, Tomas Mikolov

Introduction

• Despite all this progress, we are still very far from having‘intelligent machines’

• We do not have datasets that could be used to build such machines

• We did not even agree on the metrics of success – how to define machine intelligence?


Talk Overview

• Where we are:• Distributed representations

• Recurrent networks

• Where are we going:• Learning of complex patterns

• Incremental learning, long term memory

• Learning to learn (learning without supervision)

• Virtual environments as datasets for building AI


Distributed representations of words

• Vector representation of words computed using neural networks

• Linguistic regularities in the word vector space

• Word2vec, fastText


Word vectors

• Each word is associated with a real valued vector in N-dimensional space (usually N = 50 – 1000)

• The word vectors have similar properties to word classes (similar words have similar vector representations)

• Computed often using various types of neural networks


Word vectors

• These word vectors can be subsequently used as features in many NLP tasks (Collobert et al, 2011)

• As word vectors can be trained on huge text datasets, they provide generalization for systems trained with limited amount of supervised data


Word vectors

• Many neural architectures were proposed for training the word vectors, usually using several hidden layers

• We need some way how to compare word vectors trained using different architectures


Linguistic regularities in vector space

• We can do nearest neighbor search around result of vector operation “king – man + woman” and obtain “queen”

Linguistic regularities in continuous space word representations (Mikolov et al, 2013)


Word vectors – datasets for evaluation

Word-based dataset, almost 20K questions, focuses on both syntax and semantics:

• Athens:Greece Oslo: ___

• Angola:kwanza Iran: ___

• brother:sister grandson: ___

• possibly:impossibly ethical: ___

• walking:walked swimming: ___

Efficient estimation of word representations in vector space (Mikolov et al, 2013)


Word vectors – Bengio architecture

• Neural net based word vectors were traditionally trained as part of neural network language model (Bengio et al, 2003)

• Training on <1M words took days


Word vectors – word2vec architectures

• The ‘continuous bag-of-words model’ (CBOW)adds inputs from words within short windowto predict the current word

• No hidden layer, no matrix multiplications

• The weights for different positions are shared

• Computationally much more efficient thann-gram NNLM of (Bengio, 2003)


Word vectors – word2vec architectures

• Predict surrounding words using thecurrent word

• This architectures is called ‘skip-gram NNLM’

• Performance similar to the CBOW model


Word vectors - training

• Stochastic gradient descent + backpropagation

• Efficient solutions to very large softmax – size equal to vocabulary size, can easily be in order of millions (too many outputs to evaluate):

• Hierarchical softmax: class-based approach

• Negative sampling: 1-against-all loss instead of softmax


Word vectors – comparison of performance (2013)


• Google 20K questions dataset (word based, both syntax and semantics)

• Almost all models are trained on different datasets

Word vectors – more analogies


Word vectors – visualization using PCA


Beyond word2vec: GloVe?

• GloVe is a well-known reimplementation of word2vec from Stanford NLP group

• The most important modification is to first count word cooccurrences, and perform dimensionality in a second step

• Claims superior performance to word2vec



• GloVe: Global Vectors for Word Representation (Pennington, Socher, Manning, 2014):


Model Dim. Training Size Accuracy

CBOW 1000 6B 63%

SG 1000 6B 65%

SVD-L 300 42B 49%

GloVe 300 42B 75%


• GloVe: Global Vectors for Word Representation (Pennington, Socher, Manning, 2014):

• … comparing quality of machine learning techniques when training on different datasets is not recommended…


Model Dim. Training Size Accuracy

CBOW 1000 6B 63%

SG 1000 6B 65%

SVD-L 300 42B 49%

GloVe 300 42B 75%


• Part of word2vec package is script ‘demo-train-big-model-v1.sh’:• Achieves 78% accuracy, higher than any of the GloVe models

• Anyone can reproduce the results, uses only public data

• Published long time before the GloVe project

• Results further analyzed in “Improving Distributional Similarity with Lessons Learned from Word Embeddings” (Levy, Goldberg, Dagan, 2015):• When trained on the same dataset, GloVe is found to be slower to train, much

more memory complex, and produces vectors with lower quality than word2vec


Beyond word2vec:

• We can further improve accuracy of word vectors by adding subwordinformation

• Can be achieved by using the character n-grams as additional inputs

• Helps especially for morphologically rich languages, can form representations for out-of-vocabulary words


Beyond word2vec:

• https://github.com/facebookresearch/fastText

• Enriching Word Vectors with Subword Information (Bojanowski, Grave, Joulin, Mikolov, 2017)


word2vec fastText

Czech Semantic 26% 28%

Czech Syntactic 53% 78%

German Semantic 67% 62%

German Syntactic 45% 56%

English Semantic 79% 78%

English Syntactic 70% 75%

https://github.com/facebookresearch/fastText

Beyond word2vec:

• When word2vec / fastText and GloVe are trained on comparable corpora, it is clear which algorithms are superior:

• Models available at fasttext.cc

• Paper with details will be published later this year


Semantic Syntactic Total

GloVe: Wikipedia + Gigaword 300d 78% 67% 72%

Word2vec: Wikipedia + News 300d 91% 84% 87%

fastText: Wikipedia + News 300d 88% 88% 88%

fasttext.cc

Beyond word2vec:

• fastText also implements supervised mode: like CBOW architecture predicting label instead of the middle word• Comparable accuracy to deep learning models (bidirectional LSTMs, CNNs),

trains in seconds where deep learning models require days to weeks

• Bag of tricks for efficient text classification (Joulin, Grave, Bojanowski, Mikolov, 2017)

• Can use pre-trained features to build classifiers when only limited number of training examples is available


Distributed word representations: summary

• Simple models seem to be sufficient: no need for every neural net to be deep

• Large text corpora are crucial for good performance

• Open source packages: Word2vec, GloVe, fastText• Word2vec & fastText are superior to Glove when trained on the same data,

and faster + more memory efficient


Recurrent Networks and Beyond

• Recent success of recurrent networks

• Explore limitations of recurrent networks

• Discuss what needs to be done to build machines that can understand language


Simple RNN Architecture

• Input layer, hidden layer with recurrentconnections, and the output layer

• In theory, the hidden layer can learnto represent unlimited memory

• Also called Elman network(Finding structure in time, Elman 1990)


Brief History of Recurrent Nets – 90’s - 2010

• After the initial excitement, recurrent nets vanished from the mainstream research

• Despite being theoretically powerful models, RNNs were mostly considered as unstable to be trained


Brief History of Recurrent Nets – 2010 - today

• In 2010 - 2012, it was shown that RNNs can significantly improve state-of-art in:• language modeling

• machine translation

• data compression

• speech recognition

• RNNLM toolkit was published (used at Microsoft Research, Google, IBM, Facebook, Yandex, …)

• The key novel trick in RNNLM was trivial: to clip gradients to prevent instability of training


Brief History of RNNLMs – 2010 - today

• Breakthrough result in 2011: 11% WER reduction over large system from IBM (NIST RT04)

• Ensemble of big RNNLM models trained on a lot of data


Brief History of RNNLMs – 2010 - today

• RNNs became much more accessible through open-source toolkits:• Theano

• Torch

• TensorFlow

• …

• Training on GPUs allowed further scaling up (billions of words, thousands of hidden neurons)


Recurrent Nets Today

• Widely applied:• ASR (both acoustic and language models)• MT (language & translation & alignment models, joint models, end-to-end)• Many NLP applications• Video modeling, handwriting recognition, user intent prediction, …

• Downside: for many problems RNNs are too powerful, models are becoming unnecessarily complex

• Complex RNN architectures are popular (LSTM, GRU), though there are simpler tricks how to add longer short term memory to RNNs(Learning longer memory in recurrent neural networks, 2014)


Beyond Deep Learning

• Going beyond: what RNNs and deep networks cannot model efficiently?

• Surprisingly simple patterns! For example, memorization ofvariable-length sequence of symbols

• Thus, these models cannot deal efficiently with novel words


Beyond Deep Learning: Algorithmic Patterns

• Many complex patterns have short, finite description length in natural language (or in any Turing-complete computational system)

• We call such patterns Algorithmic patterns

• Examples of algorithmic patterns: 𝑎𝑛𝑏𝑛, sequence memorization, addition of numbers learned from examples

• These patterns often cannot be learned with standard deep learning techniques (ie. just by using many hidden layers)


• Learns algorithms from examples

• Add structured memory to RNN:• Trainable [read/write]

• Unbounded

• Actions: PUSH / POP / NO-OP

• Examples of memory structures: stacks, lists, queues, tapes, grids, …

Stack RNN


Algorithmic Patterns

• Examples of simple algorithmic patterns generated by short programs (grammars)

• The goal is to learn these patterns unsupervisedly just by observing the example sequences


Algorithmic Patterns - Counting

• Performance on simple counting tasks

• RNN with sigmoidal activation function cannot count

• Stack-RNN and LSTM can count


Algorithmic Patterns - Sequences

• Sequence memorization is not learnable by LSTM

• Expandable memory of stacks allows to learn solution that generalizes


Stack RNNs: summary

The good:

• Turing-complete model of computation (with >=2 stacks)

• Learns some algorithmic patterns

• Has long term memory

• Simple model that works for some problems that break RNNs and LSTMs

• Reproducible: https://github.com/facebook/Stack-RNN

The bad:

• The long term memory is used only to store partial computation (ie. learned skills are not stored there yet)

• Does not seem to be a good model for incremental learning

• Stacks do not seem to be a very general choice for the topology of the memory


https://github.com/facebook/Stack-RNN

Beyond Stack-RNNs

• Many different types of memories have been recently proposed

• More complex tasks have been learned: binary multiplication with very big numbers (thousands of digits)

• One can argue that learning how to solve these problems have limited practical value, especially when the model and task are designed together


Towards strong AI

• It may be good to choose at first the important tasks we want to solve in the long term

• Then, limit the complexity gradually until we obtain solvable tasks

• And finally develop machine learning techniques that can solve the more complex tasks

• Our plan is described in ‘A Roadmap towards Machine Intelligence’ (2015)


The Goal for “useful general AI”

• General AI research is currently very popular…

• But different people see AI as something very different (Image recognition? Machine translation? Data compression?)

• For this talk, we assume “useful AI” is an artificial machine (computer) capable of helping human users in solving wide range of tasks, in a similar way other humans can


Useful General AI

• We attempted to identify crucial components of useful AI:

• Ability to perform tasks for human users

• Ability to learn and improve

• Ability to communicate


Additional components

• The previous design can be further extended by adding:• Grounding (virtual worlds, 2D / 3D)

• Multi-modal perception (vision, audio, …)

• Communication between AIs

• …

• However, we believe that development of even the simplest general AI is very complex and thus start with just the necessary components


The Roadmap to AI

• We want to develop a machine that can learn to perform novel tasks for us through natural communication - example:

Me: Can you check the weather every day before I go to work

to see if it is going to rain, so that I don’t forget to

bring umbrella?

Machine: But how do I do that?

Me: go to search engine and enter query ‘weather new york’

… (some morning week later) …

Machine: it will rain today!


The Roadmap to AI

• The existing machine learning techniques seem to be insufficient for this goal: deep (recurrent, convolutional) neural networks have excellent performance on supervised tasks, but there is much needed future development of:• Unsupervised / reward-based learning• Compositional and incremental learning• Long term memory

• Currently, there is no existing standard dataset focusing on teaching machines to communicate in natural language while addressing these research problems


CommAI-env

• CommAI-env is an open-source virtual environment focusing on learning to communicate

• Published together with a set of very simple (but still probably unsolvable) communication tasks

CommAI: Evaluating the first steps towards a useful general AI(Baroni et al, 2017)

https://github.com/facebookresearch/CommAI-env


https://github.com/facebookresearch/CommAI-env

CommAI tasks

• Currently there are several existing datasets, with various degree of complexity

• Example of a basic task:

Teacher: repeat after me: AABBB

Learner: AAB

Teacher: wrong, the correct answer is: AABBB


CommAI tasks

• Currently there are several existing datasets, with various degree of complexity

• Example of a basic task:Teacher: repeat after me: BBAA

Learner: BBAA

Teacher: good! [reward +1]

Learning to repeat a word (or more generally, a sequence of symbols) has been already shown to be very challenging for RNNs.


CommAI tasks

• The tasks are closely related and share some common structure through natural language

• Example of a task that re-uses previous knowledge:

Teacher: repeat twice after me: ABA

Learner: ABAABA

Teacher: good! [reward +1]

It should be much faster to learn ‘repeat twice’ task after the Learner already knows how to repeat a sequence once.


Future of general AI research

• We need standard datasets and metrics to compare various attempts to solve communication-based general AI

• CommAI-env is a prototype of communication-based environment, and General AI Challenge from GoodAI is an example of standard dataset defined together with the metrics of success:https://www.general-ai-challenge.org/

• The objective function should reflect the learning speed: we aim to build Learners that can learn from as few examples as possible to perform novel tasks


https://www.general-ai-challenge.org/

Open research problems

Unsupervised / reward-based learning:

• How can the machine “learn to learn”: knowledgeable Learner should modify its own behavior even when no explicit reward signal is present

• Learner should be able to memorize new facts and abilities without explicit instructions



Compositional and incremental learning:

• How can the Learner build new skills by re-using the existing skills, ie. without learning from scratch solution to every new problem?



Long term memory:

• What should be the structure of the long term memory? How should it be updated?

• All these research problems seem to be very related. One can probably not solve compositional and incremental learning without having any way to form persistent long term memory.


Conclusion

• We may want to be more goal-oriented when we talk about strong or general AI

• Communication seems to be necessary for useful strong AI

• To achieve progress, researchers need to have standard datasets to compare various approaches, and incentives to work on very difficult unsolved problems

• https://research.fb.com/commai-fellowships-and-visiting-researcher-programs/


https://research.fb.com/commai-fellowships-and-visiting-researcher-programs/

Documents

Towards building Intelligent Machines that we can ... · •Great progress in machine learning did produce wonderful ... Towards building Intelligent Machines, Tomas Mikolov. Word