Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group

Application of RNNs to Language ProcessingAndrey Malinin, Shixiang Gu

CUED Division F Speech Group

Overview

• Language Modelling

• Machine Translation

Overview

• Language Modelling


Language Modelling Problem

• Aim is to calculate the probability of a sequence (sentence) P(X)

• Can be decomposed into product of conditional probabilities of tokens (works):

• In practice, only finite content used

N-Gram Language Model

• N-Grams estimate word conditional probabilities via counting:

• Sparse (alleviated by back-off, but not entirely)

• Doesn’t exploit word similarity

• Finite Context

Neural Network Language Model

Y. Bengio et al., JMLR’03

Limitation of Neural Network Language Model

• Sparsity – Solved

• World Similarity – Solved

• Finite Context – Not

• Computational Complexity - Softmax

Recurrent Neural Network Language Model

[X. Liu, et al.]

Wall Street Journal Results – T. Mikolov Google 2010

Limitation of RNN Language Model

• Sparsity – Solved!

• World Similarity -> Sentence Similarity – Solved!

• Finite Context – Solved? Not quite…

• Still Computationally Complex Softmax

Lattice Rescoring with RNNs

• Application of RNNs to lattices expands space

• Lattice is expanded to a prefix tree or N-best list

• Impractical to apply to large lattices

• Approximate Lattice Expansion – expand if:

• N-gram history is different

• RNN history vector distance exceeds a threshold

Overview

• Language Modeling


Machine Translation Task

• Translate an Source Sentence E into a target sentence F

• Can be formulated in Noisy-Channel Framework:

E’ = argmaxE[P(F|E)] = argmaxE[P(E|F)*P(F)]

• P(F) is just a language model – need to estimate P(E|F).

Previous Approaches: Word Alignment

• Use IBM Models 1-5 to create initial word alignments of increasing complexity and accuracy from sentence pairs.

• Make conditional independence assumptions to separate out sentence length, alignment and translation models.

• Bootstrap using simpler models to initialize more complex models.

W. Byrne, 4F11

Previous Approaches: Phrase Based SMT

• Using IBM world alignments create phrase alignments and a phrase translation model.

• Parameters estimated by Maximum Likelihood or EM.

• Apply Synchronous Context Free Grammar to learn hierarchical rules over phrases.

W. Byrne, 4F11

Problems with Previous Approaches

• Highly Memory Intensive

• Initial alignment makes conditional independence assumption

• Word and Phrase translation models only count co-occurrences of surface form – don’t take word similarity into account

• Highly non-trivial to decode hierarchical phrase based translation

• word alignments + lexical reordering model

• language model

• phrase translations

• parse a synchronous context free grammar over the text – components are very different from one another.

Neural Machine Translation

• The translation problem is expressed as a probability P(F|E)

• Equivalent to P(fn, fn-1, …, f0 | em, em-1, …, e0) -> a sequence conditioned on another sequence.

• Create an RNN architecture where the output of on RNN (decoder) is conditioned on another RNN (encoder).

• We can connect them using a joint alignment and translation mechanism.

• Results in a single gestalt Machine Translation model which can generate candidate translations.

Bi-Directional RNNs

Neural Machine Translation: Encoder

h0 h1 hj hN… …

……

e0 e1 ej eN

…

…

… … …

• Can be pre-trained as a Bi Directional RNN language model

Neural Machine Translation: Decoder

s0 s1 st sM… …

f0 f1 ft FM= </S>

…

… … …<S>

• ft is produced by sampling the discrete probability produced by softmax output layer.

• Can be pre-trained as a RNN language model

Neural Machine Translation: Joint Alignment

h0 h1 hj hN… ……………

s0 s1 st sM… …

f0 f1 ft fM

…

… … …<S>

Ct = ∑atjhj

st-1 z0 z1 zj zN

a t,1:Nzj = W ∙ tanh(V ∙ st-1 + U ∙ hj)

Neural Machine Translation: Features

• End-to-end differentiable, trained using SGD with cross-entropy error function.

• Encoder and Decoder learn to represent source and target sentences in a compact, distributed manner

• Does not make conditional independence assumptions to separate out translation model, alignment model, re-ordering model, etc…

• Does not pre-align words by bootstrapping from simpler models.

• Learns translation and joint alignment in a semantic space, not over surface forms.

• Conceptually easy to decode – complexity similar to speech processing, not SMT.

• Fewer Parameters – more memory efficient.

NMT BLEU results on English to French Translation

D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate.

http://arxiv.org/abs/1409.0473

Conclusion

• RNNs and LSTM RNNs have been widely applied to a large.

• State of the art in language modelling

• Competitive performance on new tasks.

• Quickly evolving.

Biliography

• W. Byrne, Engineering Part IIB: Module 4F11 Speech and Language Processing. Lecture 12. http://mi.eng.cam.ac.uk/~pcw/local/4F11/4F11_2014_lect12.pdf

• D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014.

• Y. Bengio, et al., “A neural probabilistic language model”. Journal of Machine Learning Research, No. 3 (2003)

• X. Liu, et al. “Efficient Lattice Rescoring using Recurrent Neural Network Language Models”. In: Proceedings of IEEE ICASSP 2014.

• T. Mikolov. “Statistical Language Models Based on Neural Networks” (2012) PhD Thesis. Brno University of Technology, Faculty of Information Technology, Department Of Computer Graphics and Multimedia.

http://mi.eng.cam.ac.uk/~pcw/local/4F11/4F11_2014_lect12.pdf

http://arxiv.org/abs/1409.0473

Documents

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group