11
6/23/2016 Presentation Final file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 1/11 Introductory Presentation on RNN, LSTM and Seq-2-Seq Models by Jayeol Chun and Sang-Hyun Eun 1. Brief Overview of Theory behind RNN Q: What is RNN?

RNN, LSTM and Seq-2-Seq Models

Embed Size (px)

Citation preview

Page 1: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 1/11

Introductory Presentation on RNN, LSTM and Seq-2-SeqModels

by Jayeol Chun and Sang-Hyun Eun

1. Brief Overview of Theory behind RNN

Q: What is RNN?

Page 2: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 2/11

Feed-Forward vs. Feed-Back : Static vs. Dynamic

As opposed to Convolution Neural Network (CNN) where there are no cycles, Recurrent Neural Network (RNN) maintains the Persistence of Information by linking the outputs ofprevious computations to the later computations, and is thus well suited for processing sequence ofcharacters, naturally making it an ideal tool in NLP.

Basic RNN Computation in Theoryclass RNN: # ... def step(self, x):

# update the hidden state self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))

# compute the output vector y = np.dot(self.W_hy, self.h) return y

# main instruction rnn = RNN() y = rnn.step(x) # x is an input vector, y is the RNN's output vector

~ Point to Take Away : Quite Simple !!

Challenge

Unstable Gradient Problem

-> "The gradient in deep neural networks is unstable, tending to either explode or vanish in earlier layers."

In at least some deep neural networks, the gradient tends to get smaller as we move backward through thehidden layers. This means that neurons in the earlier layers learn much more slowly than neurons in later layers.

Question : The more the hidden layers, the better ??

Page 3: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 3/11

"Backpropagation computes gradients by chain rule -> This has the effect of multiplying n of these smallnumbers to compute gradients of the front layers in an n-layer network, meaning that the gradient (error signal)decreases exponentially with n and the front layers train very slowly."

2. Long Short Term Memory Network (LSTM)

Page 4: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 4/11

The most commonly used type of RNN that addresses the above challengeCan learn to recognize context-sensitive languagesKey is the cell state. It runs down the straight down the entire chain, with only minor linear interactionsIt updates its information with a structure called gatesThere are the 3 main types of gates

Forget Gate Layer - Sigmoid layer and chooses what information to forget.Input Gate Layer - Choose what values to update and whats values to addOutput Gate Layer - Based on our cell state, filter it to decide which values we want tooutput

3. Sequence to Sequence Model

Page 5: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 5/11

Seq-2-Seq Model consists of two RNNs : an encoder that processes the input and maps it to a vector, and adecoder that generates the output sequence of symbols from the vector representation. Specifically, theencoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vectorrepresentation back to a variable-length target sequence of symbols. The two networks are trained jointly tomaximize the conditional probability of the target sequence given a source sequence.

Each box in the picture above represents a cell of the RNN.

Page 6: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 6/11

Example

Sample RNN / Seq-2-Seq Code

Page 7: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 7/11

In [1]: import tensorflow as tf from tensorflow.models.rnn import rnn, rnn_cell import numpy as np

char_rdic = ['h','e','l','o'] # id -> char char_dic = {w: i for i, w in enumerate(char_rdic)} # char -> id sample = [char_dic[c] for c in "hello"] # to index x_data = np.array([ [1,0,0,0], # h [0,1,0,0], # e [0,0,1,0], # l [0,0,1,0]], # l dtype='f')

# Configuration char_vocab_size = len(char_dic)

rnn_size = 4 #char_vocab_size # 1 hot coding (one of 4) time_step_size = 4 # 'hell' -> predict 'ello' batch_size = 1 # one sample

# RNN model rnn_cell = rnn_cell.BasicRNNCell(rnn_size)

state = tf.zeros([batch_size, rnn_cell.state_size])

X_split = tf.split(0, time_step_size, x_data)

outputs, state = tf.nn.seq2seq.rnn_decoder ( X_split, state, rnn_cell)

print (state)

print (outputs)

# logits: list of 2D Tensors of shape [batch_size x num_decoder_symbols]. # targets: list of 1D batch-sized int32 Tensors of the same length as logits.# weights: list of 1D batch-sized float-Tensors of the same length as logits.logits = tf.reshape(tf.concat(1, outputs), [-1, rnn_size])

targets = tf.reshape(sample[1:], [-1])

weights = tf.ones([time_step_size * batch_size])

loss = tf.nn.seq2seq.sequence_loss_by_example([logits], [targets], [weig

hts])

cost = tf.reduce_sum(loss) / batch_size

train_op = tf.train.RMSPropOptimizer(0.01, 0.9).minimize(cost)

# Launch the graph in a session with tf.Session() as sess: # you need to initialize all variables tf.initialize_all_variables().run()

for i in range(100): sess.run(train_op)

result = sess.run(tf.arg_max(logits, 1))

print (result, [char_rdic[t] for t in result])

Page 8: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 8/11

Page 9: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 9/11

Tensor("rnn_decoder/BasicRNNCell_3/Tanh:0", shape=(1, 4), dtype=float32) [<tf.Tensor 'rnn_decoder/BasicRNNCell/Tanh:0' shape=(1, 4) dtype=float32>, <tf.Tensor 'rnn_decoder/BasicRNNCell_1/Tanh:0' shape=(1, 4) dtype=float32>, <tf.Tensor 'rnn_decoder/BasicRNNCell_2/Tanh:0' shape=(1, 4) dtype=float32>, <tf.Tensor 'rnn_decoder/BasicRNNCell_3/Tanh:0' shape=(1, 4) dtype=float32>] (array([2, 0, 0, 0]), ['l', 'h', 'h', 'h']) (array([2, 0, 0, 0]), ['l', 'h', 'h', 'h']) (array([2, 0, 3, 0]), ['l', 'h', 'o', 'h']) (array([2, 0, 3, 0]), ['l', 'h', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 0]), ['l', 'l', 'o', 'h']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 3, 2]), ['l', 'l', 'o', 'l']) (array([2, 2, 2, 2]), ['l', 'l', 'l', 'l']) (array([2, 2, 2, 2]), ['l', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l'])

Page 10: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 10/11

(array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 2]), ['e', 'l', 'l', 'l']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o']) (array([1, 2, 2, 3]), ['e', 'l', 'l', 'o'])

Page 11: RNN, LSTM and Seq-2-Seq Models

6/23/2016 Presentation Final

file:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 11/11

4. RNN Applications- Language Modeling

- Conversation Modeling / Question Answering

- Machine Translation

- Speech Recognition

- Image / Video Captioning

- Image / Music Generation

5. References- http://colah.github.io/posts/2015-08-Understanding-LSTMs/ - https://en.wikipedia.org/wiki/Convolutional_neural_network - https://en.wikipedia.org/wiki/Recurrent_neural_network - http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ - https://github.com/hunkim/ml - https://www.tensorflow.org/versions/r0.9/tutorials/index.html - http://karpathy.github.io/2015/05/21/rnn-effectiveness/ - http://arxiv.org/pdf/1409.3215v3.pdf - http://arxiv.org/pdf/1406.1078v3.pdf

Code References:

- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/kernel_tests/rnn_test.py - https://github.com/hans/ipython-notebooks/blob/master/tf/TF%20tutorial.ipynb - https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/ptb_word_lm.py - https://gist.github.com/karpathy/d4dee566867f8291f086#file-min-char-rnn-py