LSTMs Overview - University of Texas at...

LSTMs Overview

Subhashini Venugopalan

Neural Networks

Hidden

Output

WHY RNNs/LSTMs?

Accepts only fixed size input e.g 224x224 images.

Performs a fixed number of computations (#layers).

Outputs a fixed size vector.

Hidden

Output

Can we operate over sequences of inputs?

Limitations of vanilla Neural Networks

Recurrent Neural Networks

Image Credit: Chris Olah

They are networks with loops. [Elman ‘90]

Un-Roll The Loop

Recurrent Neural Network “unrolled in time”

● Each time step has a layer with the same weights.● The repeating layer/module is a sigmoid or a tanh.● Learns to model (ht | x1, …, xt-1)

Simple RNNs

sigmoidor

sigmoid or tanh

Problems with Simple RNNs

● Can’t seem to handle “long-term dependencies” in practice● Gradients shrink through the many layers (Vanishing Gradients)

[Hochreiter ‘91][Bengio et. al. ‘94]Image Credit: Chris Olah

Long Short Term Memory (LSTMs)

Image Credit: Chris Olah[Hochreiter and Schmidhuber ‘97]

LSTM Unit

xt ht-1 xt ht-1

xt ht-1

Memory Cell

Output GateInput

Forget Gate

Input ModulationGate

Memory Cell:Core of the LSTM UnitEncodes all inputs observed

[Hochreiter and Schmidhuber ‘97][Graves ‘13]

LSTM Unit

xt ht-1 xt ht-1

xt ht-1

Memory Cell

Output GateInput

Forget Gate

Memory Cell:Core of the LSTM UnitEncodes all inputs observed

Gates:Input, Output and ForgetSigmoid [0,1]

LSTM Unit

xt ht-1 xt ht-1

xt ht-1

Memory Cell

Output GateInput

Forget Gate

Update the Cell state

Learns long-term dependencies

Can Model Sequences

● Can handle longer-term dependencies● Overcomes Vanishing Gradients problem● GRUs - Gated Recurrent Units is a much simpler variant which also

overcomes these issues.

LSTM LSTM LSTM LSTM

[Cho et. al. ‘14]

Putting Things Together

Image Credit: Sutskever et. al.

Encode a sequence of inputs to a vector.(ht | x1, …, xt-1)

Decode from the vector to a sequence of outputs.Pr(xt | x1, …, xt-1)

SOLVE A WIDER RANGE OF PROBLEMS

Image Captioning

Image Credit: Andrej Karpathy

Activity Recognition Sequence to Sequence

Vinyals et. al. ‘15,Donahue et. al. ‘15

Donahue et. al. ‘15 Machine TranslationSpeech RecognitionVideo DescriptionVQA, POS tagging, ...

Sutskever et. al. ‘14, Cho et. al. ‘14Graves & Jaitly ‘14V. et. al. ‘15, Li et. al. ‘153 of 4 papers to be discussed this class

Resources● Graves’ paper - LSTMs explanation. Generating sequences with recurrent

neural networks. Applications to handwriting and speech recognition.● Chris’ Blog - LSTM unit explanation.● Karpathy’s Blog - Applications.● Tensorflow and Caffe - Code examples.

Sequence to Sequence Video to Text

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue Raymond Mooney, Trevor Darrell, Kate Saenko

Objective

A monkey is pulling a dog’s tail and is chased by the dog.

Encode

Recurrent Neural Networks (RNNs) can map a vector to a sequence.

[Donahue et al. CVPR’15]

[Sutskever et al. NIPS’14]

[Vinyals et al. CVPR’15]

EnglishSentence

RNN encoder

RNN decoder

FrenchSentence

Encode RNN decoder Sentence

Encode RNN decoder Sentence [Venugopalan et. al.

NAACL’15]

RNN decoder Sentence [Venugopalan et. al. ICCV’

15] (this work)

RNN encoder

S2VT Overview

LSTM LSTMLSTMLSTM LSTM LSTMLSTMLSTM

CNN CNN CNN CNN

A man is talking

...Encoding stage

Decoding stage

Now decode it to a sentence!

Sequence to Sequence - Video to Text (S2VT)S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko

Frames: RGB

1000 categories

Forward propagateOutput: “fc7” features

(activations before classification layer)

fc7: 4096 dimension “feature vector”

1. Train on Imagenet

2. Take activations from layer before classification

Frames: Flow

CNN (modified AlexNet)

101 Action Classes

CNNForward propagate

Output: “fc7” features(activations before classification layer)

fc7: 4096 dimension “feature vector”

1. Train CNN on Activity classes

3. Take activations from layer before classification

2. Use optical flow to extract flow images.

UCF 101

[T. Brox et. al. ECCV ‘04]

Dataset: Youtube

● A man is walking on a rope.● A man is walking across a rope.● A man is balancing on a rope.● A man is balancing on a rope at the beach.● A man walks on a tightrope at the beach.● A man is balancing on a volleyball net.● A man is walking on a rope held by poles● A man balanced on a wire.● The man is balancing on the wire.● A man is walking on a rope.● A man is standing in the sea shore.

● ~2000 clips● Avg. length: 11s per clip● ~40 sentence per clip● ~81,000 sentences

Results (Youtube)

Mean-Pool (VGG)

S2VT (RGB+Flow)

S2VT (randomized)

S2VT (RGB)

METEOR: MT metric. Considers alignment, para-phrases and similarity.

Evaluation: Movie Corpora

MPII-MD

● MPII, Germany● DVS alignment: semi-

automated and crowdsourced● 94 movies● 68,000 clips● Avg. length: 3.9s per clip● ~1 sentence per clip● 68,375 sentences

● Univ. of Montreal● DVS alignment: automated

speech extraction● 92 movies● 46,009 clips● Avg. length: 6.2s per clip● 1-2 sentences per clip● 56,634 sentences

Movie Corpus - DVS

Processed: Looking troubled, someone descends the stairs.

Someone rushes into the courtyard. She then puts a head scarf on ...

Results (MPII-MD Movie Corpus)

Best Prior Work[Rohrbach et al. CVPR’15]

Mean-Pool

S2VT (RGB)

Results (M-VAD Movie Corpus)

Best Prior Work[Yao et al. ICCV’15]

Mean-Pool

S2VT (RGB)

M-VAD: https://youtu.be/pER0mjzSYaM

● What are the advantages/drawbacks of this approach?○ End-to-end, annotations

● Detaching recognition and generation.● Why only METEOR (not BLEU or other metrics)?● Domain adaptation, Re-use RNNs (youtube -> movies, activity recognition)● Languages other than English.● Features apart from Optical Flow, RGB; temporal representation.

Discussion

Code and more exampleshttp://vsubhashini.github.io/s2vt.html

LSTMs Overview - University of Texas at...

Documents

Deep Q-learning for Active Recognition of GERMS: Baseline ...vision.cs.utexas.edu/381V-spring2016/slides/zhang-paper.pdfDeep Q-learning for Active Recognition of GERMS: Baseline performance

Temporal Segmentation of Egocentric Videosvision.cs.utexas.edu/381V-fall2016/slides/pradhan_paper.pdf · 2016-11-29 · Class 1 853K 58K 453K 999K 1053K 126K 94% 73% 72% 98% 86% 96%

Long-Short Term Memory and Other Gated RNNssrihari/CSE676/10.10 LSTM.pdf · Equation for LSTM internal state update •The LSTM cell internal state is updated as follows •But with

Object detection - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/spring2016_detection.pdf · •Non-rigid, deformable objects not captured well with representations

DeViSE: A Deep Visual-Semantic Embedding Modelvision.cs.utexas.edu/381V-fall2016/slides/nagarajan...margin random incorrect class Inference - ZSL When a new image comes in: 1. Push

lecture1 intro spring2018 - University of Texas at Austinvision.cs.utexas.edu/376-spring2018/slides/lecture1-spring2018.pdf&6 &rpsxwhu 9lvlrq /hfwxuh 9lvlrq dqg judsklfv,pdjhv0rgho

Inferring Analogous Attributes - University of Texas at Austinvision.cs.utexas.edu/projects/inferring_analogous_attribute/inferring-analogous... · Inferring Analogous Attributes

Long-Short Term Memory and Other Gated RNNssrihari/CSE676/10.10 LSTM.pdf · Long-Short Term Memory and Other Gated RNNs ... 9. Leaky Units and ... is singular or plural so that we

3D ShapeNets: A Deep Representation for Volumetric Shape …vision.cs.utexas.edu/381V-spring2016/slides/sinha-paper.pdf · 2016-04-22 · Slide Credit: Wu, Song et al. 3D ShapeNets:

Temporal Segmentation of Egocentric Videosvision.cs.utexas.edu/381V-fall2017/slides/huang_paper.pdf · 2017. 10. 18. · Policeman UN Inspectors in Syria Google Glass . Video credit:

CS381V Experiment Presentationvision.cs.utexas.edu/381V-spring2016/slides/kuo-expt.pdf · The Paper • Indoor Segmentation and Support Inference from RGBD Images. N. Silberman, D

Object Classes by Between- Class Attribute Transfer ...vision.cs.utexas.edu/381V-spring2016/slides/sinha-expt.pdfTook 40 training and 19 test classes with 9 overlapping classes deer,

Grouping and segmentation - University of Texas at Austinvision.cs.utexas.edu/378h-fall2015/slides/lecture7.pdf · – Gestalt properties • Bottom-up segmentation via clustering

ShadowDraw - University of Texas at Austinvision.cs.utexas.edu/381V-fall2016/slides/priyadarshi-paper.pdf · • Shadow Creation by Weighting. Image Matching • Obtain Candidate

Long Short-Term Memory - Brigham Young Universityaxon.cs.byu.edu/~martinez/classes/778/Papers/lstm.pdf · Long Short-Term Memory M. Stanley Fujimoto CS778 – Winter 2016 30 Jan 2016

SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS AND FULLY CONNECTED …vision.cs.utexas.edu/381V-fall2016/slides/kelle_paper.pdf · 2016. 9. 22. · Paper by Chen, Papandreou,

What's In A Name?vision.cs.utexas.edu/381V-fall2016/slides/nelson-expt.pdf · Performed name and gender classification Also tested on ~400 randomly selected IMDB-Wiki images Dataset

Mid-level representations - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/spring2016_midlevel.pdf · Matlab code available for these examples: ... Segmentation

Andrew Zisserman Mark Jaderberg, Karen Simonyan, Andrea ...vision.cs.utexas.edu/381V-spring2016/slides/lad-paper.pdf · Mark Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

Learning video saliency from human gaze using candidate ...vision.cs.utexas.edu/381V-spring2016/slides/bora-paper.pdfLearning video saliency from human gaze using candidate selection