Describing videos by exploiting temporal structure

Preview:

Citation preview

Describing Videos by Exploiting Temporal Structure

Slides by Alberto MontesComputer Vision Group, April 12th, 2016

[arXiv] [GitXiv] [video] [code]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christoper Pal, Hugo Larochelle, Aaron Courville

Introduction

Introduction

Goal: Generate captions from videos.

Video Description Generation Framework

Encoder-Decoder Framework

Encoder: Convolutional Neural Network

Basic approach:

Deep CNN over frames

Decoder: Long Short-Term Memory Network

Long Short Term Memory

Long Short Term Memory

Forget Gate:

Long Short Term Memory

Input Gate Layer

New candidates for cell state

Long Short Term Memory

Update Memory Content:

Long Short Term Memory

E[yt]: word embedding matrix

inputprevious

hidden stateWeights matrices:context from

encoder bias

Exploiting Temporal Structure

Exploiting Local Features

● Trained for activity recognition.● Only the conv layers will be used.

Histograms of oriented Gradient

Histograms of oriented Flow

Motion Boundary Histogram

A Spatio-Temporal Convolution Neural Net

Exploiting Global Structure

Attention Mechanism

Update of attention weights:

Experiments

YouTube2Text

1,970 video clips with multiple descriptions

Training set: 1,200 video clips

Validation set: 100 video clips

Datasets

DVS

Videos taken from DVDs

49,000 video clips

Training set: 39,000 video clips

Validation set: 5,000

Test set: 5,000

Setup and Training

4 setups:

◉ Basic (2D GoogLeNet CNN)◉ Local (+ 3D CNN features)◉ Global (+ temporal attention

mechanism)◉ Local + Global

Training

- Adadelta gradient- Loss function:

Results

Evaluation

Evaluation

Conclusions

Propose a 3D CNN to capture local fine-grained motion information.

A temporal attention mechanism to capture global information.

State-of-the-art results on Youtube2text with a combination of both approaches.

Thank you!Questions?

Recommended