Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Temporal Activity Detection in Untrimmed Videos with Recurrent

Neural Networks

Alberto Montes

July 15th, 2016

Xavi Giró Amaia Salvador

Outline

1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work

Motivation

Problem Definition

Videos

Problem Definition

Videos

Activity Classification

Longboarding

Problem Definition

Videos

Activity Temporal Localization

Longboarding

Problem Definition

Neural Network

Activity

Problem Definition

Activity

CNN RNN+

Large-Scale Activity Recognition Challenge

Stats:

● 19,994 Videos● 200 Activities● 660 hours of video● 313 hours of activities● 65.6 million of frames

Dataset

Outline

Literature Approaches

Activity

CNN RNN+

Convolutional Neural Network

Convolutional Layer

Recurrent Neural Network

c0 c1 c2

Activity

CNN RNN+

3D Convolution

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015, December). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE ICCV 2015 (pp. 4489-4497). IEEE.

3D Convolution

● 16-frame video clip as input● 80 million parameters● 3x3x3 filter size at all conv layers

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015, December). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE ICCV 2015 (pp. 4489-4497). IEEE.

Activity

CNN RNN+

Activity

CNN RNN+

Segments Proposals

Shou, Z., Wang, D., & Chang, S. F. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs CVPR

Activity

CNN RNN+

RNN for Activity Localization

Yeung, Serena, Olga Russakovsky, Greg Mori, and Li Fei-Fei. et al. "End-to-end Learning of Action Detection from Frame Glimpses in Videos." CVPR 2016

Outline

Architecture Overview

16 frames 200 activities + background

Outline

3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing

Outline

C3D Network

Caffe +

feature vector

published on:

C3D Network

Caffeby

feature vector

Outline

Audio Features

Recurrent Neural Network Input

Audio Features:● MFCC● Spectral

concatvideo features

Provided by Ignasi Esquerra

Outline

Network Architecture

LSTM with previous output feedback

Outline

Training Methodology

Categorical Cross Entropy Loss

Training Methodology

For unbalanced data, weighted loss:

660 hours of video

313 hours of activities

Outline

Classification Post-Processing

Background

Activity 1

Activity 2

Activity 200

Background

Activity 1

Activity 2

Activity 200

Background

Activity 1

Activity 2

Activity 200

Max Probability

Detection Post-Processing

Background

Activity 1

Activity 2

Activity 200

Applied a mean filter of k samplestime

Background

ActivityC

Outline

Classification: Audio Features

mAP = 0.5755mAP = 0.5938

Music unrelated to the activity is often added to the videos in post-processing,causing a decrease in performance when audio and video features are combined.

Classification: Depth Analysis

mAP = 0.5938 mAP = 0.5492 mAP = 0.5635

Deeper networks present overfitting

Classification Results Per Activity

Using the Pommel HorseSailingPlaying Ice HockeyRock ClimbingBMX

Classification Results Per Activity

Drinking CoffeePeeling PotatoesHaving an Ice CreamRock-Paper-ScissorsPolishing shoes

Top Level Classification

Detection

mAP = 0.2251 mAP = 0.2067

Model with feedback did not improve results

Training with feedback

512-LSTM

video features0 0 1 0 0 0

concat

When training

previous ground truth

Training with feedback

512-LSTM

video features0 0.1 0.6 0.2 0.1 0

concat

When testing

previous prediction

Comparing Post-Processing

Grid search for optimal parameters

Detection Results per Activity

WindsurfingRiding Bumper CarsPlaying RacquetballUsing the Pommel HorseUsing Parallel Bars

Detection Results per Activity

Drinking CoffeePutting on ShoesRock-Paper-ScissorsRemoving CurlersSmoking a Cigarette

Top Level Detection

Qualitative Evaluation

Ground Truth:Playing water polo

Prediction:0.765 Playing water polo0.202 Swimming0.007 Springboard diving

Ground Truth:Hopscotch

Prediction:0.848 Running a marathon0.023 Triple jump0.022 Javelin throw

Challenge Results

Classification Task(24 participants)

Baseline

42.20%

0% 100%

93.23%

WinnerAverage

Performance

66.26%58.74%

UPC Team

* results over test subsetSlide Design by Issey Masuda

Challenge Results

Detection Task(6 participants)

Baseline

0% 50%

42.47%

WinnerAverage

Performance

29.94%22.36%

UPC Team

* results over test subsetSlide Design by Issey Masuda

Outline

Conclusions

Classification:Longboarding

Detection:42.7s – 193.5s Longboarding

Conclusions

Spatial Net

Temporal Net

Output

Winning entry forActivityNet Classification task

Wang, Limin, et al. "Towards good practices for very deep two-stream convnets." arXiv preprint arXiv:1507.02159 (2015).

Conclusions

Classification:Longboarding

Detection:42.7s – 193.5s Longboarding

Conclusions

Best results were obtained for sport categories, due to the pretraining of C3D with the Sports-1M dataset

Future Work: E2E Training

Training the whole pipeline end-to-end would reduce the bias towards sport categories

Future Work: Attention Models

Temporal Attention

Filters

Neural Network

Challenge Submission

Open Sourced Contributions

github.com/imatge-upc/activitynet-2016-cvprw

“Thank you for your attention

Questions?

Support Slides

Metrics

Classification DetectionIoU

Smoothing Effect Comparison

Post-Processing Effect

Smoothing Filter:

Post-Processing Effect

Activity Threshold:

Activities Duration

AP and Video Appearance Correlation

Preparing Data

batch 1

batch 2

Preparing Data

88Sequence of Video Vector Features

Sequence of Activities

Preparing Data

timesteps

Preparing Data

Gradient Propagation

Gathering Audio Features

16-Frame Clip

10 ms MFCC Features

t 10 ms MFCC Features

10 ms MFCC Features

16-Frame Clip

Spectral Features

… … …

16-Frame Clipmean MFCC

Features

std MFCC

Features

16-Frame Clip

Spectral Features

… … …

mean MFCC

Features

std MFCC

Features

16-Frame Clipmean MFCC

Features

std MFCC

Features

16-Frame Clip

Spectral Features

… … …

mean MFCC

Features

std MFCC

FeaturesSpectral Features

Convolutional Layer

Pooling Layer

Fully-Connected Layer

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Science

Recurrent Networks

Recurrent Acute Pancreatitis...pancreatitis,” “acute recurrent pancreatitis,” “recurrent pancreatitis” and “relapsing acute pancreatitis” in title/ abstract. A total

Recurrent tuberculosis

ELEC 677: Recurrent Neural Network Applications & Recurrent Neural Network … · 2016-11-08 · Recurrent Neural Network Applications & Recurrent Neural Network Language Models Lecture

Learning Transferable Self-attentive Representations for ... · Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision

Recurrent Networks - The University of Edinburgh · Recurrent networks Can think of recurrent networks in terms of the dynamics of the recurrent hidden state Settle to a xed point

Temporal Action Localization in Untrimmed Videos …Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs Zheng Shou, Dongang Wang, and Shih-Fu Chang Columbia University

Online Detection of Action Start in Untrimmed, Streaming Videos … · 2018. 7. 24. · the videos used in early recognition literatures are usually relatively short; dur-ing testing,

S1806-37562017000000360 Hemoptysis in recurrent ...€¦ · Hemoptysis in recurrent respiratory papillomatosis: also think about aspergillosis Recurrent respiratory papillomatosis

Recurrent Mutations

CAR-Net: Clairvoyant Attentive Recurrent Networkopenaccess.thecvf.com/...2018/...Clairvoyant_Attentive_ECCV_2018_… · CAR-Net: Clairvoyant Attentive Recurrent Network 5 in the recurrent

Weakly Supervised Person Re-Identification...ted lines in Figure 1(b)). Overall, the videos in the gallery set are untrimmed and tagged with the multiple video-level weak labels in

Recurrent Intussusception

Multimodal Assessment of Recurrent and Non-recurrent Conditions … · 2020-01-17 · Multimodal Assessment of Recurrent and Non-recurrent Conditions on Urban Streets Ilona O. Kastenhofer

LIVELINET: A Multimodal Deep Recurrent Neural Network to … · 2016-06-29 · led to recent research activity in speaking style analysis of educa-tional videos. Authors in [4] used

Recurrent Pregnancy Loss Recurrent Pregnancy Loss

Recurrent Acute Pancreatitis - Pancreas Journalspancreas.imedpub.com/recurrent-acute-pancreatitis.pdf · Recurrent acute pancreatitis (RAP) is commonly encountered, but less commonly

TRAINING VIDEOS TRAINING VIDEOS

Activity Understanding - “ProcNets: Learning to Segment ... · Activity Understanding \ProcNets: Learning to Segment Procedures in Untrimmed and Unconstrained Videos" by Zhou, Xu

Temporal Action Localization in Untrimmed Videos via Multi Stage CNNs