Temporal Action Localization in Untrimmed Videos via Multi Stage CNNs

Temporal Action Localization in Untrimmed Videos via Multi-Stage CNNs

Slides by Alberto MontesComputer Vision Group Reading Group,

June 13th, 2016

[arXiv] [code]

Zheng Shou, Dongang Wang and Shih-Fu Chang

https://github.com/imatge-upc/readcv

https://arxiv.org/abs/1601.02129

https://github.com/zhengshou/scnn

Introduction

Previous Work

Improved Dense Trajectory (iDT)

Fisher Vector2D Convolution

Segment-CNN

Segment-CNN

Segment-CNN

Segment-CNN

Problem Definition

Video:

frame # frames

Annotations:

Candidates:

action category

action categorystart and ending frame

Multi-Scale Segment Generation

◉ Each frame resized to 171x128 pixels◉ Temporal sliding windows:

○ 16, 32, 64, 128, 256, 512 frames○ 75% overlap

◉ Construct segment s by uniformly sampling 16 frames

Network Architecture

C3D Network

Training Proposal and Classification Network

◉ lr=0.0001 except fc8 lr=0.01, momentum=0.9, weight decay factor=0.0005

◉ Drop lr by factor of 2 every 10K iterations

Proposal Network:

● fc8: 2 nodes

Classification Network:

● fc8: K+1 nodes

Localization Network

Add Custom Loss function


true class label

overlap sensitivity

Try to boost segments with high overlap

Works best with: λ = 1, α = 0.25


Learning target:


Prediction and Post-processing

◉ Keep segments with Ppro

> 0.7◉ Remove background segments◉ P

loc multiply with class-specific frequency of

occurrence for each window length in the training data to leverage window length distribution patterns

◉ NMS based on Ploc

to remove redundancy.

(θ - 0.1)

Experiments

MEXaction2

“Bull Charge Cape” and

“Horse Riding” videos

77 hours of videos

Training set: 1336 instances

Validation set: 310 instances

Test set: 329 instances

Datasets

THUMOS 2014

Temporal Action Detection Task

20 categories

Training set: 2755 videos

Validation set: 1010 videos and 3007 instances

Test set: 1574 videos and 3358 instances

Results MEXaction2

DFT: Dense Trajectory Features + SVM

Results MEXaction2

Results MEXaction2

Evaluation

Evaluation

Evaluation

Impact of individual networks:

Conclusions

Propose a multi-stage framework Semgent-CNN to address temporal action location

“

Thank you!Questions?

Technology

Temporal Action Localization in Untrimmed Videos via Multi Stage CNNs