Upload
xavier-giro
View
116
Download
2
Embed Size (px)
Citation preview
Temporal Action Localization in Untrimmed Videos via Multi-Stage CNNs
Slides by Alberto MontesComputer Vision Group Reading Group,
June 13th, 2016
[arXiv] [code]
Zheng Shou, Dongang Wang and Shih-Fu Chang
Introduction
Previous Work
Improved Dense Trajectory (iDT)
Fisher Vector2D Convolution
Segment-CNN
Segment-CNN
Segment-CNN
Segment-CNN
Problem Definition
Video:
frame # frames
Annotations:
Candidates:
action category
action categorystart and ending frame
Multi-Scale Segment Generation
◉ Each frame resized to 171x128 pixels◉ Temporal sliding windows:
○ 16, 32, 64, 128, 256, 512 frames○ 75% overlap
◉ Construct segment s by uniformly sampling 16 frames
Network Architecture
C3D Network
Training Proposal and Classification Network
◉ lr=0.0001 except fc8 lr=0.01, momentum=0.9, weight decay factor=0.0005
◉ Drop lr by factor of 2 every 10K iterations
Proposal Network:
● fc8: 2 nodes
Classification Network:
● fc8: K+1 nodes
Localization Network
Add Custom Loss function
Localization Network
true class label
overlap sensitivity
Try to boost segments with high overlap
Works best with: λ = 1, α = 0.25
Localization Network
Learning target:
Localization Network
Prediction and Post-processing
◉ Keep segments with Ppro
> 0.7◉ Remove background segments◉ P
loc multiply with class-specific frequency of
occurrence for each window length in the training data to leverage window length distribution patterns
◉ NMS based on Ploc
to remove redundancy.
(θ - 0.1)
Experiments
MEXaction2
“Bull Charge Cape” and
“Horse Riding” videos
77 hours of videos
Training set: 1336 instances
Validation set: 310 instances
Test set: 329 instances
Datasets
THUMOS 2014
Temporal Action Detection Task
20 categories
Training set: 2755 videos
Validation set: 1010 videos and 3007 instances
Test set: 1574 videos and 3358 instances
Results MEXaction2
DFT: Dense Trajectory Features + SVM
Results MEXaction2
Results MEXaction2
Evaluation
Evaluation
Evaluation
Impact of individual networks:
Conclusions
Propose a multi-stage framework Semgent-CNN to address temporal action location
“
Thank you!Questions?