Upload
xavier-giro
View
4.008
Download
0
Embed Size (px)
Citation preview
Temporal Activity Detection in Untrimmed Videos with Recurrent
Neural Networks
Alberto Montes
July 15th, 2016
Xavi Giró Amaia Salvador
Outline
1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work
2
Motivation
3
Motivation
4
Problem Definition
5
Videos
Problem Definition
6
Videos
Activity Classification
Longboarding
Problem Definition
7
Videos
Activity Temporal Localization
Longboarding
Problem Definition
8
How?
Problem Definition
9
Neural Network
Activity
Problem Definition
10
Activity
CNN RNN+
11
Large-Scale Activity Recognition Challenge
Stats:
● 19,994 Videos● 200 Activities● 660 hours of video● 313 hours of activities● 65.6 million of frames
Dataset
12
Outline
1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work
13
Literature Approaches
14
Activity
CNN RNN+
Convolutional Neural Network
15
Convolutional Layer
Recurrent Neural Network
16
c0 c1 c2
Literature Approaches
17
Activity
CNN RNN+
3D Convolution
18
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015, December). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE ICCV 2015 (pp. 4489-4497). IEEE.
3D Convolution
19
● 16-frame video clip as input● 80 million parameters● 3x3x3 filter size at all conv layers
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015, December). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE ICCV 2015 (pp. 4489-4497). IEEE.
Literature Approaches
20
Activity
CNN RNN+
Literature Approaches
21
Activity
CNN RNN+
Segments Proposals
22
Shou, Z., Wang, D., & Chang, S. F. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs CVPR
2016.
Literature Approaches
23
Activity
CNN RNN+
RNN for Activity Localization
24
Yeung, Serena, Olga Russakovsky, Greg Mori, and Li Fei-Fei. et al. "End-to-end Learning of Action Detection from Frame Glimpses in Videos." CVPR 2016
Outline
1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work
25
Architecture Overview
26
16 frames 200 activities + background
16 frames 200 activities + background
16 frames 200 activities + background
Outline
3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing
27
Outline
3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing
28
C3D Network
29
Caffe +
by
feature vector
published on:
C3D Network
30
Caffeby
feature vector
Outline
3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing
31
Audio Features
32
C3D
Recurrent Neural Network Input
Audio Features:● MFCC● Spectral
concatvideo features
Provided by Ignasi Esquerra
Outline
3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing
33
Network Architecture
34
Network Architecture
35
Network Architecture
36
LSTM with previous output feedback
Outline
3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing
37
Training Methodology
Categorical Cross Entropy Loss
38
Training Methodology
For unbalanced data, weighted loss:
39
660 hours of video
313 hours of activities
Outline
3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing
40
Classification Post-Processing
41
Background
Activity 1
Activity 2
Activity 200
Clip
1
Clip
2
Clip
3
Clip
N
Classification Post-Processing
42
Background
Activity 1
Activity 2
Activity 200
Clip
1
Clip
2
Clip
3
Clip
N
Ave
rage
Classification Post-Processing
43
Background
Activity 1
Activity 2
Activity 200
Clip
1
Clip
2
Clip
3
Clip
N
Ave
rage
Max Probability
Detection Post-Processing
44
Background
Activity 1
Activity 2
Activity 200
Clip
1
Clip
2
Clip
3
Clip
N
Applied a mean filter of k samplestime
Detection Post-Processing
45
Background
ActivityC
lip 1
Clip
2
Clip
3
Clip
N
Ɣ
Detection Post-Processing
46
Ɣ
Outline
1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work
47
Classification: Audio Features
48
mAP = 0.5755mAP = 0.5938
Music unrelated to the activity is often added to the videos in post-processing,causing a decrease in performance when audio and video features are combined.
Classification: Depth Analysis
49
mAP = 0.5938 mAP = 0.5492 mAP = 0.5635
Deeper networks present overfitting
Classification Results Per Activity
50
Classification Results Per Activity
51
Using the Pommel HorseSailingPlaying Ice HockeyRock ClimbingBMX
Classification Results Per Activity
52
Drinking CoffeePeeling PotatoesHaving an Ice CreamRock-Paper-ScissorsPolishing shoes
Top Level Classification
53
Detection
54
mAP = 0.2251 mAP = 0.2067
Model with feedback did not improve results
Training with feedback
55
512-LSTM
video features0 0 1 0 0 0
concat
When training
previous ground truth
Training with feedback
56
512-LSTM
video features0 0.1 0.6 0.2 0.1 0
concat
When testing
previous prediction
Comparing Post-Processing
57
Ɣ
Grid search for optimal parameters
Detection Results per Activity
58
Detection Results per Activity
59
WindsurfingRiding Bumper CarsPlaying RacquetballUsing the Pommel HorseUsing Parallel Bars
Detection Results per Activity
60
Drinking CoffeePutting on ShoesRock-Paper-ScissorsRemoving CurlersSmoking a Cigarette
Top Level Detection
61
Qualitative Evaluation
62
Ground Truth:Playing water polo
Prediction:0.765 Playing water polo0.202 Swimming0.007 Springboard diving
Qualitative Evaluation
63
Ground Truth:Hopscotch
Prediction:0.848 Running a marathon0.023 Triple jump0.022 Javelin throw
Qualitative Evaluation
64
Challenge Results
66
Classification Task(24 participants)
Baseline
42.20%
0% 100%
93.23%
WinnerAverage
Performance
66.26%58.74%
UPC Team
* results over test subsetSlide Design by Issey Masuda
mAP
Challenge Results
67
Detection Task(6 participants)
Baseline
9.70%
0% 50%
42.47%
WinnerAverage
Performance
29.94%22.36%
UPC Team
mAP
* results over test subsetSlide Design by Issey Masuda
Outline
1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work
68
Conclusions
69
Classification:Longboarding
Detection:42.7s – 193.5s Longboarding
Conclusions
70
Video
Spatial Net
Temporal Net
Output
Winning entry forActivityNet Classification task
Wang, Limin, et al. "Towards good practices for very deep two-stream convnets." arXiv preprint arXiv:1507.02159 (2015).
Conclusions
71
Classification:Longboarding
Detection:42.7s – 193.5s Longboarding
Conclusions
72
Best results were obtained for sport categories, due to the pretraining of C3D with the Sports-1M dataset
Future Work: E2E Training
73
Training the whole pipeline end-to-end would reduce the bias towards sport categories
Future Work: Attention Models
74
Temporal Attention
Filters
Neural Network
Challenge Submission
75
Open Sourced Contributions
76
github.com/imatge-upc/activitynet-2016-cvprw
“Thank you for your attention
77
78
Questions?
79
Support Slides
Metrics
80
Hit@3
Classification DetectionIoU
Smoothing Effect Comparison
81
Post-Processing Effect
82
Smoothing Filter:
Post-Processing Effect
83
Activity Threshold:
Activities Duration
84
AP and Video Appearance Correlation
85
AP and Video Appearance Correlation
86
Preparing Data
87
batch 1
batch 2
Preparing Data
88Sequence of Video Vector Features
Sequence of Activities
time
Preparing Data
89
time
timesteps
Preparing Data
90
Preparing Data
91
Gradient Propagation
Gathering Audio Features
92
16-Frame Clip
10 ms MFCC Features
t 10 ms MFCC Features
10 ms MFCC Features
10 ms MFCC Features
10 ms MFCC Features
10 ms MFCC Features
16-Frame Clip
Spectral Features
… … …
Gathering Audio Features
93
16-Frame Clipmean MFCC
Features
t
std MFCC
Features
16-Frame Clip
Spectral Features
… … …
mean MFCC
Features
std MFCC
Features
Gathering Audio Features
94
16-Frame Clipmean MFCC
Features
t
std MFCC
Features
16-Frame Clip
Spectral Features
… … …
mean MFCC
Features
std MFCC
FeaturesSpectral Features
Convolutional Neural Network
95
Convolutional Layer
Convolutional Neural Network
96
Pooling Layer
Convolutional Neural Network
97
Fully-Connected Layer
Qualitative Evaluation
98