Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

  • View
    4.009

  • Download
    0

  • Category

    Science

Preview:

Citation preview

Temporal Activity Detection in Untrimmed Videos with Recurrent

Neural Networks

Alberto Montes

July 15th, 2016

Xavi Giró Amaia Salvador

Outline

1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work

2

Motivation

3

Motivation

4

Problem Definition

5

Videos

Problem Definition

6

Videos

Activity Classification

Longboarding

Problem Definition

7

Videos

Activity Temporal Localization

Longboarding

Problem Definition

8

How?

Problem Definition

9

Neural Network

Activity

Problem Definition

10

Activity

CNN RNN+

11

Large-Scale Activity Recognition Challenge

Stats:

● 19,994 Videos● 200 Activities● 660 hours of video● 313 hours of activities● 65.6 million of frames

Dataset

12

Outline

1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work

13

Literature Approaches

14

Activity

CNN RNN+

Convolutional Neural Network

15

Convolutional Layer

Recurrent Neural Network

16

c0 c1 c2

Literature Approaches

17

Activity

CNN RNN+

3D Convolution

18

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015, December). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE ICCV 2015 (pp. 4489-4497). IEEE.

3D Convolution

19

● 16-frame video clip as input● 80 million parameters● 3x3x3 filter size at all conv layers

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015, December). Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE ICCV 2015 (pp. 4489-4497). IEEE.

Literature Approaches

20

Activity

CNN RNN+

Literature Approaches

21

Activity

CNN RNN+

Segments Proposals

22

Shou, Z., Wang, D., & Chang, S. F. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs CVPR

2016.

Literature Approaches

23

Activity

CNN RNN+

RNN for Activity Localization

24

Yeung, Serena, Olga Russakovsky, Greg Mori, and Li Fei-Fei. et al. "End-to-end Learning of Action Detection from Frame Glimpses in Videos." CVPR 2016

Outline

1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work

25

Architecture Overview

26

16 frames 200 activities + background

16 frames 200 activities + background

16 frames 200 activities + background

Outline

3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing

27

Outline

3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing

28

C3D Network

29

Caffe +

by

feature vector

published on:

C3D Network

30

Caffeby

feature vector

Outline

3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing

31

Audio Features

32

C3D

Recurrent Neural Network Input

Audio Features:● MFCC● Spectral

concatvideo features

Provided by Ignasi Esquerra

Outline

3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing

33

Network Architecture

34

Network Architecture

35

Network Architecture

36

LSTM with previous output feedback

Outline

3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing

37

Training Methodology

Categorical Cross Entropy Loss

38

Training Methodology

For unbalanced data, weighted loss:

39

660 hours of video

313 hours of activities

Outline

3. Methodologya. Extracting C3D Featuresb. Audio Featuresc. Network Architectured. Training Methodologye. Post-Processing

40

Classification Post-Processing

41

Background

Activity 1

Activity 2

Activity 200

Clip

1

Clip

2

Clip

3

Clip

N

Classification Post-Processing

42

Background

Activity 1

Activity 2

Activity 200

Clip

1

Clip

2

Clip

3

Clip

N

Ave

rage

Classification Post-Processing

43

Background

Activity 1

Activity 2

Activity 200

Clip

1

Clip

2

Clip

3

Clip

N

Ave

rage

Max Probability

Detection Post-Processing

44

Background

Activity 1

Activity 2

Activity 200

Clip

1

Clip

2

Clip

3

Clip

N

Applied a mean filter of k samplestime

Detection Post-Processing

45

Background

ActivityC

lip 1

Clip

2

Clip

3

Clip

N

Ɣ

Detection Post-Processing

46

Ɣ

Outline

1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work

47

Classification: Audio Features

48

mAP = 0.5755mAP = 0.5938

Music unrelated to the activity is often added to the videos in post-processing,causing a decrease in performance when audio and video features are combined.

Classification: Depth Analysis

49

mAP = 0.5938 mAP = 0.5492 mAP = 0.5635

Deeper networks present overfitting

Classification Results Per Activity

50

Classification Results Per Activity

51

Using the Pommel HorseSailingPlaying Ice HockeyRock ClimbingBMX

Classification Results Per Activity

52

Drinking CoffeePeeling PotatoesHaving an Ice CreamRock-Paper-ScissorsPolishing shoes

Top Level Classification

53

Detection

54

mAP = 0.2251 mAP = 0.2067

Model with feedback did not improve results

Training with feedback

55

512-LSTM

video features0 0 1 0 0 0

concat

When training

previous ground truth

Training with feedback

56

512-LSTM

video features0 0.1 0.6 0.2 0.1 0

concat

When testing

previous prediction

Comparing Post-Processing

57

Ɣ

Grid search for optimal parameters

Detection Results per Activity

58

Detection Results per Activity

59

WindsurfingRiding Bumper CarsPlaying RacquetballUsing the Pommel HorseUsing Parallel Bars

Detection Results per Activity

60

Drinking CoffeePutting on ShoesRock-Paper-ScissorsRemoving CurlersSmoking a Cigarette

Top Level Detection

61

Qualitative Evaluation

62

Ground Truth:Playing water polo

Prediction:0.765 Playing water polo0.202 Swimming0.007 Springboard diving

Qualitative Evaluation

63

Ground Truth:Hopscotch

Prediction:0.848 Running a marathon0.023 Triple jump0.022 Javelin throw

Qualitative Evaluation

64

Qualitative Evaluation

65

Challenge Results

66

Classification Task(24 participants)

Baseline

42.20%

0% 100%

93.23%

WinnerAverage

Performance

66.26%58.74%

UPC Team

* results over test subsetSlide Design by Issey Masuda

mAP

Challenge Results

67

Detection Task(6 participants)

Baseline

9.70%

0% 50%

42.47%

WinnerAverage

Performance

29.94%22.36%

UPC Team

mAP

* results over test subsetSlide Design by Issey Masuda

Outline

1. Introduction2. Related Work3. Methodology4. Results5. Conclusions and Future Work

68

Conclusions

69

Classification:Longboarding

Detection:42.7s – 193.5s Longboarding

Conclusions

70

Video

Spatial Net

Temporal Net

Output

Winning entry forActivityNet Classification task

Wang, Limin, et al. "Towards good practices for very deep two-stream convnets." arXiv preprint arXiv:1507.02159 (2015).

Conclusions

71

Classification:Longboarding

Detection:42.7s – 193.5s Longboarding

Conclusions

72

Best results were obtained for sport categories, due to the pretraining of C3D with the Sports-1M dataset

Future Work: E2E Training

73

Training the whole pipeline end-to-end would reduce the bias towards sport categories

Future Work: Attention Models

74

Temporal Attention

Filters

Neural Network

Challenge Submission

75

Open Sourced Contributions

76

github.com/imatge-upc/activitynet-2016-cvprw

“Thank you for your attention

77

78

Questions?

79

Support Slides

Metrics

80

Hit@3

Classification DetectionIoU

Smoothing Effect Comparison

81

Post-Processing Effect

82

Smoothing Filter:

Post-Processing Effect

83

Activity Threshold:

Activities Duration

84

AP and Video Appearance Correlation

85

AP and Video Appearance Correlation

86

Preparing Data

87

batch 1

batch 2

Preparing Data

88Sequence of Video Vector Features

Sequence of Activities

time

Preparing Data

89

time

timesteps

Preparing Data

90

Preparing Data

91

Gradient Propagation

Gathering Audio Features

92

16-Frame Clip

10 ms MFCC Features

t 10 ms MFCC Features

10 ms MFCC Features

10 ms MFCC Features

10 ms MFCC Features

10 ms MFCC Features

16-Frame Clip

Spectral Features

… … …

Gathering Audio Features

93

16-Frame Clipmean MFCC

Features

t

std MFCC

Features

16-Frame Clip

Spectral Features

… … …

mean MFCC

Features

std MFCC

Features

Gathering Audio Features

94

16-Frame Clipmean MFCC

Features

t

std MFCC

Features

16-Frame Clip

Spectral Features

… … …

mean MFCC

Features

std MFCC

FeaturesSpectral Features

Convolutional Neural Network

95

Convolutional Layer

Convolutional Neural Network

96

Pooling Layer

Convolutional Neural Network

97

Fully-Connected Layer

Qualitative Evaluation

98

Recommended