24
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu , Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1 School of Computer Science, Fudan University, Shanghai, China ACM Multimedia, Brisbane, Australia, Oct., 2015 [email protected]

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

Embed Size (px)

Citation preview

Page 1: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework

for Video Classification

Zuxuan Wu, Xi Wang, Yu-Gang Jiang,

Hao Ye, Xiangyang Xue

1

School of Computer Science, Fudan University, Shanghai, China

ACM Multimedia, Brisbane, Australia, Oct., 2015

[email protected]

Page 2: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

2

Video Classification

• Videos are everywhere

• Wide applications Web video search Video collection management Intelligent video surveillance

Page 3: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

3

Video Classification: State-of-the-Arts

1. Improved Dense Trajectories [Wang et al., ICCV 2013]

a) Tracking trajectoriesb) Computing local descriptors along the trajectories

2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015]

a) Encoding local features with Fisher Vector/VLADb) Normalization methods, such as Power Norm

Page 4: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

2. Two-Stream CNN [Simonyan et al., NIPS 2014]

Video Classification: Deep Learning

1. Image-based CNN Classification [Zha et al., arXiv 2015]

a) Extracting deep features for each frameb) Averaging frame-level deep features

Page 5: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

Video Classification: Deep Learning

Falling into water

Div

ing

Jumping from platform

Rotating in the air

[Ng et al., CVPR 2015]3. Recurrent NN: LSTM

The performance is not ideal, same as image-based classification.

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Ot-1 Ot Ot+1

Page 6: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

Video Classification: Deep Learning

Jumping from platform

Rotating in the air

Falling into water

Div

ing

[Ng et al., CVPR 2015]3. Recurrent NN: LSTM

The performance of LSTM and average pooling is close.

We propose a hybrid deep learning framework to capture appearance, short-term motion and

long-term temporal dynamics in videos.

Page 7: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

7Input Video

Final Prediction

Individual Frames

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Spatial CNN Spatial CNN Spatial CNN Spatial CNN Spatial CNN

1sy 2

sy 3sy 1

sT y s

Ty

Stacked Optical Flow

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Motion CNN Motion CNN Motion CNN Motion CNN Motion CNN

1my 2

my 3my 1

mT y m

Ty

EsW E

mW

E

Fusion Layer

Our FrameworkWe propose a hybrid deep learning framework to model rich multimodal information:

a) Appearance, shot-term motion with CNN

b) Long-term temporal information with LSTM

c) Regularized fusion to explore feature correlations

Regularzation

Page 8: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

8

Spatial and Motion CNN Features

Spatial Convolutional Neural NetworkIndividual

Frame

Motion Convolutional Neural Network

Score Fusion

StackedOptical Flow

Inp

ut

Vid

eo

Page 9: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

9

Temporal Modeling with LSTM

An unrolled recurrent neural network.

Page 10: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

10

Regularized Feature Fusion

[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

Page 11: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

11

DNN Learning Scheme

- Calculate prediction error- Update weights in a BP manner ( )w t ( 1)w t

Regularized Feature Fusion

[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

The fusion is performed in a free manner without explicitly exploring the feature correlations.

Page 12: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

12

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Page 13: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

13

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Page 14: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

14

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Minimizing the l21 norm will make the matrix be row-sparse!

Page 15: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

15

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Minimizing the l1 norm will prevent incorrect feature sharing!

Page 16: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

16

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Optimization:

For the E-th layer: Proximal gradient descent

Page 17: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

17

Regularized Feature FusionAlgorithm:

1. Initialize weights randomly

2. for epoch = 1: K

① Calculate prediction error with feed forward propagation.

for l = 1: L

② Back propagate the prediction error and update weight

matrices

③ if L == E: Evaluating the proximal operator

end for

Page 18: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

18

ExperimentsDatasets:- UCF101: 101 action classes, 13,320 video clips from YouTube

- Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from YouTube

Page 19: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

19

ExperimentsTemporal Modeling:

UCF-101 CCV

Spatial ConvNet 80.4 75.0

Motion ConvNet 78.3 59.1

Spatial LSTM 83.3 43.3

Motion LSTM 76.6 54.7

ConvNet (spatial + motion) 86.2 75.8

LSTM (spatial + motion) 86.3 61.9

ConvNet + LSTM (spatial) 84.4 77.9

ConvNet + LSTM (motion) 81.4 70.9

ALL Streams 90.3 82.4

LSTM are worse than CNN on noisy long videos.CNN and LSTM are highly complementary!

Page 20: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

20

ExperimentsRegularized Feature Fusion:

Regularized fusion performs better compared with fusion in a free manner.

UCF-101 CCV

Spatial SVM 78.6 74.4

Motion SVM 78.2 57.9

SVM-EF 86.6 75.3

SVM-LF 85.3 74.9

SVM-MKL 86.8 75.4

NN-EF 86.5 75.6

NN-LF 85.1 75.2

M-DBM 86.9 75.3

Two-Stream CNN 86.2 75.8

Regularized Fusion 88.4 76.2%

Page 21: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

21

ExperimentsHybrid Deep Learning Framework:

Page 22: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

22

ExperimentsComparisons with State-of-the-Art:

CCVXu et al. 60.3%Ye et al. 64.0%Jhuo et al. 64.0%Ma et al. 63.4%Liu et al. 68.2%Wu et al. 70.6% Ours 83.5%

UCF101Donahue et al. 82.9%Srivastava et al. 84.3%Wang et al. 85.9%Tran et al. 86.7%Simonyan et al. 88.0%Lan et al. 89.1%Zha et al. 89.6%Ours 91.3%

Page 23: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

23

Conclusion

We propose a hybrid deep learning framework to model rich

multimodal information:

1. Modeling appearance, shot-term motion with CNN

2. Capturing long-term temporal information with LSTM

3. Regularized fusion to explore feature correlations

Take-home message:1. LSTMs and CNNs are highly complementary2. Regularized feature fusion performs better.

Page 24: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1

24

Thank you!Q & A

[email protected]