Transcript
Page 1: Two Stream Self -Supervised Learning for Action Recognitionahmdtaha/posters/CVPRW2018_poster.pdf · Ahmed Taha Moustafa Meshry Xitong Yang Yi-Ting Chen Larry Davis Goals • Tangle

Two Stream Self-Supervised Learning for Action RecognitionAhmed Taha Moustafa Meshry Xitong Yang Yi-Ting Chen Larry Davis

Goals• Tangle spatial and temporal representation for action recognition.• Boost performance on small and imbalanced video datasets

• HMDB and UCF 130 samples per class• Honda Driving Dataset is imbalanced

Approach1. Sequence Verification ? [Shuffle & learn, O3N, OPN]

2. Spatial Temporal Verification ?

ArchitectureGiven a tuple of a RGB frame and a stack of difference (SOD), thenetwork reasons about frame ordering and spatio-temporalcorrespondence.

Our approach supports various motion encodings like stack ofdifference, dynamic Images or optical flow

Experiments : Quantitative Evaluation on HMDB51, UCF101, Honda.

Temporal representation learning can benefit from ImageNet Spatial representation.

Spatial Tower

SOD

RGB Fusion FCN

Class I Class II

Class III Class VI

Supervised

Self-Supervised + Supervised

Sup Self-Sup

UCF101/Split#1 51.53 60.45

UCF101/Split#2 51.73 58.23

UCF101/Split#3 50.05 55.60

HMDB51/Split#1 31.47 33.54

HMDB51/ Split#2 32.91 36.60

Honda 81.72 82.78Motion Tower

Learned Temporal Visualization

Early Overfitting

Temporal filter visualization is uninformative.Input motion reconstruction is utilized forqualitative evaluation. Self-supervisedreconstructions are less noisy compared tosupervised learned features.

Syntactic motion Supervised Self-Supervised Real motion Supervised Self-Supervised

More Details Here

Lack of big annotated datasets can limit supervised learning techniques for many domains like the medical, andautonomous car driving. For these applications, there exists massive data. However, annotating such datasetscan be costly for various reasons; it can be time-consuming as in semantic segmentation or expensive due tomedical field expertise requirement.In this work, we investigate using self-supervisedlearning for videos. Specifically, we focus on trainingdeep CNNs for the task of action recognition.