Two Stream Self-Supervised Learning for Action Recognition
Ahmed Taha Moustafa Meshry Xitong Yang Yi-Ting Chen Larry Davis
• Tangle spatial and temporal representation for action recognition.
• Boost performance on small and imbalanced video datasets
• HMDB and UCF 130 samples per class
• Honda Driving Dataset is imbalanced
1. Sequence Verification ? [Shuffle & learn, O3N, OPN]
2. Spatial Temporal Verification ?
Given a tuple of a RGB frame and a stack of difference (SOD), the
network reasons about frame ordering and spatio-temporal
Our approach supports various motion encodings like stack of
difference, dynamic Images or optical flow
Experiments : Quantitative Evaluation on HMDB51, UCF101, Honda.
Temporal representation learning can benefit from
ImageNet Spatial representation.
RGB Fusion FCN
Class I Class II
Class III Class VI
Self-Supervised + Supervised
UCF101 / Split #1 51.53 60.45
UCF101 / Split #2 51.73 58.23
UCF101 / Split #3 50.05 55.60
HMDB51 / Split #1 31.47 33.54
HMDB51 / Split #2 32.91 36.60
Honda 81.72 82.78Motion Tower
Learned Temporal Visualization
Temporal filter visualization is uninformative.
Input motion reconstruction is utilized for
qualitative evaluation. Self-supervised
reconstructions are less noisy compared to
supervised learned features.
motion Supervised Self-Supervised Real motion Supervised Self-Supervised
More Details Here
Lack of big annotated datasets can limit supervised learning techniques for many domains like the medical, and
autonomous car driving. For these applications, there exists massive data. However, annotating such datasets
can be costly for various reasons; it can be time-consuming as in semantic segmentation or expensive due to
medical field expertise requirement.
In this work, we investigate using self-supervised
learning for videos. Specifically, we focus on training
deep CNNs for the task of action recognition.