1
Two Stream Self-Supervised Learning for Action Recognition Ahmed Taha Moustafa Meshry Xitong Yang Yi-Ting Chen Larry Davis Goals Tangle spatial and temporal representation for action recognition. Boost performance on small and imbalanced video datasets HMDB and UCF 130 samples per class Honda Driving Dataset is imbalanced Approach 1. Sequence Verification ? [Shuffle & learn, O3N, OPN] 2. Spatial Temporal Verification ? Architecture Given a tuple of a RGB frame and a stack of difference (SOD), the network reasons about frame ordering and spatio-temporal correspondence. Our approach supports various motion encodings like stack of difference, dynamic Images or optical flow Experiments : Quantitative Evaluation on HMDB51, UCF101, Honda. Temporal representation learning can benefit from ImageNet Spatial representation. Spatial Tower SOD RGB Fusion FCN Class I Class II Class III Class VI Supervised Self-Supervised + Supervised Sup Self -Sup UCF101 / Split #1 51.53 60.45 UCF101 / Split #2 51.73 58.23 UCF101 / Split #3 50.05 55.60 HMDB51 / Split #1 31.47 33.54 HMDB51 / Split #2 32.91 36.60 Honda 81.72 82.78 Motion Tower Learned Temporal Visualization Early Overfitting Temporal filter visualization is uninformative. Input motion reconstruction is utilized for qualitative evaluation. Self-supervised reconstructions are less noisy compared to supervised learned features. Syntactic motion Supervised Self-Supervised Real motion Supervised Self-Supervised More Details Here Lack of big annotated datasets can limit supervised learning techniques for many domains like the medical, and autonomous car driving. For these applications, there exists massive data. However, annotating such datasets can be costly for various reasons; it can be time-consuming as in semantic segmentation or expensive due to medical field expertise requirement. In this work, we investigate using self-supervised learning for videos. Specifically, we focus on training deep CNNs for the task of action recognition.

Two Stream Self -Supervised Learning for Action Recognitionahmdtaha/posters/CVPRW2018_poster.pdf · Ahmed Taha Moustafa Meshry Xitong Yang Yi-Ting Chen Larry Davis Goals • Tangle

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Two Stream Self -Supervised Learning for Action Recognitionahmdtaha/posters/CVPRW2018_poster.pdf · Ahmed Taha Moustafa Meshry Xitong Yang Yi-Ting Chen Larry Davis Goals • Tangle

Two Stream Self-Supervised Learning for Action RecognitionAhmed Taha Moustafa Meshry Xitong Yang Yi-Ting Chen Larry Davis

Goals• Tangle spatial and temporal representation for action recognition.• Boost performance on small and imbalanced video datasets

• HMDB and UCF 130 samples per class• Honda Driving Dataset is imbalanced

Approach1. Sequence Verification ? [Shuffle & learn, O3N, OPN]

2. Spatial Temporal Verification ?

ArchitectureGiven a tuple of a RGB frame and a stack of difference (SOD), thenetwork reasons about frame ordering and spatio-temporalcorrespondence.

Our approach supports various motion encodings like stack ofdifference, dynamic Images or optical flow

Experiments : Quantitative Evaluation on HMDB51, UCF101, Honda.

Temporal representation learning can benefit from ImageNet Spatial representation.

Spatial Tower

SOD

RGB Fusion FCN

Class I Class II

Class III Class VI

Supervised

Self-Supervised + Supervised

Sup Self-Sup

UCF101/Split#1 51.53 60.45

UCF101/Split#2 51.73 58.23

UCF101/Split#3 50.05 55.60

HMDB51/Split#1 31.47 33.54

HMDB51/ Split#2 32.91 36.60

Honda 81.72 82.78Motion Tower

Learned Temporal Visualization

Early Overfitting

Temporal filter visualization is uninformative.Input motion reconstruction is utilized forqualitative evaluation. Self-supervisedreconstructions are less noisy compared tosupervised learned features.

Syntactic motion Supervised Self-Supervised Real motion Supervised Self-Supervised

More Details Here

Lack of big annotated datasets can limit supervised learning techniques for many domains like the medical, andautonomous car driving. For these applications, there exists massive data. However, annotating such datasetscan be costly for various reasons; it can be time-consuming as in semantic segmentation or expensive due tomedical field expertise requirement.In this work, we investigate using self-supervisedlearning for videos. Specifically, we focus on trainingdeep CNNs for the task of action recognition.