Two Stream Self -Supervised Learning for Action ahmdtaha/posters/CVPRW2018_poster.pdfآ  Ahmed Taha Moustafa

Two Stream Self -Supervised Learning for Action ahmdtaha/posters/CVPRW2018_poster.pdfآ  Ahmed Taha Moustafa

  • View
    0

  • Download
    0

Embed Size (px)

Text of Two Stream Self -Supervised Learning for Action ahmdtaha/posters/CVPRW2018_poster.pdfآ  Ahmed Taha...

  • Two Stream Self-Supervised Learning for Action Recognition Ahmed Taha Moustafa Meshry Xitong Yang Yi-Ting Chen Larry Davis

    Goals • Tangle spatial and temporal representation for action recognition. • Boost performance on small and imbalanced video datasets

    • HMDB and UCF 130 samples per class • Honda Driving Dataset is imbalanced

    Approach 1. Sequence Verification ? [Shuffle & learn, O3N, OPN]

    2. Spatial Temporal Verification ?

    Architecture Given a tuple of a RGB frame and a stack of difference (SOD), the network reasons about frame ordering and spatio-temporal correspondence.

    Our approach supports various motion encodings like stack of difference, dynamic Images or optical flow

    Experiments : Quantitative Evaluation on HMDB51, UCF101, Honda.

    Temporal representation learning can benefit from ImageNet Spatial representation.

    Spatial Tower

    SOD

    RGB Fusion FCN

    Class I Class II

    Class III Class VI

    Supervised

    Self-Supervised + Supervised

    Sup Self-Sup

    UCF101 / Split #1 51.53 60.45

    UCF101 / Split #2 51.73 58.23

    UCF101 / Split #3 50.05 55.60

    HMDB51 / Split #1 31.47 33.54

    HMDB51 / Split #2 32.91 36.60

    Honda 81.72 82.78Motion Tower

    Learned Temporal Visualization

    Early Overfitting

    Temporal filter visualization is uninformative. Input motion reconstruction is utilized for qualitative evaluation. Self-supervised reconstructions are less noisy compared to supervised learned features.

    Syntactic motion Supervised Self-Supervised Real motion Supervised Self-Supervised

    More Details Here

    Lack of big annotated datasets can limit supervised learning techniques for many domains like the medical, and autonomous car driving. For these applications, there exists massive data. However, annotating such datasets can be costly for various reasons; it can be time-consuming as in semantic segmentation or expensive due to medical field expertise requirement. In this work, we investigate using self-supervised learning for videos. Specifically, we focus on training deep CNNs for the task of action recognition.