Learnable pooling with Context Gating for video classificationmiech/poster_Youtube8M.pdf ·...

Learnable pooling with Context Gating for video classification

Antoine Miech1,2 Josef Sivic1,2,3Ivan Laptev1,2

Ranked 1 at kaggleModel Overview

1DI ENS, École Normale Supérieure, PSL Research University 3CIIRC, CTU in Prague2Inria

Challenges

Contributions

LOUPE Tensorflow toolbox:github.com/antoine77340/LOUPEarXiv: 1706.06905

Focus on aggregation

Results on the Youtube-8M dataset

Training speed and Generalization

Learnable pooling via Clustering

Context Gating

References

• Multi-label video taggingGroundtruth: Car - Vehicle - Sport Utility Vehicle - Dacia Duster - Renault - 4-Wheel Drive

Groundtruth:Tree- Christmas Tree - Christmas Decoration - Christmas

Car - Vehicle - Sport Utility Vehicle - Dacia Duster - Fiat AutomobilesVolkswagen Beetles

Top 6 scores:

Top 6 scores: Christmas - Christmas decoration - Origami - Paper - Tree Christmas Tree

• Large diversity of labels• Incomplete and noisy annotation• What is a good representation for video ?

• Learnable pooling for video representations• Context Gating: A non-linear learnable module for modeling interdependencies between activations

GatedOutput

Fully-connected

Sigmoid

Multiplicationwith Input

SkiTree

• Models interdependencies between activations with a self-gating mechanism

• Adds quadratic interactions

• Inspired by Gated Linear Unit [3][1] R. Arandjelović, et al. NetVLAD: CNN architecture for weakly supervised place recognition. CVPR 2016[2] J. Sivic et al. Video Google: a text retrieval approach to object matching in videos. ICCV 2003 [3] Y. N. Dauphin, et al. Language modeling with gated convolutional networks. arXiv:1612.08083[4] H. Jégou, et al. Aggregating local descriptors into a compact image representation. CVPR 2010[5] F. Perronnin, et al. Fisher kernels on visual vocabularies for image categorization. CVPR 2007

Release

• How to aggregate multi-modal sequences of features into a good compact representation ?

Visual features

Audio features

Supervised model

Single andcompact pooledrepresentation

• Training time vs. performance for LSTM, GRU, NetVLAD and Gated NetVLAD.

• Performance as a function of the training set size.

• Soft-DBoWSoft Bag-of-Words [2] formulated in terms of differentiable operations.

• NetVLAD [1] / NetRVLADNetVLAD: Approximates a VLAD [4] encoding with differentiable operations.NetRVLAD: Residual-less NetVLAD.

• NetFVA variant of the Fisher Vector representation [5] with an approximation of Gaussian Mixture model by differentiable operations. This is done through a simple extension of NetVLAD [1].

0 5 10 15 20 25Number of models

• Comparison of different pooling based methods (with or without Context Gating) and the RNN pooling approaches.

• The effect of ensembling different methods.A prediction score average of 7 different models can achieve the first place of the kaggle Youtube-8M challenge out of 650 teams.Models are: Gated NetVLAD, Gated

NetRVLAD, Gated NetFV, Gated Soft DBoW,

DBoW, GRU and LSTM.

105 106 107

Number of samples

Gated NetVLADNetVLADLSTMAverage Pooling

X: input layerY: output layerW,b: parameters to learn

2 4 6 8 10 12 14Time (hour)

Gated NetVLADNetVLADGRULSTM

+7 Millions videos

4716labels

3.4 avg labels/video

450000video hoursVIDEO

FEATURES

Learnable pooling

(e.g NetVLAD,NetFV)

FClayer

AUDIO FEATURES

Context Gating

Learnable pooling

(e.g NetVLAD,NetFV)

FEATURE POOLING

INPUTFEATURES

CLASSIFICATION

Time (hours)

NetVLAD NetRVLAD NetFV Soft-DBoW GRU LSTMModel

With Context GatingBased on ClusteringRNN

Learnable pooling with Context Gating for video classificationmiech/poster_Youtube8M.pdf ·...

Documents

Power gating

Disamatic Gating

Designing easily learnable eyes-free interaction - Kevin Li

Learnable Stroke Models for Example-based Portrait Painting

Theory of Learnable

Learnable Strategies for Bilateral Agent Negotiation over

Trust: Today’s Critical, Learnable Competency

Are Sunspots Learnable? An Experimental Investigation in a

Clock Gating

Learnable and unlearnable languages Kees Hengeveld

Learnable image histograms-based deep radiomics for renal

Learnable Triangulation of Human Pose - CVF Open Accessopenaccess.thecvf.com/content_ICCV_2019/papers/Iskakov... · 2019-10-23 · Learnable Triangulation of Human Pose Karim Iskakov

Knowledge and Tree-Edits in Learnable Entailment Proofs

Gating Current

LEARNABLE EVOLUTION MODEL: Evolutionary Processes Guided

Gating System

Testing Vision-Based Control Systems Using Learnable

Multi-level Context Gating of Embedded Collective

Improved ClockImproved Clock--Gating Control Scheme Gating Control Scheme … · 2010-07-12 · Improved ClockImproved Clock--Gating Control Scheme Gating Control Scheme for Transparent

LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION