10
Few-Shot Video Classification via Temporal Alignment Kaidi Cao Jingwei Ji * Zhangjie Cao * Chien-Yi Chang Juan Carlos Niebles Stanford University {kaidicao, jingweij, caozj18, cy3, jniebles}@cs.stanford.edu Abstract There is a growing interest in learning a model which could recognize novel classes with only a few labeled ex- amples. In this paper, we propose Temporal Alignment Module (TAM), a novel few-shot learning framework that can learn to classify a previous unseen video. While most previous works neglect long-term temporal ordering infor- mation, our proposed model explicitly leverages the tem- poral ordering information in video data through temporal alignment. This leads to strong data-efficiency for few-shot learning. In concrete, TAM calculates the distance value of query video with respect to novel class proxies by av- eraging the per frame distances along its alignment path. We introduce continuous relaxation to TAM so the model can be learned in an end-to-end fashion to directly optimize the few-shot learning objective. We evaluate TAM on two challenging real-world datasets, Kinetics and Something- Something-V2, and show that our model leads to significant improvement of few-shot video classification over a wide range of competitive baselines. 1. Introduction The emergence of deep learning has greatly advanced the frontiers of action recognition [41, 4]. The main focus tends to center around learning effective video representations for classification using large amounts of labeled data. In order to recognize novel classes that a pretrained network has not seen before, typically we need to manually collect hundreds of video samples for knowledge transferring. But such a procedure is rather tedious and labor intensive especially for videos, where the difficulty and cost of labeling is much higher compared to images. There is a growing interest in learning models capable of effectively adapting themselves to recognize novel classes with only a few labeled examples. This is known as the few- shot learning task [10, 6]. Under the setup of meta-learning based few-shot learning, the model is explicitly trained to * Indicates equal contribution. Query Support Set making tea making sushi Figure 1. Our few-shot video classification setting. Pairs of se- mantically matched frames are connected with a blue dashed line. The arrows show the direction of the temporal alignment path. deal with scarce training data for previously unseen classes across different episodes. While the majority of recent few- shot learning works focus on image classification, adapting it to video data is not a trivial extension. Videos are much more complicated than images, as rec- ognizing some specific actions, such as opening the door, usually requires a complete modeling of temporal informa- tion. In the previous literature of video classification, 3D convolution and optical flow are two of the most popular methods to model temporal relations. The direct output of neural network encoders is always a temporal sequence of deep encoded features. State-of-the-art approaches com- monly apply a temporal pooling module (usually mean pooling) in order to make final prediction. As observed before, averaging the deep features only captures the co- occurrence rather than the temporal ordering of patterns, which will unavoidably result in information loss. Loss of information is even more severe for few-shot 1 arXiv:1906.11415v1 [cs.CV] 27 Jun 2019

Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

Few-Shot Video Classification via Temporal Alignment

Kaidi Cao Jingwei Ji∗ Zhangjie Cao∗ Chien-Yi Chang Juan Carlos NieblesStanford University

{kaidicao, jingweij, caozj18, cy3, jniebles}@cs.stanford.edu

Abstract

There is a growing interest in learning a model whichcould recognize novel classes with only a few labeled ex-amples. In this paper, we propose Temporal AlignmentModule (TAM), a novel few-shot learning framework thatcan learn to classify a previous unseen video. While mostprevious works neglect long-term temporal ordering infor-mation, our proposed model explicitly leverages the tem-poral ordering information in video data through temporalalignment. This leads to strong data-efficiency for few-shotlearning. In concrete, TAM calculates the distance valueof query video with respect to novel class proxies by av-eraging the per frame distances along its alignment path.We introduce continuous relaxation to TAM so the modelcan be learned in an end-to-end fashion to directly optimizethe few-shot learning objective. We evaluate TAM on twochallenging real-world datasets, Kinetics and Something-Something-V2, and show that our model leads to significantimprovement of few-shot video classification over a widerange of competitive baselines.

1. Introduction

The emergence of deep learning has greatly advanced thefrontiers of action recognition [41, 4]. The main focus tendsto center around learning effective video representations forclassification using large amounts of labeled data. In orderto recognize novel classes that a pretrained network has notseen before, typically we need to manually collect hundredsof video samples for knowledge transferring. But such aprocedure is rather tedious and labor intensive especiallyfor videos, where the difficulty and cost of labeling is muchhigher compared to images.

There is a growing interest in learning models capable ofeffectively adapting themselves to recognize novel classeswith only a few labeled examples. This is known as the few-shot learning task [10, 6]. Under the setup of meta-learningbased few-shot learning, the model is explicitly trained to

∗Indicates equal contribution.

Query

Support Set

making tea

making sushi

Figure 1. Our few-shot video classification setting. Pairs of se-mantically matched frames are connected with a blue dashed line.The arrows show the direction of the temporal alignment path.

deal with scarce training data for previously unseen classesacross different episodes. While the majority of recent few-shot learning works focus on image classification, adaptingit to video data is not a trivial extension.

Videos are much more complicated than images, as rec-ognizing some specific actions, such as opening the door,usually requires a complete modeling of temporal informa-tion. In the previous literature of video classification, 3Dconvolution and optical flow are two of the most popularmethods to model temporal relations. The direct output ofneural network encoders is always a temporal sequence ofdeep encoded features. State-of-the-art approaches com-monly apply a temporal pooling module (usually meanpooling) in order to make final prediction. As observedbefore, averaging the deep features only captures the co-occurrence rather than the temporal ordering of patterns,which will unavoidably result in information loss.

Loss of information is even more severe for few-shot

1

arX

iv:1

906.

1141

5v1

[cs

.CV

] 2

7 Ju

n 20

19

Page 2: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

learning. It is hard to learn the local temporal patterns whichare useful for few-shot classification with limited amountof data. Utilizing the long-term temporal ordering informa-tion, which is often neglected in previous works on videoclassification, might potentially help with few-shot learn-ing. For example, if the model could verify that there isa procedure of pouring water before a close-up view of ajust made tea, as shown in Fig. 1, the model will then be-come quite confident about predicting the class of this queryvideo to be making tea, rather than some other potentialpredictions like boiling water or serving tea. In addition,Fig. 1 also shows that for two videos in the same class, eventhough they both contain a procedure of pouring water fol-lowed by closed-up view of tea, the exact temporal durationof each atomic step can vary dramatically. This non-lineartemporal variations of videos pose great challenges for few-shot video learning.

With these insights, we thus propose Temporal Align-ment Module (TAM) for few-shot video classification, anovel temporal-alignment based approach that learns to es-timate temporal alignment score of a query video with cor-responding proxies in the support set. To be specific, wecompute temporal alignment score for each potential query-support pair by averaging per-frame distances along a tem-poral alignment path, which enforces the score we use tomake prediction to preserve temporal ordering. Further-more, TAM is fully differentiable so that the model can betrained end-to-end and optimize the few-shot objective di-rectly. This in turn helps the model to better utilize long-term temporal information to make few-shot learning pre-dictions. This module allows us to better model the tem-poral evolution of videos, while enabling stronger data effi-ciency in the few-shot setting.

We evaluate our model for few-shot video classificationtask on two action recognition datasets: Kinetics [17] andSomething-Something V2 [12]. We show that when thereis only a single example available, our method outperformsthe mean pooling baseline which does not consider tempo-ral ordering information by approximately 8% in top-1 ac-curacy. We also show qualitatively that the proposed frame-work is able to learn meaningful alignment path in an end-to-end manner.

In summary, our main contributions are: (i) We arethe first to explicitly address the non-linear temporal vari-ations issue in the few-shot video classification setting. (ii)We propose Temporal Alignment Module (TAM), a data-efficient few-shot learning framework that can dynamicallyalign two video sequences while preserving the temporal or-dering, which is often neglected in previous works. (iii) Weuse continuous relaxation to make our model fully differen-tiable and show that it outperforms previous state-of-the-artmethods by a large margin on two challenging datasets.

2. Related WorkFew-Shot Learning. To address few-shot learning, a directapproach is to train a model on the training set and fine-tune with the few data in the novel classes. Since the datain novel classes are not enough to fine-tune the model withgeneral learning techniques, methods are proposed to learna good initialization model [9, 26, 32] or develop a noveloptimizer [30, 25]. These works aim to relieve the difficultyof fine-tuning the model with limited samples. However,such methods suffer from overfitting when the training datain novel classes are scarce but the variance is large. Anotherbranch of works, which learns a common metric for bothseen and novel classes, can avoid overfitting to some extent.Convolutional Siamese Net [20] trains a Siamese networkto compare two samples. Latent Embedding Optimization[39] employs attention kernel to measure the distance. Pro-totypical Network [35] utilizes the Euclidean distance tothe class center. Graph Neural Networks [10] constructsa weighted graph to represent all the data and measure thesimilarity between data. Other methods use data augmenta-tion, which learns to augment labeled data in unseen classesfor supervised training [13, 44]. However, video generationis still an under-explored problem at least generating videoscondition on a typical category. Thus, in this paper, we em-ploy the metric learning approach and designs a temporal-aligned video metric for few-shot video classification.

There are works exploring few-shot recognition. OSS-Metric Learning [19] proposes a novel OSS-Metric Learn-ing to measure the similarity of video pairs to enableone-shot video classification. [23] introduces a zero-shotmethod which learns a mapping function from an attributeto a class center. It has an extension to few-shot learning byintegrating labeled data on unseen classes. CMN [47] is themost related work to ours. They introduce a multi-saliencyembedding algorithm to encode video into a fixed-size ma-trix representation. Then they propose a compound mem-ory network (CMN) to compress and store the representa-tion and classify videos by matching and ranking. However,previous works collapse the order of frames at representa-tion [19, 23, 47]. Thus, the learned model is sub-optimalfor video datasets where sequence order is important. Inthis paper, we preserve the frame order in video representa-tion and estimate distance with temporal alignment, whichutilizes video sequence order to solve few-shot video clas-sification.Video Classification. A significant amount of research hastackled the problem of video classification. State-of-the-art video classification methods have evolved from hand-crafted representation learning [18, 33, 40] to deep-learningbased models. C3D [37] utilizes 3D spatial-temporal con-volutional filters to extract deep features from sequencesof RGB frames. TSN [41] and I3D [4] uses two-stream2D or 3D CNNs with larger size on both RGB and opti-

Page 3: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

Query

S1

S2

Sn

...

Support Set

T

DfT

n

Softmax

Alignment Score

T

T

Distance Matrices with Temporal Alignment

Embed

Embed

Embed

Embed

Df

Inner Product

Query Feature Map

Support Feature Map

Temporal Alignment Module

Label

n

n

n

n

Figure 2. Overview of our method. We first extract per-frame deep features using the embedding network. We then compute the distancematrices between the query video and videos in the support set. Next, an alignment score is computed out of the matrix representation.Finally we apply softmax operator over the alignment score of each novel class.

cal flow sequences. By factorizing 3D convolutional filtersinto separate spatial and temporal components, P3D [29]and R(2+1)D [38] yield models with comparable or superiorclassification accuracy but smaller in size. An issue of thesevideo representation learning methods is their dependenceon large-scale video datasets for training. Models with anexcessive amount of learnable parameters tend to fail whenonly a small number of training samples are available.

Another concern of video representation learning is thelack of temporal relational reasoning. Classification on thevideos sensitive to temporal ordering poses a more signif-icant challenge to the above networks which are tailoredto capture short-term temporal features. Non-local neu-ral networks [42] introduce self-attention to aggregate tem-poral information in the long-term. Wang et al. [43] fur-ther employ space-time region graphs to model the spatial-temporal reasoning. Recently, TRN [46] proposes a tempo-ral relational module to achieve superior performance. Still,these networks inevitably pool/fuse features from differentframes in the last layers to extract a single feature vectorrepresenting the whole video. In contrast, our model is ableto learn video representation without loss of temporal or-dering in order to generate more accurate final predictions.Sequence Alignment. Sequence alignment is of great im-portance in the field of bioinformatics, which describes theway of arrangement of DNA/RNA or protein sequences, inorder to identify the regions of similarity among them [2].In the vision community, researchers have growing inter-ests in tackling the sequence alignment problem with highdimensional multi-modal data, such as finding the align-ment between untrimmed video sequence and the corre-sponding textual action sequence [5, 8, 31]. The maintechnique that has been applied to this line of work is dy-namic programming. While dynamic programming is guar-

anteed to find the optimal alignment between two sequencesgiven a prescribed distance function, the discrete operationsused in dynamic programming are non-differentiable andhence prevent learning distance functions with gradient-based methods. Our work is closely related to recentprogress on using continuous relaxation of discrete oper-ations to tackle sequence alignment problem [5] and henceallow us to train our entire model end-to-end.

3. MethodsOur goal is to learn a model which can classify novel

classes of videos with only a few labeled examples. Thewide range of intra-class spatial-temporal variations ofvideos poses great challenges for few-shot video classifi-cations. We address this challenge by proposing a few-shot learning framework with Temporal Alignment Mod-ule (TAM), which is to our best knowledge the first modelthat can explicitly learn a distance measure independent ofnon-linear temporal variations in videos. The use of TAMsets our approach apart from previous works that fail to pre-serve temporal ordering and relation during meta trainingand meta testing. Fig.2 shows the outline of our model.

In the following, we will first provide a problem formu-lation of few-shot video classification task, and then defineour model and show how it can be used at training and testtime.

3.1. Problem Formulation

In the few-shot video classification setting, we split theclasses we have annotation of into Ctrain: the base classesthat have sufficient data for representation learning andCtest: the novel or unseen classes that have only a few la-beled data during testing stage. The goal of few-shot learn-

Page 4: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

ing is then to train a network that can generalize well tonew episodes over novel classes. In a n-way, k-shot prob-lem, for each episode the support set will contain n novelclasses, and each class will have a very small amount ofsamples (k in our setting). The algorithm will have to clas-sify videos from query set to one of the novel classes insupport set. Episodes are randomly drawn from a larger col-lection of data, which we hereby denote them as meta set.In our setting, we introduce 3 splits over classes as metatraining Ttrain, meta validation Tval and meta testing Ttestsets.

We formulate the few-shot learning as a represen-tation learning problem through a distance functionφ(fϕ(x1), fϕ(x2)), where x1 and x2 are two samples drawnfrom Ctrain and fϕ(·) is an embedding function that mapssamples to their representations. The difference betweenour problem formulation with the majority of previous few-shot learning researches lies in the fact that we are now deal-ing with higher dimensional inputs, i.e. (2+1)D volumesinstead of 2D images. The addition of the time dimensionin few-shot setting demands the model to be able to learntemporal ordering and relation with limited data in order togeneralize to novel classes, which pose challenges that havenot been properly addressed by previous works.

3.2. Model

With the above problem formulation, our goal is to learna video distance function by minimizing the few-shot learn-ing objective. Our key insight is that we want to explicitlylearn a distance function independent of non-linear tempo-ral variations by aligning the frames of two videos. Unlikeprevious works which use weighted average or mean pool-ing along the time dimension [41, 37, 42, 45, 4, 47], ourmodel is able to infer temporal ordering and relationshipduring meta training or meta testing in an explicit and dataefficient manner. In this subsection, we will breakdown ourmodel following the pipeline as illustrated in Fig. 2.Embedding Module: The purpose of the embedding mod-ule fϕ is to generate a compact representation of a trimmedaction video that encapsulates its visual content. A rawvideo usually consists of hundreds of frames, whose infor-mation could be redundant were to perform per frame in-ference. Thus frame sampling is usually adopted as a pre-proccessing stage for video inputs. Existing frame samplingschemes can be mainly divided into two categories: densesampling (randomly cropping out T consecutive framesfrom the original full-length video and then dropping ev-ery other frame) [37, 4, 42, 45] and sparse sampling (sam-ples distribute uniformly along the temporal dimension)[41, 48, 46, 21]. We follow the sparse sampling protocolfirst described in TSN [41], which divides the video se-quence into T segments and extracts a short snippets in eachsegment. The sparse sampling scheme allows each video

sequence to be represented by a fix number of snippets. Thesampled snippets span the whole video, which enables long-term temporal modeling.

Given a input sequence S = {X1, X2, ..., XT }, we willapply a CNN backbone network fϕ to process each ofthe snippets individually. After the encoding, raw videosnippets will then turn into a sequence of feature vectorsfϕ(S) = {fϕ(X1), fϕ(X2), ..., fϕ(XT )}. It is worth notic-ing that for the embedding of each video fϕ(S), its di-mension is T × Df , rather than Df for image embedding,which is usually chosen as the activation before final fully-connected layer of a CNN network.Distance Measure with Temporal Alignment Module(TAM):

Given two videos Si, Sj and their embedded featuresfϕ(Si), fϕ(Sj) ∈ RT×D, we can calculate the frame-leveldistance matrix D ∈ RT×T as

Dl,m = 1− fϕ(Si)l, · fϕ(Sj)m,||fϕ(Si)l,|| ||fϕ(Sj)m,||

, (1)

where Dl,m is the frame-level distance value between thelth frame of video Si and the mth frame of video Sj .

We further defineW ⊂ {0, 1}T×T to be the set of pos-sible binary alignment matrices, where ∀W ∈ W , Wij = 1if the ith frame of video Si is aligned to the jth frame ofvideo Sj . Our goal is to find the best alignment W ∗ ∈ W

W ∗ = argminW∈W

〈W,D(fϕ(Si), fϕ(Sj))〉, (2)

which minimizes the inner product between the alignmentmatrix W and the frame-level distance matrix D defined inEq. (1). The video distance measure is thus given by

φ(fϕ(Si), fϕ(Sj)) = 〈W ∗, D〉. (3)

We propose to use a variant of Dynamic Time Warping(DTW) algorithm [24] to solve Eq. (2). This is achievedby solving for a cumulative distance function

γ(i, j) = Dij +min{γ(i− 1, j − 1), γ(i− 1, j), γ(i, j − 1)}.(4)

In this setting of plain DTW, an alignment path is a con-tiguous set of matrix elements which defines a mappingbetween two sequences that satisfies the following condi-tions: boundary conditions, continuity and monotonicity.The boundary condition poses constraints on alignment ma-trix W such that W11 = 1 and WTT = 1 must be true inall possible alignment paths. In our alignment formulation,though the videos are trimmed, the action in the query videodoes not have to match exactly about its start and end actionwith the proxy. For example, consider the action of makingcoffee, there might be a snippet of stirring coffee at the endof action and it might not. To address this issue, we pro-pose to relax this the boundary condition. Instead of having

Page 5: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

Figure 3. Methods for calculating alignment score. Each subplot shows a distance matrix. The darker of the color of an entry, the smallerthe distance value is of a pair of relevant frames. The entries with green border denotes the entries contributing to the final alignment score.

a path aligning the two videos from start to end, we allowthe algorithm to find a path with flexible starting and endingpoints, while maintaining continuity and monotonicity. Towork through this, we pad two column of 0s at the start andend of the distance matrix so that it functions as enabling thealignment process to start and end at arbitrary position. Sofor our method, instead of computing the alignment scoreon a T × T matrix, we work with the padded matrix of sizeT × (T + 2). We further denote the indexes of the firstdimension as 1, 2, ..., T , and indexes of the second dimen-sion as 0, 1, 2, ..., T, T + 1, for simplicity. The compute ofcumulative distance function is then changed into function

γ(i, j) =

Dij +

min{γ(i− 1, j − 1), γ(i− 1, j), γ(i, j − 1)},

j = 0 or j = T + 1

min{γ(i− 1, j − 1), γ(i, j − 1), otherwise

(5)

Note that if we follow the Eq. 5 to compute the alignmentscore, the score by itself is naturally normalized. Since ateach time step except j = 0 and j = T − 1, the alignmentfunction forces a path from γ(·, j − 1) to γ(·, j), the finalalignment score will be a summation of exactly T scores. Inthis light, alignment scores computed from different pairsof query videos and support videos are normalized, whichmeans that the scale will not be affected by the path chosen.Differentiable TAM with Continuous Relaxation: Whilethe above formulation is straightforward, the key technicalchallenge is that γ is not differentiable with respect to thedistance function D. Following the recent works on contin-uous relaxation of discrete operation and its application invideo temporal segmentation [5, 22], we introduce a contin-uous relaxation to our Temporal Alignment Module (TAM).We use log-sum-exp with a smoothing parameter λ > 0to approximate the non-differentiable minimum operator inEq. (5)

min(x1, x2, ..., xn) ≈ −λ logn∑i=1

e−xi/λ if λ→ 0. (6)

While the use of continuous relaxation in Eq. (6) does

not convexify the the objective function, it helps the opti-mization process and allows gradients to be backpropagatedthrough TAM.Training and Inference: We have shown how to computethe cumulative distance function γ and use continuous re-laxation to make it differentiable given a pair of input videos(Si, Sj). The video distance measure is given by

φ(fϕ(Si), fϕ(Sj)) = γ(T, T + 1). (7)

In training time, given ground-truth video pair (S, S) andsupport set S, we train our entire model end-to-end by di-rectly minimizing the loss function

L = − logexp(−φ(fϕ(S), fϕ(S)))∑Z∈S exp(−φ(fϕ(S), fϕ(Z)))

. (8)

At test time we are given an unseen query video Q and itssupport set S, our goal is to find the video S∗ ∈ S thatminimize the video distance function

S∗ = argminS∈S

φ(Q,S). (9)

4. ExperimentsIn this work, our task is few-shot video classification,

where the objective is to classify novel classes with only afew examples from the support set. We divide our experi-ments into the following sections.

4.1. Datasets

As pointed out by [45, 46], existing action recogni-tion datasets can be roughly classified into two groups:Youtube type videos: UCF101 [36], Sports 1M [16], Ki-netics [17], and crowd-sourced videos: Jester[1], Charades[34], Something-Something V1&V2 [12], in which thevideos are collected by asking the crowd-source workers torecord themselves performing instructed activities. Crowd-sourced videos usually focus more on modeling the tem-poral relationships, since visual contents among different

Page 6: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

classes are more similar than those of Youtube type videos.To demonstrate the effectiveness of our approach on thesetwo groups of video data, we base our few-shot evalua-tion on two action recognition datasets, Kinetics [17] andSomething-Something V2 [12].

Kinetics [17] and Something-Something V2 [12] areconstructed to serve as action recognition datasets so wehave to construct their few-shot versions. For Kineticsdataset, we follow the same split as CMN [47] and sam-ple 64 classes for meta training, 12 classes for validationand 24 classes for meta testing. Since there is no existingsplit for few-shot classification on Something-SomethingV2, we construct a few-shot dataset following the same ruleas CMN [47]. We randomly selected 100 classes from thewhole dataset. The 100 classes are then split into 64, 12 and24 classes as the meta-training, meta-validation and meta-testing set, respectively.

4.2. Implementation Details

For a n-way, k-shot test setting, we randomly sample nclasses with each class containing k examples as the supportset. We construct the query set to have n examples, whereeach unlabeled sample in the query set belongs to one ofthe n classes in the support set. Thus each episode has atotal of n(k + 1) examples. We report the mean accuracyby randomly sampling 10,000 episodes in the following ex-periments.

We follow the video preprocessing procedure introducedin TSN [41]. During training we first resize each frame inthe video to 256 × 256 and then randomly crop a 224 ×224 region from the video clip. For inference we changethe random crop to center crop. For Kinetics dataset werandomly apply horizontal flip during training. Since thelabel in Something-Something V2 dataset incorporates anassumption of left and right, e.g. pulling something fromleft to right and pulling something from right to left, so wedo not use horizontal flip for this dataset.

Following the experiment setting of CMN, we useResNet-50 [14] as the backbone network for TSN. We ini-tialize network using pre-trained models on ImageNet [7].We optimized our model with SGD [3], with a startinglearning rate of 0.001 and decaying every 30 epochs by0.1. We use meta-validation set to tune the parameters, andstop the training process when the accuracy on the meta-validation set is about to decrease. We implemented thewhole framework with PyTorch [27] framework. The wholemodel takes 4 TITAN Xp GPUs to train for 10 hours.

4.3. Evaluating Few-Shot Learning

We now evaluate the representations we learned afteroptimizing few-shot learning objective. We compare ourmethod with the two following categories of baselines:

Table 1. Few-shot video classification results. We report 5-wayvideo classification accuracy on meta-testing set.

Kinetics Something V2Method 1-shot 5-shot 1-shot 5-shot

Matching Net [47] 53.3 74.6 - -MAML [47] 54.2 75.3 - -CMN [47] 60.5 78.9 - -

TSN++ 64.5 77.9 33.6 43.0CMN++ 65.4 78.8 34.4 43.8TRN++ 68.4 82.0 38.6 48.9

TAM (ours) 73.0 85.8 42.8 52.3

4.3.1 Train from ImageNet Pretrained Features

For baselines that use ImageNet pretrained features, we fol-low the same setting as described in CMN. As the fact thatprevious few-shot learning algorithms are all designed todeal with images, they usually take image-level feature en-coded by some backbone network as input. To circum-vent this discrepancy, we first feed frames of a video to aResNet-50 network pretrained on ImageNet, and then av-erage frame-level features to obtain a video-level feature.The averaged video-level feature is then served as the inputof few-shot algorithms.Matching Net [39] We use an FCE classification layer inthe original paper without fine-tuning in all experiments.The FCE module uses a bidirectional-LSTM and each train-ing example could be viewed as an embedding of all theother examples.MAML [9] Given the video-level feature as the input, wetrain the model following the default hyper-parameter andother settings described in [9].CMN [47] As CMN is specially designed for few-shotvideo classification, it could handle video feature inputsdirectly. The encoded feature sequence is first fed intoa multi-saliency embedding function to get a video-levelfeature. Final few-shot prediction is done by a compoundmemory structure similar to [15].

For the experiment results using ImageNet pretrainedbackbones, we directly take the numbers from CMN [47]to ensure fair comparison.

4.3.2 Finetune from Backbone on Meta Training Set

As raised by [6, 11, 28], using cosine distances between theinput feature and the trainable proxy for each class couldexplicitly reduce intra-class variations among features dur-ing training. The rigorous experiments in [6] has shown thatthe Baseline++ model is competitive or even surpass whencompared with other few-shot learning methods. So in fine-tuned settings we adapt several previous approaches withthe structure of Baseline++ to serve as strong baselines.TSN++ For TSN++ baseline, we also use episode-based

Page 7: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

mean score: 0.49alignment score: 0.30

mean score: 0.47alignment score: 0.35

CMN’s match: Cutting Watermelon

Our match: Unboxing

Query: Unboxing

mean score: 0.38alignment score: 0.15

mean score: 0.35alignment score: 0.27

CMN’s match: Holding something in front of something

Our match: Pushing something from right to left

Query : Pushing something from right to left

Figure 4. Visualization of our learning results. Comparison of our matched with CMN’s matched results in an episode. Although theaveraged score is quite high given the false matching and the query image, our algorithm is able to find the correct alignment path theminimize the alignment score, which ultimately results in the correct prediction.

training to simulate the few-shot setting at meta-train stageto directly optimize for generalization to unseen novelclasses. In order to get a video-level representation, we av-erage over the temporal dimension of extracted per framefeatures for both query sets and support sets. The videolevel feature from support set could then serve as proxies foreach novel class. We can then obtain the prediction prob-ability for each class by normalizing these cosine distancevalues with a softmax function. For inference during meta-testing stage, we first forward each video in the support setto get proxies for each class. Given the proxies we can thenmake prediction for videos in query set.CMN++ We follow the setting of CMN and reimple-ment this method by ourselves. The only difference aboutCMN++ and CMN is that we replace the ImageNet pre-trained feature with the feature extracted by TSN++ men-tioned above.TRN++ We also compare our approach against methodsthat attempt to learn a compact video-level representationgiven a sequence of image-level features. TRN [46] pro-poses a temporal relation module, which uses multilayerperceptrons (MLP) to fuse features of different frames. Werefer TRN++ to one of baselines by replacing average con-sensus module in TSN++ with temporal relation module.

By default we conduct 5-way few-shot classification ifthere is no further clarification. The 1-shot and 5-shot videoclassification results on both the Kinetics and Something-Something V2 datasets are listed in Table 1. It can be con-cluded that our approach significantly outperforms all thebaselines on both datasets. In CMN paper, the experimen-tal observations show that fine-tuning the backbone mod-ule on the meta-training set does not improve the few-shotvideo classification performance. In contradiction, we findthat with proper data augmentation and training strategy,a model could be trained to generalize better on unseenclasses in a new domain given the meta-training set. By

comparing the results of TSN++ and TRN++, we couldconclude that considering temporal relation explicitly helpswith model generalization on unseen classes. Compare toTSN++, the improvement brought by CMN++ is not aslarge as the gap on ImageNet pretrained features reportedin the original paper. This may be due to the reason that weare now using a more suitable distance function (cosine dis-tance) during meta-training so that the frame-level featureis more discriminative among unseen classes. This in turnmakes it harder to improve the final prediction given thosestrong features as the input. Finally it is worthwhile to notethat TAM outperforms all the finetuned baselines by a largemargin. This demonstrates the importance of taking tem-poral ordering information into consideration while dealingwith few-shot video classification problem.

4.4. Qualitative Results and Visualizations

We show qualitative results comparing CMN and TAMin Fig. 4. In particular, we observe that CMN has diffi-culty in differentiating two actions from different classeswith very similar visual clues among all the frames, e.g.,backgrounds. As can be seen from the distance matrices inFig. 4, though our method cannot alter the fact that the twovisually similar action clips will have an averagely lowerframe-wise distance value, it is able to find a temporal align-ment that minimize the cumulative distance score betweenthe query action video and the true support class video whilethe per-frame visual clue is not evident enough. Though themean score of TAM is lower than the match of CMN, TAMsucceeds in making the right prediction via calculating alower alignment score out of the distance matrix.

4.5. Ablation Study

Here we perform ablation experiments to demonstratethe effectiveness of our selections of the final model. Wehave shown in Section 4.3 that explicitly modeling the tem-

Page 8: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

Table 2. Temporal matching ablation study. We compare ourmethod to temporal-agnostic and temporal-aware baselines.

Kinetics Something V2matching type 1-shot 5-shot 1-shot 5-shot

Min 52.4 71.6 29.7 38.5Mean 67.8 78.9 35.2 45.3

Diagonal 66.2 79.3 38.3 48.7Plain DTW 69.2 80.6 39.6 49.0TAM(Ours) 73.0 85.8 42.8 52.3

poral ordering plays an important role for generalization tounseen classes. We now analyze the effect of different tem-poral alignment approaches.

While having the cosine distance matrixD, there are sev-eral choices we could adopt to extract the alignment scoreout of the matrix, as visualized in Fig. 3. In addition to ourproposed method, we consider several heuristics for gen-erating the scores. The first is “Min”, where we use theminimum element in the matrix D to represent the videodistance value. The second is “Mean”, for which we av-erage over the cosine distance value of all pairs of frames.These two choices both neglect the temporal ordering. Wewill then introduce a few potential choices that explicitlyconsider sequence ordering when computing the temporalalignment score. An immediate scheme is to take an av-erage over the diagonal of the distance matrix. The as-sumption made behind this approach is that the query videosequence shall be perfectly aligned with its correspondingsupport proxy of the same class, which could be somewhatideal in read world applications. To allow for more adap-tive alignment strategy, we introduce Plain DTW and ourmethod. Here the Plain DTW in Table. 2 means that thereis no padding so that W11 and WTT are assumed to be inthe alignment path, and for each time step during comput-ing alignment score we allow a possible movement choiceamong −→,↘ and ↓.

The results are shown in Table. 2. It can be observed thatwe are able to improve the few-shot learning by consideringtemporal ordering explicitly. There are some slight differ-ences in performance between method Diagonal and Meanregarding to the two datasets here. There are less visualclues in each frame of Something-Something V2 than thatof Kinetics, so the improvement of using Diagonal with re-gard to using Mean is prominent for Something-SomethingV2, while the gap is closed for Kinetics. However, we seethat through adaptive temporal alignment, our method con-sistently improve the baselines on two datasets by more than3% accross 1-shot and 5-shots. This shows that by reinforc-ing the model to learn an adaptive alignment path acrossquery videos and proxies, the final model could learn to en-code better representations for the video, as well as a moreaccurate alignment score which could in turn help with few-shot classification.

The next ablation study is on the sensitivity of smooth-

0.01 0.05 0.1 0.5 1Smoothing Parameter

70.072.575.077.580.082.585.0

Accu

racy

(%)

Kinetics

Top-1Top-5

0.01 0.05 0.1 0.5 1Smoothing Parameter

40424446485052

Accu

racy

(%)

Something-Something V2

Top-1Top-5

Figure 5. Smoothing factor sensitivity. We compare the effect ofusing different smoothing factors.

ing parameter λ. Previous works [5, 22] have shown thatusing λ empirically helps optimization in many tasks. In-tuitively, a smaller λ functions more like the min operationand a larger λ means a heavier smoothing effect over thevalues in nearby positions. We experimented on λ withinthe value set of [0.01, 0.05, 0.1, 0.5, 1].

The results are shown in Fig. 5. In general, the perfor-mance is stable across values of λ. We observe that in prac-tice λ ranges from 0.05 to 0.1 works relatively good underthe setting of both two datasets. Thus we notice that a suit-able λ is essential for the representation learning. When λis too small, though it is able to function most similarly asthe real min operator, the gradient is too imbalanced so thatsome pairs of frames are not adequately trained. On the con-trary, a large λ might be too smooth so that the differenceamong all kinds of alignments are not notable enough.

5. Conclusion

We propose Temporal Alignment Module (TAM), anovel few-shot framework that can explicitly learn distancemeasure and representation independent of non-linear tem-poral variations in videos using very few data. In con-trast to previous works, TAM dynamically aligns two videosequences while preserving the temporal ordering and itfurther uses continuous relaxation to directly optimize forthe few-shot learning objective in an end-to-end fashion.Our results and ablations show that our model significantlyoutperforms a wide range of competitive baselines andachieves state-of-the-art results on two challenging real-world datasets.

Page 9: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

Acknowledgements This work has been partially sup-ported by JD.com American Technologies Corporation (JD)under the SAILJD AI Research Initiative. This article solelyreflects the opinions and conclusions of its authors and notJD or any entity associated with JD.com.

References[1] The 20bn-jester dataset v1. https://20bn.com/

datasets/jester. 5[2] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang,

Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast andpsi-blast: a new generation of protein database search pro-grams. Nucleic acids research, 25(17):3389–3402, 1997. 3

[3] L. Bottou. Large-scale machine learning with stochastic gra-dient descent. In Proceedings of COMPSTAT’2010, pages177–186. Springer, 2010. 6

[4] J. Carreira and A. Zisserman. Quo vadis, action recognition?a new model and the kinetics dataset. In proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 6299–6308, 2017. 1, 2, 4

[5] C.-Y. Chang, D.-A. Huang, Y. Sui, L. Fei-Fei, and J. C.Niebles. D3TW : Discriminative differentiable dynamic timewarping for weakly supervised action alignment and seg-mentation. arXiv preprint arXiv:1901.02598, 2019. 3, 5,8

[6] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. Wang, and J.-B. Huang.A closer look at few-shot classification. In InternationalConference on Learning Representations, 2019. 1, 6

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.Imagenet: A large-scale hierarchical image database. 2009.6

[8] P. Dogan, B. Li, L. Sigal, and M. Gross. A neural multi-sequence alignment technique (neumatch). In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 8749–8758, 2018. 3

[9] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedingsof the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017. 2, 6

[10] V. Garcia and J. Bruna. Few-shot learning with graph neuralnetworks. In ICLR, 2017. 1, 2

[11] S. Gidaris and N. Komodakis. Dynamic few-shot visuallearning without forgetting. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages4367–4375, 2018. 6

[12] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska,S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos,M. Mueller-Freitag, et al. The” something something” videodatabase for learning and evaluating visual common sense.In ICCV, volume 2, page 8, 2017. 2, 5, 6

[13] B. Hariharan and R. Girshick. Low-shot visual recogni-tion by shrinking and hallucinating features. In Proceedingsof the IEEE International Conference on Computer Vision,pages 3018–3027, 2017. 2

[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition, pages770–778, 2016. 6

[15] Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learningto remember rare events. arXiv preprint arXiv:1703.03129,2017. 6

[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classification with convo-lutional neural networks. In Proceedings of the IEEE con-ference on Computer Vision and Pattern Recognition, pages1725–1732, 2014. 5

[17] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi-jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al.The kinetics human action video dataset. arXiv preprintarXiv:1705.06950, 2017. 2, 5, 6

[18] A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporaldescriptor based on 3d-gradients. In BMVC 2008-19thBritish Machine Vision Conference, pages 275–1. BritishMachine Vision Association, 2008. 2

[19] O. Kliper-Gross, T. Hassner, and L. Wolf. One shot similar-ity metric learning for action recognition. In InternationalWorkshop on Similarity-Based Pattern Recognition, pages31–45. Springer, 2011. 2

[20] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neu-ral networks for one-shot image recognition. In ICML DeepLearning Workshop, volume 2, 2015. 2

[21] J. Lin, C. Gan, and S. Han. Temporal shift module for effi-cient video understanding. arXiv preprint arXiv:1811.08383,2018. 4

[22] A. Mensch and M. Blondel. Differentiable dynamic pro-gramming for structured prediction and attention. ICML,2018. 5, 8

[23] A. Mishra, V. K. Verma, M. S. K. Reddy, S. Arulkumar,P. Rai, and A. Mittal. A generative approach to zero-shotand few-shot action recognition. In 2018 IEEE Winter Con-ference on Applications of Computer Vision (WACV), pages372–380. IEEE, 2018. 2

[24] M. Muller. Dynamic time warping. Information retrieval formusic and motion, pages 69–84, 2007. 4

[25] T. Munkhdalai and H. Yu. Meta networks. In Proceedingsof the 34th International Conference on Machine Learning-Volume 70, pages 2554–2563. JMLR. org, 2017. 2

[26] A. Nichol and J. Schulman. Reptile: a scalable metalearningalgorithm. arXiv preprint arXiv:1803.02999, 2018. 2

[27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in pytorch. 2017. 6

[28] H. Qi, M. Brown, and D. G. Lowe. Low-shot learning withimprinted weights. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5822–5830, 2018. 6

[29] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal repre-sentation with pseudo-3d residual networks. In proceedingsof the IEEE International Conference on Computer Vision,pages 5533–5541, 2017. 3

[30] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016. 2

Page 10: Abstract arXiv:1906.11415v1 [cs.CV] 27 Jun 2019to-end manner. In summary, our main contributions are: (i) We are the first to explicitly address the non-linear temporal vari-ations

[31] A. Richard, H. Kuehne, A. Iqbal, and J. Gall.Neuralnetwork-viterbi: A framework for weakly su-pervised video learning. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 7386–7395, 2018. 3

[32] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pas-canu, S. Osindero, and R. Hadsell. Meta-learning with latentembedding optimization. arXiv preprint arXiv:1807.05960,2018. 2

[33] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift de-scriptor and its application to action recognition. In Proceed-ings of the 15th ACM international conference on Multime-dia, pages 357–360. ACM, 2007. 2

[34] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev,and A. Gupta. Hollywood in homes: Crowdsourcing datacollection for activity understanding. In European Confer-ence on Computer Vision, pages 510–526. Springer, 2016.5

[35] J. Snell, K. Swersky, and R. Zemel. Prototypical networksfor few-shot learning. In Advances in Neural InformationProcessing Systems, pages 4077–4087, 2017. 2

[36] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A datasetof 101 human actions classes from videos in the wild. arXivpreprint arXiv:1212.0402, 2012. 5

[37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.Learning spatiotemporal features with 3d convolutional net-works. In Proceedings of the IEEE international conferenceon computer vision, pages 4489–4497, 2015. 2, 4

[38] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, andM. Paluri. A closer look at spatiotemporal convolutions foraction recognition. In Proceedings of the IEEE conferenceon Computer Vision and Pattern Recognition, pages 6450–6459, 2018. 3

[39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al.Matching networks for one shot learning. In Advances inneural information processing systems, pages 3630–3638,2016. 2, 6

[40] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In Proceedings of the IEEE international con-ference on computer vision, pages 3551–3558, 2013. 2

[41] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, andL. Van Gool. Temporal segment networks: Towards goodpractices for deep action recognition. In European confer-ence on computer vision, pages 20–36. Springer, 2016. 1, 2,4, 6

[42] X. Wang, R. Girshick, A. Gupta, and K. He. Non-localneural networks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 7794–7803, 2018. 3, 4

[43] X. Wang and A. Gupta. Videos as space-time region graphs.In Proceedings of the European Conference on Computer Vi-sion (ECCV), pages 399–417, 2018. 3

[44] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low-shot learning from imaginary data. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 7278–7286, 2018. 2

[45] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinkingspatiotemporal feature learning: Speed-accuracy trade-offs

in video classification. In Proceedings of the European Con-ference on Computer Vision (ECCV), pages 305–321, 2018.4, 5

[46] B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporalrelational reasoning in videos. In Proceedings of the Euro-pean Conference on Computer Vision (ECCV), pages 803–818, 2018. 3, 4, 5, 7

[47] L. Zhu and Y. Yang. Compound memory networks for few-shot video classification. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 751–766,2018. 2, 4, 6

[48] M. Zolfaghari, K. Singh, and T. Brox. Eco: Efficient con-volutional network for online video understanding. In Pro-ceedings of the European Conference on Computer Vision(ECCV), pages 695–712, 2018. 4