Discriminative Figure-Centric Models for Joint Action ...taranlan/pdf/lan_iccv11.pdfrecognize objects, using a generative model (pLSA). Dese-laers et al. [2] learn models for objects

Discriminative Figure-Centric Models for Joint Action Localization andRecognition

Tian LanSchool of Computing Science

Simon Fraser [email protected]

Yang WangDept. of Computer Science

[email protected]

Greg MoriSchool of Computing Science

Simon Fraser [email protected]

Abstract

In this paper we develop an algorithm for action recogni-tion and localization in videos. The algorithm uses a figure-centric visual word representation. Different from previousapproaches it does not require reliable human detection andtracking as input. Instead, the person location is treated asa latent variable that is inferred simultaneously with ac-tion recognition. A spatial model for an action is learnedin a discriminative fashion under a figure-centric represen-tation. Temporal smoothness over video sequences is alsoenforced. We present results on the UCF-Sports dataset,verifying the effectiveness of our model in situations wheredetection and tracking of individuals is challenging.

1. Introduction

At a broad level, there are two prevalent approaches foraction recognition from video sequences. The first is sta-tistical, gathering histograms of local space-time interestpoints over videos. The second is structural, using hu-man figure-centric representations which maintain spatialarrangements of features with respect to the person. Interms of output, many action recognition approaches do notanswer the question of where an action takes place in avideo, just that it does exist somewhere in the video. Inthis paper we argue two points, that this action localizationis an important component of action recognition, and thatdoing so can lead to better action classification accuracy.

Impressive action recognition results have been achievedusing bag-of-words statistical representations [14]. How-ever, there are limitations in this representation of an ac-tion, lacking important cues about the spatial arrangementof features. Recent efforts have been made to extendthis representation, using relative arrangements of visualwords [20, 13]. Yet these representations still lack explicitmodeling of the human figure, and are limited to higher-order statistics of visual word co-occurrences.

Figure 1: The goal of this paper is to learn a model using trainingvideos annotated with action labels and bounding boxes aroundpeople performing the action. Given a new video, we can use themodel to simultaneously predict the action label of the video andlocalize the person performing the action in each frame, includinga detailed discriminative region representation.

Structural approaches that use figure-centric representa-tions rely on either a template matching strategy [21] or hu-man detection and tracking as input [5]. Template matchingis arguably too brittle for unconstrained videos. Reliablehuman detectors exist [8], though only for canonical uprightposes without occlusion. For recognizing more diverse cat-egories of actions (e.g the UCF Sports dataset [19]), thistype of human detector is unreliable. Hence, many actionrecognition approaches forego a human-centric representa-tion in favour of a purely statistical approach.

In this paper we argue that building figure-centric struc-tural information into an action representation is advanta-geous, if it can be done robustly. Further, if one is to gobeyond just recognizing an action and proceed to localizeit within a video, it is essential to actually determine wherethe action is taking place. The main contribution of this pa-per is a representation that bridges between a bag-of-wordsstyle statistical representation and this type of figure-centricstructural representation. To overcome the challenge of per-

son detection in complex actions, we do not pre-suppose theexistence of a human detector capable of producing an ac-curate figure-centric representation. Instead, we treat theposition of the human as a latent variable in a discrimi-native latent variable model, and infer it while simultane-ously recognizing an action. We also move beyond a sim-ple bounding box representation for the human figure andlearn a more detailed spatial model of an action, discrimi-natively selecting which cells of a bounding box represen-tation should be included as a model for an action. Thisdetailed action-aware representation explicitly models thediscriminative region within a bounding box and thus canbe more robust to background clutter. A similar representa-tion is proposed recently to handle occlusion in object de-tection [11].

2. Previous WorkThe literature on vision-based action recognition is im-

mense. Weinland et al. [23] provide a recent survey; wereview closely related work here. Visual word representa-tions, in the form of quantized interest point descriptors,are common in the action recognition literature [18, 4, 14].Beyond first-order bag-of-words representations, Ryoo andAggarwal [20] developed a matching kernel that considersspatial and temporal relations between space-time interestpoints. Kovashka and Grauman [13] consider higher-orderrelations between visual words, each with discriminativelyselected spatial arrangements. Instead, our work endows vi-sual word representations with a figure-centric frame of ref-erence. We maintain a spatial coordinate system specifyingwhere features are with respect to the person, while theseother bag-of-words methods do not. Klaser [12] evaluatesbags-of-words features in different datasets with and with-out a figure-centric representation obtained using a persondetector. Our work also uses a figure-centric visual wordrepresentation, though with latent modeling of the personlocation and a finer level of discriminatively chosen spatialdetails.

Figure-centric template matching approaches based onshape and motion features are another common strategy [21,5]. As noted above, assuming reliable human detection andtracking as input to action recognition is problematic. Luet al. [15] attempt to address the tracking component witha generative probabilistic model for simultaneous trackingand action recognition. Impressive results are shown forsmall-scale human figures, performing actions in far-fieldhockey videos. Our work is similar in spirit, though doesnot require manual initialization of tracks and learns dis-criminative models for actions with larger shape variations.

In this paper we develop an algorithm to both recog-nize and localize actions in videos. Similar work hasbeen explored in the object recognition literature. The Im-plicit Shape Model (ISM) and its extensions [10] have been

(a) (b) (c)

Figure 2: Background context for action recognition. The threeimages taken from the UCF-Sports dataset [19] are from theclasses: diving, swinging (at the high bar) and swinging (on thepommel) respectively. Context is unhelpful for distinguishing be-tween the actions in (b) and (c), but essential to distinguish theactions in (a) and (b).

widely used in object category detection and segmentation.ISM is a generative model that determines possible objectlocations and scales in a probabilistic Hough voting proce-dure. ISM can be first used to find hypotheses and a dis-criminative model is then applied to verify them and filterout false positives. Maji and Malik [16] develop a max-margin formulation of the Hough Transform. Felzenszwalbet al. [8] develop the latent SVM that aligns parts whilelearning a model. Fergus et al. [9] encode spatial infor-mation of interest points and simultaneously localize andrecognize objects, using a generative model (pLSA). Dese-laers et al. [2] learn models for objects using features insidea bounding box. In our work, we use a latent variable to dis-criminatively select which information inside the boundingbox is useful for our task. In addition, in contrast with all theaforementioned object recognition work, we consider videosequences and we explicitly enforce the temporal coherenceof actions across time.

3. Recognizing and Localizing Actions in VideoOur goal is to learn a model to both recognize and lo-

calize actions in videos, depicted in Fig. 1. We assume weare given a training set of videos with action labels and per-son locations. We train a model that can determine whetheran action occurs in an input video, and the spatio-temporallocations in the video at which the action occurs. Robustsolutions will require that many sources of information arebrought to bear on the problem. In this paper we incor-porate two sources in a unified model – a statistical scenecontext representation and a structural representation of theindividual person.

Global statistical models for scene context are now partof the standard toolkit for action recognition (e.g. Marsza-lek et al. [17]). Much can be gained from background scenerecognition via a global statistical model – in our exper-iments we show that this effect seems to dominate in awidely-used dataset. However, as Fig. 2 illustrates, globalstatistical models are not enough, and descriptors on theperson may be overwhelmed by the cloud of backgroundfeatures. Clearly, a structural representation that considers a

figure-centric representation is needed here. It is needed notonly for differentiating actions where background scenesare similar, but also for localizing the actions.

In this work, we combine a global scene model with afigure-centric representation. One of the challenges in usinga figure-centric representation is the need for human detec-tion and tracking in order to form an aligned representation.We take inspiration from the commonly used LSVM-basedobject detector [8], which implicitly searches for an align-ment of images via latent variables. However, that detectoris not directly applicable to action recognition in videos. Inour work, we make several important modifications to adaptit to this problem domain.

First, analysis of video sequences naturally leads one toconsider temporal continuity, or tracking constraints. Weuse latent variables to represent the location of the personperforming the action in each frame. Our tracking con-straint enforces the region of the video corresponding tothe person (represented by those latent variables) shouldbe consistent in appearance over time. Second, unlike therelatively rigid objects (pedestrians, cars, etc.) for whichthe LSVM-based detector works well, human figures per-forming various actions undergo drastic changes in shape.The global “root filter” template and set of parts describedwith HOG features are insufficient to capture this variation.Instead, we deploy a figure-centric bag-of-words represen-tation combined with a flexible latent sub-region model tocapture this variation. Exact learning and inference in ourmodel are intractable. We develop efficient approximatelearning and inference algorithms.

3.1. Figure-Centric Video Sequence Model

We propose a discriminative figure-centric model thatjointly captures the relationship between the action label ofa video and the location of the person performing the actionin each frame. The location is represented by a boundingbox around the person, and a detailed shape mask that se-lects certain regions within the bounding box. We dividea bounding box into R cells, and introduce a latent vari-able to discriminatively select which cells of a boundingbox representation are “on”. The action label of a videoand the bounding boxes of the person performing the actionare observed on training data (but not on test data). The dis-criminative cells of each bounding box are treated as latentvariables in the model for both training and test data.

Each training video I is associated with an action la-bel y. Suppose the video contains τ frames representedas I = (I1, I2, . . . , Iτ ), where Ii denotes the i-th frameof the video. We use L = (l1, l2, . . . , lτ ) to denote theset of bounding boxes in the video, one per each frame.The i-th bounding box li is a 4-dimensional vector repre-senting location, height, and width of the bounding box.We use λ(li; Ii) to denote the feature vector extracted

from the patch defined by li in the frame Ii. We assumeλ(li; Ii) is the concatenation of three vectors, i.e. λ(li; Ii) =[xi; gi; ci]. Here xi and gi denote the appearance fea-ture (codeword from k-means quantized HOG3D descrip-tors [12]) and spatial locations of interest points in thebounding box li respectively. ci denotes the holistic featureof the patch, here we use a color histogram.

To encode which sub-regions of the bounding box areimportant for an action, we introduce a matrix z = {zij},1 ≤ i ≤ τ, 1 ≤ j ≤ R. An entry zij = 1 means the j-thcell in the i-th frame is discriminative and thus “turned on”,and zij = 0 otherwise. In other words, z specify the dis-criminative regions inside each bounding box. We treat z aslatent variables and infer them automatically when learningmodel parameters.

The configurations of bounding boxes in a video are notindependent. For example, bounding boxes of neighboringframes tend to be similar in terms of image appearance, lo-cation and size. To capture this intuition, we assume thatthere are connections between bounding boxes in neighbor-ing frames. Intuitively speaking, this will enforce a track-ing constraint that states bounding boxes on the action ofinterest should move smoothly over time. We use a chain-structured undirected graph G = (V, E) to represent theconfigurations of bounding boxes L in a video. A vertexvi ∈ V corresponds to the configuration li of the bound-ing box in the i-th frame. An edge (vi, vi+1) ∈ E corre-sponds to the dependency between two neighboring bound-ing boxes li and li+1.

Inspired by the latent SVM [8, 25], we use the fol-lowing scoring function to measure the compatibility be-tween a video I, an action label y and the configurationsof bounding boxes L that localize the person perform-ing the action in each frame of the video: fθ(L, y, I) =maxz θ

>Φ(z, L, y, I), where θ are the model parameters,and Φ(z, L, y, I) is a feature vector defined on z, L, y, I.The model parameters have three parts θ = {α, β, γ}, andθ>Φ(z, L, y, I) is defined as:

θ>Φ(z, L, y, I) =∑i∈V

α>φ(li, zi, y, Ii)

+∑

i,i+1∈Eβ>ψ(li, li+1, zi, zi+1, Ii, Ii+1) + γ>η(y, I) (1)

The details of the potential functions in Eq. 1 are describedin the following.

Unary Potential α>φ(li, zi, y, Ii): For the i-th frame Ii,this potential function measures the compatibility betweenthe action label y, the configuration of the bounding box li,and the discriminative cells of the bounding box zi. Recallthat we divide a bounding box into R cells, zi is a vectorthat denotes whether each cell in the bounding box li is dis-

criminative or not. We define the potential function as:

α>φ(li, zi, y, Ii) = (2)Y∑a=1

R∑j=1

K∑w=1

∑v∈N (j)

α>aw · 1(y = a) · 1(xiv = w) · bin(giv) · zij

whereN (j) denotes the set of interest points in the j-th cell,we use bin(·) to denote the feature vector that bins the rel-ative location of an interest point with respect to the centerof the bounding box. Hence bin(giv) is a sparse vector ofall zeros with a single one for the bin occupied by giv . LetY denote the number of action labels, K denote the num-ber of codewords, B denote the number of bins, then theparameter α is a matrix of size Y ×K ×B, where an entryαawr can be interpreted as how much the model prefers tosee a “discriminative” interest point in the r-th bin when itscodeword is w and the action label is a.

Pairwise Potential β>ψ(li, li+1, zi, zi+1, Ii, Ii+1):This potential function measures the compatibility betweentwo neighboring frames and assesses how likely they areto contain the same person. The compatibility is measuredin terms of three factors: similarity of bounding boxes,similarity of discriminative regions and similarity of patchappearances. More formally, it is written as:

β>ψ(li, li+1, zi, zi+1, Ii, Ii+1) = β1 · s(zi, zi+1) +β2 ·m(ci, ci+1) + β3 ·m(li, li+1) (3)

where the first term s(zi, zi+1) describes the shape sim-ilarity between the discriminative regions in the i-th andi+ 1-th bounding box, which is computed as s(zi, zi+1) =1 − 1

R

∑Rj=1 |zij − zi+1,j |. The second term m(ci, ci+1)

measures the similarity between the color histograms of thepatches defined by the i-th and i+1-th bounding box, wherem is a similarity function, here we use the reciprocal valueof L2 distance. The third term denotes the similarity be-tween the i-th and the i + 1-th bounding box in terms ofthree cues: location, aspect ratio and area. β is a vec-tor of model parameters that control the relative contribu-tions of these three terms. In essence, this potential func-tion tries to enforce a tracking constraint that two neighbor-ing frames should have similar bounding boxes (in terms oflocation, aspect ratio and area), similar shaped discrimina-tive regions, and the image patches from the two boundingboxes are also similar.

Global Action Potential γ>η(y, I): Many actionclasses can be distinguished by scene context. To capturethis, we also include a potential function that is a globaltemplate model measuring the compatibility between theaction label y and a global feature vector of the whole video.It is parameterized as:

γ>η(y, I) =∑a∈Y

γ>a 1(y = a) · x0 (4)

where x0 is a feature vector extracted from the whole videoI. Here we use a statistical bag-of-words style representa-tion for the whole video. The parameter γa is a template forthe action class a.

4. Learning and InferenceWe now describe how to infer the action label given the

model parameters (Sec. 4.1), and how to learn the modelparameters from a set of training data (Sec. 4.2). In train-ing, the action labels and bounding boxes around peopleperforming the action are provided. At test time, this infor-mation is unavailable and must be inferred.

4.1. Inference

Given the model parameters θ, the inference problem isto find the best action label y∗ and the best configurations ofbounding boxes L∗ that localize the person performing theaction in each frame of a video I. The inference problemrequires solving the following optimization problem:

maxy

maxL

fθ(L, y, I) = maxy

maxL

maxzθ>Φ(z, L, y, I) (5)

We can enumerate all the possible y ∈ Y . For a fixedy, we need to solve an inference problem of maximizing Land z as follows:

maxL

maxzθ>Φ(z, L, y, I) (6)

The optimization problem in Eq. 6 is in general NP-hardsince it involves a combinatorial search. One possible solu-tion is to use an iterative method as follows: (1) holding Lfixed, find the optimal z; (2) holding z, finding the optimalL. These two steps are repeated until convergence. How-ever, this solution is still not efficient in practice, since thesecond step of the method requires searching over all thelocations and scales of the bounding boxes in each frameof the video. Further, we have to do it during every it-eration. This is very computationally expensive. So wefurther develop an approximation scheme to speed up thisstep. The intuition of the approximation scheme is to startthe configuration space L to be L, which denotes all possi-ble locations and scales of bounding boxes in each frame.During each iteration, we gradually shrink L by picking thesearch spaces that are most likely to contain the true loca-tions/scales/aspect ratios of bounding boxes. At the begin-ning, we still need to do an exhaustive search of all possiblelocations/scales/aspect ratios. But in subsequent iterations,we only need to search over smaller and smaller sets of pos-sible locations/scales/aspect ratios. The details of the infer-ence method are as follows.

We initialize z to be a matrix with every entry equalsto one, which means all the cells in the bounding boxesare “turned on” in the model. Initially, we set the search

space of the configurations L to be L, i.e. all possible lo-cations, scales, and aspect ratios of the bounding boxes ineach frame of a video. We then iterate the following twosteps, and reduce the search space of L in each iterationuntil a final configuration L∗ is achieved:

1. Holding the variables z fixed, find the set L′ of the topκ% scored configurationsL according to the score function:θ>Φ(z, L, y, I) by exhaustively searching over L and sort-ing the scores. Then shrink the search space as L ← L′.

2. Enumerate all possible configurations L ∈L, and optimize the variable z according to: z =arg maxL∈L,z′ θ>Φ(z′, L, y, I).

These two steps are repeated until only one configurationis left in the search space L. The optimization problem instep 1 is done by exhaustive search over L. Since the searchspace L shrinks at every iteration, the exhaustive search isexpensive only in the first few iterations. The optimizationproblems in step 1 and step 2 are standard MAP inferenceproblems in undirected graphical models. We use standardbelief propagation to solve them.

In practice, in order to reduce the initial search space L,we train a HOG detector [1] on our training set that detectspeople of any action class. We evaluate the scores returnedby the HOG detector for all possible bounding boxes in avideo frame I and then use the top 100 bounding boxes ac-cording to their scores as the initial search space L for avideo frame. Essentially this very rough person detector(with 100 false positives per frame) acts as a saliency op-erator. Typically, the top 100 bounding boxes cover all theregions that could possibly contain people and allow us toquickly discard many background regions. This proceduregreatly reduces the running time of inference.

4.2. Learning

Given a set of N training examples 〈In, Ln, yn〉 (n =1, 2, . . . , N), we would like to train the model parameter θthat tends to produce the correct action label y and localizethe person performing the action for a new test video I. Notethat the bounding boxes L of the person performing the ac-tion are observed on training data, but the discriminativeregions in each bounding box (or equivalently the variablesz) are unobserved and will be automatically inferred.

We adopt the latent SVM framework [8, 25] for learning.

minθ,ξ≥0

12||w||2 + C

N∑n=1

ξn (7a)

s.t. fθ(yn, Ln, In)− fθ(y, L, In) ≥∆(y, yn, L, Ln)− ξn,∀n,∀y,∀L (7b)

where ∆(y, yn, L, Ln) measures the joint loss betweenthe ground-truth action label and bounding boxes (yn, Ln)compared with the hypothesized ones (y, L). The joint

loss ∆(y, yn, L, Ln) should reflect how well the hypoth-esized action label y and bounding boxes L match theground truth yn and Ln. We define the joint loss as aweighted combination of recognition loss and localizationloss ∆(y, yn, L, Ln) = µ∆0/1(y, yn) + (1− µ)∆(L,Ln),where 0 ≤ µ ≤ 1 balancing the relative contributions ofthese two terms. In our experiments, we set µ to be 0.5.The recognition loss ∆0/1 is a 0-1 loss that measures thedifference between the ground-truth action label yn and ahypothesized action label y, i.e. ∆0/1(y, yn) = 1 if y 6= yn,and ∆0/1(y, yn) = 0 otherwise.

The localization loss ∆(L,Ln) measures the differencebetween the scales/locations/aspect ratios of the ground-truth bounding boxes Ln and the hypothesized boundingboxes L. This loss function is in turn defined as the sumover a set of local losses on each frame: ∆(L,Ln) =1τ

∑i ∆i(Li, Lni ), where τ is the number of frames in a

video. We use the intersection over union loss used inthe PASCAL VOC challenge [6] as the localization loss:∆i(Li, Lni ) = 1 − Area(Li∩Ln

i )Area(Li∪Ln

i ) , where the quality of lo-calization is measured based on the amount of area over-lap between the predicted bounding box Li and the groundtruth bounding box Lni in the i-th frame. Area(Li ∩ Lni )is the area of intersection of the two bounding boxes, whileArea(Li ∪ Lni ) is the area of their union.

We use the non-convex bundle optimization in [3] tosolve Eq. 7. In a nutshell, the algorithm iteratively buildsan increasingly accurate piecewise quadratic approximationto the objective function. During each iteration, a new lin-ear cutting plane is found via a subgradient of the objectivefunction and added to the piecewise quadratic approxima-tion. We omit the details due to space constraints.

5. ExperimentsWe present results of both action recognition and local-

ization on the UCF-Sports dataset [19] to demonstrate theeffectiveness of our model, especially in situations wheredetection and tracking of individuals are challenging.

5.1. Experimental Settings and Baselines

The UCF-Sports dataset [19] contains 150 videos from10 action classes: diving, golf swinging, kicking, lifting,horse riding, running, skating, swinging (on the pommelhorse and on the floor), swinging (at the high bar), andwalking. The videos are taken from real sports broadcasts.Bounding boxes around the person performing the action ofinterest in each frame are also available.

Reported results [19, 22, 24, 13, 12] on this dataset useLeave-One-Out (LOO) cross validation, cycling each ex-ample as a test video one at a time. There are two issueswith using LOO on this dataset. First, it is not clear howparameters (e.g. regularizer weightings) are set. Second,there are strong scene correlations among videos in certain

C value 0.1 1 10 100 1000accuracy 0.434 0.466 0.643 0.819 0.819

Table 1: Accuracies of the bag-of-words model with different Cparameters (LOO).

Method Accuracyglobal bag-of-words 63.1local bag-of-words 65.6

spatial bag-of-words with ∆0/1 63.1spatial bag-of-words with ∆joint 68.5

latent model with ∆0/1 63.7our approach 73.1

Table 2: Mean per-class action recognition accuracies (splits).

Method AccuracyKovashka et al. [13] 87.3

Klaser [12] 86.7Wang et al. [22] 85.6Yeffet et al. [24] 79.3

Rodriguez et al. [19] 69.2global bag-of-words 81.9

our approach 83.7

Table 3: Mean per-class action recognition accuracies (LOO).

classes; many videos are captured in exactly the same loca-tion. With LOO, the learning method can exploit this corre-lation and memorize the background instead of learning theaction. As evidence, we tested the performance of a bag-of-words model [22] using a linear kernel with different Cparameters (Table 1). We can see the regularizer weightingC greatly affects the results. The best accuracy is achievedwhen the learning method focuses on the training error in-stead of a large margin (C = 1000). Essentially, the focusis on memorizing the training examples. To help alleviatethese problems, we split the dataset by taking one third ofthe videos from each action category to form the test set,and the rest of the videos are used for training. This willreduce the chances of videos in the test set sharing the samescene with videos in the training set1.

In order to comprehensively evaluate the performance ofthe proposed model in terms of both action recognition andlocalization, we define the following baseline methods tocompare with. The first two baseline methods only do ac-tion recognition, while the last two baseline methods cando both action recognition and localization. We extract theHOG3D features for interest points detected by dense sam-pling, using the code from the author’s website2, with theparameter settings following [22]. For all methods we use a4000 word codebook, and the number of cells (R) is set to

1The training-testing split is available at our website http://www.sfu.ca/˜tla58

2http://lear.inrialpes.fr/people/klaeser/software

81.Global bag-of-words: The first baseline is an SVM clas-sifier on the global feature vector x0 with a bag-of-wordsrepresentation for a video, which is similar to [22].Local bag-of-words: The second baseline is similar to thefirst one. The only difference is that the bag-of-words his-togram is computed using only the interest point descriptorswithin the person’s bounding box in a video. Note that inthis baseline, ground truth bounding boxes are also used attest time. So strictly speaking, this is not a practical method.We include it here only for comparison purposes.Spatial bag-of-words: The third baseline is similar to ourproposed method. The difference is that it does not use thelatent variables z to select the discriminative regions. Inother words, z are fixed to be all ones. For this baseline, wereport the results of using both the classification loss (∆0/1)and the joint loss of localization and classification (∆joint).Latent model with ∆0/1: The last baseline is equivalent toour proposed method, except that it uses the classificationloss (∆0/1) in learning. The comparison with this baselinewill demonstrate that it is helpful to choose a loss functionthat jointly considers action recognition and localization.

Since the first two baseline methods do not performaction localization, we only evaluate them for the actionrecognition task. The other baselines are evaluated in termsof both action recognition and localization. For simplicity,we use a linear kernel for both SVM and latent SVM, andthe SVM classifier is implemented using LIBLINEAR [7].

5.2. Experimental Results

Action Recognition: We summarize mean per-class ac-tion recognition accuracies in Table 2. We can see thatour method significantly outperforms the baseline methods.Baselines using the 0/1 classification loss (spatial bag-of-words with ∆0/1 and latent model with ∆0/1) do not showmuch improvement over the global bag-of-words. We be-lieve this is due to the fact that ∆0/1 does not enforce cor-rect localizations during training. The localization resultsat test time can be arbitrary, which in turn do not providemuch useful information for recognition. We can see signif-icant improvement when optimizing the joint loss of recog-nition and localization (spatial bag-of-words with ∆joint andour approach). In addition, by introducing the latent vari-ables to suppress the non-discriminative regions inside eachbounding box, our approach works significantly better thanthe baseline (spatial bag-of-words with ∆joint) in the sameframework but without using latent variables. Per-classaccuracies of our method and the baseline global bag-of-words are compared in Fig. 3.

In order to compare our methods with the state-of-theart on the UCF sports dataset, we also report our resultsin the LOO setup in Table 3 (though this setup has prob-lems as stated previously). Our method still outperforms

http://www.sfu.ca/~tla58

http://www.sfu.ca/~tla58

Figure 3: (Best viewed in color) Per-class action classification ac-curacies of our approach and global bag-of-words. Swinging1 andswinging2 represent the action classes swinging (on the pommelhorse and on the floor) and swinging (at the high bar) respectively.

(a) (b)

Figure 4: (Best viewed in color) Comparison of action localizationperformance. (a) ROC curves, when σ is set to 0.2. (b) The com-parison of area under ROC (AUC) measures in terms of differentσ. σ is the threshold that determines whether a video is correctlylocalized (see text for explanation).

the baseline of global bag-of-words, with a smaller gap thansplitting the dataset. The accuracy of our method is slightlylower than [13] and the best result reported in [22] and [12],we think it is for two reasons: 1) the strong scene correla-tion between training and test videos in LOO induces thelearning method to recognize the background instead of theaction, 2) the use of linear versus complex kernels.Action Localization: Since the first two baselines (globalbag-of-words and local bag-of-words) cannot perform ac-tion localization, we only evaluate the last three baselines(spatial bag-of-words with ∆0/1, spatial bag-of-words with∆joint and latent model with ∆0/1) and our approach foraction localization in Table 2. Our evaluation criterion isas follows: we compute an “intersection-over-union” scorefor each frame in a video according to: O(Li, Lni ) =Area(Li∩Ln

i )Area(Li∪Ln

i ) , where Li denotes the predicted bounding boxfor the i-th frame, and Lni is the ground truth boundingbox for the i-th frame. The localization score O(L,Ln)for a video is computed by taking an average of the scoresO(Li, Lni ) of all the frames in a video: O(L,Ln) =1τ

∑iO(Li, Lni ), where τ is the number of frames in a

video. if the score O(L,Ln) is larger than σ, then the video

is considered as correctly localized.Given a test video I, our model returns |Y| scores ac-

cording to fθ(y, L, I), where y ∈ Y . We take each actionclass as the positive class at one time, and we use the scoresfθ(y, L, I) to produce ROC curves for each positive class.A video is considered as being correctly predicted if boththe predicted action label y∗ and the localization L∗ matchthe ground truth. Due to space limitations, we only visu-alize the average action localization performance of all theaction categories in terms of ROC curves, which is com-puted when σ is set to 0.2. We also evaluate the area underROC (AUC) measure, with σ varying from 0.1 to 0.5, in astep of 0.1. The curves are shown in Fig. 4.

The localization score O(L,Ln) for each video is com-puted by taking an average over the “intersection-over-union” scores of all the frames. If we consider a personas being correctly localized based on “intersection-over-union” score larger than 0.5 (see the evaluation criterionin the PASCAL VOC [6]), the threshold σ = 0.2 can beroughly interpreted as a video is correctly localized if onaverage 40% of the frames are correctly localized. σ = 0.5means almost all the frames in a video are correctly local-ized, which is a very stringent criterion.

We can draw similar conclusions for the action localiza-tion results: optimizing a joint loss leads to better local-ization and using the latent variables to suppress the non-discriminative regions inside each bounding box also im-proves the action localization performance.

We visualize the localization results and the learnt dis-criminative regions inside each bounding box (or equiva-lently latent variables z) for each action category in Fig. 5(see the figure caption for detailed explanation).

6. Conclusion

We have presented a discriminative model for joint ac-tion localization and recognition in videos. We use a figure-centric visual word representation. Different from previousapproaches that require human detection and tracking as in-put, we treat the position of the human as a latent variable ina discriminative latent variable model, and infer it while si-multaneously recognizing an action. We also move beyonda simple bounding box representation for the human figureand learn a more detailed spatial model of an action, dis-criminatively selecting which cells of a bounding box repre-sentation should be included as a model for an action. Ourexperimental results demonstrate that our proposed modeloutperforms other baseline methods in both action localiza-tion and recognition.

References[1] N. Dalal and B. Triggs. Histograms of oriented gradients for

human detection. In CVPR, 2005. 5

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

(k) (l) (m) (n) (o)

Figure 5: (Best viewed in color) Visualization of localization results and the learnt discriminative cells of each bounding boxes (or equiva-lently latent variables z) for each action category. The red cells indicate the regions are inferred as discriminative regions by our model, theblue cells indicate the regions are not discriminative. The first two rows show correct examples. We can see that most of the discriminativeregions (red cells) are on the person performing the action of interest or discriminative context such as the golf club, soccer ball, barbell,horse and bar in (b)-(e) and (i) respectively. The last row shows incorrect examples. We can see that most of the incorrect localizations aredue to background clutter (e.g. the person shaped logo in (k)), or strong scene context (e.g. the soccer ball in (m)).

[2] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objectswhile learning their appearance. In ECCV, 2010. 2

[3] T.-M.-T. Do and T. Artieres. Large margin training for hid-den markov models with partially observed states. In ICML,2009. 5

[4] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behaviorrecognition via sparse spatio-temporal features. In VS-PETS,2005. 2

[5] A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing actionat a distance. In ICCV, 2003. 1, 2

[6] M. Everingham, L. Gool, C. Williams, J. Winn, and A. Zis-serman. The PASCAL visual object classes (VOC) chal-lenge. IJCV, 88(2):303–338, 2010. 5, 7

[7] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.Lin. LIBLINEAR: A library for large linear classification.JMLR, 2008. 6

[8] P. Felzenszwalb, D. McAllester, and D. Ramanan. A dis-criminatively trained, multiscale, deformable part model. InCVPR, 2008. 1, 2, 3, 5

[9] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learn-ing object categories from google’s image search. In ICCV,2005. 2

[10] M. Fritz, B. Leibe, B. Caputo, and B. Schiele. Integratingrepresentative and discriminant models for object categorydetection. In ICCV, 2005. 2

[11] T. Gao, B. Packer, and D. Koller. A segmentation-awareobject detection model with occlusion handling. In CVPR,2011. 2

[12] A. Klaser. Learning human actions in videos. PhD thesis,Universit de Grenoble, 2010. 2, 3, 5, 6, 7

[13] A. Kovashka and K. Grauman. Learning a hierarchy of dis-criminative space-time neighborhood features for human ac-tion recognition. In CVPR, 2010. 1, 2, 5, 6, 7

[14] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning realistic human actions from movies. In CVPR,2008. 1, 2

[15] W.-L. Lu, K. Okuma, and J. J. Little. Tracking and recog-nizing actions of multiple hockey players using the boostedparticle filter. IVC, 27(1-2):189–205, 2009. 2

[16] S. Maji and J. Malik. Object detection using a max-marginhough transform. In CVPR, 2009. 2

[17] M. Marszalek, I. Laptev, and C. Schmid. Actions in context.In CVPR, 2009. 2

[18] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learn-ing of human action categories using spatial-temporal words.IJCV, 2008. 2

[19] M. D. Rodriguez, J. Ahmed, and M. Shah. Action MACH: Aspatial-temporal maximum average correlation height filterfor action recognition. In CVPR, 2008. 1, 2, 5, 6

[20] M. Ryoo and J. Aggarwal. Spatio-temporal relationshipmatch: Video structure comparison for recognition of com-plex human activities. In ICCV, 2009. 1, 2

[21] E. Shechtman and M. Irani. Space-time behavior based cor-relation. In CVPR, 2005. 1, 2

[22] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.Evaluation of local spatio-temporal features for action recog-nition. In BMVC, 2009. 5, 6, 7

[23] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-based methods for action representation, segmentation andrecognition. CVIU, 115:224–241, 2011. 2

[24] L. Yeffet and L.Wolf. Local trinary patterns for human actionrecognition. In ICCV, 2009. 5, 6

[25] C.-N. Yu and T. Joachims. Learning structural SVMs withlatent variables. In ICML, 2009. 3, 5

Documents

Discriminative Figure-Centric Models for Joint Action ...taranlan/pdf/lan_iccv11.pdfrecognize objects, using a generative model (pLSA). Dese-laers et al. [2] learn models for objects