18
Procedural Generation of Videos to Train Deep Action Recognition Networks esar Roberto de Souza 1,2 , Adrien Gaidon 1 , Yohann Cabon 1 , Antonio Manuel L ´ opez 2 1 Computer Vision Group, Xerox Research Center Europe, Meylan, France 2 Centre de Visi´ o per Computador, Universitat Aut` onoma de Barcelona, Bellaterra, Spain {cesar.desouza, adrien.gaidon, yohann.cabon}@xrce.xerox.com, [email protected] Abstract Deep learning for human action recognition in videos is making significant progress, but is slowed down by its dependency on expensive manual labeling of large video collections. In this work, we investigate the generation of synthetic training data for action recognition, as it has re- cently shown promising results for a variety of other com- puter vision tasks. We propose an interpretable parametric generative model of human action videos that relies on pro- cedural generation and other computer graphics techniques of modern game engines. We generate a diverse, realistic, and physically plausible dataset of human action videos, called PHAV for ”Procedural Human Action Videos” 1 . It contains a total of 37, 536 videos, more than 1, 000 exam- ples for each action of 35 categories. Our approach is not limited to existing motion capture sequences, and we pro- cedurally define 14 synthetic actions. We introduce a deep multi-task representation learning architecture to mix syn- thetic and real videos, even if the action categories differ. Our experiments on the UCF101 and HMDB51 benchmarks suggest that combining our large set of synthetic videos with small real-world datasets can boost recognition per- formance, significantly outperforming fine-tuning state-of- the-art unsupervised generative models of videos. 1. Introduction Understanding human behavior in videos is a key prob- lem in computer vision. Accurate representations of both appearance and motion require either carefully handcraft- ing features with prior knowledge (e.g., the dense trajecto- ries of [78]) or end-to-end deep learning of high capacity models with a large amount of labeled data (e.g., the two- stream network of [60]). These two families of methods have complementary strengths and weaknesses, and they often need to be combined to achieve state-of-the-art ac- tion recognition performance [81, 12]. Nevertheless, deep 1 https://adas.cvc.uab.es/phav/ Figure 1: Procedurally generated human action videos (clockwise): push, kick ball, walking hug, car hit. networks have the potential to significantly improve their accuracy based on training data. Hence, they are becom- ing the de-facto standard for recognition problems where it is possible to collect large labeled training sets, often by crowd-sourcing manual annotations (e.g., ImageNet [13], MS-COCO [34]). However, manual labeling is costly, time- consuming, error-prone, raises privacy concerns, and re- quires massive human intervention for every new task. This is often impractical, especially for videos, or even unfeasi- ble for ground truth modalities like optical flow or depth. Using realistic synthetic data generated from virtual worlds alleviates these issues. Thanks to modern modeling, rendering, and simulation software, virtual worlds allow for the efficient generation of vast amounts of controlled and algorithmically labeled data, including for modalities that cannot be labeled by a human. This approach has recently shown great promise for deep learning across a breadth of computer vision problems, including optical flow [39], depth estimation [34], object detection [36, 72, 85, 66, 46], pose and viewpoint estimation [59, 45, 64], tracking [17], and semantic segmentation [23, 53, 52]. In this work, we investigate procedural generation of 1 arXiv:1612.00881v1 [cs.CV] 2 Dec 2016

Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Procedural Generation of Videos to Train Deep Action Recognition Networks

Cesar Roberto de Souza1,2, Adrien Gaidon1, Yohann Cabon1, Antonio Manuel Lopez2

1Computer Vision Group, Xerox Research Center Europe, Meylan, France2Centre de Visio per Computador, Universitat Autonoma de Barcelona, Bellaterra, Spain{cesar.desouza, adrien.gaidon, yohann.cabon}@xrce.xerox.com, [email protected]

Abstract

Deep learning for human action recognition in videosis making significant progress, but is slowed down by itsdependency on expensive manual labeling of large videocollections. In this work, we investigate the generation ofsynthetic training data for action recognition, as it has re-cently shown promising results for a variety of other com-puter vision tasks. We propose an interpretable parametricgenerative model of human action videos that relies on pro-cedural generation and other computer graphics techniquesof modern game engines. We generate a diverse, realistic,and physically plausible dataset of human action videos,called PHAV for ”Procedural Human Action Videos” 1. Itcontains a total of 37, 536 videos, more than 1, 000 exam-ples for each action of 35 categories. Our approach is notlimited to existing motion capture sequences, and we pro-cedurally define 14 synthetic actions. We introduce a deepmulti-task representation learning architecture to mix syn-thetic and real videos, even if the action categories differ.Our experiments on the UCF101 and HMDB51 benchmarkssuggest that combining our large set of synthetic videoswith small real-world datasets can boost recognition per-formance, significantly outperforming fine-tuning state-of-the-art unsupervised generative models of videos.

1. Introduction

Understanding human behavior in videos is a key prob-lem in computer vision. Accurate representations of bothappearance and motion require either carefully handcraft-ing features with prior knowledge (e.g., the dense trajecto-ries of [78]) or end-to-end deep learning of high capacitymodels with a large amount of labeled data (e.g., the two-stream network of [60]). These two families of methodshave complementary strengths and weaknesses, and theyoften need to be combined to achieve state-of-the-art ac-tion recognition performance [81, 12]. Nevertheless, deep

1https://adas.cvc.uab.es/phav/

Figure 1: Procedurally generated human action videos(clockwise): push, kick ball, walking hug, car hit.

networks have the potential to significantly improve theiraccuracy based on training data. Hence, they are becom-ing the de-facto standard for recognition problems where itis possible to collect large labeled training sets, often bycrowd-sourcing manual annotations (e.g., ImageNet [13],MS-COCO [34]). However, manual labeling is costly, time-consuming, error-prone, raises privacy concerns, and re-quires massive human intervention for every new task. Thisis often impractical, especially for videos, or even unfeasi-ble for ground truth modalities like optical flow or depth.

Using realistic synthetic data generated from virtualworlds alleviates these issues. Thanks to modern modeling,rendering, and simulation software, virtual worlds allow forthe efficient generation of vast amounts of controlled andalgorithmically labeled data, including for modalities thatcannot be labeled by a human. This approach has recentlyshown great promise for deep learning across a breadthof computer vision problems, including optical flow [39],depth estimation [34], object detection [36, 72, 85, 66, 46],pose and viewpoint estimation [59, 45, 64], tracking [17],and semantic segmentation [23, 53, 52].

In this work, we investigate procedural generation of

1

arX

iv:1

612.

0088

1v1

[cs

.CV

] 2

Dec

201

6

Page 2: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

synthetic human action videos from virtual worlds in or-der to train deep action recognition models. This is anopen problem with formidable technical challenges, as itrequires a full generative model of videos with realistic ap-pearance and motion statistics conditioned on specific ac-tion categories. Our experiments suggest that our procedu-rally generated action videos can complement scarce real-world data. We indeed report significant performance gainson target real-world categories although they differ from theactions present in our synthetic training videos.

Our first contribution is a parametric generative modelof human action videos relying on physics, scene composi-tion rules, and procedural animation techniques like ”rag-doll physics” that provide a much stronger prior than justviewing videos as tensors or sequences of frames. We showhow to procedurally generate physically plausible variationsof different types of action categories obtained by MO-CAP datasets, animation blending, physics-based naviga-tion, or entirely from scratch using programmatically de-fined behaviors. We use naturalistic actor-centric random-ized camera paths to film the generated actions with carefor physical interactions of the camera. Furthermore, ourmanually designed generative model has interpretable pa-rameters that allow to either randomly sample or preciselycontrol discrete and continuous scene (weather, lighting, en-vironment, time of day, etc), actor, and action variations togenerate large amounts of diverse, physically plausible, andrealistic human action videos.

Our second contribution is a quantitative experimen-tal validation using a modern and accessible game en-gine (Unity®Pro) to synthesize a labeled dataset of 37, 536videos, corresponding to more than 1, 000 examples foreach of 35 action categories: 21 grounded in MOCAPdata, and 14 entirely synthetic ones defined procedurally.Our dataset, called PHAV for ”Procedural Human ActionVideos” (cf. figure 1 for example frames), will be publiclyreleased together with our generator upon publication. Ourprocedural generative model took approximately 2 monthsof 2 engineers to be programmed and our PHAV dataset 3days to be generated using 4 gaming GPUs. We investi-gate the use of this data in conjunction with the standardUCF101 [62] and HMDB51 [29] action recognition bench-marks. To allow for generic use, and as predefined procedu-ral action categories may differ from unknown a priori real-world target ones, we propose a multi-task learning archi-tecture based on the recent Temporal Segment Network [82](TSN). We call our model Cool-TSN (cf. Figure 6) in refer-ence to the ”cool world” of [71], as we mix both syntheticand real samples at the mini-batch level during training. Ourexperiments show that the generation of our synthetic hu-man action videos can significantly improve action recogni-tion accuracy, especially with small real-world training sets,in spite of differences in appearance, motion, and action cat-

egories. Moreover, we vastly outperform other state-of-the-art generative video models [77] when combined with thesame number of real-world training examples.

The rest of the paper is organized as follows. Section 2presents a brief review of related works. In Section 3, wepresent our parametric generative model, and how we useit to procedurally generate our PHAV dataset. Section 4presents our cool-TSN deep learning algorithm for actionrecognition. We report our quantitative experiments in Sec-tion 5 measuring the usefulness of PHAV. Finally, conclu-sions are drawn in Section 6.

2. Related workSynthetic data has been used to train visual models for

object detection and recognition, pose estimation, and in-door scene understanding [36, 72, 85, 84, 19, 1, 55, 59, 51,57, 48, 56, 2, 10, 66, 45, 44, 46, 22, 24, 54, 37, 65, 64, 42,5, 23]. [21] used a virtual racing circuit to generate dif-ferent types of pixel-wise ground truth (depth, optical flowand class labels). [53, 52] relied on game technology to traindeep semantic segmentation networks, while [17] used it formulti-object tracking, [58] for depth estimation from RGB,and [61] for place recognition. Several works use syntheticscenarios to evaluate the performance of different featuredescriptors [27, 3, 75, 74, 76] and to train and test opticaland/or scene flow estimation methods [40, 7, 43, 39], stereoalgorithms [20], or trackers [68, 17]. Synthetic LIDAR-style data has also been used for object detection [30, 31].Finally, virtual worlds are used for learning artificial behav-iors such as playing Atari games [41], imitating players inshooter games [35], end-to-end driving/navigating [9, 87],learning common sense [73, 88] or physical intuitions [33].However, due to the formidable complexity of realistic an-imation, video generation, and scene understanding, theseapproaches focus on basic controlled game environments,motions, and action spaces.

To the best of our knowledge, ours is the first work to in-vestigate virtual worlds and game engines to generate syn-thetic training videos for action recognition. Although someof the aforementioned related works rely on virtual charac-ters, their actions are not the focus, not procedurally gener-ated, and often reduced to just walking.

The related work of [38] uses MOCAP data to induce re-alistic motion in an ”abstract armature” placed in an emptysynthetic environment and 2000 clips (320x240@30FPS)of 90 frames are generated. From these non-photo-realisticclips, handcrafted motion features are selected as relevantand later used to learn action recognition models for 11 ac-tions in real-world videos. Our approach does not just re-play MOCAP, but procedurally generates new action cat-egories – including interactions between persons, objectsand the environment – as well as random physically plau-sible variations. Moreover, we jointly generate and learn

2

Page 3: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

deep representations of both action appearance and motionthanks to our realistic synthetic data, and our multi-tasklearning formulation to combine real and synthetic data.

A recent alternative to our procedural generative modelthat also does not require manual video labeling is the un-supervised Video Generative Adversarial Network (VGAN)of [77]. Instead of leveraging prior structural knowledgeabout physics and human actions, the authors view videosas tensors of pixel values and learn a two-stream GAN on5, 000 hours of unlabeled Flickr videos. This method fo-cuses on tiny videos and capturing scene motion assuminga stationary camera. This architecture can be used for ac-tion recognition in videos when complemented with pre-diction layers fine-tuned on labeled videos. Compared tothis approach, our proposal allows to work with any state-of-the-art discriminative architecture, as video generationand action recognition are decoupled steps. We can, there-fore, benefit from a strong ImageNet initialization for bothappearance and motion streams as in [82]. Moreover, wecan decide what specific actions/scenarios/camera-motionsto generate, enforcing diversity thanks to our interpretableparametrization. For these reasons, we show in Section 5that, given the same amount of labeled videos, our modelnearly doubles the performance of [77].

3. PHAV: Procedural Human Action VideosIn this section we introduce our interpretable parametric

generative model of videos depicting particular human ac-tions, and how we use it to generate our PHAV dataset. Wedescribe the procedural generation techniques we leverageto randomly sample diverse yet physically plausible appear-ance and motion variations, both for MOCAP-grounded ac-tions and programmatically defined categories.

3.1. Action scene composition

In order to generate a human action video, we place aprotagonist performing an action in an environment, underparticular weather conditions at a specific period of the day.There can be one or more background actors in the scene,as well as one or more supporting characters. We film thevirtual scene using a parametric camera behavior.

The protagonist is the main human model performing theaction. For actions involving two or more people, one ischosen to be the protagonist. Background actors can freelywalk in the current virtual environment, while supportingcharacters are actors with a secondary role necessary tocomplete an action, e.g., hold hands.

The action is a human motion belonging to a predefinedsemantic category originated from one or more motion datasources (described in section 3.3), including pre-determinedmotions from a MOCAP dataset, or programmatic actionsdefined using procedural animation techniques [14, 70],in particular ragdoll physics. In addition, we use these

Figure 2: Orthographic view of different world regions.

techniques to sample physically-plausible motion variations(described in section 3.4) to increase diversity.

The environment refers to a region in the virtual world(cf. Fig. 2), which consists of large urban areas, natural en-vironments (e.g., forests, lakes, and parks), indoor scenes,and sports grounds (e.g., a stadium). Each of these environ-ments may contain moving or static background pedestri-ans or objects – e.g., cars, chairs – with which humans canphysically interact, voluntarily or not. The outdoor weatherin the virtual world can be rainy, overcast, clear, or foggy.The period of the day can be dawn, day, dusk, or night.

Similar to [17, 53], we use a library of pre-made 3Dmodels obtained from the Unity Asset Store, which includesartist-designed human, object, and texture models, as wellas semi-automatically created realistic environments.

3.2. Camera

We use a physics-based camera which we call the Kitecamera (cf. Fig. 3) to track the protagonist in a scene. Thisphysics-aware camera is governed by a rigid body attachedby a spring to a target position that is, in turn, attached tothe protagonist by another spring. By randomly samplingdifferent parameters for the drag and weight of the rigidbodies, as well as elasticity and length of the springs, wecan achieve cameras with a wide range of shot types, 3Dtransformations, and tracking behaviors, such as followingthe actor, following the actor with a delay, or stationary.Another parameter controls the direction and strength of aninitial impulse that starts moving the camera in a random di-rection. With different rigid body parameters, this impulsecan cause our camera to simulate a handheld camera, movein a circular trajectory, or freely bounce around in the scenewhile filming the attached protagonist.

3

Page 4: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Figure 3: Left: schematic representation of our Kite cam-era. Right: human ragdoll configuration with 15 muscles.

3.3. Actions

Our approach relies on two main existing data sourcesfor basic human animations. First, we use the CMU MO-CAP database [8], which contains 2605 sequences of 144subjects divided in 6 broad categories, 23 subcategories andfurther described with a short text. We leverage relevantmotions from this dataset to be used as a motion sourcefor our procedural generation based on a simple filteringof their textual motion descriptions. Second, we use a largeamount of hand-designed realistic motions made by anima-tion artists and available on the Unity Asset Store.

The key insight of our approach is that these sourcesneed not necessarily contain motions from predeterminedaction categories of interest, neither synthetic nor targetreal-world actions (unknown a priori). Instead, we pro-pose to use this library of atomic motions to procedurallygenerate realistic action categories. Our PHAV dataset con-tains 35 different action classes (cf. Table 1), including 21simple categories present in the HMDB51 dataset and com-posed directly of some of the aforementioned atomic mo-tions. In addition to these actions, we programmaticallydefine 10 action classes involving a single actor and 4 ac-tion classes involving two person interactions. We cre-ate these new synthetic actions by taking atomic motionsas a base and using procedural animation techniques likeblending and ragdoll physics (cf. Section 3.4) to composethem in a physically plausible manner according to sim-ple rules defining each action, such as tying hands together(e.g., ”walk hold hands”), disabling one or more muscles(e.g., ”crawl”, ”limp”), or colliding the protagonist againstobstacles (e.g., ”car hit”, ”bump into each other”).

3.4. Physically plausible motion variations

We now describe procedural animation techniques [14,70] to randomly generate large amounts of physically plau-sible and diverse action videos, far beyond simply replayingsource atomic motions.Ragdoll physics. A key component of our work is the use

Type # Actions

sub-HMDB 21

brush hair, catch, clap, climb stairs,golf, jump, kick ball, push, pick,

pour, pull up, run, shoot ball, shootbow, shoot gun, sit, stand, swing

baseball, throw, walk, wave

One-personsynthetic 10

car hit, crawl, dive floor, flee, hop,leg split, limp, moonwalk, stagger,

surrenderTwo-people

synthetic 4walking hug, walk hold hands, walk

the line, bump into each other

Table 1: Actions included in our PHAV dataset.

of ragdoll physics. Ragdoll physics are limited real-timephysical simulations that can be used to animate a model(such as a human model) while respecting basic physicsproperties such as connected joint limits, angular limits,weight and strength. We consider ragdolls with 15 mov-able body parts (referenced here as muscles), as illustratedin Figure 3. For each action, we separate those 15 musclesinto two disjoint groups: those that are strictly necessaryfor performing the action, and those that are complemen-tary (altering their movement should not interfere with thesemantics of the currently considered action). The presenceof the ragdoll allows us to introduce variations of differ-ent nature in the generated samples. The other modes ofvariability generation described in this section will assumethat the physical plausibility of the models is being kept bythe use of ragdoll physics. As an implementation detail,we use RootMotion’s PuppetMaster2 for implementing andcontrolling human ragdolls in Unity®Pro.Random perturbations. Inspired by [49], we create vari-ations of a given motion by adding random perturbationsto muscles that should not alter the semantic category ofthe action being performed. Those perturbations are im-plemented by adding a rigid body to a random subset ofthe complementary muscles. Those bodies are set to or-bit around the muscle’s position in the original animationskeleton, drifting the movement of the puppet’s muscle toits own position in a periodic oscillating movement. Moredetailed references on how to implement variations of thistype can be found in [49, 14, 50, 70] and references therein.Muscle weakening. We vary the strength of the avatar per-forming the action. By reducing its strength, the actor per-forms an action with seemingly more difficulty.Action blending. Similarly to modern video games, we usea blended ragdoll technique to constrain the output of a pre-made animation to physically-plausible motions. In actionblending, we randomly sample a different motion sequence(coming either from the same or from a different class) andreplace the movements of current complementary muscleswith those from this new action. We limit the number of

2http://root-motion.com

4

Page 5: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Figure 4: Example generation failure cases. First row:physics violations (passing through a wall). Second row:over-constrained joints and unintended variations.

blended actions in PHAV to be at most two.Objects. The last physics-based source of variation is theuse of objects. First, we manually annotated a subset of theMOCAP actions marking the instants in time where the ac-tor started or ended the manipulation of an object. Second,we use inverse kinematics to generate plausible program-matic interactions. For reproducibility, our annotated subsetof MOCAP actions, as well as the code for interacting withobjects for particular actions will be available upon request.Failure cases. Although our approach uses physics-basedprocedural animation techniques, unsupervised generationof large amounts of random variations with a focus on di-versity inevitably causes edge cases where physical modelsfail. This results in glitches reminiscent of typical videogame bugs (cf. Fig. 4). Using a random 1% sample of ourdataset, we manually estimated that this corresponds to lessthan 15% of the videos generated. Although this could beimproved, our experiments in Section 5 show that this noisedoes not prevent us from significantly improving the train-ing of deep action recognition networks using this data.

3.5. Interpretable parametric generative model

We define a human action video as a random variableX = 〈H,A,L,B, V, C,E,D,W 〉, where H is a humanmodel, A an action category, L a video length, B a setof basic motions (from MOCAP, manual design, or pro-grammed), V a set of motion variations, C a camera, Ean environment, D a period of the day, W a weather condi-tion, and possible values for those parameters are shown inTable 2. Our generative model (cf. Figures 5, 8, 9, and 10)for an action video X is given by:

P (X) =P (H) P (A) P (L | B) P (B | A)

P (Θv | V ) P (V | A) P (Θe | E) P (E | A)

P (Θc | C) P (C | A,E)

P (Θd | D) P (D) P (Θw |W ) P (W )

(1)

where Θw is a random variable on weather-specific param-eters (e.g., intensity of rain, clouds, fog), Θc is a r.v. oncamera-specific parameters (e.g., weights and stiffness for

Parameter # Possible values

Human Model (H) 20 models designed by artistsEnvironment (E) 7 simple, urban, green, middle, lake,

stadium, house interiorWeather (W) 4 clear, overcast, rain, fogPeriod of day (D) 4 night, dawn, day, duskVariation (V) 5 ∅, muscle perturbation and weak-

ening, action blending, objects

Table 2: Overview of key random variables of our paramet-ric generative model of human action videos (cf. section 2.1of the attached supplementary material for more details).

Figure 5: Simplified graphical model for our generator.

Statistic Value

Clips 37,536Total dataset frames 5,857,361Total dataset duration 2d06h14mAverage video duration 5.2sAverage number of frames 156.02Frames per second 30Video dimensions 340x256Average clips per category 1,072.6

Table 3: Statistics of the generated dataset instance.

Kite camera springs), Θe is a r.v. on environment-specificparameters (e.g., current waypoint, waypoint locations,background pedestrian starting points and destinations), Θd

is a r.v. on period-specific parameters (e.g., amount of sun-light, sun orientation), and Θv is a r.v. on variation-specificparameters (e.g., strength of each muscle, strength of pertur-bations, blending muscles). The probability functions asso-ciated with categorical variables (e.g.,A) can be set to be ei-ther uniform, or configured manually to use pre-determinedweights. Similarly, probability distributions associated withcontinuous values (e.g., Θc) are either set using a uniformdistribution with finite support, or using triangular distri-butions with pre-determined support and most likely value.We give additional details about our graphical model, aswell as the values used to configure the parameter distri-butions in section 2 of the attached supplementary material.

3.6. PHAV generation details

We generate videos between 1 and 10 seconds, at 30FPS,and a resolution of 340x256, as they are the dimensions re-quired by the state-of-the-art deep model that we train in

5

Page 6: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Figure 6: Our ”Cool-TSN” deep multi-task learning architecture for end-to-end action recognition in videos.

Section 4. We use anti-aliasing, motion blur, and standardphoto-realistic cinematic effects.

We have generated 54 hours of videos, with a total of5.8M frames. A summary of the key statistics for the gen-erated dataset can be seen in Table 3. Our parametric modelcan generate action videos at 5.61 FPS using one consumer-grade gaming GPU (NVIDIA GTX 1070). In contrast, theaverage annotation time for data-annotation methods suchas [52, 11, 6] are significantly below 0.5 FPS. While thoseworks deal with semantic segmentation where the cost ofannotation is higher, we can generate all modalities forroughly the same cost as RGB. We further note that allmodalities will be included in our public dataset release.

4. Cool Temporal Segment Networks

We propose to demonstrate the usefulness of our PHAVdataset via deep multi-task representation learning. Ourmain goal is to learn an end-to-end action recognition modelfor real-world target categories by combining a few exam-ples of labeled real-world videos with a large number ofprocedurally generated videos for different surrogate cate-gories. Our hypothesis is that, although the synthetic ex-amples differ in statistics and tasks, their realism, quan-tity, and diversity can act as a strong prior and regular-izer against overfitting, towards data-efficient representa-tion learning that can operate with few manually labeledreal videos. Figure 6 depicts our learning algorithm inspiredby [60], but adapted for the recent state-of-the-art TemporalSegment Networks (TSN) of [82] with ”cool worlds” [71],i.e. mixing real and virtual data during training.

4.1. Temporal Segment Networks

The recent TSN architecture of [82] improves signifi-cantly on the original two-stream architecture of [60]. Itprocesses both RGB frames and stacked optical flow framesusing a deeper Inception architecture [67] with Batch Nor-malization [25] and DropOut [63]. Although it still requiresmassive labeled training sets, this architecture is more dataefficient, and therefore more suitable for action recognitionin videos. In particular, [82] shows that both the appearanceand motion streams of TSNs can benefit from a strong ini-tialization on ImageNet, which is one of the main factorsresponsible for the high recognition accuracy of TSN.

Another improvement of TSN is the explicit use of long-range temporal structure by jointly processing random shortsnippets from a uniform temporal subdivision of a video.TSN computes separate predictions for K different tempo-ral segments of a video. These partial predictions are thencondensed into a video-level decision using a segmentalconsensus function G. We use the same parameters as [82]:a number of segments K = 3, and the consensus function:G = 1

K

∑Kk=1 F(Tk;W ), where F(Tk;W ) is a function

representing a CNN architecture with weight parametersWoperating on short snippet Tk from video segment k.

4.2. Multi-task learning in a Cool World

As illustrated in Figure 6, the main differences we in-troduce with our ”Cool-TSN” architecture are at both endsof the training procedure: (i) the mini-batch generation, (ii)the multi-task prediction and loss layers.Cool mixed-source mini-batches. Inspired by [71, 53], webuild mini-batches containing a mix of real-world videosand synthetic ones. Following [82], we build minibatches

6

Page 7: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

of 256 videos divided in blocks of 32 dispatched across8 GPUs for efficient parallel training using MPI3. Each32 block contains 10 random synthetic videos and 22 realvideos in all our experiments, as we observed it roughly bal-ances the contribution of the different losses during back-propagation. Note that although we could use our gener-ated ground truth flow for the PHAV samples in the motionstream, we use the same fast optical flow estimation algo-rithm as [82] (TVL1 [86]) for all samples in order to fairlyestimate the usefulness of our generated videos.Multi-task prediction and loss layers. Starting from thelast feature layer of each stream, we create two separatecomputation paths, one for target classes from the real-world dataset, and another for surrogate categories from thevirtual world. Each path consists of its own segmental con-sensus, fully-connected prediction, and softmax loss layers.As a result, we obtain the following multi-task loss:

L(y,G) =∑

z∈{real,virtual}

δ{y∈Cz}wzLz(y,G) (2)

Lz(y,G) = −∑i∈Cz

yi

Gi − log∑j∈Cz

expGj

(3)

where z indexes the source dataset (real or virtual) of thevideo, wz is a loss weight (we use the relative proportion ofz in the mini-batch), Cz denotes the set of action categoriesfor dataset z, and δ{y∈Cz} is the indicator function that re-turns one when label y belongs to Cz and zero otherwise.We use standard SGD with backpropagation to minimizethat objective, and as every mini-batch contains both realand virtual samples, every update is guaranteed to updateboth shared feature layers and separate prediction layers ina common descent direction. We discuss the setting of thelearning hyper-parameters (e.g., learning rate, iterations) inthe following experimental section.

5. ExperimentsIn this section, we detail our action recognition exper-

iments on widely used real-world video benchmarks. Wequantify the impact of multi-task representation learningwith our procedurally generated PHAV videos on real-world accuracy, in particular in the small labeled dataregime. We also compare our method with the state of theart on both fully supervised and unsupervised methods.

5.1. Real world action recognition datasets

We consider the two most widely used real-world pub-lic benchmarks for human action recognition in videos.The HMDB-51 [29] dataset contains 6,849 fixed resolu-tion videos clips divided between 51 action categories. The

3https://github.com/yjxiong/temporal-segment-networks

Target Model Spatial Temporal Full

PHAV TSN 58.2 72.7 74.2

UCF-101 [82] 85.1 89.7 94.0UCF-101 TSN 84.2 89.3 93.6UCF-101 Cool-TSN 86.1 90.1 94.2

HMDB-51 [82] 51.0 64.2 68.5HMDB-51 TSN 50.4 61.2 66.6HMDB-51 Cool-TSN 52.4 64.1 69.5

Table 4: Impact of our PHAV dataset using Cool-TSN. [82]uses TSN with cross-modality training.

evaluation metric for this dataset is the average accuracyover three data splits. The UCF-101 [62, 26] dataset con-tains 13,320 video clips divided among 101 action classes.Like HMDB, its standard evaluation metric is the averagemean accuracy over three data splits. Similarly to UCF-101 and HMDB-51, we generate three random splits on ourPHAV dataset, with 80% for training and the rest for testing,and report average accuracy when evaluating on PHAV.

5.2. Temporal Segment Networks

In our first experiments (cf. Table 4), we reproduce theperformance of the original TSN in UCF-101 and HMDB-51, as well as estimate performance on PHAV separately.We use the same learning parameters as in [82]. For sim-plicity, we use neither cross-modality pre-training nor athird warped optical flow stream like [82], as their impacton TSN is limited with respect to the substantial increase intraining time and computational complexity, degrading onlyby −1.9% on HMDB-51, and −0.4% on UCF-101.

TSN on PHAV alone yields an average accuracy betweenthat of HMDB-51 and UCF-101 for all streams. This sanitycheck confirms that, just like real-world videos, our syn-thetic videos contain both appearance and motion patternsthat can be captured by TSN to discriminate between ourdifferent procedural categories.

5.3. Cool Temporal Segment Networks

In Table 4 we report results of our Cool-TSN multi-task representation learning (Section 4.2), which addition-ally uses PHAV to train UCF-101 and HMDB-51 models.We stop training after 3, 000 iterations for the RGB streamand 20, 000 for the flow stream, all other parameters asin [82]. Our results suggest that using Cool-TSN to leveragePHAV yields recognition improvements for all modalitiesin all datasets, especially for the smaller HMDB-51. Thisprovides quantitative experimental evidence supporting ourclaim that procedural generation of synthetic human actionvideos can indeed act as a strong prior and regularizer whenlearning deep action recognition networks.

7

Page 8: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Figure 7: Impact of using subsets of the real world trainingvideos (split 1), with PHAV (Cool-TSN) or without (TSN).

We further validate our hypothesis by investigating theimpact of reducing the number of real world training videos(and iterations), with or without the use of PHAV. Ourresults reported in Figure 7 indeed confirm that reducingtraining data impacts more severely TSN than Cool-TSN.HMDB displays the largest gaps. We partially attribute thisto the smaller size of HMDB and also because some cat-egories of PHAV overlap with some categories of HMDB.Our results show that it is possible to replace half of HMDBwith procedural videos and still obtain comparable perfor-mance to using the full dataset (65.9 vs. 67.8). In a sim-ilar way, and although actions differ more, we show thatreducing UCF-101 to a quarter of its original training setstill yields a Cool-TSN model that rivals the state-of-the-art [83, 60, 81]. This shows that our procedural generativemodel of videos can indeed be used to augment differentsmall real-world training sets and obtain better recognitionaccuracy at a lower cost in terms of manual labor.

5.4. Comparison with the state of the art

In this section, we compare our model with the state ofthe art in action recognition (Table 5). We separate thecurrent state of the art into works that use one or multiplesources of training data (such as by pre-training, multi-tasklearning or model transfer). We note that all works that usemultiple sources can potentially benefit from PHAV withoutany modifications.

Our results indicate that our method is competitive withthe state of the art, including methods that use muchmore manually labeled training data like the Sports-1M

UCF-101 HMDB-51Method %mAcc %mAcc

ON

ES

OU

RC

E iDT+FV [80] 84.8 57.2iDT+StackFV [47] - 66.8iDT+SFV+STP [79] 86.0 60.1iDT+MIFS [32] 89.1 65.1VideoDarwin [16] - 63.7

MU

LTIP

LE

SO

UR

CE

S

2S-CNN [60] 88.0 59.4TDD [81] 90.3 63.2TDD+iDT [81] 91.5 65.9C3D+iDT[69] 90.4 -Actions∼Trans [83] 92.0 62.02S-Fusion [15] 93.5 69.2Hybrid-iDT [12] 92.5 70.4TSN-3M [82] 94.2 69.4VGAN [77] 52.1 -

Cool-TSN 94.2 69.5

Table 5: Comparison against the state of the art.

dataset [28]. In addition, our Cool-TSN almost doublesthe performance of the current best generative video modelVGAN [77] on UCF101, for the same amount of manu-ally labeled target real-world videos. Before fine-tuning onUCF-101, VGAN is trained on 5, 000 hours of unlabeledvideos from Flickr, whereas with PHAV we only generated54 hours of diverse videos without human supervision.

6. Conclusion

In this work, we have introduced PHAV, a large syntheticdataset for action recognition based on a procedural gener-ative model of videos. Although our model does not learnvideo representations like VGAN, it can generate many di-verse training videos that are realistic thanks to its ground-ing in strong prior physical knowledge about scenes, ob-jects, lighting, motions, and humans.

We provide quantitative evidence that our procedurallygenerated videos can be used as a simple drop-in comple-ment to small training sets of manually labeled real-worldvideos. Hence, we can leverage state-of-the-art superviseddeep models for action recognition without modifications,yielding vast improvements over alternative unsupervisedgenerative models of video. Importantly, we show that wedo not need to generate training videos for particular targetcategories fixed a priori. Instead, surrogate categories de-fined procedurally enable efficient multi-task representationlearning for potentially unrelated target actions that mighthave only few real-world training examples.

Our approach combines standard techniques from Com-puter Graphics (procedural generation) with state-of-the-art

8

Page 9: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

deep learning for action recognition. This opens interest-ing new perspectives for video modeling and understand-ing, including action recognition models that can leveragealgorithmic ground truth generation for optical flow, depth,semantic segmentation, or pose, or the combination withunsupervised generative models like VGAN [77] for dy-namic background generation, domain adaptation, or real-to-virtual world style transfer [18].Acknowledgements. Antonio M. Lopez is supported bythe Spanish MICINN project TRA2014-57088-C2-1-R, bythe Secretaria d’Universitats i Recerca del Departamentd’Economia i Coneixement de la Generalitat de Catalunya(2014-SGR-1506), and the CERCA Programme / Generali-tat de Catalunya.

References[1] A. Agarwal and B. Triggs. A local basis representation for

estimating human pose from cluttered images. In ACCV,2006. 2

[2] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic.Seeing 3D chairs: exemplar part-based 2d-3d alignment us-ing a large dataset of CAD models. In CVPR, 2014. 2

[3] M. Aubry and B. Russell. Understanding deep features withcomputer-generated imagery. In ICCV, 2015. 2

[4] C. M. Bishop. Pattern Recognition and Machine Learning.2006. 12

[5] E. Bochinski, V. Eiselein, and T. Sikora. Training a convolu-tional neural network for multi-class object detection usingsolely virtual world data. In AVSS, 2016. 2

[6] G. Brostow, J. Fauqueur, and R. Cipolla. Semantic objectclasses in video: A high-definition ground truth database.Pattern Recognition Letters, 30(20):88–89, 2009. 6

[7] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. Anaturalistic open source movie for optical flow evaluation.ECCV, pages 611–625, 2012. 2

[8] Carnegie Mellon Graphics Lab. Carnegie Mellon UniversityMotion Capture Database, 2016. 4, 14

[9] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. DeepDriving:Learning affordance for direct perception in autonomousdriving. In ICCV, 2015. 2

[10] L.-C. Chen, S. Fidler, and R. Yuille, Alan L. Urtasun. Beatthe MTurkers: Automatic image labeling from weak 3D su-pervision. In CVPR, 2014. 2

[11] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. TheCityscapes dataset for semantic urban scene understanding.In CVPR, 2016. 6

[12] C. R. de Souza, A. Gaidon, E. Vig, and A. M. Lopez. Sym-pathy for the Details: Dense Trajectories and Hybrid Clas-sification Architectures for Action Recognition. In ECCV,pages 1–27, 2016. 1, 8

[13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 1

[14] A. Egges, A. Kamphuis, and M. Overmars, editors. Motionin Games, volume 5277 of Lecture Notes in Computer Sci-ence. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.3, 4

[15] C. Feichtenhofer, A. Pinz, and A. Zisserman. ConvolutionalTwo-Stream Network Fusion for Video Action Recognition.In CVPR, 2016. 8

[16] B. Fernando, E. Gavves, M. Oramas, A. Ghodrati, T. Tuyte-laars, and K. U. Leuven. Modeling video evolution for actionrecognition. In CVPR, 2015. 8

[17] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual Worldsas Proxy for Multi-Object Tracking Analysis. CVPR, pages4340–4349, 2016. 1, 2, 3

[18] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transferusing convolutional neural networks. CVPR, pages 2414–2423, 2016. 9

[19] K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring3D structure with a statistical image-based shape model. InICCV, 2003. 2

[20] R. Haeusler and D. Kondermann. Synthesizing real worldstereo challenges. In Proceedings of the German Conferenceon Pattern Recognition, 2013. 2

[21] H. Haltakov, C. Unger, and S. Ilic. Framework for generationof synthetic ground truth data for driver assistance applica-tions. In Proceedings of the German Conference on PatternRecognition, 2013. 2

[22] A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, andR. Cipolla. SynthCam3D: Semantic understanding with syn-thetic indoor scenes. CoRR, arXiv:1505.00171, 2015. 2

[23] A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, andR. Cipolla. Understanding real world indoor scenes withsynthetic data. In CVPR, 2016. 1, 2

[24] H. Hattori, V. Naresh Boddeti, K. M. Kitani, and T. Kanade.Learning scene-specific pedestrian detectors without realdata. In CVPR, 2015. 2

[25] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. InICML, 2015. 6

[26] Y.-G. Jiang, J. Liu, a. Roshan Zamir, I. Laptev, M. Piccardi,M. Shah, and R. Sukthankar. THUMOS Challenge: ActionRecognition with a Large Number of Classes, 2013. 7

[27] B. Kaneva, A. Torralba, and W. Freeman. Evaluation of im-age features using a photorealistic virtual world. In ICCV,2011. 2

[28] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classification with convo-lutional neural networks. In CVPR, 2014. 8

[29] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.HMDB: a large video database for human motion recogni-tion. In ICCV, 2011. 2, 7

[30] K. Lai and D. Fox. 3D laser scan classification using webdata and domain adaptation. In Robotics: Science and Sys-tems, 2009. 2

[31] K. Lai and D. Fox. Object recognition in 3D point cloudsusing web data and domain adaptation. International Journalof Robotics Research, 29(8):1019–1037, 2010. 2

9

Page 10: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

[32] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Be-yond gaussian pyramid : Multi-skip feature stacking for ac-tion recognition. In CVPR, 2015. 8

[33] A. Lerer, S. Gross, and R. Fergus. Learning physical intu-ition of block towers by example. In ICML, 2016. 2

[34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollar, and C. Zitnick. Microsoft COCO: Com-mon objects in context. In ECCV, pages 740–755, 2014. 1

[35] J. Llargues, J. Peralta, R. Arrabales, M. Gonzalez, P. Cortez,and A. Lopez. Artificial intelligence approaches for the gen-eration and assessment of believable human-like behaviourin virtual characters. Expert Systems With Applications,41(16):7281–7290, 2014. 2

[36] J. Marın, D. Vazquez, D. Geronimo, and A. Lopez. Learningappearance in virtual scenarios for pedestrian detection. InCVPR, 2010. 1, 2

[37] F. Massa, B. Russell, and M. Aubry. Deep exemplar 2D-3Ddetection by adapting from real to rendered views. In CVPR,2016. 2

[38] P. Matikainen, R. Sukthankar, and M. Hebert. Feature seed-ing for action recognition. In ICCV, 2011. 2

[39] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers,A. Dosovitskiy, and T. Brox. A large dataset to train convo-lutional networks for disparity, optical flow, and scene flowestimation. In CVPR, 2016. 1, 2

[40] S. Meister and D. Kondermann. Real versus realistically ren-dered scenes for optical flow evaluation. In CEMT, 2011. 2

[41] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atariwith deep reinforcement learning. In NIPS, Workshop onDeep Learning, 2013. 2

[42] Y. Movshovitz-Attias, T. Kanade, and Y. Sheikh. How use-ful is photo-realistic rendering for visual learning? CoRR,arXiv:1603.08152, 2016. 2

[43] N. Onkarappa and A. Sappa. Synthetic sequences andground-truth flow field generation for algorithm validation.Multimedia Tools and Applications, 74(9):3121–3135, 2015.2

[44] P. Panareda-Busto, J. Liebelt, and J. Gall. Adaptation ofsynthetic data for coarse-to-fine viewpoint refinement. InBMVC, 2015. 2

[45] J. Papon and M. Schoeler. Semantic pose using deep net-works trained on synthetic RGB-D. In ICCV, 2015. 1, 2

[46] X. Peng, B. Sun, K. Ali, and K. Saenko. Learning deep ob-ject detectors from 3D models. In ICCV, 2015. 1, 2

[47] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognitionwith stacked fisher vectors. In ECCV, 2014. 8

[48] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Teaching 3Dgeometry to deformable part models. In CVPR, 2012. 2

[49] K. Perlin. Real time responsive animation with personality.T-VCG, 1(1):5–15, 1995. 4

[50] K. Perlin and G. Seidman. Autonomous Digital Actors. InMotion in Games, pages 246–255. Springer Berlin Heidel-berg, Berlin, Heidelberg, 2008. 4

[51] L. Pishchulin, A. Jain, C. Wojek, M. Andriluka,T. Thormahlen, and B. Schiele. Learning people detectionmodels from few training samples. In CVPR, 2011. 2

[52] S. Richter, V. Vineet, S. Roth, and K. Vladlen. Playing fordata: Ground truth from computer games. In ECCV, 2016.1, 2, 6

[53] G. Ros, L. Sellart, J. Materzyska, D. Vazquez, and A. Lopez.The SYNTHIA dataset: a large collection of synthetic im-ages for semantic segmentation of urban scenes. In CVPR,2016. 1, 2, 3, 6

[54] A. Rozantsev, V. Lepetit, and P. Fua. On rendering syntheticimages for training an object detector. Computer Vision andImage Understanding, 137:24–37, 2015. 2

[55] S. Satkin, M. Goesele, and B. Schiele. Back to the future:Learning shape models from 3D CAD data. In BMVC, 2010.2

[56] S. Satkin, J. Lin, and M. Hebert. Data-driven scene under-standing from 3D models. In BMVC, 2012. 2

[57] J. Schels, J. Liebelt, K. Schertler, and R. Lienhart. Syntheti-cally trained multi-view object class and viewpoint detectionfor advanced image retrieval. In ICMR, 2011. 2

[58] A. Shafaei, J. Little, and M. Schmidt. Play and learn: Usingvideo games to train computer vision models. In BMVC,2016. 2

[59] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finoc-chio, R. Moore, A. Kipmanand, and A. Blake. Real-timehuman pose recognition in parts from a single depth image.In CVPR, 2011. 1, 2

[60] K. Simonyan and A. Zisserman. Two-stream convolutionalnetworks for action recognition in videos. In NIPS, 2014. 1,6, 8

[61] E. Sizikova1, V. K. Singh, B. Georgescu, M. Halber, K. Ma,and T. Chen. Enhancing place recognition using joint inten-sity - depth analysis and synthetic data. In ECCV, VARVAIWorkshop, 2016. 2

[62] K. Soomro, A. R. Zamir, and M. Shah. UCF101: Adataset of 101 human actions classes from videos in the wild.arXiv:1212.0402, December 2012. 2, 7

[63] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Dropout: A simple way to prevent neuralnetworks from overfitting. JMLR, 15:1929–1958, 2014. 6

[64] H. Su, C. Qi, Y. Yi, and L. Guibas. Render for CNN: view-point estimation in images using CNNs trained with rendered3D model views. In ICCV, 2015. 1, 2

[65] H. Su, F. Wang, Y. Yi, and L. Guibas. 3D-assisted featuresynthesis for novel views of an object. In ICCV, 2015. 2

[66] B. Sun and K. Saenko. From virtual to reality: Fast adap-tation of virtual object detectors to real domains. In BMVC,2014. 1, 2

[67] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR, 2015. 6

[68] G. Taylor, A. Chosak, and P. Brewer. OVVV: Using vir-tual worlds to design and evaluate surveillance systems. InCVPR, 2007. 2

[69] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.Learning spatiotemporal features with 3d convolutional net-works. In CVPR, 2014. 8

[70] H. van Welbergen, B. J. H. Basten, a. Egges, Z. M. Ruttkay,and M. H. Overmars. Real Time Character Animation: A

10

Page 11: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Trade-off Between Naturalness and Control. Eurographics -State-of-the-Art-Report, xx:45–72, 2009. 3, 4

[71] D. Vazquez, A. Lopez, D. Ponsa, and J. Marın. Cool world:domain adaptation of virtual and real worlds for human de-tection using active learning. In NIPS, Workshop on DomainAdaptation: Theory and Applications, 2011. 2, 6

[72] D. Vazquez, A. M. Lopez, J. Marın, D. Ponsa, andD. Geronimo. Virtual and real world adaptation for pedes-trian detection. T-PAMI, 36(4):797 – 809, 2014. 1, 2

[73] R. Vedantam, X. Lin, T. Batra, C. Zitnick, and D. Parikh.Learning common sense through visual abstraction. InICCV, 2015. 2

[74] V. Veeravasarapu, R. Hota, C. Rothkopf, and R. Visvanathan.Model validation for vision systems via graphics simulation.CoRR, arXiv:1512.01401, 2015. 2

[75] V. Veeravasarapu, R. Hota, C. Rothkopf, and R. Vis-vanathan. Simulations for validation of vision systems.CoRR, arXiv:1512.01030, 2015. 2

[76] V. Veeravasarapu, C. Rothkopf, and R. Visvanathan. Model-driven simulations for deep convolutional neural networks.CoRR, arXiv:1605.09582, 2016. 2

[77] C. Vondrick, H. Pirsiavash, and A. Torralba. Generatingvideos with scene dynamics. In NIPS, 2016. 2, 3, 8, 9

[78] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Dense tra-jectories and motion boundary descriptors for action recog-nition. IJCV, 103:60–79, 2013. 1

[79] H. Wang, D. Oneata, J. Verbeek, and C. Schmid. A ro-bust and efficient video representation for action recognition.IJCV, pages 1–20, July 2015. 8

[80] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In ICCV, 2013. 8

[81] L. Wang, Y. Qiao, and X. Tang. Action recognition withtrajectory-pooled deep-convolutional descriptors. In CVPR,2015. 1, 8

[82] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, andL. V. Gool. Temporal Segment Networks: Towards GoodPractices for Deep Action Recognition. In ECCV, 2016. 2,3, 6, 7, 8

[83] X. Wang, A. Farhadi, and A. Gupta. Actions ∼ Transforma-tions. In CVPR, 2015. 8

[84] J. Xu, S. Ramos, D. Vazquez, and A. Lopez. Domainadaptation of deformable part-based models. T-PAMI,36(12):2367–2380, 2014. 2

[85] J. Xu, D. Vazquez, A. Lopez, J. Marın, and D. Ponsa. Learn-ing a part-based pedestrian detector in a virtual world. T-ITS,15(5):2121–2131, 2014. 1, 2

[86] C. Zach, T. Pock, and H. Bischof. A duality based approachfor realtime tv-l1 optical flow. In DAGM-Symposium, 2007.7

[87] Y. Zhu, R. Mottaghi, E. Kolve, J. Lim, and A. Gupta. Target-driven visual navigation in indoor scenes using deep rein-forcement learning. CoRR, arXiv:1609.05143, 2016. 2

[88] C. Zitnick, R. Vedantam, and D. Parikh. Adopting ab-stract images for semantic scene understanding. T-PAMI,38(4):627–638, 2016. 2

11

Page 12: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Procedural Generation of Videos to Train Deep Action Recognition Networks

Supplementary material

Cesar Roberto de Souza1,2, Adrien Gaidon1, Yohann Cabon1, Antonio Manuel Lopez2

1Computer Vision Group, Xerox Research Center Europe, Meylan, France2Centre de Visio per Computador, Universitat Autonoma de Barcelona, Bellaterra, Spain{cesar.desouza, adrien.gaidon, yohann.cabon}@xrce.xerox.com, [email protected]

1. IntroductionThis material provides additional details regarding our

submission. In particular, we provide in-depth details aboutthe parametric generative model we used to generate ourprocedural videos, an extended version of the probabilisticgraphical model (whereas the graph shown in the submis-sion had to be simplified due to size considerations), ex-panded generation statistics, and results for our Cool-TSNmodel for the separate RGB and flow streams.

2. Generation detailsIn this section, we provide more details about the inter-

pretable parametric generative model used in our proceduralgeneration of videos, presenting an extended version of theprobabilistic graphical model given in the section 3.5 of theoriginal submission.

2.1. Variables

We start by defining the main random variables used inour generative model. Here we focus only on critical vari-ables that are fundamental in understanding the orchestra-tion of the different parts of our generation, whereas all part-specific variables are shown in section 2.2. The categoricalvariables that drive most of the procedural generation are:

H : h ∈ {model1,model2, . . . ,model20}A : a ∈ {“clap”, . . . , “bump each other”}B : b ∈ {motion1,motion2, . . . ,motion953}V : v ∈ {“none”, “random perturbation”,

“weakening”, “objects”, “blending”}C : c ∈ {“kite”, “indoors”, “closeup”, “static”}E : e ∈ {“urban”, “stadium”, “middle”,

“green”, “house”, “lake”}D : d ∈ {“dawn”, “day”, “dusk”}W : w ∈ {“clear”, “overcast”, “rain, “fog”}

(4)

where H is the human model to be used by the protagonist,A is the action category to be generated, B is the base mo-tion sequence used for the action, V is the variation to beapplied to the base motion, C is the camera behavior, E isthe environment of the virtual world where the action willtake place, D is the day phase, W is the weather condition.

These categorical variables are in turn controlled by agroup of parameters that can be adjusted in order to drivethe sample generation. These parameters include the θAparameters of a categorical distribution on action categoriesA, the θW for weather conditions W , θD for day phases D,θH for model models H , θV for variation types V , and θCfor camera behaviors C.

Additional parameters include the conditional probabil-ity tables of the dependent variables: a matrix of parametersθAE where each row contains the parameters for categori-cal distributions on environments E for each action cate-gory A, the matrix of parameters θAC on camera behaviorsC for each action A, the matrix of parameters θEC on cam-era behaviors C for each environment E, and the matrix ofparameters θAB on motions B for each action A.

Finally, other relevant parameters include Tmin, Tmax,and Tmod, the minimum, maximum and most likely dura-tions for the generated video. We denote the set of all pa-rameters in our model by θ.

2.2. Model

The complete interpretable parametric probabilisticmodel used by our generation process, given our generationparameters θ, can be written as:

P (H,A,L,B, V, C,E,D,W | θ) =

P1(D,W | θ) P2(H | θ)

P3(A,L,B, V,C,E,W | θ)

(5)

where P1, P2 and P3 are defined by the probabilistic graph-ical models represented on Figure 8, 9 and 10, respectively.We use extended plate notation [4] to indicate repeatingvariables, marking parameters (non-variables) using filledrectangles.

12

Page 13: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Figure 8: Probabilistic graphical model for P1(D,W | θ),the first part of our parametric generator (world time andweather).

Figure 9: Probabilistic graphical model for P2(H | θ), thesecond part of our parametric generator (human models).

2.3. Distributions

The generation process makes use of four main familiesof distributions: categorical, uniform, Bernoulli and trian-gular. We adopt the following three-parameter formulationfor the triangular distribution:

Tr(x | a, b, c) =

0 for x < a,2(x−a)

(b−a)(c−a) for a ≤ x < c,

2b−a for x = c,

2(b−x)(b−a)(b−c) for c < x ≤ b,

0 for b < x.

(6)

We use the open-source Accord.NET Framework1 to ob-tain these distributions from inside Unity®Pro. While weuse mostly uniform distributions to configure our model,we sometimes bias the generation towards some values tobe closer to real-world dataset statistics.Day phase. As real-world action recognition datasets aremore likely to contain video recordings captured during theday, we fixed the parameter θD such that

P (D = dawn | θD) = 1/4

P (D = day | θD) = 2/4

P (D = dusk | θD) = 1/4.

(7)

We note that although our system can also generate nightsamples, we do not include them in PHAV at this moment.Weather. As clear conditions are the most frequent, andfoggy the rarest, we fixed the parameter θW such that

P (W = clear | θW ) = 4/10

P (W = overcast | θW ) = 3/10

P (W = rain | θW ) = 2/10

P (W = fog | θW ) = 1/10.

(8)

Camera. In addition to the Kite camera, we also includedspecialized cameras that can be enabled only for certain en-vironments (Indoors), certain actions (Close-Up), and farshots (Static). Since very far shots rarely occur in the targetdatasets, we fixed the parameter θC such that

P (C = kite | θC) = 10/31

P (C = static | θC) = 1/31

P (C = closeup | θC) = 10/31

P (C = indoors | θC) = 10/31.

(9)

We have also fixed θCE and θAC such that the Indoors cam-era is used available for the house environment, and that theClose-Up camera can also be used for the BrushHair actionin addition to Kite.Environment, human model and variations. We fixed theparameters θE , θH , and θV using equal weights, such thatthe variables E, H , and V can have uniform distributions.Base motions. All base motions are weighted according tothe minimum video length parameter Tmin, where motionswhose duration is less than Tmin are assigned weight zero,and others are set to uniform, such that

P (B = b|Tmin) ∝

{1 if length(b) ≥ Tmin

0 otherwise(10)

We then perform the selection of a motion B given a cate-gory A by introducing a list of regular expressions associ-ated with each of the action categories. We then compute

1http://accord-framework.net

13

Page 14: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Figure 10: Probabilistic graphical model for P3(A,L,B, V,C,E,W | θ), the third part of our parametric generator (sceneand action preparation).

matches between the textual description of the motion in itssource (e.g., short text descriptions in [8]) and these expres-sions, such that

(θAB)ab =

{1 if match(regexa, descb)0 otherwise

∀a ∈ A,∀b ∈ B.

(11)We then use θAB such that

P (B = b | A = a, θAB) ∝ (θAB)a,b. (12)

Weather elements. The selected weather W affects worldparameters such as the sun brightness, ambient luminosity,and multiple boolean variables that control different aspectsof the world (cf. figure 8). The activation of one of theseboolean variables (e.g., fog visibility) can influence in theactivation of others (e.g., clouds) according to Bernoulli dis-tributions (p = 0.5).World clock time. The world time is controlled dependingon D. In order to avoid generating a large number of sam-ples in the borders between two periods of the day, wherethe distinction between both phases is blurry, we use differ-ent triangular distributions associated with each phase, giv-ing a larger probability to hours of interest (sunset, dawn,noon) and smaller probabilities to hours at the transitions.

We therefore define the distribution of the world clock timesP (T ) as:

P (T = t | D) ∝∑d∈D

P (T = t | D = d) (13)

where

P (T = t | D =dawn ) = Tr(t | 7h, 10h, 9h )

P (T = t | D =day ) = Tr(t | 10h, 16h, 13h )

P (T = t | D =dusk ) = Tr(t | 17h, 20h, 18h ).(14)

Generated video duration. The selection of the clip dura-tion L given the selected motion b is performed consideringthe motion length Lb, the maximum video length Tmin andthe desired mode Tmod:

P (L = l | B = b) = Tr(a = Tmin,

b = min(Lb, Tmax),

c = min(Tmod, Lb))

(15)

Actors placement and environment. Each environment Ehas at most two associated waypoint graphs. One graphrefers to possible positions for the protagonist, while anadditional second graph gives possible positions BWG forspawning background actors. Indoor scenes (cf. figure 11)

14

Page 15: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Figure 11: Example of indoor and outdoors scenes.

do not include background actor graphs. After an environ-ment has been selected, a waypoint PW is randomly se-lected from the graph using an uniform distribution. Theprotagonist position Pxyz is then set according to the posi-tion of PW . The Sxyz position of each supporting character,if any, is set depending on Pxyz . The position and destina-tions for the background actors are set depending on BWG.

Camera placement and parameters. After a camera hasbeen selected, its position Cxyz and the position Txyz of thetarget are set depending on the position Pxyz of the protag-onist. The camera parameters are randomly sampled usinguniform distributions on sensible ranges according to theobserved behavior in Unity. The most relevant secondaryvariables for the camera are shown in figure 10. They in-clude Unity-specific parameters for the camera-target (CTs,CTm) and target-protagonist springs (TPs, CTm) that canbe used to control their strength and a minimum distancetolerance zone in which the spring has no effect (remains atrest). In our generator, the minimum distance is set to either0, 1 or 2 meters with uniform probabilities. This setting isresponsible for a ”delay” effect that allows the protagonistto not be always in the center of camera focus (and thusavoiding creating such bias in the data).

Action variations. After a variation mode has been se-lected, the generator needs to select a subset of the ragdollmuscles (cf. figure 12) to be perturbed (random perturba-tions) or to be replaced with movement from a differentmotion (action blending). These muscles are selected usinga uniform distribution on muscles that have been markedas non-critical depending on the previously selected actioncategory A. When using weakening, a subset of muscleswill be chosen to be weakened with varying parameters in-dependent of the action category. When using objects, thechoice of objects to be used and how they have to be usedis also dependent on the action category.

Figure 12: Ragdoll configuration with 15 muscles.

Object placement. Interaction with objects can happenin two forms: dynamic or static. When using objects dy-namically, an object of the needed type (e.g., bow, ball) isspawned around (or is attached to) the protagonist at a pre-determined position, and is manipulated using 3D joints,inverse kinematics, or both. When using static (fixed) ob-jects, the protagonist is moved to the vicinity of an objectalready present in the virtual world (e.g., bench, stairs).

15

Page 16: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Figure 13: Plot of the number of videos generated for eachcategory in the version of our PHAV dataset used in thesubmission.

2.4. Statistics

In this section we show statistics about the version ofPHAV that has been used in experimental section of the sub-mission. Figure 13 shows thee number of videos generatedafter each action category in PHAV. As it can be seen, thenumber is higher than 1,000 samples for all categories.

Figure 14 shows the number of videos generated byvalue of each main random generation variable. The his-

Figure 14: Plot of the number of videos per parameter valuein the version of our PHAV dataset used in the submission.

tograms reflect the probability values presented in section2.3. While our parametric model is flexible enough to gen-erate a wide range of world variations, we have focused ongenerating videos that would be more similar to those in thetarget datasets.

16

Page 17: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Fraction UCF101 UCF101+PHAV HMDB51 HMDB51+PHAV

5% 68.8 72.1 30.7 36.210% 80.9 84.7 44.2 48.825% 89.0 90.9 54.8 59.550% 92.5 92.4 62.9 66.0100% 92.8 93.3 67.8 69.7

Table 6: TSN and Cool-TSN (+PHAV) with different frac-tions of real-world training data (split 1).

3. ExperimentsIn this section, we show more details about the exper-

iments shown in the experimental section of our originalsubmission.

Table 6 shows the impact of training our Cool-TSNmodels using only a fraction of the real world data (fig-ure 7 of original submission) in a tabular format. As itcan be seen, mixing real-world and virtual-world data fromPHAVis helpful in almost all cases.

Figure 15 shows the performance of each network streamseparately. The second image on the row shows the per-formance on the Spatial (RGB) stream. The last image onthe row shows the performance for the Temporal (opticalflow) stream. One can see how the optical flow stream is thebiggest responsible for the good performance of our Cool-TSN, including when using very low fractions of the realdata. This confirms that our generator is indeed producingplausible motions that are being helpful to learn both thevirtual and real-world data sources.

4. VideoWe have included a video (cf. figure 16) as additional

supplementary material to our submission. The video showsrandom subsamples for a subset of the action categories inPHAV. Each subsample is divided into 5 main variation cat-egories. Each video is marked with a label indicating thevariation being used, using the legend shown in figure 17.

5. ConclusionOur detailed graphical model shows how a complex

video generation can be driven through few, simple param-eters. We have also shown that generating action videoswhile still taking the effect of physics into account is a chal-lenging task. Nevertheless, we have demonstrated that ourapproach is feasible through experimental evidence on tworeal-world datasets, disclosing further information about theperformance of each RGB and optical flow channels in thissupplementary material.

Figure 15: TSN and Cool-TSN results for different amountsof training data for combination and separate streams.

17

Page 18: Procedural Generation of Videos to Train Deep Action ... · Procedural Generation of Videos to Train Deep Action Recognition Networks C ... art generative video models [77] when combined

Figure 16: Sample frame from the supplementary video available at https://adas.cvc.uab.es/phav/.

Figure 17: Legend for the variations shown in the video.

18