6
HAL Id: hal-01515027 https://hal.archives-ouvertes.fr/hal-01515027 Submitted on 15 Jun 2017 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Five Challenges for Intelligent Cinematography and Editing Rémi Ronfard To cite this version: Rémi Ronfard. Five Challenges for Intelligent Cinematography and Editing. Eurographics Work- shop on Intelligent Cinematography and Editing, Eurographics Association, Apr 2017, Lyon, France. 10.2312/wiced.20171069. hal-01515027

Five Challenges for Intelligent Cinematography and Editing

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

HAL Id: hal-01515027https://hal.archives-ouvertes.fr/hal-01515027

Submitted on 15 Jun 2017

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Five Challenges for Intelligent Cinematography andEditing

Rémi Ronfard

To cite this version:Rémi Ronfard. Five Challenges for Intelligent Cinematography and Editing. Eurographics Work-shop on Intelligent Cinematography and Editing, Eurographics Association, Apr 2017, Lyon, France.�10.2312/wiced.20171069�. �hal-01515027�

Eurographics Workshop on Intelligent Cinematography and Editing (2017), pp. 1–5W. Bares, V. Gandhi, Q. Galvane, and R. Ronfard (Editors)

Five Challenges for Intelligent Cinematography and Editing

Remi Ronfard

Univ. Grenoble Alpes, Inria, Grenoble, France

AbstractIn this position paper, we propose five challenges for advancing the state of the art in intelligent cinematography and editingby taking advantage of the huge quantity of cinematographic data (movies) and metadata (movie scripts) available in digitalformats. This suggests a data-driven approach to intelligent cinematography and editing, with at least five scientific bottlenecksthat need to be carefully analyzed and resolved.we briefly describe them and suggest some possible avenues for future researchin each of those new directions.

Categories and Subject Descriptors (according to ACM CCS): I.2.10 [Vision and Scene Understanding]: —I.3.3 [ComputerGraphics]: —

1. Building a database of movie scenes

There have been several attempts in the past to build databases ofmovie scenes for the purpose of action recognition. One interest-ing line of research is to use movie scripts as a source of weakannotation, where action verbs in the script are used to identifyactions in the movie. Cour et al. describe a method for aligningmovies and scripts using telext subtitles and building a databaseof short movie clips described by action verbs [CT07, CJMT08].Laptev et al. adopt a similar strategy for building the so-called "hol-lywood" dataset (HOHA). They automatically extract video seg-ments that likely contain the actions mentioned in the aligned script[LMSR08]. Those segments are then verified manually. Gupta etal. use teletex transcriptions of sports broadcasts for building adatabase of sports actions [GSSD09]. Salway et al. use audio de-scriptions for creating a database of movies with a rich descriptionof actions [AVA05, SLO07]. Rohrbach et al. align movies with au-dio descriptions to create a parallel corpus of over 68K sentencesand video snippets from 94 HD movies. [RRTS15].

Such databases are useful for the purpose of action recognition,but are not sufficient for learning models of cinematography andfilm editing. For one thing, the number of action classes and exam-ples per action class are usually small (the HOHA database con-tains 430 video segments labeled with 8 action classes). Further-more, they do not preserve the structure of the movies into cine-matographic scenes and shots. For the purpose of learning generalmodels of cinematography and editing, a much larger number ofmovie scenes will be needed with a much more diverse set of ac-tions and cinematographic styles. A movie generally contains inthe order of a hundred scenes. Therefore, a complete alignment ofone hundred movies with their scripts can be expected to yield adatabase of ten thousand movie scenes. This will require an intense

effort from our community because the problem of detecting scenebreaks in movies remains difficult in general.

A possible approach to this problem is to train scene break clas-sifiers from labeled examples. Active learning methods should beused to refine classifiers using false negatives (scene breaks) andfalse positives (non scene breaks) collected by film experts. An-other possible approach is to detect scene breaks as part of thescript-to-movie alignment process. This will require a more gen-eral form of alignment where both the movie and the script are rep-resented as tree structures (the movie contains scenes which con-tain shots which contain frames, the script contains scenes whichcontain actions and dialogues). The alignment should then be per-formed between trees, rather than sequences, and specific methodssuch as [HO82] should be used, rather than the commonly useddigital time warping (DTW).

2. Breaking down scenes into shots

After collecting and annotating a large collection of movie scenes,we will have to confront two related methodological issues. Thefirst issue is the size of the vocabulary of actions present in thosescenes, which will likely be in be in the order of several thousandconcepts. This makes the traditional approaches of learning actionconcepts one by one impractical and beyond the reach of the intel-ligent cinematography and editing community. The second issue isthat the action labels present in the script are only a very rough andincomplete description of the action actually performed in screen.The art of mise-en-scene and acting consist primarily in translatingthe more abstract action concepts present in the script into the moreconcrete actions played to the camera. This is best illustrated bycomparing the original screenplay for a short movie scene from themovie ’Casablanca" reproduced in Figure 1 with the actions per-

submitted to Eurographics Workshop on Intelligent Cinematography and Editing (2017)

2 R. Ronfard / Challenges

Figure 1: A scene from the original screenplay of the movieCasablanca.

formed by the actors in the movie , which are described in Figure 2and Figure 3.

In order to overcome those difficulties, we believe it will benecessary to perform a shot-by shot annotation of each scene inthe corpus. In some exceptional cases, such shot descriptions areavailable in the shape of a decoupage (continuity script) which canbe automatically aligned to the movie scene [RTT03, Ron04]. Inthe more general case, the shot-by-shot annotation must be createdfrom scratch, using controled vocabularies and formal descriptionlanguages such as the prose storyboard language [RGB15], whichhas been shown to be expressive enough to describe movie shotsand scenes with arbitrary complexity. Movie scenes contain on theorder of twenty shots, which means that a collection of 10,000scenes will comprise approximately 2 million shots. Clearly, thisannotation cannot be obtained manually and future work is need toautomate it at least partially.

A promising approach will be to train conditional random fields(CRF) from examples of fully-described shots and to attempt togeneralize to novel shots from the same movie or the same genre.Similar approaches have been proposed recently for describing stillimages using scene graphs [JKS∗15] and we conjecture that theywill generalize to the more difficult problem of describing movieshots using prose storyboards.

3. Recognizing actors and their actions

Using the temporal alignments between prose storyboards andmovie shots will put us in a good position for learning to recognizemovie actors and their actions, and to understand the different cin-ematographic and editing styles which are used to portray them. Inprevious work, we obtained good results in simultaneously learn-ing models of actions and viewpoints using hidden Markov models[WBR07]. We therefore conjecture that similar approaches can beused for simultaneously learning models of movie actors and their

Figure 2: Shot-by-shot description (decoupage) of the same scenefrom the movie Casablanca as in Figure 1 shots 5 and 6 (repro-duced from Loyall and Bates [LB97]). The screenplay contains onlyfive actions: "Ugarte runs, sees Rick, grabs him, "guards rush inand grab Ugarte". The decoupage contains many more subtle in-teractions between dialogue, non verbal communication and physi-cal actions. Can a statistical model of mise-en-scene be learned fortranslating the (hidden) screenplay actions into the (visible) movieactions ?

actions (content) together with corresponding shot composition andediting (style). Marginalizing over style parameters, we can expectto obtain improved precision in the difficult task of human actionrecognition, which is notoriously hard in movies . Marginalizingover content parameters, we can expect to learn useful models ofcinematography and editing styles, well beyond the current stateof the art in the statistical analysis of film style [CDN10, CC15].What makes this problem particularly challenging is the huge sizeof the vocabulary both in content and in style (to be compared withthe 11 action categories and 8 view points learned by Weinland etal [WBR07]).

Luckily, recent advances in computer vision are making humanbody and face detection reliable enough that it becomes possibleto reformulate the action recognition problem. Instead of askingthe harder question - what is happening in this shot or scene ? wecan now ask an easier question - what is this actor doing in thisshot or scene ? Relying on actor body and face detection bringsthe additional advantage that we can describe the video in bodycoordinates, which are suitable representation for human actionsand activities. In this context, a very promising approach for recog-nizing a large vocabulary of actions will be to learn semi-Markovconditional random field (SMCRF) models using variants of back-propagation [Col02, SWCS08].

Despite the spectacular recent progress in large-scale machinelearning, we would like to argue that learning models of action and

submitted to Eurographics Workshop on Intelligent Cinematography and Editing (2017)

R. Ronfard / Challenges 3

(a) Shot 1 - By the time the gendarmes manage to get thedoor open again, Ugarte has pulled a gun. He FIRES at thedoorway.

(b) Shots 2, 3 et 4 - The SHOTS bring on pandemonium in the cafe.

(c) Shot 5 - As Ugarte runs through the hallway he sees Rick, appearing from the opposite direction, and grabs him.

(d) Shot 6 - Quick dialogue between Ugarte and Rick. Guardsand gendarmes rush in and grab Ugarte.

(e) Shots 7, 8 et 9 - Rick stands impassively as they drag Ugarte off.

Figure 3: Keyframes from the movie ’Casablanca’ corresponding to the scene scripted in in Figure 1. The scene was filmed and edited withnine different shots, elaborating on the much shorter action description present in the original screenplay. The translation from script toshot (decoupage) is a major component of film directing, involving actor direction as well cinematography and editing. Understanding thecomplexity of decoupage is a key challenge for intelligent cinematography and editing and requires a careful breakdown and analysis ofclassic scenes into shots.

cinematography in a purely data-driven fashion, may not be suffi-cient. As a supplementary source of information, it would be usefulto create synthetic examples, where the different parameters of cin-ematic styles, including blocking, lighting and camera framing, canbe generated in a more systematic fashion.

This leads to the challenge of creating realistic simulations ofmovie scenes in 3-D animation. In previous work, we recreatedone short scene from the movie ’back to the future’ for the pur-pose of demonstrating the performance of our automatic film edit-

ing method [GRLC15] and comparing it to the actual editing inthe movie [GRC15]. In related work, researchers have started touse game engines to reproduce movie scenes as part of the ’ma-chinima’ movement [KM05,Low05,Nit09]. Such techniques showgreat promise for generating variations in movie-making using vir-tual sets, actors and cameras, which leads us to our next challenge.

submitted to Eurographics Workshop on Intelligent Cinematography and Editing (2017)

4 R. Ronfard / Challenges

4. Reverse-engineering movie scenes

Starting from an example movie scene broken down into shots, to-gether with a detailed screenplay describing the dramatic action anda prose storyboard describing the composition of each shot, we arein a good position for re-creating the scene in 3-D animation usingthe tools of machinima. Existing software tools such as Persona andMatinee in the Unreal Engine, facilitate the creation of such cine-matic sequence using a combination of live interaction and scriptedanimation. The Source Recorder in Valve provides similar support.Open source game engines such as Panda3D or Blender can also beused tor replicate movie scenes using existing assets.

Machinima pioneer Michael Nitsche gives a very detailed ac-count of the machinima reconstruction of the movie ’Casablanca’during two workshop at the University of Cambridge in 2002 et2003 [NM04, Nit09]. Those reports clearly illustrate the promisesand the limitations of existing machinima tools. Even today, the ef-fort required to recreate a movie scene in machinima remains enor-mous, as we have experienced ourselves while re-creating the scenein ’Back to the future’ in our lab. But the reward is also substan-tial, as that 3-D model make it possible to generate a large numberof variations in style for the same content, and to uses them forlearning invariant action recognition methods [dSGCP16].

Generating those 3-D scenes automatically is our next challenge.Currently, each step in the reconstruction requires laborious inter-action - including set reconstruction, virtual actor modeling, retar-geting of full body animation from motion capture databases, fa-cial animation and lip-sync, synchronisation between actors, colli-sion detection and physical simulation of the environment. In futurework, it should be possible to use prose storyboards [RGB15] as ascripting language for automatically generating 3-D scenes in ma-chinima.

We believe this is a realistic goal, much more so than the previ-ous work of Loyall et Bates [LB93] or Ye and Baldwin [YB08] whoattempted to create 3-D scenes directly from movie scripts, withoutthe intermediate step of the prose storyboard. In this endeavour,Loyall et Bates proposed the HAP language [LB93], which usesthe framework of behavior trees for scripting actions and reactionsof virtual actors in response to their environment. Actor behaviorsare computer programs with names (goals), parameters, sub-goals,pre-conditions and post-conditions. They can run sequentially or inparallel.

A promising direction for future research will be to build a prob-abilistic version of the HAP language, with probabilities computedfrom examples of real movie scenes. Such a language could be usedto learn statistical models of actions and acting styles and to re-use them during machinima generation. This process of reverse-engineering movie scenes in terms of generative models wouldmake it possible for our community to share large numbers ofscenes with a variety of contents and styles, suitable for learningmore realistic models of cinematography and editing, not limitedto a single movie, director, era or genre.

5. Generating movie scenes

Given a large enough number of movie scenes and their reverse-engineered, 3-D animation versions, it becomes possible to refor-

mulate the problem of cinematography and editing as a regressionproblem, in the way of recent attempts to translate video into textand vice versa using deep neural networks [SVL14, VRD∗15].

We expect that such methods will eventually make it possible togenerate movie scenes with the complexity of the ’Casablanca’ ex-ample in Figure 3 on a much larger scale than is currently possible.This short movie scene is much more complex and compelling thanall the previous work in intelligent cinematography and editing,which uses relatively simple, sometimes caricatural 3-D graphicsand animation.

The promises and the challenges of the data-driven approach thatwe advocate are equally great. Each of the challenges will requirea much needed collaboration between researchers in computer vi-sion and computer graphics, knowledge engineers, film scholarsand machine learning specialists. That may be the ultimate chal-lenge for our community.

References[AVA05] ANDREW A. S., VASSILIOU A., AHMAD K.: What happens

in films? In IEEE International Conference on Multimedia and Expo(2005). 1

[CC15] CUTTING J., CANDAN A.: Shot durations, shot classes, and theincreased pace of popular movies. Projections: The journal for moviesand mind 9, 2 (2015), 40–62. 2

[CDN10] CUTTING, DELONG, NOTHELFER: Attention and the evolu-tion of hollywood film. Psychological Science 21 (2010), 440–447. 2

[CJMT08] COUR T., JORDAN C., MILTSAKAKI E., TASKAR B.: Movieand script: Alignment and parsing of video and text transcription. InECCV (2008). 1

[Col02] COLLINS M.: Discriminative training methods for hiddenmarkov models: Theory and experiments with perceptron algorithms. InEmpirical Methods in Natural Language Processing (EMNLP) (2002).2

[CT07] COUR T., TASKAR B.: Video deconstruction: Revealing narra-tive structure through image and text alignment. In NIPS Workshop onthe Grammar of Vision: Probabilistic Grammar-Based Models for VisualScene Understanding and Object Categorization (2007). 1

[dSGCP16] DE SOUZA C. R., GAIDON A., CABON Y., PEÑA A. M. L.:Procedural generation of videos to train deep action recognition net-works. CoRR abs/1612.00881 (2016). URL: http://arxiv.org/abs/1612.00881. 4

[GRC15] GALVANE Q., RONFARD R., CHRISTIE M.: Comparingfilm-editing. In Eurographics Workshop on Intelligent Cinematogra-phy and Editing, WICED ’15 (Zurich, Switzerland, May 2015), Euro-graphics Association, pp. 5–12. URL: https://hal.inria.fr/hal-01160593, doi:10.2312/wiced.20151072. 3

[GRLC15] GALVANE Q., RONFARD R., LINO C., CHRISTIE M.: Conti-nuity Editing for 3D Animation. In AAAI Conference on Artificial Intel-ligence (Austin, Texas, United States, Jan. 2015), AAAI Press, pp. 753–761. URL: https://hal.inria.fr/hal-01088561. 3

[GSSD09] GUPTA A., SRINIVASAN P., SHI J., DAVIS L. S.: Under-standing videos, constructing plots - learning a visually grounded story-line model from annotated videos. In CVPR (2009). 1

[HO82] HOFFMANN C. M., O’DONNELL M. J.: Pattern matching intrees. J. ACM 29, 1 (1982), 68–95. doi:http://doi.acm.org/10.1145/322290.322295. 1

[JKS∗15] JOHNSON J., KRISHNA R., STARK M., LI L.-J., SHAMMAD. A., BERNSTEIN M., FEI-FEI L.: Image retrieval using scenegraphs. In IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) (2015). 2

submitted to Eurographics Workshop on Intelligent Cinematography and Editing (2017)

R. Ronfard / Challenges 5

[KM05] KELLAND M., MORRIS L.: Machinima: Making Movies in 3DVirtual Environments. Cambridge: The Ilex Press, 2005. 3

[LB93] LOYALL A., BATES J.: Real-time control of animated broadagents. In In Proceedings of the Fifteenth Annual Conference of theCognitive Science Society (1993). 4

[LB97] LOYALL A. B., BATES J.: Personality-rich believable agents thatuse language. In Proceedings of the First International Conference onAutonomous Agents (New York, NY, USA, 1997), AGENTS ’97, ACM,pp. 106–113. URL: http://doi.acm.org/10.1145/267658.267681, doi:10.1145/267658.267681. 2

[LMSR08] LAPTEV I., MARSZALEK M., SCHMID C., ROZENFELD B.:Learning realistic human actions from movies. In IEEE Conference onComputer Vision and Pattern Recognition (2008), pp. 1–8. doi:10.1109/CVPR.2008.4587756. 1

[Low05] LOWOOD H.: High-performance play: The making of machin-ima. In Videogames and Art: Intersections and Interactions, Clarke A.,(eds.) G. M., (Eds.). Intellect Books (UK), 2005. 3

[Nit09] NITSCHE M.: Video Game Spaces: Image, Play, and Structure in3D Worlds. MIT Press, 2009. 3, 4

[NM04] NITSCHE M., MAUREEN M.: Play it again sam: Film per-formance, virtual environments and game engines. In Visions in Per-formance: The Impact of Digital Technologies, Carver G., Beardon C.,(Eds.). Swets & Zeitlinger, 2004. 4

[RGB15] RONFARD R., GANDHI V., BOIRON L.: The prose story-board language: A tool for annotating and directing movies. CoRRabs/1508.07593 (2015). URL: http://arxiv.org/abs/1508.07593. 2, 4

[Ron04] RONFARD R.: Reading Movies An Integrated DVD Player forBrowsing Movies And Their Scripts. In ACM Multimedia (New YorkCity, United States, 2004), ACM, (Ed.), ACM. URL: https://hal.inria.fr/inria-00545143. 2

[RRTS15] ROHRBACH A., ROHRBACH M., TANDON N., SCHIELE B.:A dataset for movie description. CoRR abs/1501.02530 (2015). URL:http://arxiv.org/abs/1501.02530. 1

[RTT03] RONFARD R., TRAN-THUONG T.: A framework for align-ing and indexing movies with their script. In Proceedings of IEEEInternational Conference on Multimedia and Expo (ICME) (Baltimore,MD, United States, July 2003). URL: https://hal.inria.fr/inria-00423417. 2

[SLO07] SALWAY A., LEHANE B., O’CONNOR N. E.: Associatingcharacters with events in films. In CIVR ’07: Proceedings of the 6th ACMinternational conference on Image and video retrieval (New York, NY,USA, 2007), ACM, pp. 510–517. doi:http://doi.acm.org/10.1145/1282280.1282354. 1

[SVL14] SUTSKEVER I., VINYALS O., LE Q. V.: Sequence to sequencelearning with neural networks. In Advances in Neural Information Pro-cessing Systems 27: Annual Conference on Neural Information Process-ing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada(2014), pp. 3104–3112. 4

[SWCS08] SHI Q., WANG L., CHENG L., SMOLA A.: Discriminativehuman action segmentation and recognition using semi-markov model.In IEEE Conference on Computer Vision and Pattern Recognition (June2008), pp. 1–8. doi:10.1109/CVPR.2008.4587557. 2

[VRD∗15] VENUGOPALAN S., ROHRBACH M., DONAHUE J.,MOONEY R., DARRELL T., SAENKO K.: Sequence to sequence –video to text. In Proceedings of the IEEE International Conference onComputer Vision (ICCV) (2015). 4

[WBR07] WEINLAND D., BOYER E., RONFARD R.: Action Recogni-tion from Arbitrary Views using 3D Exemplars. In ICCV 2007 - 11thIEEE International Conference on Computer Vision (Rio de Janeiro,Brazil, Oct. 2007), IEEE, pp. 1–7. URL: https://hal.inria.fr/inria-00544741, doi:10.1109/ICCV.2007.4408849. 2

[YB08] YE P., BALDWIN T.: Towards automatic animated storyboard-ing. In AAAI (2008), pp. 578–583. 4

submitted to Eurographics Workshop on Intelligent Cinematography and Editing (2017)