Authoring multimedia authoring tools

1070-986X/04/$20.00 © 2004 IEEE Published by the IEEE Computer Society 1

Media Impact Frank NackCWI, Netherlands

It’s hard to find people with no video-captureexperience these days. In fact, capturing

devices, while continually becoming smaller andeasier to use, have increased in capacity. They’realso more connectable and interoperable, andtheir propensity to show up where we leastexpect them is surprising. Perhaps the averagehousehold toaster will soon come equipped witha video camera—“No, honey, that’s the multi-media card slot. The toast goes over there….”

Yet, despite all these advances, the video-capture experience is still frustrating. So, what’s theproblem? Precisely, it’s difficult to determine whatto capture and how, and how to handle the ensu-ing process required to transform the raw capturedfootage into a presentable multimedia artifact.

Frustrated usersConsider a typical home videographer in a

holiday scenario. Holidays often combine twoincentives to bring out the camera: new sightsand sounds, and memorable moments. Depend-ing on how trigger-happy our videographer is, heor she might start the film rolling and in no timegather quite a bit of footage. In the worst case, nofurther processing occurs, and the capturedfootage assumes the title of movie. This couldmean hours of footage peppered with wandering,panning, and zooming camera work. The endresult is often boring, and the viewer falls asleep.

Our home videographer might be willing toprocess the captured footage with standard com-mercial software. But this is often frustratingbecause the available tools are essentially low-level film manipulations under the hood, requir-ing much patience and some editing knowledge.The most useful outcome is often just a short-ened version of the original footage.

Not surprisingly, this postproduction processis difficult for the amateur videographer. Peopleexpect to emulate what they’ve experienced, and

they’re saturated with highly produced media—finished products whose authors are learned inthe use of the film medium. But Michelangelopainted and sculpted well because he practicedto distraction. Churchill orated famously, despitea speech impediment, because he rehearsed. Wecan’t expect our amateur videographer to auto-matically author effective multimedia.

Defining the domainResearchers are only too eager to try to solve

such an interesting problem. The “MultimediaAuthoring Approaches” sidebar shows a fewapproaches to multimedia authoring. A quickglance reveals a heterogeneity of context that isrepresentative of the field of computer-assistedmultimedia authoring and serves to highlight animportant requirement for researchers: They mustprecisely define the intended domain of theauthoring technology in question. Here, “domain”means the complex of assumed context (targetuser, physical limitations, target audiences, and soon) and rules (conventions), which collectivelyconstitute the air in which the solution lives andbecomes efficient in achieving its stated goals.Defining this domain doesn’t require explicitlyenumerating and instantiating every aspect, norspecifying them to a fine point. Genres, abstrac-tions, and catch-alls exist precisely because theyhelp specify ranges of items that creators of author-ing tools can more easily handle.

But why do we need to address these sideissues? They’re kind of hard to determine anyway.Why can’t we just concentrate on simple con-cretes, such as the data type we’re dealing with andits intrinsic properties? Raw video capture andmanipulation, object tracking, cataloging andmatching, and so on would surely provide a firmfoundation, wouldn’t it? Such a data-type-centricapproach might seem appealing, but it doesn’t

Brett Adams andSvetha VenkateshCurtin University

of Technology

Authoring Multimedia AuthoringTools

continued on p. 4

4

IEEE

Mul

tiM

edia

Media Impact

help define the chief problem of technology: Whatis the nature of the gap we’re trying to bridge?

Defining the gapConsider the following example. In the editing

room of a professional feature film, the editoroften needs to locate a piece of footage to fill aneed. The problem is retrieval, and the gap isinformation on the desired clip’s whereabouts. Inthe editing room of one of the Lord of the Ringsmovies, there were times when editors had to chis-el hours of footage down to seconds.

Contrast that situation with a home videog-rapher trying to assemble a vacation movie fromassembled clips. Here, the main challenge is basi-cally to determine how much of what should gowhere. The problem is which sequence makes aninteresting progression, and the correspondinggap is an understanding of film grammar andnarrative principles (or some other theory of dis-course structure). This situation is analogous tofinding a citation and knowing the principles ofessay writing, respectively.

We aren’t saying every home videographerwants to produce something like a Hollywoodfeature. But people do have certain, at leastimplicit, expectations. They want, for example,a sense of continuity; a cohesive theme; and well-composed, cleanly edited material. When theirproduction ends up being little more than anunmediated record of events, they’re naturallydisappointed. Sometimes the user can’t name thesource of this dissatisfaction, let alone remedy it,indicating that many people aren’t even aware ofwhat they can do with the medium.

Closing the gapOnce we’ve better defined the gap our tech-

nology is trying to bridge, we’re in a position topropose solutions. For our amateur videographer,the solution is perhaps either education or a low-ering of expectations. Such procrustean solutions,however, amount to surrender, and users are gen-erally ill-inclined to surgery of any sort, be it filmschool or subliminal messages of pessimism.

However, we could provide the video equiva-lent of a thesaurus and spellchecker: squiggly lines

Traditionally, the process of creating a finished video pre-sentation includes three main phases: preproduction, produc-tion, and postproduction—or, in abstract terms, planning,execution, and polishing. Here, we present some multimediaauthoring approaches from the literature, grouped by theprocess phase they focus on. We prefer not to emphasize anyparticular commercial provider, but we encourage you tosearch for audio or video authoring tools for each relevantphase. You’ll be amazed at what you find.

Planning❚ R. Baecker et al., ”A Multimedia System for Authoring

Motion Pictures,” Proc. ACM Multimedia 1996, ACM Press,1996, pp. 31-42.

❚ B. Bailey, J. Konstan, and J. Carlis, “DEMAIS: Designing Mul-timedia Applications with Interactive Storyboards,” Proc. 9thACM Int’l Conf. Multimedia, ACM Press, 2001, pp. 241-250.

Execution❚ L.-W. He, M. Cohen, and D. Salesin, “The Virtual Cine-

matographer: A Paradigm for Automatic Real-Time CameraControl and Directing,” Proc. Computer Graphics Int’l, IEEEPress, 1996, pp. 217-224.

❚ B. Tomlinson, B. Blumberg, and D. Nain, “Expressive

Autonomous Cinematography for Interactive Virtual Envi-ronments,” Proc. 4th Int’l Conf. Autonomous Agents, 2000,pp. 317-324.

❚ B. Barry and G. Davenport, “Documenting Life: Videographyand Common Sense,” Proc. Int’l Conf. Multimedia and Expo(ICME 03), vol. 2, IEEE Press, 2003, pp. 197-200.

Polishing❚ A. Girgensohn et al., “A Semi-Automatic Approach to Home

Video Editing,” Proc. 13th Ann. ACM Symp. User InterfaceSoftware and Technology, ACM Press, 2000, pp. 81-89.

❚ J. Casares et al., “Simplifying Video Editing Using Metada-ta,” Proc. Symp. Designing Interactive Systems (DIS 02), ACMPress, 2002, pp. 157-166.

Holistic: From planning to polishing, andoptionally to repolishing❚ M. Davis, “Editing out Video Editing,” IEEE MultiMedia, vol.

10, no. 2, Apr.–June 2003, pp. 54-64.

❚ F. Nack, “From Ontology-Based Semiosis to ComputationalIntelligence: The Future of Media Computing,” Media Com-puting Computational Media Aesthetics, C. Dorai and S.Venkatesh, eds., Kluwer Academic, 2002, pp. 159-196.

Multimedia Authoring Approaches

continued from p. 1

under the offending text. We could even offergrammar checking, providing additional supportsat a syntactic level. This would help a little, but itstill wouldn’t be enough. The problem is twofold.First, we must formulate the cogent argument ormessage we wish to display—for example, thisscene goes before that one and supports this idea.This is called the discourse. Second, we mustexpress this discourse in a surface manifestation(such as video) using the particular medium’spowers to heighten the discourse’s effectiveness.

Formulating content structureGetting beneath the surface of the problem,

what can we assume our amateur videographerknows? People are good at content: What is it?This footage contains two people skiing. Whatsemantic associations does the footage carry? Theolder man and the younger man are father andson; skiing can be fun yet painful. This common-sense analysis is just the sort of knowledge that’sdifficult to automatically extract from media,model, and ply with computation.

However, if we liken the selection and order-ing of footage to the building of an argument, wenote that the user’s inherent understanding of theisolated pieces (fathers and sons) isn’t sufficientto help him or her form a reasoned chain of logic.The goal might be an enjoyable home movie, butthe user doesn’t know how to get there. Thosevery same pieces of knowledge (sons, fathers, ski-ing, and pain) don’t say anything about how tocombine them after plucking them from theirsemantic webs to serve the argument.

Enumerating possibilities from a given set ofrules and constraints, however, is somethingwe’ve had success at expressing algorithmically.For example, simple software wizards can helpgive the needed structure some flexibility, lettingthe user include discourse elements relevant to hisor her presentation within the provided structure.

Modeling discourse structures is a necessary firststep. The particular, relevant theory of discourseand corresponding models will vary. If the goal isan entertaining home movie, some type of narra-tive model is appropriate—perhaps somethingsimple that allows coarse structures such as crises,climaxes, and resolutions. Or, a more complexmodel might be appropriate—for example, Dra-matica, which views a story as an argument andhas a host of structures and roles that a user couldpotentially instantiate (http://www.dramatica.com/). Rhetorical structure theory mightapply to domains in which the veracity of the

resulting generated presentation is more impor-tant.1 Having structured a discourse in abstractterms for a presentation, we must now instantiatethose discourse terminals with living, breathingcontent; for amateur video, this means footage ofconcretes that match the abstract contract.

Expressing formThe next problem to address is how our home

videographer can capture the footage. A givenvideo-clip capture veritably bristles with para-meters. Light, composition, camera and objectmotion, camera mounting, z-axis blocking, dura-tion, and focal length (to name just a few) allgreatly impact the captured footage, its content,and its aesthetic potential. We can’t expect ouramateur to have domain-specific knowledge suchas how to map discourse elements to well-formed, effective surface manifestations.

Film theory has done much work for us indefining appropriate parameterizations for givendiscourse-related goals.2 For example,

❚ for emphasis, shot duration patterning andaudio volume;

❚ for emotional state, color atmosphere;

❚ for spatial location, focal depth and audioclues; and

❚ for temporal location, shot transition typeand textual descriptions.

So, it’s possible to encode mappings relating thepresence of discourse structures to cinematicparameters. However, an important caveat is thatthe degree of freedom for the requested manifes-tation parameterization—for example, a shot ofyour son next to his friend—is not unlimited ifthe user’s context is the kind of impromptu set-ting common for amateur videographers. Attimes, it’s possible to relieve such complicationsthrough reuse of existing media.

Media reuseReuse of media artifacts is a natural benefit of

maturing media technology, mainly its metada-ta aspects. Examples include extending in timeor purpose the initial use to new audiences, mul-tiple versions of the same presentation, and soon. Reuse springs from a desire to harness theinitial effort required to create media artifactsand to make it abound to repeated or repurposed

5

July–Septem

ber 2004

uses, as with industrial manufacture: The samebolt, if cast to a standard, can secure a washingmachine or a 747, so why can’t we apply thesame principle here?

To answer this question, we must identify theinterfaces of a piece of footage. This in turndepends on the domain context. In a genre hier-archy stretching from a moving photo album toa simple thematic narrative, to the more complextraditional narrative, it’s difficult to take footagefrom a lower genre and reuse it in a genre aboveit. The metadata to place and manipulate footagein the easier genre is insufficient to the task in thetarget genre.

Imagine you filmed a narrative version of yourson’s birthday party, including all the fun and dif-ficulties leading up to the climactic opening of pre-sents. Then, you later decide to do a montage ofmajor family events during the past year. It wouldbe simple to select shots from the already existingnarrative and insert them into your review. Theonly constraint would be that the footage must insome sense be a highlight. In the following year,you decide to create a birthday movie for yourdaughter. Somehow, you end up missing thefootage of the presents! In a moment of despera-tion, you decide to sneak in some footage fromyour son’s movie. But this could cause some prob-lems: If certain children appear in one or two shotsand are never seen again, eyebrows will elevate. Orperhaps you’ve found some footage suitablydevoid of people, but you’ve overlooked the factthat you changed the curtains last Christmas.These are examples of unmet continuity constraints.Another potential problem is that maybe in yourson’s video you have presents framed in extremeclose-up, and this runs against the grain of the styleof movie you’re employing for your daughter. Thiswould be an example of manifestation constraints inanswer to aesthetic goals.

In short, authoring presentations of any sortrequire a level of understanding about the con-tent or meaning of the media presented. Reuse ofexisting material is no exception. Add to this thefact that the genre in use can partly predicatethat meaning, and we come to the moral of thestory: We must define the source and target gen-res. Coupled with this information, our metada-ta, however we obtain it, provides computationalassistance for creating our presentation.

Example multimedia authoring systemAccordingly, we’ve designed a multimedia

authoring system to fulfill these requirements. A

typical home movie creation flows through a sto-ryboard, capture, and edit life cycle. The processstarts with the selection of a narrative template—an abstraction of an occasion, such as a weddingor a birthday party—viewed in narrative terms(climaxes and so on). The user either obtains thistemplate from an existing library (generalized,specialized, or cannibalized) or constructs it fromscratch for the intended event. The user specifiesa creative purpose in terms of recognizable gen-res (for example, action or documentary), whichin turn map to affective goals. Our multimediaauthoring system then automatically transformsthe narrative template, coupled with these goals,into a storyboard comprising scenes and attachedshot directives with the help of aesthetic struc-turalizers. This storyboard is the basis for aninteractive capture process that results in footagethat the system can automatically edit into a filmor alter for affective impact.3

The endIf we are to create effective multimedia

authoring tools, we must avail ourselves of thevarious disciplines that we normally throw in the“someone-else’s-problem” basket. Discourse the-ory, domain distinctives such as media aesthet-ics, human–computer interface issues, andmultimedia data description standards all have apart to play. However, none of these fields standsstill, so we need to continually query them fornew insights that might impact our multimediaauthoring endeavor. MM

References1. C.A. Lindley et al., The Application of Rhetorical

Structure Theory to Interactive News Program Genera-

tion from Digital Archives, tech. report INS-R0101,

CWI, Amsterdam, 2001.

2. H. Zettl, Sight, Sound, Motion: Applied Media

Aesthetics, 3rd ed., Wadsworth Publishing, 1999.

3. B. Adams and S. Venkatesh, “Weaving Stories in

Digital Media: When Spielberg Makes Home

Movies,” Proc. ACM Multimedia Conf., ACM Press,

2003, pp. 207-210.

Readers may contact Brett Adams or Svetha Venkatesh at

Dept. of Computing, Curtin University of Technology, GPO

Box U1987, Perth, Western Australia 6845; adamsb@

cs.curtin.edu.au or [email protected].

Contact Media Impact editor Frank Nack at CWI, Kruis-

laan 413, PO Box 94079, 1090 GB Amsterdam, the Nether-

lands; [email protected].

6

IEEE

Mul

tiM

edia

Media Impact

Documents

Authoring multimedia authoring tools